googleapis / python-bigquery:Client.load_dataframe_to_table 失败并出现 PyArrow “TypeError:需要一个整数(获取类型 str
Posted
技术标签:
【中文标题】googleapis / python-bigquery:Client.load_dataframe_to_table 失败并出现 PyArrow “TypeError:需要一个整数(获取类型 str)”【英文标题】:googleapis / python-bigquery: Client.load_dataframe_to_table fails with PyArrow "TypeError: an integer is required (got type str)" 【发布时间】:2021-10-04 02:00:41 【问题描述】:给定以下代码:
try:
dest_table = bigquery.table.Table(table_id, schema)
job = self.client.load_table_from_dataframe(
df_data, # pd.DataFrame
dest_table,
job_config=bigquery.job.LoadJobConfig(schema=schema)
)
job.result()
except TypeError:
with pd.option_context("display.max_rows", None, "display.max_columns", None, "display.width", None):
LOG.error("Failed to upload dataframe: \n\n%s\n", df_data.to_csv(header=True, index=False, quoting=csv.QUOTE_NONNUMERIC))
LOG.error("\n%s\n", df_data.dtypes)
if schema:
LOG.error(
"schema: \n\n%s",
('[\n ' + ',\n '.join(json.dumps(field) for field in schema) + '\n]\n')
)
LOG.error(f"dest_table_id: dest_table_id")
raise
BigQuery 的 Client.load_table_from_dataframe
从 pyarrow 引发:
dags/utils/database/_bigquery.py:257: in load_file_to_table
job = self.client.load_table_from_dataframe(
lib/python3.8/site-packages/google/cloud/bigquery/client.py:2233: in load_table_from_dataframe
_pandas_helpers.dataframe_to_parquet(
lib/python3.8/site-packages/google/cloud/bigquery/_pandas_helpers.py:486: in dataframe_to_parquet
arrow_table = dataframe_to_arrow(dataframe, bq_schema)
lib/python3.8/site-packages/google/cloud/bigquery/_pandas_helpers.py:450: in dataframe_to_arrow
bq_to_arrow_array(get_column_or_index(dataframe, bq_field.name), bq_field)
lib/python3.8/site-packages/google/cloud/bigquery/_pandas_helpers.py:224: in bq_to_arrow_array
return pyarrow.Array.from_pandas(series, type=arrow_type)
pyarrow/array.pxi:859: in pyarrow.lib.Array.from_pandas
???
pyarrow/array.pxi:265: in pyarrow.lib.array
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E TypeError: an integer is required (got type str)
然而,在使用 ERROR 调试打印进行调查时,提供的 DataFrame 似乎与为表提供的每个字段的预期架构相匹配:
ERROR utils.database._bigquery:_bigquery.py:265 Failed to upload dataframe:
"clinic_key","schedule_template_time_interval_key","schedule_template_key","date_key","schedule_owner_key","schedule_template_schedule_track_key","schedule_content_label_key","start_time_key","end_time_key","priority"
"clitest11111111111111111111111","1","1","2021-01-01","1","1","1","19:00:00","21:00:00",1
"clitest11111111111111111111111","1","1","2021-01-01","1","1","2","20:00:00","20:30:00",2
"clitest11111111111111111111111","1","1","2021-01-01","1","1","3","20:20:00","20:30:00",3
ERROR utils.database._bigquery:_bigquery.py:266
clinic_key object
schedule_template_time_interval_key object
schedule_template_key object
date_key object
schedule_owner_key object
schedule_template_schedule_track_key object
schedule_content_label_key object
start_time_key object
end_time_key object
priority int64
dtype: object
ERROR utils.database._bigquery:_bigquery.py:268 schema:
[
"name": "clinic_key", "type": "STRING", "mode": "NULLABLE",
"name": "schedule_template_time_interval_key", "type": "STRING", "mode": "NULLABLE",
"name": "schedule_template_key", "type": "STRING", "mode": "NULLABLE",
"name": "date_key", "type": "DATE", "mode": "NULLABLE",
"name": "schedule_owner_key", "type": "STRING", "mode": "NULLABLE",
"name": "schedule_template_schedule_track_key", "type": "STRING", "mode": "NULLABLE",
"name": "schedule_content_label_key", "type": "STRING", "mode": "NULLABLE",
"name": "start_time_key", "type": "TIME", "mode": "NULLABLE",
"name": "end_time_key", "type": "TIME", "mode": "NULLABLE",
"name": "priority", "type": "INT64", "mode": "NULLABLE"
]
在调用 load_table_from_dataframe 之前,我尝试将 (start_time_key, end_time_key) 字段转换为 INT(自一天开始以来的秒数),但这并没有解决问题。除此之外,我很困惑;我不明白哪个字段应该是整数,而是字符串。
我该如何解决这个问题?
P.S.:我尝试了一种不同的方法,使用 load_file_to_table,但我遇到了另一个问题。这是link to the other issue。
【问题讨论】:
【参考方案1】:当您尝试将数据从数据框加载到 BigQuery 时,我尝试通过使用 Python 客户端库引用此 document 来复制您的用例。
Google 提供的示例按预期工作。当我尝试使用您提供的架构时,我得到了同样的错误,即 "TypeError: an integer is required (got type str)"
发生此错误是因为您在架构中提供的数据类型和列字段中提供的值,即
"date_key", "type": "DATE"
"start_time_key", "type": "TIME"
"end_time_key", "type": "TIME"
在这些列中,您分别将数据类型传递为 Date 和 Time,但在传递值时,您将其提供为 String你得到 TypeError 即
"Date_key": "2021-01-01", "start_time_key": "19:00:00" , "end_time_key" : "21:00:00"
你可以参考下面的代码sn-p和数据框。我已经复制到我的一端,它正在工作。
加载数据.py:
import datetime
from google.cloud import bigquery
import pandas
import pytz
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "myproject.dataset1.tab4"
records = [
"clinic_key": u"cli101",
"schedule_template_time_interval_key":"1",
"schedule_template_key":"1",
"date_key":datetime.date(2021,10,5),
"schedule_owner_key":"1",
"schedule_template_schedule_track_key":"1",
"schedule_content_label_key":"1",
"start_time_key":datetime.time(19,5,00),
"end_time_key":datetime.time(20,00,00),
"priority":1,
,
"clinic_key": u"cli102",
"schedule_template_time_interval_key":"2",
"schedule_template_key":"2",
"date_key":datetime.date(2021,10,6),
"schedule_owner_key":"2",
"schedule_template_schedule_track_key":"2",
"schedule_content_label_key":"2",
"start_time_key":datetime.time(16,10,00),
"end_time_key":datetime.time(16,50,00),
"priority":2,
,
"clinic_key": u"cli103",
"schedule_template_time_interval_key":"3",
"schedule_template_key":"3",
"date_key":datetime.date(2021,10,7),
"schedule_owner_key":"3",
"schedule_template_schedule_track_key":"3",
"schedule_content_label_key":"3",
"start_time_key":datetime.time(19,10,00),
"end_time_key":datetime.time(20,00,00),
"priority":1,
,
"clinic_key": u"cli104",
"schedule_template_time_interval_key":"4",
"schedule_template_key":"4",
"date_key":datetime.date(2021,10,8),
"schedule_owner_key":"4",
"schedule_template_schedule_track_key":"4",
"schedule_content_label_key":"4",
"start_time_key":datetime.time(20,40,00),
"end_time_key":datetime.time(21,15,00),
"priority":3,
,
]
dataframe = pandas.DataFrame(
records,
# In the loaded table, the column order reflects the order of the
# columns in the DataFrame.
columns=[
"clinic_key",
"schedule_template_time_interval_key",
"schedule_template_key",
"date_key",
"schedule_owner_key",
"schedule_template_schedule_track_key",
"schedule_content_label_key",
"start_time_key",
"end_time_key",
"priority",
],
# Optionally, set a named index, which can also be written to the
# BigQuery table.
)
job_config = bigquery.LoadJobConfig(
# Specify a (partial) schema. All columns are always written to the
# table. The schema is used to assist in data type definitions.
schema=[
# Specify the type of columns whose type cannot be auto-detected. For
# example the "title" column uses pandas dtype "object", so its
# data type is ambiguous.
bigquery.SchemaField("clinic_key", "STRING"),
bigquery.SchemaField("schedule_template_time_interval_key","STRING"),
bigquery.SchemaField("schedule_template_key","STRING"),
bigquery.SchemaField("schedule_owner_key","STRING"),
bigquery.SchemaField("schedule_template_schedule_track_key","STRING"),
bigquery.SchemaField("schedule_content_label_key","STRING"),
bigquery.SchemaField("priority","INTEGER"),
],
write_disposition="WRITE_TRUNCATE",
)
job = client.load_table_from_dataframe(
dataframe, table_id, job_config=job_config
) # Make an API request.
job.result() # Wait for the job to complete.
table = client.get_table(table_id) # Make an API request.
print(
"Loaded rows and columns to ".format(
table.num_rows, len(table.schema), table_id
)
)
示例数据框:
"clinic_key": u"cli101",
"schedule_template_time_interval_key":"1",
"schedule_template_key":"1",
"date_key":datetime.date(2021,10,5),
"schedule_owner_key":"1",
"schedule_template_schedule_track_key":"1",
"schedule_content_label_key":"1",
"start_time_key":datetime.time(19,5,00),
"end_time_key":datetime.time(20,00,00),
"priority":1,
当您在架构中使用 DATE 和 TIME 时,列字段中的值应在 datetime.date(2021,2,21) 和 datetime.time(16,00 ,00) 格式。
架构输出:
查询输出:
【讨论】:
所以诀窍是使用本机python类型而不是将值格式化为字符串?这有点出乎意料,但有道理!非常感谢您的深入分析,Sandeep! 对于那些希望以编程方式执行上述操作的人,我会尝试使用以下方法转换字段:df["date_key"].apply(lambda v: datetime.datetime.strptime(v, "%Y-%m-%d").date())
和 df["time_key"].apply(lambda v: datetime.datetime.strptime(v, "%H:%M:%S").time())
以上是关于googleapis / python-bigquery:Client.load_dataframe_to_table 失败并出现 PyArrow “TypeError:需要一个整数(获取类型 str的主要内容,如果未能解决你的问题,请参考以下文章