googleapis / python-bigquery:Client.load_dataframe_to_table 失败并出现 PyArrow “TypeError:需要一个整数(获取类型 str

Posted

技术标签:

【中文标题】googleapis / python-bigquery:Client.load_dataframe_to_table 失败并出现 PyArrow “TypeError:需要一个整数(获取类型 str)”【英文标题】:googleapis / python-bigquery: Client.load_dataframe_to_table fails with PyArrow "TypeError: an integer is required (got type str)" 【发布时间】:2021-10-04 02:00:41 【问题描述】:

给定以下代码:

try:
    dest_table = bigquery.table.Table(table_id, schema)
    job = self.client.load_table_from_dataframe(
        df_data, # pd.DataFrame
        dest_table,
        job_config=bigquery.job.LoadJobConfig(schema=schema)
    )
    job.result()
except TypeError:
    with pd.option_context("display.max_rows", None, "display.max_columns", None, "display.width", None):
        LOG.error("Failed to upload dataframe: \n\n%s\n", df_data.to_csv(header=True, index=False, quoting=csv.QUOTE_NONNUMERIC))
        LOG.error("\n%s\n", df_data.dtypes)
        if schema:
            LOG.error(
                "schema: \n\n%s", 
                ('[\n  ' + ',\n  '.join(json.dumps(field) for field in schema) + '\n]\n')
            )
        LOG.error(f"dest_table_id: dest_table_id")
    raise

BigQuery 的 Client.load_table_from_dataframe 从 pyarrow 引发:

dags/utils/database/_bigquery.py:257: in load_file_to_table
    job = self.client.load_table_from_dataframe(
lib/python3.8/site-packages/google/cloud/bigquery/client.py:2233: in load_table_from_dataframe
    _pandas_helpers.dataframe_to_parquet(
lib/python3.8/site-packages/google/cloud/bigquery/_pandas_helpers.py:486: in dataframe_to_parquet
    arrow_table = dataframe_to_arrow(dataframe, bq_schema)
lib/python3.8/site-packages/google/cloud/bigquery/_pandas_helpers.py:450: in dataframe_to_arrow
    bq_to_arrow_array(get_column_or_index(dataframe, bq_field.name), bq_field)
lib/python3.8/site-packages/google/cloud/bigquery/_pandas_helpers.py:224: in bq_to_arrow_array
    return pyarrow.Array.from_pandas(series, type=arrow_type)
pyarrow/array.pxi:859: in pyarrow.lib.Array.from_pandas
    ???
pyarrow/array.pxi:265: in pyarrow.lib.array
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   TypeError: an integer is required (got type str)

然而,在使用 ERROR 调试打印进行调查时,提供的 DataFrame 似乎与为表提供的每个字段的预期架构相匹配:

ERROR    utils.database._bigquery:_bigquery.py:265 Failed to upload dataframe:

"clinic_key","schedule_template_time_interval_key","schedule_template_key","date_key","schedule_owner_key","schedule_template_schedule_track_key","schedule_content_label_key","start_time_key","end_time_key","priority"
"clitest11111111111111111111111","1","1","2021-01-01","1","1","1","19:00:00","21:00:00",1
"clitest11111111111111111111111","1","1","2021-01-01","1","1","2","20:00:00","20:30:00",2
"clitest11111111111111111111111","1","1","2021-01-01","1","1","3","20:20:00","20:30:00",3

ERROR    utils.database._bigquery:_bigquery.py:266
clinic_key                              object
schedule_template_time_interval_key     object
schedule_template_key                   object
date_key                                object
schedule_owner_key                      object
schedule_template_schedule_track_key    object
schedule_content_label_key              object
start_time_key                          object
end_time_key                            object
priority                                 int64
dtype: object

ERROR    utils.database._bigquery:_bigquery.py:268 schema:

[
  "name": "clinic_key", "type": "STRING", "mode": "NULLABLE",
  "name": "schedule_template_time_interval_key", "type": "STRING", "mode": "NULLABLE",
  "name": "schedule_template_key", "type": "STRING", "mode": "NULLABLE",
  "name": "date_key", "type": "DATE", "mode": "NULLABLE",
  "name": "schedule_owner_key", "type": "STRING", "mode": "NULLABLE",
  "name": "schedule_template_schedule_track_key", "type": "STRING", "mode": "NULLABLE",
  "name": "schedule_content_label_key", "type": "STRING", "mode": "NULLABLE",
  "name": "start_time_key", "type": "TIME", "mode": "NULLABLE",
  "name": "end_time_key", "type": "TIME", "mode": "NULLABLE",
  "name": "priority", "type": "INT64", "mode": "NULLABLE"
]

在调用 load_table_from_dataframe 之前,我尝试将 (start_time_key, end_time_key) 字段转换为 INT(自一天开始以来的秒数),但这并没有解决问题。除此之外,我很困惑;我不明白哪个字段应该是整数,而是字符串。

我该如何解决这个问题?

P.S.:我尝试了一种不同的方法,使用 load_file_to_table,但我遇到了另一个问题。这是link to the other issue。

【问题讨论】:

【参考方案1】:

当您尝试将数据从数据框加载到 BigQuery 时,我尝试通过使用 Python 客户端库引用此 document 来复制您的用例。

Google 提供的示例按预期工作。当我尝试使用您提供的架构时,我得到了同样的错误,即 "TypeError: an integer is required (got type str)"

发生此错误是因为您在架构中提供的数据类型和列字段中提供的值,即

"date_key", "type": "DATE"
"start_time_key", "type": "TIME"
"end_time_key", "type": "TIME"

在这些列中,您分别将数据类型传递为 DateTime,但在传递值时,您将其提供为 String你得到 TypeError 即

"Date_key": "2021-01-01", "start_time_key": "19:00:00" , "end_time_key" : "21:00:00"

你可以参考下面的代码sn-p和数据框。我已经复制到我的一端,它正在工作。

加载数据.py:

import datetime

from google.cloud import bigquery
import pandas
import pytz

# Construct a BigQuery client object.
client = bigquery.Client()

# TODO(developer): Set table_id to the ID of the table to create.
table_id = "myproject.dataset1.tab4"

records = [
   
      "clinic_key": u"cli101",
      "schedule_template_time_interval_key":"1",
      "schedule_template_key":"1",
      "date_key":datetime.date(2021,10,5),
      "schedule_owner_key":"1",
      "schedule_template_schedule_track_key":"1",
      "schedule_content_label_key":"1",
      "start_time_key":datetime.time(19,5,00),
      "end_time_key":datetime.time(20,00,00),
      "priority":1,

   ,


      "clinic_key": u"cli102",
      "schedule_template_time_interval_key":"2",
      "schedule_template_key":"2",
      "date_key":datetime.date(2021,10,6),
      "schedule_owner_key":"2",
      "schedule_template_schedule_track_key":"2",
      "schedule_content_label_key":"2",
      "start_time_key":datetime.time(16,10,00),
      "end_time_key":datetime.time(16,50,00),
      "priority":2,

   ,
   
      "clinic_key": u"cli103",
      "schedule_template_time_interval_key":"3",
      "schedule_template_key":"3",
      "date_key":datetime.date(2021,10,7),
      "schedule_owner_key":"3",
      "schedule_template_schedule_track_key":"3",
      "schedule_content_label_key":"3",
      "start_time_key":datetime.time(19,10,00),
      "end_time_key":datetime.time(20,00,00),
      "priority":1,

   ,
   
      "clinic_key": u"cli104",
      "schedule_template_time_interval_key":"4",
      "schedule_template_key":"4",
      "date_key":datetime.date(2021,10,8),
      "schedule_owner_key":"4",
      "schedule_template_schedule_track_key":"4",
      "schedule_content_label_key":"4",
      "start_time_key":datetime.time(20,40,00),
      "end_time_key":datetime.time(21,15,00),
      "priority":3,

   ,
]
dataframe = pandas.DataFrame(
   records,
   # In the loaded table, the column order reflects the order of the
   # columns in the DataFrame.
   columns=[
"clinic_key",                            
"schedule_template_time_interval_key",  
"schedule_template_key",                  
"date_key",                             
"schedule_owner_key",                     
"schedule_template_schedule_track_key", 
"schedule_content_label_key",             
"start_time_key",                        
"end_time_key",                          
"priority",                               
   ],
   # Optionally, set a named index, which can also be written to the
   # BigQuery table.
  
)
job_config = bigquery.LoadJobConfig(
   # Specify a (partial) schema. All columns are always written to the
   # table. The schema is used to assist in data type definitions.
   schema=[
       # Specify the type of columns whose type cannot be auto-detected. For
       # example the "title" column uses pandas dtype "object", so its
       # data type is ambiguous.
       bigquery.SchemaField("clinic_key", "STRING"),
       bigquery.SchemaField("schedule_template_time_interval_key","STRING"),
       bigquery.SchemaField("schedule_template_key","STRING"),
       bigquery.SchemaField("schedule_owner_key","STRING"),
       bigquery.SchemaField("schedule_template_schedule_track_key","STRING"),
       bigquery.SchemaField("schedule_content_label_key","STRING"),
       bigquery.SchemaField("priority","INTEGER"),   
      
   ],
      write_disposition="WRITE_TRUNCATE",
)

job = client.load_table_from_dataframe(
   dataframe, table_id, job_config=job_config
)  # Make an API request.
job.result()  # Wait for the job to complete.

table = client.get_table(table_id)  # Make an API request.
print(
   "Loaded  rows and  columns to ".format(
       table.num_rows, len(table.schema), table_id
   )
)

示例数据框:

      "clinic_key": u"cli101",
      "schedule_template_time_interval_key":"1",
      "schedule_template_key":"1",
      "date_key":datetime.date(2021,10,5),
      "schedule_owner_key":"1",
      "schedule_template_schedule_track_key":"1",
      "schedule_content_label_key":"1",
      "start_time_key":datetime.time(19,5,00),
      "end_time_key":datetime.time(20,00,00),
      "priority":1,

当您在架构中使用 DATE 和 TIME 时,列字段中的值应在 datetime.date(2021,2,21)datetime.time(16,00 ,00) 格式。

架构输出:

查询输出:

【讨论】:

所以诀窍是使用本机python类型而不是将值格式化为字符串?这有点出乎意料,但有道理!非常感谢您的深入分析,Sandeep! 对于那些希望以编程方式执行上述操作的人,我会尝试使用以下方法转换字段:df["date_key"].apply(lambda v: datetime.datetime.strptime(v, "%Y-%m-%d").date())df["time_key"].apply(lambda v: datetime.datetime.strptime(v, "%H:%M:%S").time())

以上是关于googleapis / python-bigquery:Client.load_dataframe_to_table 失败并出现 PyArrow “TypeError:需要一个整数(获取类型 str的主要内容,如果未能解决你的问题,请参考以下文章

在 ios 应用中使用 googleapis

Webpack 和 GoogleApis 端点路径问题

jquery googleapis

Terraform 错误创建主题:googleapi:错误 403:用户无权执行此操作

从googleapi加载jQuery

使用Redirector插件解决googleapis公共库加载的问题