如何使用 python 自动生成 bigquery 模式?

Posted

技术标签:

【中文标题】如何使用 python 自动生成 bigquery 模式?【英文标题】:how to automatize the bigquery' schema generation using python? 【发布时间】:2019-03-10 01:29:57 【问题描述】:

我在谷歌云上有一个 mysql 数据库,我想创建自动模式以将数据插入 Bigquery, 我需要自动创建以下行:

schema= [bigquery.SchemaField('EmployeeID', 'STRING', mode='NULLABLE')
bigquery.SchemaField('LastName', 'STRING', mode='NULLABLE')
bigquery.SchemaField('FirstName', 'STRING', mode='NULLABLE')
bigquery.SchemaField('Title', 'STRING', mode='NULLABLE')
bigquery.SchemaField('TitleOfCourtesy', 'STRING', mode='NULLABLE')
bigquery.SchemaField('BirthDate', 'STRING', mode='NULLABLE')
bigquery.SchemaField('HireDate', 'STRING', mode='NULLABLE')
bigquery.SchemaField('Address', 'STRING', mode='NULLABLE')
bigquery.SchemaField('City', 'STRING', mode='NULLABLE')
bigquery.SchemaField('Region', 'STRING', mode='NULLABLE')
bigquery.SchemaField('PostalCode', 'STRING', mode='NULLABLE')
bigquery.SchemaField('Country', 'STRING', mode='NULLABLE')
bigquery.SchemaField('HomePhone', 'STRING', mode='NULLABLE')
bigquery.SchemaField('Extension', 'STRING', mode='NULLABLE')
bigquery.SchemaField('Photo', 'STRING', mode='NULLABLE')
bigquery.SchemaField('Notes', 'STRING', mode='NULLABLE')
bigquery.SchemaField('ReportsTo', 'STRING', mode='NULLABLE')
bigquery.SchemaField('PhotoPath', 'STRING', mode='NULLABLE')]

所以为了实现这一点,我尝试了: 首先,我使用函数获取列的名称,这是我的输出:

print(table_schema_name_column)
['EmployeeID', 'LastName', 'FirstName', 'Title', 'TitleOfCourtesy', 'BirthDate', 'HireDate', 'Address', 'City', 'Region', 'PostalCode', 'Country', 'HomePhone', 'Extension', 'Photo', 'Notes', 'ReportsTo', 'PhotoPath']

然后我尝试了:

schema2=[]
for element in table_schema_name_column:
    base2="bigquery.SchemaField("+'\''+element+"\', \'STRING\', mode=\'NULLABLE\')"
    tmp=base2
    #print(base2)
    schema2.append(base2)

print(schema2)

这是对应的输出:

["bigquery.SchemaField('EmployeeID', 'STRING', mode='NULLABLE')", 
"bigquery.SchemaField('LastName', 'STRING', mode='NULLABLE')", 
"bigquery.SchemaField('FirstName', 'STRING', mode='NULLABLE')", "bigquery.SchemaField('Title', 'STRING', mode='NULLABLE')", 
"bigquery.SchemaField('TitleOfCourtesy', 'STRING', mode='NULLABLE')", "bigquery.SchemaField('BirthDate', 'STRING', mode='NULLABLE')",
"bigquery.SchemaField('HireDate', 'STRING', mode='NULLABLE')", "bigquery.SchemaField('Address', 'STRING', mode='NULLABLE')", 
"bigquery.SchemaField('City', 'STRING', mode='NULLABLE')", "bigquery.SchemaField('Region', 'STRING', mode='NULLABLE')",
 "bigquery.SchemaField('PostalCode', 'STRING', mode='NULLABLE')", "bigquery.SchemaField('Country', 'STRING', mode='NULLABLE')", 
 "bigquery.SchemaField('HomePhone', 'STRING', mode='NULLABLE')", "bigquery.SchemaField('Extension', 'STRING', mode='NULLABLE')",
  "bigquery.SchemaField('Photo', 'STRING', mode='NULLABLE')", "bigquery.SchemaField('Notes', 'STRING', mode='NULLABLE')", 
  "bigquery.SchemaField('ReportsTo', 'STRING', mode='NULLABLE')", "bigquery.SchemaField('PhotoPath', 'STRING', mode='NULLABLE')"]

这个 schema2 的问题是当我尝试使用它来创建一个表时,我得到了以下错误:

table_ref = dataset_ref.table("my_table_aut")
table = bigquery.Table(table_ref, schema=schema2)
table = client.create_table(table)  # API request

assert table.table_id == "my_table_aut"

错误输出:

ValueError                                Traceback (most recent call last)
<ipython-input-13-ce1fc2c637fe> in <module>
      4 ]
      5 table_ref = dataset_ref.table("my_table_aut")
----> 6 table = bigquery.Table(table_ref, schema=schema2)
      7 table = client.create_table(table)  # API request
      8 

~/.local/lib/python3.6/site-packages/google/cloud/bigquery/table.py in __init__(self, table_ref, schema)
    371         # Let the @property do validation.
    372         if schema is not None:
--> 373             self.schema = schema
    374 
    375     @property

~/.local/lib/python3.6/site-packages/google/cloud/bigquery/table.py in schema(self, value)
    420             self._properties["schema"] = None
    421         elif not all(isinstance(field, SchemaField) for field in value):
--> 422             raise ValueError("Schema items must be fields")
    423         else:
    424             self._properties["schema"] = "fields": _build_schema_resource(value)

ValueError: Schema items must be fields

因此,我希望感谢支持以完成这项任务

【问题讨论】:

我不明白为什么 SchemaField 对象周围有引号。看起来你正在制作一个字符串数组...... 这就是我无法将字符串数组转换为相应对象的问题 【参考方案1】:

这应该可行:

schema2=[]
for element in table_schema_name_column:
    schema2.append(bigquery.SchemaField(element, 'STRING', mode='NULLABLE'))

table_ref = dataset_ref.table("my_table_aut")
table = bigquery.Table(table_ref, schema=schema2)
table = client.create_table(table)

【讨论】:

以上是关于如何使用 python 自动生成 bigquery 模式?的主要内容,如果未能解决你的问题,请参考以下文章

从 python 字典自动生成 BigQuery 架构

在 Bigquery 中为多个 CSV 文件自动创建表

如何使用Appengine和来自API的Python脚本流数据将数据流式传输到Google Cloud BigQuery?

如何使用 Python 与 BigQuery 建立连接

如何使用 Python BigQuery API 追加到 BigQuery 中的表

从 python 生成 Faker 数据并将其加载到 BigQuery 嵌套表中