带有 STRUCTS 数组的 Bigquery python SchemaField()
Posted
技术标签:
【中文标题】带有 STRUCTS 数组的 Bigquery python SchemaField()【英文标题】:Bigquery python SchemaField() with ARRAY of STRUCTS 【发布时间】:2018-03-28 19:48:37 【问题描述】:我正在尝试通过 python 客户端在 Bigquery 中创建一个表。文档使用bigquery.SchemaField('name', 'TYPE')
来定义一个字段。虽然它似乎不适用于 ARRAYS 或 STRUCTS。这是我正在尝试创建的 STRUCTS 字段的数组:
bigquery.SchemaField('owners', 'ARRAY<STRUCT<emailAddress STRING, displayName STRING>>', 'REPEATABLE'),
如果我使用上面的字段定义,我会收到以下 API 错误:
400 POST https://www.googleapis.com/bigquery/v2/projects/import-sheet/datasets/sheetgo/tables: Invalid value for: ARRAY<STRUCT<emailAddress STRING, displayName STRING>> is not a valid value
整个代码:
schema = [
bigquery.SchemaField('user', 'STRING'),
bigquery.SchemaField('id', 'STRING'),
bigquery.SchemaField('service_origin', 'STRING'),
bigquery.SchemaField('name', 'STRING'),
bigquery.SchemaField('mimeType', 'STRING'),
bigquery.SchemaField('createdAt', 'DATETIME'),
bigquery.SchemaField('ownedByMe', 'BOOLEAN'),
bigquery.SchemaField('owners', 'ARRAY<STRUCT<emailAddress STRING, displayName STRING>>', 'REPEATABLE'),
bigquery.SchemaField('parents', 'ARRAY<STRING>', 'REPEATABLE'),
bigquery.SchemaField('teamDriveId', 'STRING'),
bigquery.SchemaField('permissions', 'STRING'),
bigquery.SchemaField('shared', 'BOOLEAN'),
bigquery.SchemaField('writersCanShare', 'BOOLEAN'),
bigquery.SchemaField('sharingUser', 'STRING'),
bigquery.SchemaField('version', 'STRING'),
bigquery.SchemaField('size', 'FLOAT'),
bigquery.SchemaField('data_properties', 'ARRAY<STRUCT<'
'rows INTEGER,'
'cells_with_importrange ARRAY<'
'STRUCT<'
'row_index INTEGER,'
'col_index INTEGER,'
'importrange STRING'
'>'
'>,'
'tab_name STRING,'
'cell_count FLOAT,'
'header_rows ARRAY<STRING>,'
'>>', 'REPEATABLE'),
bigquery.SchemaField('timezone', 'STRING'),
bigquery.SchemaField('locale', 'STRING'),
bigquery.SchemaField('last_scansheet', 'STRING'),
]
bigquery_client = bigquery.Client(PROJECT_ID)
dataset_ref = bigquery_client.dataset("eita")
table_ref = dataset_ref.table(table_id)
table = bigquery.Table(table_ref, schema=schema)
table = bigquery_client.create_table(table)
更新
感谢Willian Fuks,我得到了这个工作。架构的最终结果是这样的:
schema = [
bigquery.SchemaField('user', 'STRING'),
bigquery.SchemaField('id', 'STRING'),
bigquery.SchemaField('service_origin', 'STRING'),
bigquery.SchemaField('name', 'STRING'),
bigquery.SchemaField('mimeType', 'STRING'),
bigquery.SchemaField('createdAt', 'DATETIME'),
bigquery.SchemaField('ownedByMe', 'BOOLEAN'),
bigquery.SchemaField('owners', 'RECORD', mode='REPEATED',
fields=(
bigquery.SchemaField('emailAddress', 'STRING'),
bigquery.SchemaField('displayName', 'STRING')
)
),
bigquery.SchemaField('parents', 'STRING', mode='REPEATED'),
bigquery.SchemaField('teamDriveId', 'STRING'),
bigquery.SchemaField('permissions', 'STRING'),
bigquery.SchemaField('shared', 'BOOLEAN'),
bigquery.SchemaField('writersCanShare', 'BOOLEAN'),
bigquery.SchemaField('sharingUser', 'STRING'),
bigquery.SchemaField('version', 'STRING'),
bigquery.SchemaField('size', 'FLOAT'),
bigquery.SchemaField('data_properties', 'RECORD', mode='REPEATED',
fields=(
bigquery.SchemaField('rows', 'INTEGER'),
bigquery.SchemaField('cells_with_importrange', 'RECORD', mode='REPEATED',
fields=(
bigquery.SchemaField('row_index', 'INTEGER'),
bigquery.SchemaField('col_index', 'INTEGER'),
bigquery.SchemaField('importrange', 'STRING'),
)
),
bigquery.SchemaField('tab_name', 'STRING'),
bigquery.SchemaField('cell_count', 'FLOAT'),
bigquery.SchemaField('header_rows', 'STRING', mode='REPEATED')
)
),
bigquery.SchemaField('timezone', 'STRING'),
bigquery.SchemaField('locale', 'STRING'),
bigquery.SchemaField('last_scansheet', 'STRING'),
]
【问题讨论】:
如果您可以包含完整的代码,而不仅仅是架构部分,那就太好了。所以其他人可以有一个很好的参考。 :) 是的,我对使用 python @YanniCao 转储结构数据感到更加困惑。 【参考方案1】:SchemaField
的构造函数合约与您使用的合约执行 expect different inputs。
试试这个:
schema = [
(...),
SchemaField('owners', 'RECORD', mode='REPEATED',
fields=(SchemaField('emailAddress', 'STRING'),
SchemaField('displayName', 'STRING')
)
),
(...)
]
主要思想是通过使用其他SchemaField
定义来定义记录字段内的字段。
【讨论】:
完美!我将更新我的问题以包含解决方案。【参考方案2】:如果您想使用标准 SQL 类型名称而不是使用旧 SQL 类型和 SchemaField
,则可以改为执行查询来创建表:
CREATE TABLE dataset.table_name
(
user STRING,
id STRING,
service_origin STRING,
name STRING,
mimeType STRING,
createdAt DATETIME,
ownedByMe BOOL,
owners ARRAY<STRUCT<emailAddress STRING, displayName STRING>>,
parents ARRAY<STRING>,
teamDriveId STRING,
permissions STRING,
shared BOOL,
writersCanShare BOOL,
sharingUser STRING,
version STRING,
size FLOAT64,
data_properties
ARRAY<STRUCT<`rows` INT64,
cells_with_importrange ARRAY<STRUCT<row_index INT64, col_index INT64, importrange STRING>>,
tab_name STRING, cell_count FLOAT64, header_rows ARRAY<STRING>>>,
timezone STRING,
locale STRING,
last_scansheet STRING
);
【讨论】:
你使用什么命令来让这个查询工作?我们发现它在 Web 控制台中工作 - 但无法获取标准的 python“client.query(sql)”命令来获取 sql 语句来创建表。 您可能应该创建一个单独的问题来显示您正在使用的代码。以上是关于带有 STRUCTS 数组的 Bigquery python SchemaField()的主要内容,如果未能解决你的问题,请参考以下文章
使用 Nifi 将带有数组的 json 插入 BigQuery 的问题
Bigquery 为其余数据输出带有 json 数组对象的不同 zip 行
在 BigQuery 中,带有空值数组列的“where”子句导致问题