使用 Pandas 或命令行上传到 BigQuery 时出现奇怪的重复字段错误。所有字段唯一
Posted
技术标签:
【中文标题】使用 Pandas 或命令行上传到 BigQuery 时出现奇怪的重复字段错误。所有字段唯一【英文标题】:Strange repeated field error when uploading to BigQuery with Pandas or command line. All fields unique 【发布时间】:2020-10-22 17:24:54 【问题描述】:我有一个 pandas 数据框,我也已写入文件。我还为 json 格式的数据创建了一个模式。我将它存储为 python 字典,并写入文件。
我尝试使用to_gpq
和命令行进行上传,但在这两种情况下,我都会收到关于字段重复的错误,即相同的字段。
这是有关数据的信息:
代码
df.shape
len(clinvar_variant_schema)
schema_fields = [x['name'] for x in clinvar_variant_schema]
schema_fields.sort()
json.dumps(schema_fields)
colnames = df.columns.tolist()
colnames.sort()
json.dumps(colnames)
set(schema_fields).difference(set(colnames))
set(colnames).difference(set(schema_fields))
输出
(1000, 24)
24
'["AF_ESP", "AF_EXAC", "AF_TGP", "ALLELEID", "ALT", "CHROM", "CLNDISDB", "CLNDN", "CLNHGVS", "CLNREVSTAT", "CLNSIG", "CLNSIGCONF", "CLNVC", "CLNVCSO", "CLNVI", "FILTER", "GENEINFO", "ID", "MC", "ORIGIN", "POS", "QUAL", "REF", "RS"]'
'["AF_ESP", "AF_EXAC", "AF_TGP", "ALLELEID", "ALT", "CHROM", "CLNDISDB", "CLNDN", "CLNHGVS", "CLNREVSTAT", "CLNSIG", "CLNSIGCONF", "CLNVC", "CLNVCSO", "CLNVI", "FILTER", "GENEINFO", "ID", "MC", "ORIGIN", "POS", "QUAL", "REF", "RS"]'
set()
set()
Colnames Schema_Names
0 AF_ESP AF_ESP
1 AF_EXAC AF_EXAC
2 AF_TGP AF_TGP
3 ALLELEID ALLELEID
4 ALT ALT
5 CHROM CHROM
6 CLNDISDB CLNDISDB
7 CLNDN CLNDN
8 CLNHGVS CLNHGVS
9 CLNREVSTAT CLNREVSTAT
10 CLNSIG CLNSIG
11 CLNSIGCONF CLNSIGCONF
12 CLNVC CLNVC
13 CLNVCSO CLNVCSO
14 CLNVI CLNVI
15 FILTER FILTER
16 GENEINFO GENEINFO
17 ID ID
18 MC MC
19 ORIGIN ORIGIN
20 POS POS
21 QUAL QUAL
22 REF REF
23 RS RS
使用熊猫
project_id = "my_project_id"
table_id = "my_dataset.clinvar_vcf"
df.to_gbq(
destination_table = table_id,
project_id = project_id,
if_exists = "replace",
table_schema = clinvar_variant_schema
)
IPython will make a temporary file named: /tmp/ipython_edit_pw1v55fy/ipython_edit_k3v1q7m5.py
1it [00:00, 2.93it/s]
---------------------------------------------------------------------------
InvalidResponse Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/google/cloud/bigquery/client.py in load_table_from_file(self, file_obj, destination, rewind, size, num_retries, job_id, job_id_prefix, location, project, job_config)
1829 response = self._do_resumable_upload(
-> 1830 file_obj, job_resource, num_retries
1831 )
/opt/conda/lib/python3.7/site-packages/google/cloud/bigquery/client.py in _do_resumable_upload(self, stream, metadata, num_retries)
2153 upload, transport = self._initiate_resumable_upload(
-> 2154 stream, metadata, num_retries
2155 )
/opt/conda/lib/python3.7/site-packages/google/cloud/bigquery/client.py in _initiate_resumable_upload(self, stream, metadata, num_retries)
2195 upload.initiate(
-> 2196 transport, stream, metadata, _GENERIC_CONTENT_TYPE, stream_final=False
2197 )
/opt/conda/lib/python3.7/site-packages/google/resumable_media/requests/upload.py in initiate(self, transport, stream, metadata, content_type, total_bytes, stream_final, timeout)
411 )
--> 412 self._process_initiate_response(response)
413 return response
/opt/conda/lib/python3.7/site-packages/google/resumable_media/_upload.py in _process_initiate_response(self, response)
505 self._get_status_code,
--> 506 callback=self._make_invalid,
507 )
/opt/conda/lib/python3.7/site-packages/google/resumable_media/_helpers.py in require_status_code(response, status_codes, get_status_code, callback)
110 u"Expected one of",
--> 111 *status_codes
112 )
InvalidResponse: ('Request failed with status code', 400, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.CREATED: 201>)
During handling of the above exception, another exception occurred:
BadRequest Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/pandas_gbq/gbq.py in load_data(self, dataframe, dataset_id, table_id, chunksize, schema, progress_bar)
628 chunks = tqdm.tqdm(chunks)
--> 629 for remaining_rows in chunks:
630 logger.info(
/opt/conda/lib/python3.7/site-packages/tqdm/std.py in __iter__(self)
1173 try:
-> 1174 for obj in iterable:
1175 yield obj
/opt/conda/lib/python3.7/site-packages/pandas_gbq/load.py in load_chunks(client, dataframe, dataset_id, table_id, chunksize, schema, location)
81 job_config=job_config,
---> 82 location=location,
83 ).result()
/opt/conda/lib/python3.7/site-packages/google/cloud/bigquery/client.py in load_table_from_file(self, file_obj, destination, rewind, size, num_retries, job_id, job_id_prefix, location, project, job_config)
1836 except resumable_media.InvalidResponse as exc:
-> 1837 raise exceptions.from_http_response(exc.response)
1838
BadRequest: 400 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/eacri-genomics/jobs?uploadType=resumable: Cannot load CSV data with a repeated field. Field: FILTER
During handling of the above exception, another exception occurred:
GenericGBQException Traceback (most recent call last)
<ipython-input-55-19cb6dc0a4ee> in <module>
6 project_id = project_id,
7 if_exists = "replace",
----> 8 table_schema = clinvar_variant_schema
9 )
/opt/conda/lib/python3.7/site-packages/pandas/core/frame.py in to_gbq(self, destination_table, project_id, chunksize, reauth, if_exists, auth_local_webserver, table_schema, location, progress_bar, credentials)
1655 location=location,
1656 progress_bar=progress_bar,
-> 1657 credentials=credentials,
1658 )
1659
/opt/conda/lib/python3.7/site-packages/pandas/io/gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, reauth, if_exists, auth_local_webserver, table_schema, location, progress_bar, credentials, verbose, private_key)
226 credentials=credentials,
227 verbose=verbose,
--> 228 private_key=private_key,
229 )
/opt/conda/lib/python3.7/site-packages/pandas_gbq/gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, reauth, if_exists, auth_local_webserver, table_schema, location, progress_bar, credentials, verbose, private_key)
1206 chunksize=chunksize,
1207 schema=table_schema,
-> 1208 progress_bar=progress_bar,
1209 )
1210
/opt/conda/lib/python3.7/site-packages/pandas_gbq/gbq.py in load_data(self, dataframe, dataset_id, table_id, chunksize, schema, progress_bar)
634 )
635 except self.http_error as ex:
--> 636 self.process_http_error(ex)
637
638 def schema(self, dataset_id, table_id):
/opt/conda/lib/python3.7/site-packages/pandas_gbq/gbq.py in process_http_error(ex)
433 # <https://cloud.google.com/bigquery/troubleshooting-errors>`__
434
--> 435 raise GenericGBQException("Reason: 0".format(ex))
436
437 def run_query(
GenericGBQException: Reason: 400 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/eacri-genomics/jobs?uploadType=resumable: Cannot load CSV data with a repeated field. Field: FILTER
使用命令行bq
bq load --source_format=CSV --field_delimiter=tab --replace --schema clinvar_variant_schema.json my_project_id:my_dataset.clinvar_vcf clinvar_expanded_vcf
BigQuery error in load operation: Cannot load CSV data with a repeated field.
Field: FILTER
---------------------------------------------------------------------------
CalledProcessError Traceback (most recent call last)
<ipython-input-45-be7b46a6694c> in <module>
----> 1 get_ipython().run_cell_magic('bash', '', 'bq load --source_format=CSV --field_delimiter=tab --replace --schema clinvar_variant_schema.json eacri-genomics:clinvar_9302020.clinvar_vcf clinvar_expanded_vcf\n')
/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
2379 with self.builtin_trap:
2380 args = (magic_arg_s, cell)
-> 2381 result = fn(*args, **kwargs)
2382 return result
2383
/opt/conda/lib/python3.7/site-packages/IPython/core/magics/script.py in named_script_magic(line, cell)
140 else:
141 line = script
--> 142 return self.shebang(line, cell)
143
144 # write a basic docstring:
<decorator-gen-103> in shebang(self, line, cell)
/opt/conda/lib/python3.7/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
185 # but it's overkill for just that one bit of state.
186 def magic_deco(arg):
--> 187 call = lambda f, *a, **k: f(*a, **k)
188
189 if callable(arg):
/opt/conda/lib/python3.7/site-packages/IPython/core/magics/script.py in shebang(self, line, cell)
243 sys.stderr.flush()
244 if args.raise_error and p.returncode!=0:
--> 245 raise CalledProcessError(p.returncode, cell, output=out, stderr=err)
246
247 def _run_script(self, p, cell, to_close):
CalledProcessError: Command 'b'bq load --source_format=CSV --field_delimiter=tab --replace --schema clinvar_variant_schema.json eacri-genomics:clinvar_9302020.clinvar_vcf clinvar_expanded_vcf\n'' returned non-zero exit status 1.
【问题讨论】:
【参考方案1】:确保您的clinvar_variant_schema
中没有'mode': 'REPEATED'
。还要检查表 clinvar_vcf
是否已存在于 BigQuery 中及其列的模式。
【讨论】:
上传中断为什么会有重复模式?那么,如果schema说repeated,数据是csv,但是数据没有重复数据,还是失败了? REPEATED 就像“数组”的同义词。考虑使用部分支持 REPEATED 的client.load_table_from_dataframe。 Client.load_table_from_json 完全支持 REPEATED。 是的,csv 与架构中的重复列不兼容。如果没有重复数据,从clinvar_variant_schema
中删除'mode': 'REPEATED'
可能是有意义的。【参考方案2】:
看起来 CSV 不支持嵌套或重复数据。
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#limitations
我相信默认情况下 to_gbq 会转换为 CSV 然后加载。因此,您可能希望使用 CSV 以外的其他格式。
【讨论】:
我没有任何嵌套数据。 所以 FILTER 列不重复,我的意思是它是否包含任何类型的列表? 我将数据从一个平面 CSV 文件读入 pandas。有没有办法让我在哪里检查此类问题? 我会在 FILTER 列中的值中搜索方括号,或者将其放入工作表中并执行相同操作。你能分享一些 FILTER 数据的样子吗 你可以做的另一个选择是使用 to_parquet 函数并加载它而不是 csv【参考方案3】:基于Sergey的答案
你确实应该使用“REPEATED”模式
bq_schema = [
SchemaField(name="sp_zipcode", field_type="STRING", mode="REPEATED", description="Search query parameter: zipcode")
]
这仍然导致Cannot load CSV data with a repeated field. Field: sp_zipcode
通过升级要求为我解决了这个问题
pip install google-cloud-bigquery --upgrade
pip install pandas-gbq --upgrade
google-cloud-bigquery==2.32.0
pandas-gbq==0.17.0
这是安装 2 个软件包后的整个 pip 冻结:
cachetools==5.0.0
certifi==2021.10.8
charset-normalizer==2.0.11
db-dtypes==0.3.1
google-api-core==2.4.0
google-auth==2.6.0
google-auth-oauthlib==0.4.6
google-cloud-bigquery==2.32.0
google-cloud-bigquery-storage==2.11.0
google-cloud-core==2.2.2
google-crc32c==1.3.0
google-resumable-media==2.1.0
googleapis-common-protos==1.54.0
grpcio==1.43.0
grpcio-status==1.43.0
idna==3.3
libcst==0.4.1
mypy-extensions==0.4.3
numpy==1.21.5
oauthlib==3.2.0
packaging==21.3
pandas==1.3.5
pandas-gbq==0.17.0
proto-plus==1.19.9
protobuf==3.19.4
pyarrow==6.0.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pydata-google-auth==1.3.0
pyparsing==3.0.7
python-dateutil==2.8.2
pytz==2021.3
PyYAML==6.0
requests==2.27.1
requests-oauthlib==1.3.1
rsa==4.8
six==1.16.0
typing-extensions==4.0.1
typing-inspect==0.7.1
urllib3==1.26.8
【讨论】:
以上是关于使用 Pandas 或命令行上传到 BigQuery 时出现奇怪的重复字段错误。所有字段唯一的主要内容,如果未能解决你的问题,请参考以下文章
execel pandas 减少复制黏贴的代码行 自动匹配单号