使用 Dataflow 的 Pandas read_gbq 初始化错误

Posted 2023-03-24

技术标签:

【中文标题】使用 Dataflow 的 Pandas read_gbq 初始化错误【英文标题】：Pandas read_gbq init error using Dataflow 【发布时间】：2019-05-20 19:13:16 【问题描述】：

我一直在使用 Python 运行数据流作业，该 Python 使用了 pandas 库。它突然开始失败并出现以下错误：

文件“/usr/local/lib/python2.7/dist-packages/pandas_gbq/auth.py”，第 305 行，在 _try_credentials 客户端 = bigquery.Client(project=project_id, credentials=credentials)

文件“/usr/local/lib/python2.7/dist-packages/google/cloud/bigquery/client.py”，第 161 行，在 init 中 self._connection = Connection(self, client_info=client_info)

init 中的文件“/usr/local/lib/python2.7/dist-packages/google/cloud/bigquery/_http.py”，第 33 行 super(Connection, self).init(client, client_info)

TypeError: init() 只需要 2 个参数（给定 3 个）

这一步失败了：

import pandas as pd  
data = pd.read_gbq(query=query, project_id=project, dialect='standard', private_key=credentials)

我的安装文件如下所示：

install_requires=[
   'google-cloud-storage==1.11.0',
   'requests==2.19.1',
   'urllib3==1.23',
   'pandas-gbq==0.6.1',
   'pandas==0.23.4',
   'protobuf==3.6.0'
    ]

这与我本地的版本相同，代码正在运行。当作业开始失败时，尚未对作业进行任何更改。它在本地成功运行，但是当我使用 Dataflowrunner 运行时我看到了这个问题。我认为这是一个依赖问题。我正在使用的任何软件包版本是否存在记录在案的问题？或者我需要将特定的包版本添加到我的设置文件中吗？

【问题讨论】：

【参考方案1】：

我必须将 BigQuery 版本添加到我的设置文件中。

'google-cloud-bigquery==1.6.0'

根据 Google documentation for Python SDK 2.5 的说法，Dataflow 工作器已经安装了 BigQuery 0.25.0。由于我之前没有指定版本，因此我认为这就是我的工作正在运行的内容。如果该版本的 BigQuery 存在问题，我仍然不确定为什么该错误最近才开始发生。无论如何，指定 1.6.0 解决了这个问题。

【讨论】：

以上是关于使用 Dataflow 的 Pandas read_gbq 初始化错误的主要内容，如果未能解决你的问题，请参考以下文章

TableRow 对象未在 dataFlow 作业中返回记录类型列

如何使用 Dataflow Python SDK 读取 BigQuery 嵌套表

通过谷歌云功能在 DataFlow 作业中的 GCS .csv

pandas使用read_csv函数读取文件最后N行数据并保留表头pandas使用read_csv函数读取网络url链接数据

pandas.read_excel，第一行值

pandas使用read_csv读取数据使用index_col参数移除Unnamed:0数据列pandas使用read_csv读取压缩格式文件