如何根据本地 csv 的 where 子句从 pandas 访问 Google Bigquery 数据

Posted 2023-03-25

技术标签:

【中文标题】如何根据本地 csv 的 where 子句从 pandas 访问 Google Bigquery 数据【英文标题】：How to access Google Bigquery data from pandas, based on where clause from a local csv 【发布时间】：2016-12-25 04:37:52 【问题描述】：

所以我有一个名为 Frames 的本地数据框，其中有一列 item。我想从 Google BigQuery 数据集 Sales 中提取信息。 Sales 有一列 itemnumber，我只想获取 Frames 中存在的值强>.项目

我需要执行以下操作：

frames=pd.DataFrame.from_csv(path,index_col=None)
df = gbq.read_gbq('SELECT * FROM Usales.Sales where itemnumber in frames.item LIMIT 1000', project_id='Project')

【问题讨论】：

你能不能把 Frames 放到 bigquery 中然后你呢：SELECT * FROM Usales.Sales where itemnumber in (select distinct item from frames) 理论上，是的，我可以，但有访问限制，因此这是不可行的。 【参考方案1】：

frames=pd.DataFrame.from_csv(path,index_col=None)
df = gbq.read_gbq('SELECT * FROM Usales.Sales where itemnumber in () LIMIT 1000'.format(', '.join('"0"'.format(item) for item in frames['item'].tolist())), project_id='project')

【讨论】：

需要调整使用连接的部分。截至目前，format(','.join(frames['item'].tolist())) 会生成一个数组，例如：(abc,cde,efg,xyz) 而不是生成这样的数组：("abc","cde","xyz") 你是对的 - 我编辑了我的回复，现在应该可以了。酷，这对我有用，它在中间的某个地方缺少了一个讨厌的括号。

df = gbq.read_gbq('SELECT * FROM Usales.Sales where itemnumber in () LIMIT 1000'.format(', '.join('"0"'.format(item) for item in frames['item'].tolist())), project_id='project')

【参考方案2】：

您需要将应用GBQ部分的部分与pandas DataFrame应用部分分开。

例如

def getDataForAnItem(item):
  # process item using gbq
  print(item)
  return  gbq.read_gbq('SELECT * FROM Usales.Sales where itemnumber in frames."+str(item)+" LIMIT 1000', project_id='Project')

frames=pd.DataFrame.from_csv(path,index_col=None)
resultDF = df['item'].apply(getDataForAnItem)

【讨论】：

这可行，但它不是将字符串数组传递给 BiqQuery 并请求总共 1000 行，而是一次遍历 Frames.Item 列并为每个项目传递 1000 行。跨度>

以上是关于如何根据本地 csv 的 where 子句从 pandas 访问 Google Bigquery 数据的主要内容，如果未能解决你的问题，请参考以下文章