解析 json 文件以获取要插入 bigquery 的正确列

Posted

技术标签:

【中文标题】解析 json 文件以获取要插入 bigquery 的正确列【英文标题】:Parse a json file to get the right columns to insert into bigquery 【发布时间】:2019-08-27 09:43:22 【问题描述】:

我对 Python 比较陌生,我正在尝试从 ECB 免费 api 获取一些汇率数据:

获取https://api.exchangeratesapi.io/latest?base=GBP

我希望最终在 bigquery 表中得到这些数据。将数据加载到 BQ 很好,但在将数据发送到 BQ 之前将其转换为正确的列/行格式是个问题。

我想最终得到一张这样的表格:

Currency    Rate      Date
CAD         1.629..   2019-08-27
HKD         9.593..   2019-08-27
ISK         152.6..   2019-08-27
...         ...       ...

我已经尝试了一些东西,但还没有完全做到:

# api-endpoint
URL = "https://api.exchangeratesapi.io/latest?base=GBP"

# sending get request and saving the response as response object
r = requests.get(url=URL)

# extracting data in json format
data = r.json()

with open('data.json', 'w') as outfile:
    json.dump(data['rates'], outfile)

a_dict = 'date': '2019-08-26'

with open('data.json') as f:
    data = json.load(f)

data.update(a_dict)

with open('data.json', 'w') as f:
    json.dump(data, f)

print(data)

这是原始的json文件:

  
   "rates":  
      "CAD":1.6296861353,
      "HKD":9.593490542,
      "ISK":152.6759753684,
      "php":64.1305429339,
      "DKK":8.2428443501,
      "HUF":363.2604778172,
      "CZK":28.4888284523,
      "GBP":1.0,
      "RON":5.2195062629,
      "SEK":11.8475893558,
      "IDR":17385.9684034803,
      "INR":87.6742617713,
      "BRL":4.9997236134,
      "RUB":80.646191945,
      "HRK":8.1744110201,
      "JPY":130.2223254066,
      "THB":37.5852652759,
      "CHF":1.2042718318,
      "EUR":1.1055465269,
      "MYR":5.1255348081,
      "BGN":2.1622278974,
      "TRY":7.0550451616,
      "CNY":8.6717964026,
      "NOK":11.0104695256,
      "NZD":1.9192287707,
      "ZAR":18.6217151449,
      "USD":1.223287232,
      "MXN":24.3265563331,
      "SGD":1.6981194654,
      "AUD":1.8126540855,
      "ILS":4.3032293014,
      "KRW":1482.7479464473,
      "PLN":4.8146551248
   ,
   "base":"GBP",
   "date":"2019-08-23"

【问题讨论】:

【参考方案1】:

欢迎!怎么样,作为解决问题的一种方法。

# import the pandas library so we can use it's from_dict function:
import pandas as pd

# subset the json to a dict of exchange rates and country codes:
d = data['rates']

# create a dataframe from this data, using pandas from_dict function:
df = pd.DataFrame.from_dict(d,orient='index')

# add a column for date (this value is taken from the json data):
df['date'] = data['date']

# name our columns, to keep things clean
df.columns = ['rate','date']

这给了你:

    rate    date
CAD 1.629686    2019-08-23
HKD 9.593491    2019-08-23
ISK 152.675975  2019-08-23
PHP 64.130543   2019-08-23
...      

在这种情况下,货币是数据框的索引,如果您希望它作为它自己的列,只需添加: df['currency'] = df.index

然后,您可以将此数据帧写入 .csv 文件,或将其写入 BigQuery。

为此,我建议您查看The BigQuery Client library,起初可能有点难以理解,因此您可能还想查看pandas.DataFrame.to_gbq,它更容易,但更少健壮(有关客户端库与 pandas 函数的更多详细信息,请参阅this link。

【讨论】:

感谢您的帮助,效果很好! Pandas 很好地获得了我想要的格式。我最终将数据框写入 csv 并加载到没有熊猫的 BQ 表中。我将在下面的评论中为任何感兴趣的人发布我的最终脚本。并感谢您的欢迎!【参考方案2】:

感谢 Ben P 的帮助。

这是我的脚本,适用于感兴趣的人。它使用我的团队用于 BQ 加载的内部库,但其余的是 pandas 和请求:

from aa.py.gcp import GCPAuth, GCPBigQueryClient
from aa.py.log import StandardLogger
import requests, os, pandas as pd

# Connect to BigQuery
logger = StandardLogger('test').logger
auth = GCPAuth(logger=logger)
credentials_path = 'XXX'
credentials = auth.get_credentials(credentials_path)
gcp_bigquery = GCPBigQueryClient(logger=logger)
gcp_bigquery.connect(credentials)

# api-endpoint
URL = "https://api.exchangeratesapi.io/latest?base=GBP"

# sending get request and saving the response as response object
r = requests.get(url=URL)

# extracting data in json format
data = r.json()

# extract rates object from json
d = data['rates']

# split currency and rate for dataframe
df = pd.DataFrame.from_dict(d,orient='index')

# add date element to dataframe
df['date'] = data['date']

#column names
df.columns = ['rate', 'date']

# print dataframe
print(df)

# write dateframe to csv
df.to_csv('data.csv', sep='\t', encoding='utf-8')

#########################################
# write csv to BQ table
file_path = os.getcwd()
file_name = 'data.csv'
dataset_id = 'Testing'
table_id = 'Exchange_Rates'

response = gcp_bigquery.load_file_into_table(file_path, file_name, dataset_id, table_id, source_format='CSV', field_delimiter="\t", create_disposition='CREATE_NEVER', write_disposition='WRITE_TRUNCATE',skip_leading_rows=1)

【讨论】:

以上是关于解析 json 文件以获取要插入 bigquery 的正确列的主要内容,如果未能解决你的问题,请参考以下文章

如何从 BigQuery 的列中解析 JSON 文件

需要 SQL 查询帮助以解析 BigQuery 表中的 JSON 数据

BigQuery 更快地插入数百万行的方法

BigQuery,Python 批量插入 bigquery 以进行流式传输服务(“告诉”错误)

BigQuery 从 JSON 文件键中获取列

从 BigQuery 中具有无效令牌的列中解析 JSON 文件