解析 json 文件以获取要插入 bigquery 的正确列
Posted
技术标签:
【中文标题】解析 json 文件以获取要插入 bigquery 的正确列【英文标题】:Parse a json file to get the right columns to insert into bigquery 【发布时间】:2019-08-27 09:43:22 【问题描述】:我对 Python 比较陌生,我正在尝试从 ECB 免费 api 获取一些汇率数据:
获取https://api.exchangeratesapi.io/latest?base=GBP
我希望最终在 bigquery 表中得到这些数据。将数据加载到 BQ 很好,但在将数据发送到 BQ 之前将其转换为正确的列/行格式是个问题。
我想最终得到一张这样的表格:
Currency Rate Date
CAD 1.629.. 2019-08-27
HKD 9.593.. 2019-08-27
ISK 152.6.. 2019-08-27
... ... ...
我已经尝试了一些东西,但还没有完全做到:
# api-endpoint
URL = "https://api.exchangeratesapi.io/latest?base=GBP"
# sending get request and saving the response as response object
r = requests.get(url=URL)
# extracting data in json format
data = r.json()
with open('data.json', 'w') as outfile:
json.dump(data['rates'], outfile)
a_dict = 'date': '2019-08-26'
with open('data.json') as f:
data = json.load(f)
data.update(a_dict)
with open('data.json', 'w') as f:
json.dump(data, f)
print(data)
这是原始的json文件:
"rates":
"CAD":1.6296861353,
"HKD":9.593490542,
"ISK":152.6759753684,
"php":64.1305429339,
"DKK":8.2428443501,
"HUF":363.2604778172,
"CZK":28.4888284523,
"GBP":1.0,
"RON":5.2195062629,
"SEK":11.8475893558,
"IDR":17385.9684034803,
"INR":87.6742617713,
"BRL":4.9997236134,
"RUB":80.646191945,
"HRK":8.1744110201,
"JPY":130.2223254066,
"THB":37.5852652759,
"CHF":1.2042718318,
"EUR":1.1055465269,
"MYR":5.1255348081,
"BGN":2.1622278974,
"TRY":7.0550451616,
"CNY":8.6717964026,
"NOK":11.0104695256,
"NZD":1.9192287707,
"ZAR":18.6217151449,
"USD":1.223287232,
"MXN":24.3265563331,
"SGD":1.6981194654,
"AUD":1.8126540855,
"ILS":4.3032293014,
"KRW":1482.7479464473,
"PLN":4.8146551248
,
"base":"GBP",
"date":"2019-08-23"
【问题讨论】:
【参考方案1】:欢迎!怎么样,作为解决问题的一种方法。
# import the pandas library so we can use it's from_dict function:
import pandas as pd
# subset the json to a dict of exchange rates and country codes:
d = data['rates']
# create a dataframe from this data, using pandas from_dict function:
df = pd.DataFrame.from_dict(d,orient='index')
# add a column for date (this value is taken from the json data):
df['date'] = data['date']
# name our columns, to keep things clean
df.columns = ['rate','date']
这给了你:
rate date
CAD 1.629686 2019-08-23
HKD 9.593491 2019-08-23
ISK 152.675975 2019-08-23
PHP 64.130543 2019-08-23
...
在这种情况下,货币是数据框的索引,如果您希望它作为它自己的列,只需添加:
df['currency'] = df.index
然后,您可以将此数据帧写入 .csv 文件,或将其写入 BigQuery。
为此,我建议您查看The BigQuery Client library,起初可能有点难以理解,因此您可能还想查看pandas.DataFrame.to_gbq,它更容易,但更少健壮(有关客户端库与 pandas 函数的更多详细信息,请参阅this link。
【讨论】:
感谢您的帮助,效果很好! Pandas 很好地获得了我想要的格式。我最终将数据框写入 csv 并加载到没有熊猫的 BQ 表中。我将在下面的评论中为任何感兴趣的人发布我的最终脚本。并感谢您的欢迎!【参考方案2】:感谢 Ben P 的帮助。
这是我的脚本,适用于感兴趣的人。它使用我的团队用于 BQ 加载的内部库,但其余的是 pandas 和请求:
from aa.py.gcp import GCPAuth, GCPBigQueryClient
from aa.py.log import StandardLogger
import requests, os, pandas as pd
# Connect to BigQuery
logger = StandardLogger('test').logger
auth = GCPAuth(logger=logger)
credentials_path = 'XXX'
credentials = auth.get_credentials(credentials_path)
gcp_bigquery = GCPBigQueryClient(logger=logger)
gcp_bigquery.connect(credentials)
# api-endpoint
URL = "https://api.exchangeratesapi.io/latest?base=GBP"
# sending get request and saving the response as response object
r = requests.get(url=URL)
# extracting data in json format
data = r.json()
# extract rates object from json
d = data['rates']
# split currency and rate for dataframe
df = pd.DataFrame.from_dict(d,orient='index')
# add date element to dataframe
df['date'] = data['date']
#column names
df.columns = ['rate', 'date']
# print dataframe
print(df)
# write dateframe to csv
df.to_csv('data.csv', sep='\t', encoding='utf-8')
#########################################
# write csv to BQ table
file_path = os.getcwd()
file_name = 'data.csv'
dataset_id = 'Testing'
table_id = 'Exchange_Rates'
response = gcp_bigquery.load_file_into_table(file_path, file_name, dataset_id, table_id, source_format='CSV', field_delimiter="\t", create_disposition='CREATE_NEVER', write_disposition='WRITE_TRUNCATE',skip_leading_rows=1)
【讨论】:
以上是关于解析 json 文件以获取要插入 bigquery 的正确列的主要内容,如果未能解决你的问题,请参考以下文章
需要 SQL 查询帮助以解析 BigQuery 表中的 JSON 数据