如何将嵌套的 JSON 键规范化为 pandas 数据帧
Posted
技术标签:
【中文标题】如何将嵌套的 JSON 键规范化为 pandas 数据帧【英文标题】:How to normalize a nested JSON key into a pandas dataframe 【发布时间】:2021-04-18 22:43:56 【问题描述】:我通常是 Python 和 API 的新手,所以这可能是一个简单回答的基本问题。我正在尝试使用 Python 从Propublica's API 获取有关国会代表的数据。我可以让 REST API 运行,但是我在将生成的 json 数据正确地构建为数据框时遇到了问题。我认为这是因为数据中有多个嵌套级别。我尝试对数据进行规范化,但我只能让它在第一个嵌套级别上工作。
这是我的代码。请注意,我已删除我的 API 密钥,但您可以快速轻松地获得一个 here。
# Import programs
import pandas as pd
from pandas.io.json import json_normalize
import requests
import json
import time
import csv
### Index 0
# Requesting data trhough API
payload = 'X-API-Key': 'a876543211234'
terms = '"trade war"AND"China"'
index = str(0) # 440 is last offset for this call
response = requests.get('https://api.propublica.org/congress/v1/116/house/members.json', headers=payload)
print(response.status_code)
#Formating json files better
json_data = json.loads(response.content.decode("utf-8"))
# Writing Data as String
json_string = json.dumps(json_data)
# Creating Stage 1 dataframe
jdata = json.loads(json_string)
df = pd.DataFrame(jdata)
df2 = pd.DataFrame(df.results)
# Normalizing Data - converts nested data into a regular looking dataframe
normal_data_0 = json_normalize(data=df['results'])
这就是 JSON 数据的样子。请注意,所有代表的数据都嵌套在“结果”和“成员”下:
'status': 'OK',
'copyright': ' Copyright (c) 2021 Pro Publica Inc. All Rights Reserved.',
'results': ['congress': '116',
'chamber': 'House',
'num_results': 451,
'offset': 0,
'members': ['id': 'A000374',
'title': 'Representative',
'short_title': 'Rep.',
'api_uri': 'https://api.propublica.org/congress/v1/members/A000374.json',
'first_name': 'Ralph',
'middle_name': None,
'last_name': 'Abraham',
'suffix': None,
'date_of_birth': '1954-09-16',
'gender': 'M',
'party': 'R',
'leadership_role': '',
'twitter_account': 'RepAbraham',
'facebook_account': 'CongressmanRalphAbraham',
'youtube_account': None,
'govtrack_id': '412630',
'cspan_id': '76236',
'votesmart_id': '155414',
'icpsr_id': '21522',
'crp_id': 'N00036633',
'google_entity_id': '/m/012dwd7_',
'fec_candidate_id': 'H4LA05221',
'url': 'https://abraham.house.gov',
'rss_url': 'https://abraham.house.gov/rss.xml',
'contact_form': None,
'in_office': False,
'cook_pvi': 'R+15',
'dw_nominate': 0.541,
'ideal_point': None,
'seniority': '6',
'next_election': '2020',
'total_votes': 954,
'missed_votes': 377,
'total_present': 0,
'last_updated': '2020-12-31 18:30:50 -0500',
'ocd_id': 'ocd-division/country:us/state:la/cd:5',
'office': '417 Cannon House Office Building',
'phone': '202-225-8490',
'fax': None,
'state': 'LA',
'district': '5',
'at_large': False,
'geoid': '2205',
'missed_votes_pct': 39.52,
'votes_with_party_pct': 94.93,
'votes_against_party_pct': 4.9,
'id': 'A000370',
'title': 'Representative',
...
这就是我的“数据集”的样子。所有 JSON 数据都以字符串形式存储在唯一行的“成员”列中:
normal_data_0
congress chamber num_results offset members
0 116 House 451 0 ['id': 'A000374', 'title': 'Representative', ...
我尝试过两次通过json_normalize
运行数据,并添加了两个变量[results,members]
。我尝试过的任何方法都没有奏效。
有什么建议吗?
【问题讨论】:
【参考方案1】:'results'
key
是一个 1 元素 list
,因此可以通过从索引 0 处的 dict
中选择 'members'
键来规范化 'members'
。
import pandas as pd
import requests
# Requesting data trhough API
payload = 'X-API-Key': '...'
terms = '"trade war"AND"China"'
index = str(0) # 440 is last offset for this call
response = requests.get('https://api.propublica.org/congress/v1/116/house/members.json', headers=payload)
# extract the json data from the response
json_data = response.json()
# normalize only members
members = pd.json_normalize(data=json_data['results'][0]['members'])
# alternatively: normalize members and the preceding keys
members = pd.json_normalize(data=json_data['results'][0], record_path=['members'], meta=['congress', 'chamber', 'num_results', 'offset'])
display(members)
id title short_title api_uri first_name middle_name last_name suffix date_of_birth gender party leadership_role twitter_account facebook_account youtube_account govtrack_id cspan_id votesmart_id icpsr_id crp_id google_entity_id fec_candidate_id url rss_url contact_form in_office cook_pvi dw_nominate ideal_point seniority next_election total_votes missed_votes total_present last_updated ocd_id office phone fax state district at_large geoid missed_votes_pct votes_with_party_pct votes_against_party_pct
0 A000374 Representative Rep. https://api.propublica.org/congress/v1/members/A000374.json Ralph None Abraham None 1954-09-16 M R RepAbraham CongressmanRalphAbraham None 412630 76236 155414 21522 N00036633 /m/012dwd7_ H4LA05221 https://abraham.house.gov https://abraham.house.gov/rss.xml None False R+15 0.541 None 6 2020 954.0 377.0 0.0 2020-12-31 18:30:50 -0500 ocd-division/country:us/state:la/cd:5 417 Cannon House Office Building 202-225-8490 None LA 5 False 2205 39.52 94.93 4.90
1 A000370 Representative Rep. https://api.propublica.org/congress/v1/members/A000370.json Alma None Adams None 1946-05-27 F D None RepAdams CongresswomanAdams None 412607 76386 5935 21545 N00035451 /m/02b45d H4NC12100 https://adams.house.gov https://adams.house.gov/rss.xml None False D+18 -0.465 None 8 2020 954.0 26.0 0.0 2020-12-31 18:30:55 -0500 ocd-division/country:us/state:nc/cd:12 2436 Rayburn House Office Building 202-225-1510 None NC 12 False 3712 2.73 99.24 0.65
2 A000055 Representative Rep. https://api.propublica.org/congress/v1/members/A000055.json Robert B. Aderholt None 1965-07-22 M R None Robert_Aderholt RobertAderholt RobertAderholt 400004 45516 441 29701 N00003028 /m/024p03 H6AL04098 https://aderholt.house.gov https://aderholt.house.gov/rss.xml None False R+30 0.369 None 24 2020 954.0 71.0 0.0 2020-12-31 18:30:49 -0500 ocd-division/country:us/state:al/cd:4 1203 Longworth House Office Building 202-225-4876 None AL 4 False 0104 7.44 93.60 6.29
3 A000371 Representative Rep. https://api.propublica.org/congress/v1/members/A000371.json Pete None Aguilar None 1979-06-19 M D None reppeteaguilar reppeteaguilar None 412615 79994 70114 21506 N00033997 /m/0jwv0xf H2CA31125 https://aguilar.house.gov https://aguilar.house.gov/rss.xml None False D+8 -0.291 None 6 2020 954.0 9.0 0.0 2020-12-31 18:30:52 -0500 ocd-division/country:us/state:ca/cd:31 109 Cannon House Office Building 202-225-3201 None CA 31 False 0631 0.94 97.45 2.44
4 A000372 Representative Rep. https://api.propublica.org/congress/v1/members/A000372.json Rick None Allen None 1951-11-07 M R None reprickallen CongressmanRickAllen None 412625 62545 136062 21516 N00033720 /m/0127y9dk H2GA12121 https://allen.house.gov None None False R+9 0.679 None 6 2020 954.0 15.0 0.0 2020-12-31 18:30:49 -0500 ocd-division/country:us/state:ga/cd:12 2400 Rayburn House Office Building 202-225-2823 None GA 12 False 1312 1.57 92.26 7.63
5 A000376 Representative Rep. https://api.propublica.org/congress/v1/members/A000376.json Colin None Allred None 1983-04-15 M D None RepColinAllred None None 412828 None 177357 None N00040989 /m/03d066b H8TX32098 https://allred.house.gov None None False R+5 NaN None 2 2020 954.0 29.0 0.0 2020-12-31 18:30:52 -0500 ocd-division/country:us/state:tx/cd:32 328 Cannon House Office Building 202-225-2231 None TX 32 False 4832 3.04 97.72 2.17
6 A000367 Representative Rep. https://api.propublica.org/congress/v1/members/A000367.json Justin None Amash None 1980-04-18 M I justinamash repjustinamash repjustinamash 412438 1033767 105566 21143 N00031938 /m/0c00p_n https://amash.house.gov https://amash.house.gov/rss.xml None False R+6 NaN None 10 2020 524.0 0.0 10.0 2020-12-31 18:30:47 -0500 ocd-division/country:us/state:mi/cd:3 None None None MI 3 False 2603 0.00 58.49 41.51
7 A000367 Representative Rep. https://api.propublica.org/congress/v1/members/A000367.json Justin None Amash None 1980-04-18 M R justinamash repjustinamash repjustinamash 412438 1033767 105566 21143 N00031938 /m/0c00p_n H0MI03126 https://amash.house.gov https://amash.house.gov/rss.xml None False None 0.654 None 10 2020 430.0 0.0 5.0 2020-12-28 21:04:36 -0500 ocd-division/country:us/state:mi/cd:3 106 Cannon House Office Building 202-225-3831 None MI 3 False 2603 0.00 61.97 37.79
8 A000369 Representative Rep. https://api.propublica.org/congress/v1/members/A000369.json Mark None Amodei None 1958-06-12 M R None MarkAmodeiNV2 MarkAmodeiNV2 markamodeinv2 412500 62817 12537 21196 N00031177 /m/03bzdkn H2NV02395 https://amodei.house.gov https://amodei.house.gov/rss/news-releases.xml None False R+7 0.384 None 10 2020 954.0 36.0 0.0 2020-12-31 18:30:49 -0500 ocd-division/country:us/state:nv/cd:2 104 Cannon House Office Building 202-225-6155 None NV 2 False 3202 3.77 92.63 7.26
9 A000377 Representative Rep. https://api.propublica.org/congress/v1/members/A000377.json Kelly None Armstrong None 1976-10-08 M R None RepArmstrongND None None 412794 None 139338 None N00042868 /g/11hcszksh3 H8ND00096 https://armstrong.house.gov None None False R+16 NaN None 2 2020 954.0 33.0 0.0 2020-12-31 18:30:49 -0500 ocd-division/country:us/state:nd/cd:1 1004 Longworth House Office Building 202-225-2611 None ND At-Large True 3800 3.46 93.31 6.58
【讨论】:
以上是关于如何将嵌套的 JSON 键规范化为 pandas 数据帧的主要内容,如果未能解决你的问题,请参考以下文章
规范化/展平非常深的嵌套 JSON(其中名称和属性在各个级别中相同)