如何将嵌套的 JSON 键规范化为 pandas 数据帧

Posted

技术标签:

【中文标题】如何将嵌套的 JSON 键规范化为 pandas 数据帧【英文标题】:How to normalize a nested JSON key into a pandas dataframe 【发布时间】:2021-04-18 22:43:56 【问题描述】:

我通常是 Python 和 API 的新手,所以这可能是一个简单回答的基本问题。我正在尝试使用 Python 从Propublica's API 获取有关国会代表的数据。我可以让 REST API 运行,但是我在将生成的 json 数据正确地构建为数据框时遇到了问题。我认为这是因为数据中有多个嵌套级别。我尝试对数据进行规范化,但我只能让它在第一个嵌套级别上工作。

这是我的代码。请注意,我已删除我的 API 密钥,但您可以快速轻松地获得一个 here。

# Import programs
import pandas as pd
from pandas.io.json import json_normalize
import requests
import json
import time
import csv

### Index 0

# Requesting data trhough API
payload = 'X-API-Key': 'a876543211234' 
terms = '"trade war"AND"China"'
index = str(0) # 440 is last offset for this call

response = requests.get('https://api.propublica.org/congress/v1/116/house/members.json', headers=payload)
print(response.status_code)

#Formating json files better
json_data = json.loads(response.content.decode("utf-8"))

# Writing Data as String
json_string = json.dumps(json_data)

# Creating Stage 1 dataframe
jdata = json.loads(json_string)
df = pd.DataFrame(jdata)
df2 = pd.DataFrame(df.results)

# Normalizing Data - converts nested data into a regular looking dataframe
normal_data_0 = json_normalize(data=df['results'])

这就是 JSON 数据的样子。请注意,所有代表的数据都嵌套在“结果”和“成员”下:

'status': 'OK',
 'copyright': ' Copyright (c) 2021 Pro Publica Inc. All Rights Reserved.',
 'results': ['congress': '116',
   'chamber': 'House',
   'num_results': 451,
   'offset': 0,
   'members': ['id': 'A000374',
     'title': 'Representative',
     'short_title': 'Rep.',
     'api_uri': 'https://api.propublica.org/congress/v1/members/A000374.json',
     'first_name': 'Ralph',
     'middle_name': None,
     'last_name': 'Abraham',
     'suffix': None,
     'date_of_birth': '1954-09-16',
     'gender': 'M',
     'party': 'R',
     'leadership_role': '',
     'twitter_account': 'RepAbraham',
     'facebook_account': 'CongressmanRalphAbraham',
     'youtube_account': None,
     'govtrack_id': '412630',
     'cspan_id': '76236',
     'votesmart_id': '155414',
     'icpsr_id': '21522',
     'crp_id': 'N00036633',
     'google_entity_id': '/m/012dwd7_',
     'fec_candidate_id': 'H4LA05221',
     'url': 'https://abraham.house.gov',
     'rss_url': 'https://abraham.house.gov/rss.xml',
     'contact_form': None,
     'in_office': False,
     'cook_pvi': 'R+15',
     'dw_nominate': 0.541,
     'ideal_point': None,
     'seniority': '6',
     'next_election': '2020',
     'total_votes': 954,
     'missed_votes': 377,
     'total_present': 0,
     'last_updated': '2020-12-31 18:30:50 -0500',
     'ocd_id': 'ocd-division/country:us/state:la/cd:5',
     'office': '417 Cannon House Office Building',
     'phone': '202-225-8490',
     'fax': None,
     'state': 'LA',
     'district': '5',
     'at_large': False,
     'geoid': '2205',
     'missed_votes_pct': 39.52,
     'votes_with_party_pct': 94.93,
     'votes_against_party_pct': 4.9,
    'id': 'A000370',
     'title': 'Representative',
      ...

这就是我的“数据集”的样子。所有 JSON 数据都以字符串形式存储在唯一行的“成员”列中:

normal_data_0

    congress    chamber num_results offset  members
0   116 House   451 0   ['id': 'A000374', 'title': 'Representative', ...

我尝试过两次通过json_normalize 运行数据,并添加了两个变量[results,members]。我尝试过的任何方法都没有奏效。

有什么建议吗?

【问题讨论】:

【参考方案1】: 'results'key 是一个 1 元素 list,因此可以通过从索引 0 处的 dict 中选择 'members' 键来规范化 'members'
import pandas as pd
import requests

# Requesting data trhough API
payload = 'X-API-Key': '...' 
terms = '"trade war"AND"China"'
index = str(0)  # 440 is last offset for this call

response = requests.get('https://api.propublica.org/congress/v1/116/house/members.json', headers=payload)

# extract the json data from the response
json_data = response.json()

# normalize only members
members = pd.json_normalize(data=json_data['results'][0]['members'])

# alternatively: normalize members and the preceding keys
members = pd.json_normalize(data=json_data['results'][0], record_path=['members'], meta=['congress', 'chamber', 'num_results', 'offset'])

display(members)

        id           title short_title                                                      api_uri first_name middle_name  last_name suffix date_of_birth gender party leadership_role  twitter_account         facebook_account youtube_account govtrack_id cspan_id votesmart_id icpsr_id     crp_id google_entity_id fec_candidate_id                          url                                         rss_url contact_form  in_office cook_pvi  dw_nominate ideal_point seniority next_election  total_votes  missed_votes  total_present               last_updated                                  ocd_id                                office         phone   fax state  district  at_large geoid  missed_votes_pct  votes_with_party_pct  votes_against_party_pct
0  A000374  Representative        Rep.  https://api.propublica.org/congress/v1/members/A000374.json      Ralph        None    Abraham   None    1954-09-16      M     R                       RepAbraham  CongressmanRalphAbraham            None      412630    76236       155414    21522  N00036633      /m/012dwd7_        H4LA05221    https://abraham.house.gov               https://abraham.house.gov/rss.xml         None      False     R+15        0.541        None         6          2020        954.0         377.0            0.0  2020-12-31 18:30:50 -0500   ocd-division/country:us/state:la/cd:5      417 Cannon House Office Building  202-225-8490  None    LA         5     False  2205             39.52                 94.93                     4.90
1  A000370  Representative        Rep.  https://api.propublica.org/congress/v1/members/A000370.json       Alma        None      Adams   None    1946-05-27      F     D            None         RepAdams       CongresswomanAdams            None      412607    76386         5935    21545  N00035451        /m/02b45d        H4NC12100      https://adams.house.gov                 https://adams.house.gov/rss.xml         None      False     D+18       -0.465        None         8          2020        954.0          26.0            0.0  2020-12-31 18:30:55 -0500  ocd-division/country:us/state:nc/cd:12    2436 Rayburn House Office Building  202-225-1510  None    NC        12     False  3712              2.73                 99.24                     0.65
2  A000055  Representative        Rep.  https://api.propublica.org/congress/v1/members/A000055.json     Robert          B.   Aderholt   None    1965-07-22      M     R            None  Robert_Aderholt           RobertAderholt  RobertAderholt      400004    45516          441    29701  N00003028        /m/024p03        H6AL04098   https://aderholt.house.gov              https://aderholt.house.gov/rss.xml         None      False     R+30        0.369        None        24          2020        954.0          71.0            0.0  2020-12-31 18:30:49 -0500   ocd-division/country:us/state:al/cd:4  1203 Longworth House Office Building  202-225-4876  None    AL         4     False  0104              7.44                 93.60                     6.29
3  A000371  Representative        Rep.  https://api.propublica.org/congress/v1/members/A000371.json       Pete        None    Aguilar   None    1979-06-19      M     D            None   reppeteaguilar           reppeteaguilar            None      412615    79994        70114    21506  N00033997       /m/0jwv0xf        H2CA31125    https://aguilar.house.gov               https://aguilar.house.gov/rss.xml         None      False      D+8       -0.291        None         6          2020        954.0           9.0            0.0  2020-12-31 18:30:52 -0500  ocd-division/country:us/state:ca/cd:31      109 Cannon House Office Building  202-225-3201  None    CA        31     False  0631              0.94                 97.45                     2.44
4  A000372  Representative        Rep.  https://api.propublica.org/congress/v1/members/A000372.json       Rick        None      Allen   None    1951-11-07      M     R            None     reprickallen     CongressmanRickAllen            None      412625    62545       136062    21516  N00033720      /m/0127y9dk        H2GA12121      https://allen.house.gov                                            None         None      False      R+9        0.679        None         6          2020        954.0          15.0            0.0  2020-12-31 18:30:49 -0500  ocd-division/country:us/state:ga/cd:12    2400 Rayburn House Office Building  202-225-2823  None    GA        12     False  1312              1.57                 92.26                     7.63
5  A000376  Representative        Rep.  https://api.propublica.org/congress/v1/members/A000376.json      Colin        None     Allred   None    1983-04-15      M     D            None   RepColinAllred                     None            None      412828     None       177357     None  N00040989       /m/03d066b        H8TX32098     https://allred.house.gov                                            None         None      False      R+5          NaN        None         2          2020        954.0          29.0            0.0  2020-12-31 18:30:52 -0500  ocd-division/country:us/state:tx/cd:32      328 Cannon House Office Building  202-225-2231  None    TX        32     False  4832              3.04                 97.72                     2.17
6  A000367  Representative        Rep.  https://api.propublica.org/congress/v1/members/A000367.json     Justin        None      Amash   None    1980-04-18      M     I                      justinamash           repjustinamash  repjustinamash      412438  1033767       105566    21143  N00031938       /m/0c00p_n                       https://amash.house.gov                 https://amash.house.gov/rss.xml         None      False      R+6          NaN        None        10          2020        524.0           0.0           10.0  2020-12-31 18:30:47 -0500   ocd-division/country:us/state:mi/cd:3                                  None          None  None    MI         3     False  2603              0.00                 58.49                    41.51
7  A000367  Representative        Rep.  https://api.propublica.org/congress/v1/members/A000367.json     Justin        None      Amash   None    1980-04-18      M     R                      justinamash           repjustinamash  repjustinamash      412438  1033767       105566    21143  N00031938       /m/0c00p_n        H0MI03126      https://amash.house.gov                 https://amash.house.gov/rss.xml         None      False     None        0.654        None        10          2020        430.0           0.0            5.0  2020-12-28 21:04:36 -0500   ocd-division/country:us/state:mi/cd:3      106 Cannon House Office Building  202-225-3831  None    MI         3     False  2603              0.00                 61.97                    37.79
8  A000369  Representative        Rep.  https://api.propublica.org/congress/v1/members/A000369.json       Mark        None     Amodei   None    1958-06-12      M     R            None    MarkAmodeiNV2            MarkAmodeiNV2   markamodeinv2      412500    62817        12537    21196  N00031177       /m/03bzdkn        H2NV02395     https://amodei.house.gov  https://amodei.house.gov/rss/news-releases.xml         None      False      R+7        0.384        None        10          2020        954.0          36.0            0.0  2020-12-31 18:30:49 -0500   ocd-division/country:us/state:nv/cd:2      104 Cannon House Office Building  202-225-6155  None    NV         2     False  3202              3.77                 92.63                     7.26
9  A000377  Representative        Rep.  https://api.propublica.org/congress/v1/members/A000377.json      Kelly        None  Armstrong   None    1976-10-08      M     R            None   RepArmstrongND                     None            None      412794     None       139338     None  N00042868    /g/11hcszksh3        H8ND00096  https://armstrong.house.gov                                            None         None      False     R+16          NaN        None         2          2020        954.0          33.0            0.0  2020-12-31 18:30:49 -0500   ocd-division/country:us/state:nd/cd:1  1004 Longworth House Office Building  202-225-2611  None    ND  At-Large      True  3800              3.46                 93.31                     6.58

【讨论】:

以上是关于如何将嵌套的 JSON 键规范化为 pandas 数据帧的主要内容,如果未能解决你的问题,请参考以下文章

规范化/展平非常深的嵌套 JSON(其中名称和属性在各个级别中相同)

Python pandas:通过代理键将 JSON 扁平化为行的快速方法

如何通过 Python Pandas 正确规范化 json

Unity 将嵌套字典序列化为 JSON

使用不同的键规范化嵌套的 json

如何使用 json_normalize 规范化嵌套的 json