使用 pandas json_normalize 扁平化包含多个嵌套列表的字典列表

Posted

技术标签:

【中文标题】使用 pandas json_normalize 扁平化包含多个嵌套列表的字典列表【英文标题】:Flattening List of Dict containing multiple nested lists using pandas json_normalize 【发布时间】:2022-01-17 11:43:33 【问题描述】:

我们有以下列表 -

users_info = [
    
        "Id": "21",
        "Name": "ABC",
        "Country": 
            "Name": "Country 1"
        ,
        "Addresses": 
            "records": [
                
                    "addressId": "12",
                    "line1": "xyz, 102",
                    "city": "PQR"
                ,
                
                    "addressId": "13",
                    "line1": "YTR, 102",
                    "city": "NMS"
                
            ]
        ,
        "Education": 
            "records": [
                
                    "Id": "45",
                    "Degree": "Bachelors"
                ,
                
                    "Id": "49",
                    "Degree": "Masters"
                
            ]
        
        
    ,
    
        "Id": "26",
        "Name": "PEW",
        "Country": 
            "Name": "Country 2"
        ,
        "Addresses": 
            "records": [
                
                    "addressId": "10",
                    "line1": "BTR, 12",
                    "city": "UYT"
                ,
                
                    "addressId": "123",
                    "line1": "MEQW, 6",
                    "city": "KJH"
                
            ]
        ,
        "Education": 
            "records": [
                
                    "Id": "45",
                    "Degree": "Bachelors"
                ,
                
                    "Id": "49",
                    "Degree": "Masters"
                
            ]
        
        
    ,
    
        "Id": "214",
        "Name": "TUF",
        "Country": None,
        "Addresses": 
            "records": None
        ,
        "Education": 
            "records": [
                
                    "Id": "45",
                    "Degree": "Bachelors"
                ,
                
                    "Id": "49",
                    "Degree": "Masters"
                
            ]
        
        
    ,
    
        "Id": "2609",
        "Name": "JJU",
        "Country": 
            "Name": "Country 2"
        ,
        "Addresses": 
            "records": [
                
                    "addressId": "10",
                    "line1": "BTR, 12",
                    "city": "UYT"
                ,
                
                    "addressId": "123",
                    "line1": "MEQW, 6",
                    "city": "KJH"
                
            ]
        ,
        "Education": None       
           
]

我想展平这个字典列表并创建 pandas 数据框 -

Id | Name | Country.Name | Addresses.addressId | Addresses.line1 | Addresses.city | Education.Id | Education.Degree

有两个字典列表 - 地址和教育。这些(国家、地址和教育)中的任何一个都可能

如何使用上述数据创建数据框以获得所需的格式?

这是我尝试过的 -

dataset_1 = pandas.json_normalize([user for user in users_info],
                                record_path = ["Addresses"],
                                meta = ["Id", "Name", ["Country", "Name"]],
                                errors="ignore"
                               )
dataset_2 = pandas.json_normalize([user for user in users_info],
                                record_path = ["Education"],
                                meta = ["Id", "Name", ["Country", "Name"]],
                                errors="ignore"
                               )
dataset = pandas.concat([dataset_1, dataset_2], axis=1)

当 dataset_1 执行时,我得到 -

NoneType object is not iterable error

【问题讨论】:

There is possibilty that any of these (Country, Addresses and Education) can be None - 您可以将其添加到示例数据中吗?样本中没有None 为无添加示例数据 【参考方案1】:

你可以使用:

df = pd.json_normalize(users_info)
df_addresses = df['Addresses.records'].explode().apply(pd.Series)
df_addresses.rename(columns=col:f'Addresses.col' for col in df_addresses.columns, inplace=True)

df_education = df['Education.records'].explode().apply(pd.Series)
df_education.rename(columns=col:f'Education.col' for col in df_education.columns, inplace=True)

cols = [col for col in df.columns if col not in ['Addresses.records', 'Education.records']]
df = df[cols].join(df_addresses).join(df_education)
df.dropna(axis=1, how='all', inplace=True)
print(df)

OUTPUT

Id Name Country.Name Addresses.addressId Addresses.line1 Addresses.city Education.Degree Education.Id
0    21  ABC    Country 1                  12        xyz, 102            PQR        Bachelors           45
0    21  ABC    Country 1                  12        xyz, 102            PQR          Masters           49
0    21  ABC    Country 1                  13        YTR, 102            NMS        Bachelors           45
0    21  ABC    Country 1                  13        YTR, 102            NMS          Masters           49
1    26  PEW    Country 2                  10         BTR, 12            UYT        Bachelors           45
1    26  PEW    Country 2                  10         BTR, 12            UYT          Masters           49
1    26  PEW    Country 2                 123         MEQW, 6            KJH        Bachelors           45
1    26  PEW    Country 2                 123         MEQW, 6            KJH          Masters           49
2   214  TUF          NaN                 NaN             NaN            NaN        Bachelors           45
2   214  TUF          NaN                 NaN             NaN            NaN          Masters           49
3  2609  JJU    Country 2                  10         BTR, 12            UYT              NaN          NaN
3  2609  JJU    Country 2                 123         MEQW, 6            KJH              NaN          NaN

【讨论】:

以上是关于使用 pandas json_normalize 扁平化包含多个嵌套列表的字典列表的主要内容,如果未能解决你的问题,请参考以下文章

Pandas json_normalize 会产生令人困惑的“KeyError”消息?

Pandas json_normalize 的逆

使用 pandas json_normalize 扁平化包含多个嵌套列表的字典列表

在带有数组的嵌套 Json 上使用 Pandas json_normalize

如何防止 json_normalize 在 Pandas 中重复列标题?

pandas json_normalize KeyError