使用 pd.json_normalize 展平字典

Posted 2023-03-11

技术标签:

【中文标题】使用 pd.json_normalize 展平字典【英文标题】：Flattening dictionary with pd.json_normalize 【发布时间】：2021-06-02 11:59:08 【问题描述】：

我目前正在拼合这个字典文件，并且遇到了一些障碍。我正在尝试使用json_normalize 来展平这些数据。如果我使用单个实例进行测试，它可以工作，但如果我想展平所有数据，它将返回一个错误，指出 key error '0' 我不知道如何解决这个问题。

数据示例-

data = 1:
      'Name': "Thrilling Tales of Dragon Slayers",
      'IDs':
            "StoreID": ['123445452543'],
            "BookID": ['543533254353'],
            "SalesID": ['543267765345'],
     2:
      'Name': "boring Tales of Dragon Slayers",
      'IDs':
            "StoreID": ['111111', '1121111'],
            "BookID": ['543533254353', '4324232342'],
            "SalesID": ['543267765345', '4353543']

我的代码

d_flat = pd.io.json.json_normalize(data, meta=['Title', 'StoreID', 'BookID', 'SalesID'])

【问题讨论】：

【参考方案1】：

设置

您的数据结构不便。我想专注于：

'IDs'

你的data：

1: 'Name': 'Thrilling Tales of Dragon Slayers',
     'IDs': 'StoreID': ['123445452543'],
             'BookID': ['543533254353'],
             'SalesID': ['543267765345'],
 2: 'Name': 'boring Tales of Dragon Slayers',
     'IDs': 'StoreID': ['111111', '1121111'],
             'BookID': ['543533254353', '4324232342'],
             'SalesID': ['543267765345', '4353543']

我想要的样子：

['Name': 'Thrilling Tales of Dragon Slayers',
  'IDs': ['StoreID': '123445452543',
           'BookID': '543533254353',
           'SalesID': '543267765345'],
 'Name': 'boring Tales of Dragon Slayers',
  'IDs': ['StoreID': '111111',
           'BookID': '543533254353',
           'SalesID': '543267765345',
          'StoreID': '1121111',
           'BookID': '4324232342',
           'SalesID': '4353543']]

重组数据

合理的方式

简单的循环，不要乱来。这让我们得到了我上面展示的内容

new = []

for v in data.values():
    temp = **v           # This is intended to keep all the other data that might be there
    ids = temp.pop('IDs')  # I have to focus on this to create the records
    temp['IDs'] = [dict(zip(ids, x)) for x in zip(*ids.values())]
    new.append(temp)

可爱的单线

new = [**v, 'IDs': [dict(zip(v['IDs'], x)) for x in zip(*v['IDs'].values())] for v in data.values()]

用`pd.json_normalize` 创建`DataFrame`

在对json_normalize 的调用中，我们需要指定记录的路径，即在'IDs' 键中找到的id 字典列表。 json_normalize 将为该列表中的每个项目在数据框中创建一行。这将通过record_path 参数完成，我们传递一个描述路径的tuple（如果它在更深的结构中）或一个字符串（如果密钥在顶层，对我们来说，它是)。

record_path = 'IDs'

然后我们想告诉json_normalize 哪些键是记录的元数据。如果有多个记录，就像我们一样，那么元数据将为每条记录重复。

meta = 'Name'

所以最终的解决方案是这样的：

pd.json_normalize(new, record_path='IDs', meta='Name')

        StoreID        BookID       SalesID                               Name
0  123445452543  543533254353  543267765345  Thrilling Tales of Dragon Slayers
1        111111  543533254353  543267765345     boring Tales of Dragon Slayers
2       1121111    4324232342       4353543     boring Tales of Dragon Slayers

然而

如果我们无论如何都在重组，不妨让它这样我们就可以将它传递给数据帧构造函数。

pd.DataFrame([
    'Name': r['Name'], **dict(zip(r['IDs'], x))
    for r in data.values() for x in zip(*r['IDs'].values())
])

                                Name       StoreID        BookID       SalesID
0  Thrilling Tales of Dragon Slayers  123445452543  543533254353  543267765345
1     boring Tales of Dragon Slayers        111111  543533254353  543267765345
2     boring Tales of Dragon Slayers       1121111    4324232342       4353543

奖励内容

当我们在做的时候。关于每个 id 类型是否具有相同数量的 id，数据是不明确的。假设他们没有。

data = 1:
      'Name': "Thrilling Tales of Dragon Slayers",
      'IDs':
            "StoreID": ['123445452543'],
            "BookID": ['543533254353'],
            "SalesID": ['543267765345'],
     2:
      'Name': "boring Tales of Dragon Slayers",
      'IDs':
            "StoreID": ['111111', '1121111'],
            "BookID": ['543533254353', '4324232342'],
            "SalesID": ['543267765345', '4353543', 'extra id']

那么我们可以从itertools使用zip_longest

from itertools import zip_longest

pd.DataFrame([
    'Name': r['Name'], **dict(zip(r['IDs'], x))
    for r in data.values() for x in zip_longest(*r['IDs'].values())
])

                                Name       StoreID        BookID       SalesID
0  Thrilling Tales of Dragon Slayers  123445452543  543533254353  543267765345
1     boring Tales of Dragon Slayers        111111  543533254353  543267765345
2     boring Tales of Dragon Slayers       1121111    4324232342       4353543
3     boring Tales of Dragon Slayers          None          None      extra id

【讨论】：

【参考方案2】：

pandas.DataFrame.from_dict

data

'IDs'

.pop

df

pd.DataFrame(df.pop('IDs').values.tolist())

dict key

.join

df

pd.Series.explode

list

.apply

根据数据，有时solution是对数据进行重塑，如piRSquared所示

import pandas as pd

# test data
data =\
1: 'IDs': 'BookID': ['543533254353'],
             'SalesID': ['543267765345'],
             'StoreID': ['123445452543'],
     'Name': 'Thrilling Tales of Dragon Slayers',
 2: 'IDs': 'BookID': ['543533254353', '4324232342'],
             'SalesID': ['543267765345', '4353543'],
             'StoreID': ['111111', '1121111'],
     'Name': 'boring Tales of Dragon Slayers'

# load the data using from_dict
df = pd.DataFrame.from_dict(data, orient='index').reset_index(drop=True)

# convert IDs to separate columns
df = df.join(pd.DataFrame(df.pop('IDs').values.tolist()))

# explode the list in each column
df = df.apply(pd.Series.explode).reset_index(drop=True)

# display(df)
                                Name        BookID       SalesID       StoreID
0  Thrilling Tales of Dragon Slayers  543533254353  543267765345  123445452543
1     boring Tales of Dragon Slayers  543533254353  543267765345        111111
2     boring Tales of Dragon Slayers    4324232342       4353543       1121111