如何从嵌套字典创建多索引数据框？

Posted 2023-02-23

技术标签:

【中文标题】如何从嵌套字典创建多索引数据框？【英文标题】：How to create a multi index dataframe from a nested dictionary? 【发布时间】：2021-11-24 09:29:53 【问题描述】：

我有一个嵌套字典，它的第一级键是 [0, 1, 2...]，每个键对应的值是这样的：


    "geometry": 
        "type": "Point",
        "coordinates": [75.4516454, 27.2520587]
    ,
    "type": "Feature",
    "properties": 
        "state": "Rajasthan",
        "code": "BDHL",
        "name": "Badhal",
        "zone": "NWR",
        "address": "Kishangarh Renwal, Rajasthan"

我想制作一个表单的熊猫数据框：

        Geometry           Type                    Properties
   Type      Coordinates           State     Code    Name    Zone    Address
0  Point     [..., ...]   Features Rajasthan BDHL    ...     ...     ...
1
2

我无法理解网上有关多索引/嵌套数据框/数据透视的示例。它们似乎都没有将第一级键作为所需数据帧中的主索引。

我如何从我拥有的数据中获取数据，并将其制成这个格式化的数据框？

【问题讨论】：

看看您是否需要从下面的答案中进一步澄清。如果没有更多问题，请接受您选择的答案并投票给您认为有帮助的任何答案，让我们知道最适合您需求的任何答案。谢谢！ 【参考方案1】：

我建议将列创建为 "geometry_type"、"geometry_coord" 等，以便将这些列与您命名为 "type" 的列区分开来。换句话说，使用您的第一个键作为前缀，并使用子键作为名称，从而创建一个新名称。之后，只需像这样解析并填充您的 Dataframe：

import json
j = json.loads("your_json.json")

df = pd.DataFrame(columns=["geometry_type", "geometry_coord", ... ])

for k, v in j.items():
    if k == "geometry":
        df = df.append(
            "geometry_type": v.get("type"),
            "geometry_coord": v.get("coordinates")
        , ignore_index=True)
    ...

您的输出可能如下所示：

    geometry_type               geometry_coord    ...
0   [75.4516454, 27.2520587]    NaN               ...

PS：如果你真的想选择你的初始选项，你可以在这里查看：Giving a column multiple indexes/headers

【讨论】：

【参考方案2】：

我想你有一个嵌套字典的列表。

使用json_normalize读取json数据并使用str.partition将当前列索引分成两部分：

import pandas as pd
import json

data = json.load(open('data.json'))
df = pd.json_normalize(data)
df.columns = df.columns.str.partition('.', expand=True).droplevel(level=1)

输出：

>>> df.columns
MultiIndex([(      'type',            ''),
            (  'geometry',        'type'),
            (  'geometry', 'coordinates'),
            ('properties',       'state'),
            ('properties',        'code'),
            ('properties',        'name'),
            ('properties',        'zone'),
            ('properties',     'address')],
           )

>>> df
      type geometry                           properties                     
               type               coordinates      state  code    name zone                        address   
0  Feature    Point  [75.4516454, 27.2520587]  Rajasthan  BDHL  Badhal  NWR   Kishangarh Renwal, Rajasthan

【讨论】：

【参考方案3】：

您可以使用pd.json_normalize() 将嵌套字典规范化为数据框df。

然后，用Index.str.split在df.columns上用参数expand=True将带点的列名拆分成多索引，如下：

第 1 步：将嵌套的 dict 规范化为数据框

j = 
    "geometry": 
        "type": "Point",
        "coordinates": [75.4516454, 27.2520587]
    ,
    "type": "Feature",
    "properties": 
        "state": "Rajasthan",
        "code": "BDHL",
        "name": "Badhal",
        "zone": "NWR",
        "address": "Kishangarh Renwal, Rajasthan"
    
 

df = pd.json_normalize(j)

第 1 步结果：

print(df)

      type geometry.type      geometry.coordinates properties.state properties.code properties.name properties.zone            properties.address
0  Feature         Point  [75.4516454, 27.2520587]        Rajasthan            BDHL          Badhal             NWR  Kishangarh Renwal, Rajasthan

第 2 步：创建多索引列标签

df.columns = df.columns.str.split('.', expand=True)

第 2 步（最终）结果：

print(df)

      type geometry                           properties                                                 
       NaN     type               coordinates      state  code    name zone                       address
0  Feature    Point  [75.4516454, 27.2520587]  Rajasthan  BDHL  Badhal  NWR  Kishangarh Renwal, Rajasthan

【讨论】：

以上是关于如何从嵌套字典创建多索引数据框？的主要内容，如果未能解决你的问题，请参考以下文章