json_normalize JSON 文件,具有包含字典的多级列表(包括示例)

Posted

技术标签:

【中文标题】json_normalize JSON 文件,具有包含字典的多级列表(包括示例)【英文标题】:json_normalize JSON file with multiple levels of lists containing dictionary (sample included) 【发布时间】:2018-12-16 14:35:45 【问题描述】:

(最初来自previous question,但为更一般的问题重新设计)

这是我使用 2 条记录的示例 json 文件:

["Time":"2016-01-10",
"ID"
:13567,
"Content":
    "Event":"UPDATE",
    "Id":"EventID":"ABCDEFG",
    "Story":[
        "@ContentCat":"News",
        "Body":"Related Meeting Memo: Engagement with target firm for potential M&A.  Please be on call this weekend for news updates.",
        "BodyTextType":"PLAIN_TEXT",
        "DerivedId":"Entity":["Id":"Amy","Score":70, "Id":"Jon","Score":70],
        "DerivedTopics":"Topics":[
                            "Id":"Meeting","Score":70,
                            "Id":"Performance","Score":70,
                            "Id":"Engagement","Score":100,
                            "Id":"Salary","Score":70,
                            "Id":"Career","Score":100]
                        ,
        "HotLevel":0,
        "LanguageString":"ENGLISH",
        "Metadata":"ClassNum":50,
                    "Headline":"Attn: Weekend",
                    "WireId":2035,
                    "WireName":"IIS",
        "Version":"Original"
                ],
"yyyymmdd":"20160110",
"month":201601,
"Time":"2016-01-12",
"ID":13568,
"Content":
    "Event":"DEAL",
    "Id":"EventID":"ABCDEFG2",
    "Story":[
        "@ContentCat":"Details",
        "Body":"Test email contents",
        "BodyTextType":"PLAIN_TEXT",
        "DerivedId":"Entity":["Id":"Bob","Score":100, "Id":"Jon","Score":70, "Id":"Jack","Score":60],
        "DerivedTopics":"Topics":[
                            "Id":"Meeting","Score":70,
                            "Id":"Engagement","Score":100,
                            "Id":"Salary","Score":70,
                            "Id":"Career","Score":100]
                        ,
        "HotLevel":0,
        "LanguageString":"ENGLISH",
        "Metadata":"ClassNum":70,
                    "Headline":"Attn: Weekend",
                    "WireId":2037,
                    "WireName":"IIS",
        "Version":"Original"
                ],
"yyyymmdd":"20160112",
"month":201602]

我正在尝试获取实体 ID 级别的数据框(从记录 1 中提取 AmyJon,从记录 2 中提取 BobJonJack)。我该怎么做呢? 为了澄清,级别是(内容 > 故事 > DerivedID > 实体 > Id)

【问题讨论】:

【参考方案1】:

使用list comprehension,您可以像这样进入该结构:

with open('test.json', 'rU') as f:
    data = json.load(f)

df = pd.DataFrame(sum([i['Content']['Story'][0]['DerivedId']['Entity']
                       for i in data], []))

print(df)

或者,如果您有大量数据并且不想做笨拙的sum(),请使用itertools.chain.from_iterable,例如:

import itertools as it
df = pd.DataFrame.from_records(it.chain.from_iterable(
    i['Content']['Story'][0]['DerivedId']['Entity'] for i in data))

结果:

     Id  Score
0   Amy     70
1   Jon     70
2   Bob    100
3   Jon     70
4  Jack     60

【讨论】:

谢谢斯蒂芬。如果我想使用 pandas json_normalize 函数添加元数据,该怎么做? 我不太了解那个函数,而且你没有显示你希望的输出,所以......我不知道。【参考方案2】:
df = pd.json_normalize(data, ['Content', 'Story', 'DerivedId', 'Entity'])
print(df)

记住,最后的根必须是 json 中的一个列表

结果:

     Id  Score
0   Amy     70
1   Jon     70
2   Bob    100
3   Jon     70
4  Jack     60

如果你只想要 id

df[['Id']]

【讨论】:

以上是关于json_normalize JSON 文件,具有包含字典的多级列表(包括示例)的主要内容,如果未能解决你的问题,请参考以下文章

json_normalize JSON 文件,列表包含字典(包括示例)

使用 pd.json_normalize 展平字典

Pandas json_normalize 不会展平所有嵌套字段

pandas json_normalize KeyError

Pandas json_normalize 返回 KeyError

Pandas json_normalize 的逆