从 Python 中的嵌套 Json 中提取信息

Posted

技术标签:

【中文标题】从 Python 中的嵌套 Json 中提取信息【英文标题】:Extract information from nested Json in Python 【发布时间】:2021-11-05 05:39:34 【问题描述】:

我有一个 dataset 包含嵌套的 json 对象。我希望从这个嵌套的 json 中提取信息并将其放入 python 中的 DataFrame 中。我使用了 json_normalize 方法,但在一定级别后我无法解析。请帮忙。谢谢。

【问题讨论】:

你能详细说明一下吗?提供数据样本? DF 应该是什么样子? json 的哪些字段也应该在 DF 中? 【参考方案1】:

一直在开发一个可以扩展所有嵌入列表和字典的功能。

from pathlib import Path

with open(Path.home().joinpath("Downloads").joinpath("Sample Json.txt")) as f: js = f.read()

def normalize(js, expand_all=False):
    df = pd.json_normalize(json.loads(js) if type(js) == str else js)
    # get first column that contains lists
    col = df.applymap(type).astype(str).eq("<class 'list'>").all().idxmax()
    # explode list and expand embedded dictionaries
    df = df.explode(col).reset_index(drop=True)
    df = df.drop(columns=[col]).join(df[col].apply(pd.Series), rsuffix=f".col")
    # any dictionary to expand?
    if df.applymap(type).astype(str).eq("<class 'dict'>").any().any():
        col = df.applymap(type).astype(str).eq("<class 'dict'>").all().idxmax()
        df = df.drop(columns=[col]).join(df[col].apply(pd.Series), rsuffix=f".col")

    # any lists left?
    while expand_all and df.applymap(type).astype(str).eq("<class 'list'>").any().any():
        df = normalize(df.to_dict("records"))
    return df

    
    
df = normalize(js, expand_all=True)

cfs ctin fldtr1 cfs3b flprdr1 dtcancel val inv_typ pos idt rchrg inum chksum num csamt samt rt txval camt iamt
0 Y 03AZX 10-Aug-20 Y Jul-20 nan 2390 R 03 27-07-2020 N TI/20-21/111 24ea1a46933dd7c6f130cc7ddce3ad89f42194d84e358746f66716d0f1b8aef0 101 0 182.25 18 2025 182.25 0
1 Y 03AZY 02-Sep-20 Y Jul-20 nan 10756 R 03 20-07-2020 N 70 164777293c8ce80595cd4803c3d0287bc544772fb9e5331602ed3d7d0534e82f 1801 0 820.35 18 9115 820.35 nan
2 Y 03A00P1Z7 10-Aug-20 Y Jul-20 nan 411.82 R 03 01-07-2020 N 18IPB06013580804 0560d2b220de53f458ac65594f50bfa5ba736f95061c88201d91371fbeccabf8 1 0 31.41 18 349 31.41 nan
3 Y 03A00P1Z7 10-Aug-20 Y Jul-20 nan 411.82 R 03 01-07-2020 N 18IPB06013580805 08ae71bcb591723318796e797da586ef9b8e5b6b920e9877be6afc9223486760 1 0 31.41 18 349 31.41 nan
4 Y 03A00P1Z7 10-Aug-20 Y Jul-20 nan 383.5 R 03 01-07-2020 N 18IPB06013580806 4d22ddd1d05d22cc4707a89dd80e76a271b99a7ba2610e3b111489fd4f7950fc 1 0 29.25 18 325 29.25 nan
5 Y 03A00P1Z7 10-Aug-20 Y Jul-20 nan 496.78 R 03 01-07-2020 N 18IPB06013580807 73e6e787493276151783d5ab1107bd0bac53780a5840964f7953bf3ba8a4efb0 1 0 37.89 18 421 37.89 nan
6 Y 03A00P1Z7 10-Aug-20 Y Jul-20 nan 411.82 R 03 21-07-2020 N 18IPB07013893564 52ef0e7269de052c0353580cad5092ff1cc7a3c454318b2df1041a62a32f033f 1 0 31.41 18 349 31.41 nan
7 Y 03A00P1Z7 10-Aug-20 Y Jul-20 nan 411.82 R 03 21-07-2020 N 18IPB07013893565 ab44c119f3db614dccfd3bc63c036eaca22a41c99e3e5090904e38aee056f4ac 1 0 31.41 18 349 31.41 nan
8 Y 03CAZD 10-Aug-20 Y Jul-20 nan 162840 R 03 13-07-2020 N T/20-21/56 92e52e48e812bb0bb2e34d9e400248730fdc40363459d05c4e9d6ebb7fe6165d 101 0 12420 18 138000 12420 0
9 Y 03AAE 22-Aug-20 Y Jul-20 nan 46556 R 03 30-07-2020 N S20/21-359 8138e35895114ae412e8256f3ce8382cdd8ae771f2780781085134618bb033c9 1801 0 3550.87 18 39454.2 3550.87 0
10 Y 03AAD1ZA 11-Aug-20 Y Jul-20 nan 8417.98 R 03 02-07-2020 N 0000030301011976 70d17e281b22541b3d41eb3269d057b73140c203771365a892dd496ffc756adb 1 0 0 0 1024.84 0 nan
11 Y 03AAD1ZA 11-Aug-20 Y Jul-20 nan 8417.98 R 03 02-07-2020 N 0000030301011976 70d17e281b22541b3d41eb3269d057b73140c203771365a892dd496ffc756adb 2 0 233.58 18 2595.37 233.58 nan
12 Y 03AAD1ZA 11-Aug-20 Y Jul-20 nan 8417.98 R 03 02-07-2020 N 0000030301011976 70d17e281b22541b3d41eb3269d057b73140c203771365a892dd496ffc756adb 3 0 89.34 5 3573.99 89.34 nan
13 Y 03AAD1ZA 11-Aug-20 Y Jul-20 nan 8417.98 R 03 02-07-2020 N 0000030301011976 70d17e281b22541b3d41eb3269d057b73140c203771365a892dd496ffc756adb 4 0 30.96 12 516.02 30.96 nan
14 Y 03AAD1ZA 11-Aug-20 Y Jul-20 nan 2824.88 R 03 06-07-2020 N 0000030301012348 2e7978264e42a74a70aa35d39ca6856f4dfb333e76935667a8de2733f888a1f1 1 0 116.46 18 1293.94 116.46 nan
15 Y 03AAD1ZA 11-Aug-20 Y Jul-20 nan 2824.88 R 03 06-07-2020 N 0000030301012348 2e7978264e42a74a70aa35d39ca6856f4dfb333e76935667a8de2733f888a1f1 2 0 37.27 12 621.18 37.27 nan
16 Y 03AAD1ZA 11-Aug-20 Y Jul-20 nan 2824.88 R 03 06-07-2020 N 0000030301012348 2e7978264e42a74a70aa35d39ca6856f4dfb333e76935667a8de2733f888a1f1 3 0 0 0 85.26 0 nan
17 Y 03AAD1ZA 11-Aug-20 Y Jul-20 nan 2824.88 R 03 06-07-2020 N 0000030301012348 2e7978264e42a74a70aa35d39ca6856f4dfb333e76935667a8de2733f888a1f1 4 0 12.31 5 492.42 12.31 nan
18 Y 03AA1ZQ 17-Aug-20 Y Jul-20 nan 39294 R 03 02-07-2020 N TI/20-21/43 69f7931986ad9274d9595ca5221e3ce82aa389d659e83376ff1ec34571057670 101 0 2997 18 33300 2997 0
19 Y 03AGG3Z5 18-Aug-20 Y Jul-20 22-Jan-20 593583 R 03 31-07-2020 N 25 623dcb5b65e34be4d0453c1783915bb8e66684a2e33a3c8a547e38754c4f1af9 1 0 45273.3 18 503036 45273.3 nan
20 Y 03AGG3Z5 18-Aug-20 Y Jul-20 22-Jan-20 601409 R 03 31-07-2020 N 26 ef8b99f99fe090f0a2374d8d6c0b15c265740e6c6487ff68d510382ec21d8ce4 1 0 45870.2 18 509668 45870.2 nan
21 Y 03AGG3Z5 18-Aug-20 Y Jul-20 22-Jan-20 767358 R 03 31-07-2020 N 27 9c1257eddeb8cdc7e6a832a3646969b71e49eeeb7d6742b26cfc6e0e3630438a 1 0 58527.3 18 650303 58527.3 nan
22 Y 03AGG3Z5 18-Aug-20 Y Jul-20 22-Jan-20 597886 R 03 31-07-2020 N 28 29fc1b28aedd1545e7ea0fd8b67b8332a83f1ac3f62af9398af2dfa26c9f1d90 1 0 45601.4 18 506683 45601.4 nan
23 Y 03AA9 18-Aug-20 Y Jul-20 nan 41914 R 03 29-07-2020 N 2020-21/K-916 d112ad384eb291d49509bdf4a005d509424fefee4caf3443bc9726cf41665295 1801 0 3196.8 18 35520 3196.8 nan
24 Y 03A1Z8 12-Aug-20 Y Jul-20 nan 274893 R 03 20-07-2020 N T/20-21/10 e5851fcc6b370714d7523080582a678a212f5dde90f5c2618880376018221f38 101 0 20966.4 18 232960 20966.4 0
25 Y 03AD1ZL 11-Aug-20 Y Jul-20 nan 125375 R 03 03-07-2020 N T/20-21/155 2bb398c7a0fedf11f1f1c1d196c43ad79910be52e6892f88915671025528eb2b 101 0 9562.5 18 106250 9562.5 0
26 Y 03AA3Z9 14-Aug-20 Y Jul-20 nan 529.99 R 03 31-07-2020 N 0301072000000650 ad1e1d1572c9058fabd6d23fb5dc4b68f1a2a10d3dd3d7e73d73d3c502d92151 1 nan 40.42 18 449.15 40.42 nan
27 Y 03AA3Z9 14-Aug-20 Y Jul-20 nan 1201 R 03 31-07-2020 N 0303072000000025 5a69229d907957c1d95eb464684891c202102b8589f5603b8ae14b07607f1655 1 nan 91.5 18 1018 91.5 nan
28 Y 03AB1ZV 11-Aug-20 Y Jul-20 nan 30976 R 03 10-07-2020 N 70 69bbeb088634a88b30c6e6046b63b1977f5534b2f676b984ef78f2c3bad8ca35 1800 nan 2362.5 18 26250 2362.5 nan
29 Y 03AD1Z1 13-Aug-20 Y Jul-20 nan 8968 R 03 01-07-2020 N B25 5b98b819ca14a377c9304e7eab21957152c4819e82e37f2619fb2c547fb84ba6 1801 0 684 18 7600 684 nan
30 Y 03AAO 10-Aug-20 Y Jul-20 nan 38940 R 03 13-07-2020 N TI/20-21/30 bae339e580c2ab9ffee90533650e4e2acdc47310230ed54aabbb96f89d3fc7c4 101 0 2970 18 33000 2970 0
31 Y 07AH1ZU 11-Aug-20 Y Jul-20 nan 13836.5 R 03 31-07-2020 N DELR/EXP/12176 cb34f329adcd88c9e8794db9892fe47bd0a7afc0373a20860de046934f7923fa 1 0 nan 18 11725.9 nan 2110.65
32 Y 03A1ZT 18-Aug-20 Y Jul-20 nan 41820 R 03 07-07-2020 N TI/20-21/68 ad61c4dd8227b214dbe4bba24b57a2c976ce8438e53cf15b3530480116ca64da 101 0 3189.69 18 35441 3189.69 0
33 Y 03A1ZT 18-Aug-20 Y Jul-20 nan 69773 R 03 10-07-2020 N TI/20-21/71 1deca4741b91716bfabc8b2ab826be76342b0fd3e698b128c927f4b426c064d0 101 0 5321.7 18 59130 5321.7 0

【讨论】:

非常感谢。我看到结果符合预期。我还可以再问一件事……如果我们必须从任何 url 导入 JSON,我们需要对上面的代码进行哪些修改?问候。 只需将 JSON 传递给函数,它接受它作为字符串或字典。所以像normalize(requests.get("http://someservice.local").json(), expand_all=True) 这样的东西会起作用【参考方案2】:

要“扁平化”嵌套的 json 文件,可以使用以下函数:

def flatten_json(nested_json):       
    out = 

    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(nested_json)
    return out

假设你的 json 被称为 myjson:

df = pd.Series(flatten_json(myjson)).to_frame()

【讨论】:

以上是关于从 Python 中的嵌套 Json 中提取信息的主要内容,如果未能解决你的问题,请参考以下文章

使用 Python 从 JSON 嵌套列表和字符串数组中提取值

通过嵌套json递归迭代python中的特定键

从 Firebase remoteMessage 中的嵌套 JSON 中提取数据

从 Pyspark 中的嵌套 Json-String 列中提取模式

如何从嵌套的 json 值中提取特定值。?

使用 Python 提取嵌套列