从 Python 中的嵌套 Json 中提取信息
Posted
技术标签:
【中文标题】从 Python 中的嵌套 Json 中提取信息【英文标题】:Extract information from nested Json in Python 【发布时间】:2021-11-05 05:39:34 【问题描述】:我有一个 dataset 包含嵌套的 json 对象。我希望从这个嵌套的 json 中提取信息并将其放入 python 中的 DataFrame 中。我使用了 json_normalize 方法,但在一定级别后我无法解析。请帮忙。谢谢。
【问题讨论】:
你能详细说明一下吗?提供数据样本? DF 应该是什么样子? json 的哪些字段也应该在 DF 中? 【参考方案1】:一直在开发一个可以扩展所有嵌入列表和字典的功能。
from pathlib import Path
with open(Path.home().joinpath("Downloads").joinpath("Sample Json.txt")) as f: js = f.read()
def normalize(js, expand_all=False):
df = pd.json_normalize(json.loads(js) if type(js) == str else js)
# get first column that contains lists
col = df.applymap(type).astype(str).eq("<class 'list'>").all().idxmax()
# explode list and expand embedded dictionaries
df = df.explode(col).reset_index(drop=True)
df = df.drop(columns=[col]).join(df[col].apply(pd.Series), rsuffix=f".col")
# any dictionary to expand?
if df.applymap(type).astype(str).eq("<class 'dict'>").any().any():
col = df.applymap(type).astype(str).eq("<class 'dict'>").all().idxmax()
df = df.drop(columns=[col]).join(df[col].apply(pd.Series), rsuffix=f".col")
# any lists left?
while expand_all and df.applymap(type).astype(str).eq("<class 'list'>").any().any():
df = normalize(df.to_dict("records"))
return df
df = normalize(js, expand_all=True)
cfs | ctin | fldtr1 | cfs3b | flprdr1 | dtcancel | val | inv_typ | pos | idt | rchrg | inum | chksum | num | csamt | samt | rt | txval | camt | iamt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Y | 03AZX | 10-Aug-20 | Y | Jul-20 | nan | 2390 | R | 03 | 27-07-2020 | N | TI/20-21/111 | 24ea1a46933dd7c6f130cc7ddce3ad89f42194d84e358746f66716d0f1b8aef0 | 101 | 0 | 182.25 | 18 | 2025 | 182.25 | 0 |
1 | Y | 03AZY | 02-Sep-20 | Y | Jul-20 | nan | 10756 | R | 03 | 20-07-2020 | N | 70 | 164777293c8ce80595cd4803c3d0287bc544772fb9e5331602ed3d7d0534e82f | 1801 | 0 | 820.35 | 18 | 9115 | 820.35 | nan |
2 | Y | 03A00P1Z7 | 10-Aug-20 | Y | Jul-20 | nan | 411.82 | R | 03 | 01-07-2020 | N | 18IPB06013580804 | 0560d2b220de53f458ac65594f50bfa5ba736f95061c88201d91371fbeccabf8 | 1 | 0 | 31.41 | 18 | 349 | 31.41 | nan |
3 | Y | 03A00P1Z7 | 10-Aug-20 | Y | Jul-20 | nan | 411.82 | R | 03 | 01-07-2020 | N | 18IPB06013580805 | 08ae71bcb591723318796e797da586ef9b8e5b6b920e9877be6afc9223486760 | 1 | 0 | 31.41 | 18 | 349 | 31.41 | nan |
4 | Y | 03A00P1Z7 | 10-Aug-20 | Y | Jul-20 | nan | 383.5 | R | 03 | 01-07-2020 | N | 18IPB06013580806 | 4d22ddd1d05d22cc4707a89dd80e76a271b99a7ba2610e3b111489fd4f7950fc | 1 | 0 | 29.25 | 18 | 325 | 29.25 | nan |
5 | Y | 03A00P1Z7 | 10-Aug-20 | Y | Jul-20 | nan | 496.78 | R | 03 | 01-07-2020 | N | 18IPB06013580807 | 73e6e787493276151783d5ab1107bd0bac53780a5840964f7953bf3ba8a4efb0 | 1 | 0 | 37.89 | 18 | 421 | 37.89 | nan |
6 | Y | 03A00P1Z7 | 10-Aug-20 | Y | Jul-20 | nan | 411.82 | R | 03 | 21-07-2020 | N | 18IPB07013893564 | 52ef0e7269de052c0353580cad5092ff1cc7a3c454318b2df1041a62a32f033f | 1 | 0 | 31.41 | 18 | 349 | 31.41 | nan |
7 | Y | 03A00P1Z7 | 10-Aug-20 | Y | Jul-20 | nan | 411.82 | R | 03 | 21-07-2020 | N | 18IPB07013893565 | ab44c119f3db614dccfd3bc63c036eaca22a41c99e3e5090904e38aee056f4ac | 1 | 0 | 31.41 | 18 | 349 | 31.41 | nan |
8 | Y | 03CAZD | 10-Aug-20 | Y | Jul-20 | nan | 162840 | R | 03 | 13-07-2020 | N | T/20-21/56 | 92e52e48e812bb0bb2e34d9e400248730fdc40363459d05c4e9d6ebb7fe6165d | 101 | 0 | 12420 | 18 | 138000 | 12420 | 0 |
9 | Y | 03AAE | 22-Aug-20 | Y | Jul-20 | nan | 46556 | R | 03 | 30-07-2020 | N | S20/21-359 | 8138e35895114ae412e8256f3ce8382cdd8ae771f2780781085134618bb033c9 | 1801 | 0 | 3550.87 | 18 | 39454.2 | 3550.87 | 0 |
10 | Y | 03AAD1ZA | 11-Aug-20 | Y | Jul-20 | nan | 8417.98 | R | 03 | 02-07-2020 | N | 0000030301011976 | 70d17e281b22541b3d41eb3269d057b73140c203771365a892dd496ffc756adb | 1 | 0 | 0 | 0 | 1024.84 | 0 | nan |
11 | Y | 03AAD1ZA | 11-Aug-20 | Y | Jul-20 | nan | 8417.98 | R | 03 | 02-07-2020 | N | 0000030301011976 | 70d17e281b22541b3d41eb3269d057b73140c203771365a892dd496ffc756adb | 2 | 0 | 233.58 | 18 | 2595.37 | 233.58 | nan |
12 | Y | 03AAD1ZA | 11-Aug-20 | Y | Jul-20 | nan | 8417.98 | R | 03 | 02-07-2020 | N | 0000030301011976 | 70d17e281b22541b3d41eb3269d057b73140c203771365a892dd496ffc756adb | 3 | 0 | 89.34 | 5 | 3573.99 | 89.34 | nan |
13 | Y | 03AAD1ZA | 11-Aug-20 | Y | Jul-20 | nan | 8417.98 | R | 03 | 02-07-2020 | N | 0000030301011976 | 70d17e281b22541b3d41eb3269d057b73140c203771365a892dd496ffc756adb | 4 | 0 | 30.96 | 12 | 516.02 | 30.96 | nan |
14 | Y | 03AAD1ZA | 11-Aug-20 | Y | Jul-20 | nan | 2824.88 | R | 03 | 06-07-2020 | N | 0000030301012348 | 2e7978264e42a74a70aa35d39ca6856f4dfb333e76935667a8de2733f888a1f1 | 1 | 0 | 116.46 | 18 | 1293.94 | 116.46 | nan |
15 | Y | 03AAD1ZA | 11-Aug-20 | Y | Jul-20 | nan | 2824.88 | R | 03 | 06-07-2020 | N | 0000030301012348 | 2e7978264e42a74a70aa35d39ca6856f4dfb333e76935667a8de2733f888a1f1 | 2 | 0 | 37.27 | 12 | 621.18 | 37.27 | nan |
16 | Y | 03AAD1ZA | 11-Aug-20 | Y | Jul-20 | nan | 2824.88 | R | 03 | 06-07-2020 | N | 0000030301012348 | 2e7978264e42a74a70aa35d39ca6856f4dfb333e76935667a8de2733f888a1f1 | 3 | 0 | 0 | 0 | 85.26 | 0 | nan |
17 | Y | 03AAD1ZA | 11-Aug-20 | Y | Jul-20 | nan | 2824.88 | R | 03 | 06-07-2020 | N | 0000030301012348 | 2e7978264e42a74a70aa35d39ca6856f4dfb333e76935667a8de2733f888a1f1 | 4 | 0 | 12.31 | 5 | 492.42 | 12.31 | nan |
18 | Y | 03AA1ZQ | 17-Aug-20 | Y | Jul-20 | nan | 39294 | R | 03 | 02-07-2020 | N | TI/20-21/43 | 69f7931986ad9274d9595ca5221e3ce82aa389d659e83376ff1ec34571057670 | 101 | 0 | 2997 | 18 | 33300 | 2997 | 0 |
19 | Y | 03AGG3Z5 | 18-Aug-20 | Y | Jul-20 | 22-Jan-20 | 593583 | R | 03 | 31-07-2020 | N | 25 | 623dcb5b65e34be4d0453c1783915bb8e66684a2e33a3c8a547e38754c4f1af9 | 1 | 0 | 45273.3 | 18 | 503036 | 45273.3 | nan |
20 | Y | 03AGG3Z5 | 18-Aug-20 | Y | Jul-20 | 22-Jan-20 | 601409 | R | 03 | 31-07-2020 | N | 26 | ef8b99f99fe090f0a2374d8d6c0b15c265740e6c6487ff68d510382ec21d8ce4 | 1 | 0 | 45870.2 | 18 | 509668 | 45870.2 | nan |
21 | Y | 03AGG3Z5 | 18-Aug-20 | Y | Jul-20 | 22-Jan-20 | 767358 | R | 03 | 31-07-2020 | N | 27 | 9c1257eddeb8cdc7e6a832a3646969b71e49eeeb7d6742b26cfc6e0e3630438a | 1 | 0 | 58527.3 | 18 | 650303 | 58527.3 | nan |
22 | Y | 03AGG3Z5 | 18-Aug-20 | Y | Jul-20 | 22-Jan-20 | 597886 | R | 03 | 31-07-2020 | N | 28 | 29fc1b28aedd1545e7ea0fd8b67b8332a83f1ac3f62af9398af2dfa26c9f1d90 | 1 | 0 | 45601.4 | 18 | 506683 | 45601.4 | nan |
23 | Y | 03AA9 | 18-Aug-20 | Y | Jul-20 | nan | 41914 | R | 03 | 29-07-2020 | N | 2020-21/K-916 | d112ad384eb291d49509bdf4a005d509424fefee4caf3443bc9726cf41665295 | 1801 | 0 | 3196.8 | 18 | 35520 | 3196.8 | nan |
24 | Y | 03A1Z8 | 12-Aug-20 | Y | Jul-20 | nan | 274893 | R | 03 | 20-07-2020 | N | T/20-21/10 | e5851fcc6b370714d7523080582a678a212f5dde90f5c2618880376018221f38 | 101 | 0 | 20966.4 | 18 | 232960 | 20966.4 | 0 |
25 | Y | 03AD1ZL | 11-Aug-20 | Y | Jul-20 | nan | 125375 | R | 03 | 03-07-2020 | N | T/20-21/155 | 2bb398c7a0fedf11f1f1c1d196c43ad79910be52e6892f88915671025528eb2b | 101 | 0 | 9562.5 | 18 | 106250 | 9562.5 | 0 |
26 | Y | 03AA3Z9 | 14-Aug-20 | Y | Jul-20 | nan | 529.99 | R | 03 | 31-07-2020 | N | 0301072000000650 | ad1e1d1572c9058fabd6d23fb5dc4b68f1a2a10d3dd3d7e73d73d3c502d92151 | 1 | nan | 40.42 | 18 | 449.15 | 40.42 | nan |
27 | Y | 03AA3Z9 | 14-Aug-20 | Y | Jul-20 | nan | 1201 | R | 03 | 31-07-2020 | N | 0303072000000025 | 5a69229d907957c1d95eb464684891c202102b8589f5603b8ae14b07607f1655 | 1 | nan | 91.5 | 18 | 1018 | 91.5 | nan |
28 | Y | 03AB1ZV | 11-Aug-20 | Y | Jul-20 | nan | 30976 | R | 03 | 10-07-2020 | N | 70 | 69bbeb088634a88b30c6e6046b63b1977f5534b2f676b984ef78f2c3bad8ca35 | 1800 | nan | 2362.5 | 18 | 26250 | 2362.5 | nan |
29 | Y | 03AD1Z1 | 13-Aug-20 | Y | Jul-20 | nan | 8968 | R | 03 | 01-07-2020 | N | B25 | 5b98b819ca14a377c9304e7eab21957152c4819e82e37f2619fb2c547fb84ba6 | 1801 | 0 | 684 | 18 | 7600 | 684 | nan |
30 | Y | 03AAO | 10-Aug-20 | Y | Jul-20 | nan | 38940 | R | 03 | 13-07-2020 | N | TI/20-21/30 | bae339e580c2ab9ffee90533650e4e2acdc47310230ed54aabbb96f89d3fc7c4 | 101 | 0 | 2970 | 18 | 33000 | 2970 | 0 |
31 | Y | 07AH1ZU | 11-Aug-20 | Y | Jul-20 | nan | 13836.5 | R | 03 | 31-07-2020 | N | DELR/EXP/12176 | cb34f329adcd88c9e8794db9892fe47bd0a7afc0373a20860de046934f7923fa | 1 | 0 | nan | 18 | 11725.9 | nan | 2110.65 |
32 | Y | 03A1ZT | 18-Aug-20 | Y | Jul-20 | nan | 41820 | R | 03 | 07-07-2020 | N | TI/20-21/68 | ad61c4dd8227b214dbe4bba24b57a2c976ce8438e53cf15b3530480116ca64da | 101 | 0 | 3189.69 | 18 | 35441 | 3189.69 | 0 |
33 | Y | 03A1ZT | 18-Aug-20 | Y | Jul-20 | nan | 69773 | R | 03 | 10-07-2020 | N | TI/20-21/71 | 1deca4741b91716bfabc8b2ab826be76342b0fd3e698b128c927f4b426c064d0 | 101 | 0 | 5321.7 | 18 | 59130 | 5321.7 | 0 |
【讨论】:
非常感谢。我看到结果符合预期。我还可以再问一件事……如果我们必须从任何 url 导入 JSON,我们需要对上面的代码进行哪些修改?问候。 只需将 JSON 传递给函数,它接受它作为字符串或字典。所以像normalize(requests.get("http://someservice.local").json(), expand_all=True)
这样的东西会起作用【参考方案2】:
要“扁平化”嵌套的 json 文件,可以使用以下函数:
def flatten_json(nested_json):
out =
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(nested_json)
return out
假设你的 json 被称为 myjson
:
df = pd.Series(flatten_json(myjson)).to_frame()
【讨论】:
以上是关于从 Python 中的嵌套 Json 中提取信息的主要内容,如果未能解决你的问题,请参考以下文章
使用 Python 从 JSON 嵌套列表和字符串数组中提取值
从 Firebase remoteMessage 中的嵌套 JSON 中提取数据