规范化/展平非常深的嵌套 JSON(其中名称和属性在各个级别中相同)
Posted
技术标签:
【中文标题】规范化/展平非常深的嵌套 JSON(其中名称和属性在各个级别中相同)【英文标题】:Normalize/Flatten a very deeply nested JSON (in which names and properties are the same across levels) 【发布时间】:2020-03-03 04:42:37 【问题描述】:我正在尝试使用 pandas 将这个非常嵌套的 json 展平或规范化为数据框。
问题是:在每个级别上,名称和属性都是相同的。
我还没有发现任何与此类似的熊猫问题。但我确实看到了 2 个类似的问题,但它在 R 和 javascript 中: Normalize deeply nested objects 和Normalize deeply nested objects 不知道你能不能从中得到启发。
我的原始文件是 40M。所以这里是它的一个示例:
data = [
"id": "haha",
"type": "table",
"composition": [
"id": "AO",
"type": "basket",
,
"id": "KK",
"type": "basket",
# "isAutoDiv": false,
"composition": [
"id": "600",
"type": "apple",
"num": 1.116066714
,
"id": "605",
"type": "apple",
"num": 1.1166976714
]
]
,
"id": "hoho",
"type": "table",
"composition": [
"id": "KT",
"type": "basket"
,
"id": "OT",
"type": "basket"
,
"id": "CL",
"type": "basket",
# "isAutoDiv": false,
"composition": [
"id": "450",
"type": "apple"
,
"id": "630",
"type": "apple"
,
"id": "023",
"type": "index",
"composition": [
"id": "AAOAAOAOO",
"type": "applejuice"
,
"id": "MMNMMNNM",
"type": "applejuice"
,
]
]
]
]
你看到了吗?每个级别的名称和属性都相同。
我用这条线来规范它。但是当嵌套对象具有相同的名称和属性时,我不知道如何规范嵌套对象:
df = json_normalize(data, record_path = ['composition'], meta = ['id', 'type'], record_prefix = 'compo_')
compo_composition compo_id compo_type id type
0 NaN AO basket haha table
1 ['id': '600', 'type': 'apple', 'num': 1.11606... KK basket haha table
2 NaN KT basket hoho table
3 NaN OT basket hoho table
4 ['id': '450', 'type': 'apple', 'id': '630',... CL basket hoho table
您在“compo_composition”列中看到仍有嵌套对象。
现在我希望它有这些列:
compo_compo_compo__id compo_compo_compo_type compo_compo__id compo_compo_type compo_id compo_type id type
非常感谢。这让我沮丧了好几天,我在任何地方都没有找到答案。
【问题讨论】:
【参考方案1】:您必须编写自定义解析器。这假设 (a) 您的 JSON 非常深,并且 (b) 路径上的每个元素都是唯一的(ala table > basket > index
,而不是 table > table > basket
)
# Make a copy so we do not change the original data
tmp = data.copy()
compositions = []
while len(tmp) > 0:
item = tmp.pop(0)
if 'composition' in item:
# If a level has children, add that level's `id`
# to the path and process its children
path = item.get('path', )
path[item['type'] + '_id'] = item['id']
children = [
'path': path, **child for child in item.get('composition', [])
]
tmp += children
else:
# If a level has no child, we are done
compositions += [item]
最后的数据框:
df = pd.DataFrame([c['path'] for c in compositions]) \
.join(pd.DataFrame(compositions)) \
.drop(columns='path')
结果:
table_id basket_id index_id id type num
0 haha KK NaN AO basket NaN
1 hoho CL 023 KT basket NaN
2 hoho CL 023 OT basket NaN
3 haha KK NaN 600 apple 1.116067
4 haha KK NaN 605 apple 1.116698
5 hoho CL 023 450 apple NaN
6 hoho CL 023 630 apple NaN
7 hoho CL 023 AAOAAOAOO applejuice NaN
8 hoho CL 023 MMNMMNNM applejuice NaN
【讨论】:
以上是关于规范化/展平非常深的嵌套 JSON(其中名称和属性在各个级别中相同)的主要内容,如果未能解决你的问题,请参考以下文章