使用 Python 将 4 级嵌套 JSON 文件转换为 1 级嵌套
Posted
技术标签:
【中文标题】使用 Python 将 4 级嵌套 JSON 文件转换为 1 级嵌套【英文标题】:Convert 4 level nested JSON file to 1 level nested with Python 【发布时间】:2019-10-13 15:08:28 【问题描述】:我有下面的 4 级嵌套 JSON 文件,我想将其规范化为一级嵌套:
输入文件是这样的:
"@index": "40",
"row": [
"column": [
"text":
"@fontName": "Times New Roman",
"@fontSize": "12.0",
"@x": "85.10",
"@y": "663.12",
"@width": "250.01",
"@height": "12.00",
"#text": "text 1"
]
,
"column": [
"text":
"@fontName": "Times New Roman",
"@fontSize": "8.0",
"@x": "121.10",
"@y": "675.36",
"@width": "348.98",
"@height": "8.04",
"#text": "text 2"
,
"text":
"@fontName": "Times New Roman",
"@fontSize": "12.0",
"@x": "473.30",
"@y": "676.92",
"@width": "42.47",
"@height": "12.00",
"#text": "text 3"
]
,
"column": [
"text":
"@fontName": "Times New Roman",
"@fontSize": "12.0",
"@x": "85.10",
"@y": "690.72",
"@width": "433.61",
"@height": "12.00",
"#text": "text 4"
]
]
想要的输出是这样的:
"@index": "40",
"row": [
"@fontName": "Times New Roman",
"@fontSize": "12.0",
"@x": "85.10",
"@y": "663.12",
"@width": "250.01",
"@height": "12.00",
"#text": "Text 1"
,
"@fontName": "Times New Roman",
"@fontSize": "8.0",
"@x": "121.10",
"@y": "675.36",
"@width": "348.98",
"@height": "8.04",
"#text": "Text 2"
,
"@fontName": "Times New Roman",
"@fontSize": "12.0",
"@x": "473.30",
"@y": "676.92",
"@width": "42.47",
"@height": "12.00",
"#text": "Text 3"
,
"@fontName": "Times New Roman",
"@fontSize": "12.0",
"@x": "85.10",
"@y": "690.72",
"@width": "433.61",
"@height": "12.00",
"#text": "Text 4"
]
到目前为止,我使用 pandas 的代码如下,但我不知道如何继续规范化到一个级别。
import json
import pandas as pd
from pandas.io.json import json_normalize #package for flattening json in pandas df
#load json object
with open('D:\Files\JSON\4Level.json') as f:
d = json.load(f)
nycphil = json_normalize(d['row'])
print (nycphil.head(4))
这是当前的输出列表,其中表明column
是一个嵌套元素:
column
0 ['text': '@fontName': 'Times New Roman', '@f...
1 ['text': '@fontName': 'Times New Roman', '@f...
2 ['text': '@fontName': 'Times New Roman', '@f...
一层嵌套的打印将是:
text.#text text.@fontName text.@fontSize ... text.@width text.@x text.@y
0 Text 1 Times New Roman 12.0 ... 250.01 85.10 663.12
1 Text 2 Times New Roman 8.0 ... 348.98 121.10 675.36
2 Text 3 Times New Roman 12.0 ... 42.47 473.30 676.92
3 Text 4 Times New Roman 12.0 ... 433.61 85.10 690.72
输入输出对比如下:
也许有人可以帮我解决这个问题。感谢您的帮助。
更新
为了在我展示的第一个示例输入中制作一个小示例,我删除了您的脚本中似乎需要的一些元素才能工作。所以现在我展示了与真实文件完全相同的结构,但是使用这个输入你的脚本不起作用。我认为他们需要一些调整,但我一直在尝试,但我不知道如何更改它们以使用这个新输入获得相同的输出。也许你可以帮助我,很抱歉没有从一开始就显示正确的输入。
"document":
"page":[
"@index":"0",
"image":
"@data":"ABC",
"@format":"png",
"@height":"620.00",
"@type":"base64encoded",
"@width":"450.00",
"@x":"85.00",
"@y":"85.00"
,
"@index":"1",
"row":[
"column":[
"text":""
,
"text":
"#text":"Text1",
"@fontName":"Arial",
"@fontSize":"12.0",
"@height":"12.00",
"@width":"71.04",
"@x":"121.10",
"@y":"83.42"
]
,
"column":[
"text":""
,
"text":
"#text":"Text2",
"@fontName":"Arial",
"@fontSize":"12.0",
"@height":"12.00",
"@width":"101.07",
"@x":"121.10",
"@y":"124.82"
]
]
,
"@index":"2",
"row":[
"column":
"text":
"#text":"Text3",
"@fontName":"Arial",
"@fontSize":"12.0",
"@height":"12.00",
"@width":"363.44",
"@x":"85.10",
"@y":"69.62"
,
"column":
"text":
"#text":"Text4",
"@fontName":"Arial",
"@fontSize":"12.0",
"@height":"12.00",
"@width":"382.36",
"@x":"85.10",
"@y":"83.42"
,
"column":
"text":
"#text":"Text5",
"@fontName":"Arial",
"@fontSize":"12.0",
"@height":"12.00",
"@width":"435.05",
"@x":"85.10",
"@y":"97.22"
]
,
"@index":"3"
]
【问题讨论】:
检查来自 here 的 flatten_json()。我已经检查过了。它正在工作。 Python flatten multilevel JSON的可能重复 【参考方案1】:作为json_normalize()
的替代方案,您还可以使用推导式。:
my_dict["row"] = [k: v for k, v in col_entry["text"].items() for entry in my_dict["row"] for col_entry in entry["column"]]
编辑:固定代码以覆盖每列列表中的多个条目。诚然,就理解的嵌套而言,这确实接近了痛苦阈值...
【讨论】:
嗨,约翰。我是 Python 的新手。在这种情况下,我需要将 Json 文件加载到字典中吗?如果是,如何将文件加载到字典中?谢谢 json.loads()。你可以在这里找到更多信息docs.python.org/3/library/json.html 这将只提取“列”列表的第一个元素。您将在输出中获得 3 个字体部分而不是 4 个。 啊,是的,对不起,我错过了同一级别的两个。回家后会修复代码。 嗨,JohnO,请您看看我的更新下方的新输入。我无法修改您的脚本以使其与与真实文件结构相同的输入一起工作。谢谢【参考方案2】:您可以使用列表推导:
d = '@index': '40', 'row': ['column': ['text': '@fontName': 'Times New Roman', '@fontSize': '12.0', '@x': '85.10', '@y': '663.12', '@width': '250.01', '@height': '12.00', '#text': 'text 1'], 'column': ['text': '@fontName': 'Times New Roman', '@fontSize': '8.0', '@x': '121.10', '@y': '675.36', '@width': '348.98', '@height': '8.04', '#text': 'text 2', 'text': '@fontName': 'Times New Roman', '@fontSize': '12.0', '@x': '473.30', '@y': '676.92', '@width': '42.47', '@height': '12.00', '#text': 'text 3'], 'column': ['text': '@fontName': 'Times New Roman', '@fontSize': '12.0', '@x': '85.10', '@y': '690.72', '@width': '433.61', '@height': '12.00', '#text': 'text 4']]
new_d = **d, 'row':[c['text'] for b in d['row'] for c in b['column']]
import json
print(json.dumps(new_d, indent=4))
输出:
"@index": "40",
"row": [
"@fontName": "Times New Roman",
"@fontSize": "12.0",
"@x": "85.10",
"@y": "663.12",
"@width": "250.01",
"@height": "12.00",
"#text": "text 1"
,
"@fontName": "Times New Roman",
"@fontSize": "8.0",
"@x": "121.10",
"@y": "675.36",
"@width": "348.98",
"@height": "8.04",
"#text": "text 2"
,
"@fontName": "Times New Roman",
"@fontSize": "12.0",
"@x": "473.30",
"@y": "676.92",
"@width": "42.47",
"@height": "12.00",
"#text": "text 3"
,
"@fontName": "Times New Roman",
"@fontSize": "12.0",
"@x": "85.10",
"@y": "690.72",
"@width": "433.61",
"@height": "12.00",
"#text": "text 4"
]
编辑:为了展平嵌套结构,您可以使用生成器的递归:
def flatten(d, t = ["image", "text"]):
for a, b in d.items():
if a in t:
yield b
elif isinstance(b, dict):
yield from flatten(b)
elif isinstance(b, list):
for i in b:
yield from flatten(i)
d = 'document': 'page': ['@index': '0', 'image': '@data': 'ABC', '@format': 'png', '@height': '620.00', '@type': 'base64encoded', '@width': '450.00', '@x': '85.00', '@y': '85.00', '@index': '1', 'row': ['column': ['text': '', 'text': '#text': 'Text1', '@fontName': 'Arial', '@fontSize': '12.0', '@height': '12.00', '@width': '71.04', '@x': '121.10', '@y': '83.42'], 'column': ['text': '', 'text': '#text': 'Text2', '@fontName': 'Arial', '@fontSize': '12.0', '@height': '12.00', '@width': '101.07', '@x': '121.10', '@y': '124.82']], '@index': '2', 'row': ['column': 'text': '#text': 'Text3', '@fontName': 'Arial', '@fontSize': '12.0', '@height': '12.00', '@width': '363.44', '@x': '85.10', '@y': '69.62', 'column': 'text': '#text': 'Text4', '@fontName': 'Arial', '@fontSize': '12.0', '@height': '12.00', '@width': '382.36', '@x': '85.10', '@y': '83.42', 'column': 'text': '#text': 'Text5', '@fontName': 'Arial', '@fontSize': '12.0', '@height': '12.00', '@width': '435.05', '@x': '85.10', '@y': '97.22'], '@index': '3']
print(json.dumps(list(filter(None, flatten(d))), indent=4))
输出:
[
"@data": "ABC",
"@format": "png",
"@height": "620.00",
"@type": "base64encoded",
"@width": "450.00",
"@x": "85.00",
"@y": "85.00"
,
"#text": "Text1",
"@fontName": "Arial",
"@fontSize": "12.0",
"@height": "12.00",
"@width": "71.04",
"@x": "121.10",
"@y": "83.42"
,
"#text": "Text2",
"@fontName": "Arial",
"@fontSize": "12.0",
"@height": "12.00",
"@width": "101.07",
"@x": "121.10",
"@y": "124.82"
,
"#text": "Text3",
"@fontName": "Arial",
"@fontSize": "12.0",
"@height": "12.00",
"@width": "363.44",
"@x": "85.10",
"@y": "69.62"
,
"#text": "Text4",
"@fontName": "Arial",
"@fontSize": "12.0",
"@height": "12.00",
"@width": "382.36",
"@x": "85.10",
"@y": "83.42"
,
"#text": "Text5",
"@fontName": "Arial",
"@fontSize": "12.0",
"@height": "12.00",
"@width": "435.05",
"@x": "85.10",
"@y": "97.22"
]
【讨论】:
感谢您的帮助。它在 Python3 中对我有用。 @GerCas 很高兴为您提供帮助! 嗨 Ajax1234,请您在我的更新下方查看新输入。我无法修改您的脚本以使其与与真实文件结构相同的输入一起工作。谢谢 @GerCas 没问题。您希望从新样品中得到什么输出?还是row
关联的数据吗?
@GerCas 是的,当然,你可以使用pd.DataFrame(result)
。【参考方案3】:
下面是一个工作代码:
(56336255.json是你发布的样本数据)
import json
import pprint
flat_data = dict()
with open('56336255.json') as f:
data = json.load(f)
for k, v in data.items():
if k == '@index':
flat_data[k] = data[k]
else:
flat_data[k] = []
for row in v:
for cell in row['column']:
flat_data[k].append(cell['text'])
pprint.pprint(flat_data)
输出
'@index': '40',
'row': ['#text': 'text 1',
'@fontName': 'Times New Roman',
'@fontSize': '12.0',
'@height': '12.00',
'@width': '250.01',
'@x': '85.10',
'@y': '663.12',
'#text': 'text 2',
'@fontName': 'Times New Roman',
'@fontSize': '8.0',
'@height': '8.04',
'@width': '348.98',
'@x': '121.10',
'@y': '675.36',
'#text': 'text 3',
'@fontName': 'Times New Roman',
'@fontSize': '12.0',
'@height': '12.00',
'@width': '42.47',
'@x': '473.30',
'@y': '676.92',
'#text': 'text 4',
'@fontName': 'Times New Roman',
'@fontSize': '12.0',
'@height': '12.00',
'@width': '433.61',
'@x': '85.10',
'@y': '690.72']
【讨论】:
【参考方案4】:这样就可以了:
data = json.load(json_file)
flat = [ column['text'] for entry in data['row'] for column in entry['column'] ]
完整的工作示例:
import json
import sys
import os.path
def main(argv):
#Load JSON
current_folder = os.path.dirname(os.path.realpath(__file__))
with open(current_folder + '\\input.json') as json_file:
data = json.load(json_file)
#Flatten (using for loops)
flat=[]
for entry in data['row']:
for column in entry['column']:
flat.append(column['text'])
# OR, Flatten the pythonic way (using list comprehension)
# looks strange at first but notice
# 1. we start with the item we want to keep in the list
# 2. the loops order is the same, we just write them inline
flat2 = [ column['text'] for entry in data['row'] for column in entry['column'] ]
#Format data for saving to JSON
output =
output['@index']=data['@index']
output['row'] = flat #or flat2
#Save to JSON
with open('flat.txt', 'w') as outfile:
json.dump(output, outfile, indent=4)
if __name__ == "__main__":
main(sys.argv[1:])
【讨论】:
【参考方案5】:试试这个,
#!/usr/bin/python
# -*- coding: utf-8 -*-
def flatten_json(y):
out =
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
expected_output = flatten_json(input_data) # This will convert
【讨论】:
@GerCas 请立即查看并确认。 同样的问题。我已经尝试直接使用 python 控制台和脚本python script.py
,在这种情况下我只得到输出'': 'input.json'
@GerCas 我已经用你的 input_data 在我的机器上再次测试了,它正在工作。
我不确定会发生什么,我将 input_data 替换为 'input.json' 并得到打印。另外,也许你可以看到我的更新。以上是关于使用 Python 将 4 级嵌套 JSON 文件转换为 1 级嵌套的主要内容,如果未能解决你的问题,请参考以下文章