如何使用平面数据表中的嵌套记录构建 JSON 文件?

Posted

技术标签:

【中文标题】如何使用平面数据表中的嵌套记录构建 JSON 文件?【英文标题】:How to build a JSON file with nested records from a flat data table? 【发布时间】:2016-10-09 08:39:46 【问题描述】:

我正在寻找一种 Python 技术来从 pandas 数据框中的平面表构建嵌套的 JSON 文件。例如,熊猫数据框架表如何:

teamname  member firstname lastname  orgname         phone        mobile
0        1       0      John      Doe     Anon  916-555-1234                 
1        1       1      Jane      Doe     Anon  916-555-4321  916-555-7890   
2        2       0    Mickey    Moose  Moosers  916-555-0000  916-555-1111   
3        2       1     Minny    Moose  Moosers  916-555-2222

获取并导出为如下所示的 JSON:


"teams": [

"teamname": "1",
"members": [
  
    "firstname": "John", 
    "lastname": "Doe",
    "orgname": "Anon",
    "phone": "916-555-1234",
    "mobile": "",
  ,
  
    "firstname": "Jane",
    "lastname": "Doe",
    "orgname": "Anon",
    "phone": "916-555-4321",
    "mobile": "916-555-7890",
  
]
,

"teamname": "2",
"members": [
  
    "firstname": "Mickey",
    "lastname": "Moose",
    "orgname": "Moosers",
    "phone": "916-555-0000",
    "mobile": "916-555-1111",
  ,
  
    "firstname": "Minny",
    "lastname": "Moose",
    "orgname": "Moosers",
    "phone": "916-555-2222",
    "mobile": "",
  
]
       
]


我已经尝试通过创建一个 dicts 的 dict 并转储到 JSON 来做到这一点。这是我当前的代码:

data = pandas.read_excel(inputExcel, sheetname = 'SCAT Teams', encoding = 'utf8')
memberDictTuple = [] 

for index, row in data.iterrows():
    dataRow = row
    rowDict = dict(zip(columnList[2:], dataRow[2:]))

    teamRowDict = columnList[0]:int(dataRow[0])

    memberId = tuple(row[1:2])
    memberId = memberId[0]

    teamName = tuple(row[0:1])
    teamName = teamName[0]

    memberDict1 = int(memberId):rowDict
    memberDict2 = int(teamName):memberDict1

    memberDictTuple.append(memberDict2)

memberDictTuple = tuple(memberDictTuple)
formattedJson = json.dumps(memberDictTuple, indent = 4, sort_keys = True)
print formattedJson

这会产生以下输出。每个项目都嵌套在“团队名称”1 或 2 下的正确级别,但如果记录具有相同的团队名称,则应将它们嵌套在一起。我该如何解决这个问题,以便 teamname 1 和 teamname 2 每个都有嵌套的 2 条记录?

[
    
        "1": 
            "0": 
                "email": "john.doe@wildlife.net", 
                "firstname": "John", 
                "lastname": "Doe", 
                "mobile": "none", 
                "orgname": "Anon", 
                "phone": "916-555-1234"
            
        
    , 
    
        "1": 
            "1": 
                "email": "jane.doe@wildlife.net", 
                "firstname": "Jane", 
                "lastname": "Doe", 
                "mobile": "916-555-7890", 
                "orgname": "Anon", 
                "phone": "916-555-4321"
            
        
    , 
    
        "2": 
            "0": 
                "email": "mickey.moose@wildlife.net", 
                "firstname": "Mickey", 
                "lastname": "Moose", 
                "mobile": "916-555-1111", 
                "orgname": "Moosers", 
                "phone": "916-555-0000"
            
        
    , 
    
        "2": 
            "1": 
                "email": "minny.moose@wildlife.net", 
                "firstname": "Minny", 
                "lastname": "Moose", 
                "mobile": "none", 
                "orgname": "Moosers", 
                "phone": "916-555-2222"
            
        
    
]

【问题讨论】:

不幸的是,关于解决问题的高级方法是否好/正确/可能/等等的问题在这里没有被考虑在内。也就是说,我认为 dict-of-dicts 方法确实看起来很有希望。您应该使用其他问题来解决剩余的详细信息,但请记住更新您收到的错误消息以及您正在使用的代码,以便它们同步(否则您的问题不可重现)。 我也尝试过调整这个答案:***.com/questions/24374062/…,但仍然没有骰子。 【参考方案1】:

根据@root 的一些输入,我使用了不同的策略,并提出了以下代码,这似乎是其中的大部分内容:

import pandas
import json
from collections import defaultdict

inputExcel = 'E:\\teamsMM.xlsx'
exportJson = 'E:\\teamsMM.json'

data = pandas.read_excel(inputExcel, sheetname = 'SCAT Teams', encoding = 'utf8')

grouped = data.groupby(['teamname', 'members']).first()

results = defaultdict(lambda: defaultdict(dict))

for t in grouped.itertuples():
    for i, key in enumerate(t.Index):
        if i ==0:
            nested = results[key]
        elif i == len(t.Index) -1:
            nested[key] = t
        else:
            nested = nested[key]


formattedJson = json.dumps(results, indent = 4)

formattedJson = '\n"teams": [\n' + formattedJson +'\n]\n '

parsed = open(exportJson, "w")
parsed.write(formattedJson)

生成的 JSON 文件是这样的:


"teams": [

    "1": 
        "0": [
            [
                1, 
                0
            ], 
            "John", 
            "Doe", 
            "Anon", 
            "916-555-1234", 
            "none", 
            "john.doe@wildlife.net"
        ], 
        "1": [
            [
                1, 
                1
            ], 
            "Jane", 
            "Doe", 
            "Anon", 
            "916-555-4321", 
            "916-555-7890", 
            "jane.doe@wildlife.net"
        ]
    , 
    "2": 
        "0": [
            [
                2, 
                0
            ], 
            "Mickey", 
            "Moose", 
            "Moosers", 
            "916-555-0000", 
            "916-555-1111", 
            "mickey.moose@wildlife.net"
        ], 
        "1": [
            [
                2, 
                1
            ], 
            "Minny", 
            "Moose", 
            "Moosers", 
            "916-555-2222", 
            "none", 
            "minny.moose@wildlife.net"
        ]
    

]
 

这种格式非常接近所需的最终产品。剩下的问题是:删除出现在每个名字上方的冗余数组 [1, 0],并将每个嵌套的标题设为“teamname”:“1”, “成员”:而不是“1”:“0”:

另外,我不知道为什么每条记录都在转换时被剥离其标题。例如为什么字典条目 "firstname":"John" 导出为 "John"。

【讨论】:

请注意,必须从 pandas 0.16.1 升级到 0.18.1 才能使此代码正常工作。【参考方案2】:

这是一个有效的解决方案,可以创建所需的 JSON 格式。首先,我按适当的列对数据框进行分组,然后不是为每个列标题/记录对创建字典(并丢失数据顺序),而是将它们创建为元组列表,然后将列表转换为有序字典。为其他所有内容分组的两列创建了另一个 Ordered Dict。列表和有序字典之间的精确分层对于 JSON 转换产生正确的格式是必要的。另请注意,转储为 JSON 时,sort_keys 必须设置为 false,否则您的所有 Ordered Dict 将重新排列为字母顺序。

import pandas
import json
from collections import OrderedDict

inputExcel = 'E:\\teams.xlsx'
exportJson = 'E:\\teams.json'

data = pandas.read_excel(inputExcel, sheetname = 'SCAT Teams', encoding = 'utf8')

# This creates a tuple of column headings for later use matching them with column data
cols = []
columnList = list(data[0:])
for col in columnList:
    cols.append(str(col))
columnList = tuple(cols)

#This groups the dataframe by the 'teamname' and 'members' columns
grouped = data.groupby(['teamname', 'members']).first()

#This creates a reference to the index level of the groups
groupnames = data.groupby(["teamname", "members"]).grouper.levels
tm = (groupnames[0])

#Create a list to add team records to at the end of the first 'for' loop
teamsList = []

for teamN in tm:
    teamN = int(teamN)  #added this in to prevent TypeError: 1 is not JSON serializable
    tempList = []   #Create an temporary list to add each record to
    for index, row in grouped.iterrows():
        dataRow = row
        if index[0] == teamN:  #Select the record in each row of the grouped dataframe if its index matches the team number

            #In order to have the JSON records come out in the same order, I had to first create a list of tuples, then convert to and Ordered Dict
            rowDict = ([(columnList[2], dataRow[0]), (columnList[3], dataRow[1]), (columnList[4], dataRow[2]), (columnList[5], dataRow[3]), (columnList[6], dataRow[4]), (columnList[7], dataRow[5])])
            rowDict = OrderedDict(rowDict)
            tempList.append(rowDict)
    #Create another Ordered Dict to keep 'teamname' and the list of members from the temporary list sorted
    t = ([('teamname', str(teamN)), ('members', tempList)])
    t= OrderedDict(t)

    #Append the Ordered Dict to the emepty list of teams created earlier
    ListX = t
    teamsList.append(ListX)


#Create a final dictionary with a single item: the list of teams
teams = "teams":teamsList 

#Dump to JSON format
formattedJson = json.dumps(teams, indent = 1, sort_keys = False) #sort_keys MUST be set to False, or all dictionaries will be alphebetized
formattedJson = formattedJson.replace("NaN", '"NULL"') #"NaN" is the NULL format in pandas dataframes - must be replaced with "NULL" to be a valid JSON file
print formattedJson

#Export to JSON file
parsed = open(exportJson, "w")
parsed.write(formattedJson)

print"\n\nExport to JSON Complete"

【讨论】:

以上是关于如何使用平面数据表中的嵌套记录构建 JSON 文件?的主要内容,如果未能解决你的问题,请参考以下文章

使用 jq 将嵌套的 JSON 文件分解为具有唯一键的平面列表

如何将 JSON 对象的嵌套部分转换为点链式平面 JSON?

在解析的 JSON 嵌套数组上运行 foreach

Django Rest Framework:将嵌套 json 字段中的数据序列化为普通对象

将 json 格式数据加载到 google bigquery 性能问题

将嵌套 JSON 转换为平面 JSON