在特定对象级别将 pandas DataFrame 中的列添加到深度嵌套的 JSON 中

Posted 2023-03-12

技术标签:

【中文标题】在特定对象级别将 pandas DataFrame 中的列添加到深度嵌套的 JSON 中【英文标题】：Add column from pandas DataFrame into deeply nested JSON at a specific object level 【发布时间】：2020-06-15 01:53:00 【问题描述】：

假设我有一个 DataFrame df，例如：

source      tables      columns   data_type   length    RecordCount
src1        table1      col1      INT         4         71
src1        table1      col2      CHAR        2         71
src1        table2      col1      CHAR        2         43
src2        table1      col1      INT         4         21
src2        table1      col2      DATE        3         21

需要一个类似于以下内容的输出：


  "src1": 
    "table1": 
      "Record Count": 71 #missing in my current code output
      "col1": 
        "type": "INT"
        "length": 4
      ,
      "col2": 
        "type": "CHAR"
        "length": 2
      
    ,
    "table2": 
      "Record Count": 43 #missing in my current code output
      "col1": 
        "type": "CHAR"
        "length": 2
      
    
  ,
  "src2": 
    "table1": 
      "Record Count": 21 #missing in my current code output
      "col1": 
        "type": "INT"
        "length": 4
      ,
      "col2": 
        "type": "DATE"
        "length": 3

当前代码：

def make_nested(df): 
    f = lambda: defaultdict(f)   
    data = f()  

    for row in df.to_numpy().tolist():
        t = data
        for index, r in enumerate(row[:-4]):
            t = t[r]
            if index == 1:
               t[row[-5]]: 
                  "Record Count": row[-1]
               
        t[row[-4]] = 
            "type": row[-3],
            "length": row[-2]
        

    return data

【问题讨论】：

for index, r in enumerate(row[:-4]): 应该替换 for r in row[:-4]:，而不是一个嵌套另一个。对代码进行了编辑，看起来我得到了相同的原始输出，但没有在 JSON 文件中添加新的记录计数信息 【参考方案1】：

这是另一种使用两步分组方法的解决方案。

# First, groupby ['source','tables'] to deal with the annoying 'Record Count'
# Need python 3.5+
# Otherwise, another method to merge two dicts should be used 
df_new=df.groupby(['source','tables']).apply(lambda x: **'Record Count':x.iloc[0,-1], **x.iloc[i,-4]: 'type':x.iloc[i,-3],'length':x.iloc[i,-2] for i in range(len(x))).reset_index()

见Merge dicts

第一步之后，df_new 的样子

    source  tables  0
0   src1    table1  'Record Count': 71, 'col1': 'type': 'INT', 'length': 4, 'col2': 'type': 'CHAR', 'length': 2
1   src1    table2  'Record Count': 43, 'col1': 'type': 'CHAR', 'length': 2
2   src2    table1  'Record Count': 21, 'col1': 'type': 'INT', 'length': 4, 'col2': 'type': 'DATE', 'length': 3

# Second groupby
df_final = df_new.groupby('source').apply(lambda x: x.iloc[i,-2]: x.iloc[i,-1] for i in range(len(x)))
output = df_final.to_json()

output 是编码字符串类型的 json 文件。获取缩进版本

import json
temp = json.loads(output)
with open('somefile','w') as f:
    json.dump(temp,f,indent=4)

【讨论】：

谢谢，这可行，但是，当我将此信息转储到文件时，它全部显示在 1 行中，而不是像我在帖子中显示的预期间隔输出。我该如何修复代码以允许这样做？每个" 前面还有斜线，我需要去掉它 @weovibewvoibweoivwoiv 添加一些内容来改变格式。另外，要直接修复您当前的代码，请尝试将 t[row[-5]]: "Record Count": row[-1] 更改为 t["Record Count"] = row[-1]

以上是关于在特定对象级别将 pandas DataFrame 中的列添加到深度嵌套的 JSON 中的主要内容，如果未能解决你的问题，请参考以下文章