对熊猫数据框的深度嵌套 JSON 响应

Posted 2023-03-11

技术标签:

【中文标题】对熊猫数据框的深度嵌套 JSON 响应【英文标题】：Deeply nested JSON response to pandas dataframe 【发布时间】：2017-12-23 10:11:20 【问题描述】：

我是 python/pandas 的新手，在将嵌套的 JSON 转换为 pandas 数据帧时遇到了一些问题。我正在向数据库发送查询并返回 JSON 字符串。

这是一个深度嵌套的 JSON 字符串，包含多个数组。来自数据库的响应包含数千行。下面是 JSON 字符串中一行的一般结构：


  "ID": "123456",
  "profile": 
    "criteria": [
      
        "type": "type1",
        "name": "name1",
        "value": "7",
        "properties": []
      ,
      
        "type": "type2",
        "name": "name2",
        "value": "6",
        "properties": [
          
            "type": "MAX",
            "name": "",
            "value": "100"
          ,
          
            "type": "MIN",
            "name": "",
            "value": "5"
          
        ]
      ,
      
        "type": "type3",
        "name": "name3",
        "value": "5",
        "properties": []
      
    ]
  
  

  "ID": "456789",
  "profile": 
    "criteria": [
      
        "type": "type4",
        "name": "name4",
        "value": "6",
        "properties": []
      
    ]

我想使用 python pandas 展平这个 JSON 字符串。我在使用 json_normalize 时遇到了问题，因为这是一个深度嵌套的 JSON 字符串：

from cassandra.cluster import Cluster
import pandas as pd
from pandas.io.json import json_normalize

def pandas_factory(colnames, rows):
    return pd.DataFrame(rows, columns=colnames)

cluster = Cluster(['xxx.xx.x.xx'], port=yyyy)
session = cluster.connect('nnnn')

session.row_factory = pandas_factory

json_string = session.execute('select json ......')
df = json_string ._current_rows
df_normalized= json_normalize(df)
print(df_normalized)

当我运行这段代码时，我得到一个 Key 错误：

KeyError: 0

我需要帮助将此 JSON 字符串转换为只有一些选定列的数据框，看起来像这样：（可以跳过其余数据）

ID        |   criteria   | type   |   name   |   value   |

123456          1          type1      name1        7      
123456          2          type2      name2        6  
123456          3          type3      name3        5    
456789          1          type4      name4        6

我试图在这里找到类似的问题，但我似乎无法将它应用于我的 JSON 字符串。

感谢任何帮助！ :)

编辑：

返回的json字符串是一个查询响应对象：ResultSet。我认为这就是为什么我在使用时遇到一些问题：

json_string= session.execute('select json profile from visning')
temp = json.loads(json_string)

并得到错误：

TypeError: the JSON object must be str, not 'ResultSet'

编辑 #2：

为了看看我正在处理什么，我使用以下方法打印了结果查询：

for line in session.execute('select json.....'):
    print(line)

得到了这样的东西：

Row(json='"ID": null, "profile": null')
Row(json='"ID": "123", "profile": "criteria": ["type": "type1", "name": "name1", "value": "10", "properties": [], "type": "type2", "name": "name2", "value": "50", "properties": [], "type": "type3", "name": "name3", "value": "40", "properties": []]')
Row(json='"ID": "456", "profile": "criteria": []')
Row(json='"ID": "789", "profile": "criteria": ["type": "type4", "name": "name4", "value": "5", "properties": []]')
Row(json='"ID": "987", "profile": "criteria": ["type": "type5", "name": "name5", "value": "70", "properties": [], "type": "type6", "name": "name6", "value": "60", "properties": [], "type": "type7", "name": "name7", "value": "2", "properties": [], "type": "type8", "name": "name8", "value": "7", "properties": []]')

我遇到的问题是将此结构转换为可在 json.loads() 中使用的 json 字符串：

json_string= session.execute('select json profile from visning')
json_list = list(json_string)
string= ''.join(list(map(str, json_list)))
temp = json.loads(string) <-- creates error json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

编辑#3：

按照下面 cmets 中的要求，打印

for line in session.execute('select json.....'):
print((line.json))

得到输出：

"ID": null, "profile": null
"ID": "123", "profile": "criteria": ["type": "type1", "name": "name1", "value": "10", "properties": [], "type": "type2", "name": "name2", "value": "50", "properties": [], "type": "type3", "name": "name3", "value": "40", "properties": []]
"ID": "456", "profile": "criteria": []
"ID": "789", "profile": "criteria": ["type": "type4", "name": "name4", "value": "5", "properties": []]
"ID": "987", "profile": "criteria": ["type": "type5", "name": "name5", "value": "70", "properties": [], "type": "type6", "name": "name6", "value": "60", "properties": [], "type": "type7", "name": "name7", "value": "2", "properties": [], "type": "type8", "name": "name8", "value": "7", "properties": []]

【问题讨论】：

你能提供一个至少有两行的 JSON 吗？我更新了问题并添加了另一行 @stovfl 我添加了print((line.json)) 的输出，参见EDIT#3 您的输出显示line.json 为您提供json，无需进一步操作。 @stovfl 我在使用带有这个 json 的 json.loads 时仍然遇到问题，请参阅编辑 3 中添加的代码 【参考方案1】：

一个硬编码的例子...

import pandas as pd

temp = [
  "ID": "123456",
  "profile": 
    "criteria": [
      
        "type": "type1",
        "name": "name1",
        "value": "7",
        "properties": []
      ,
      
        "type": "type2",
        "name": "name2",
        "value": "6",
        "properties": [
          
            "type": "MAX",
            "name": "",
            "value": "100"
          ,
          
            "type": "MIN",
            "name": "",
            "value": "5"
          
        ]
      ,
      
        "type": "type3",
        "name": "name3",
        "value": "5",
        "properties": []
      
    ]
  
,

  "ID": "456789",
  "profile": 
    "criteria": [
      
        "type": "type4",
        "name": "name4",
        "value": "6",
        "properties": []
      
    ]
  
]

cols = ['ID', 'criteria', 'type', 'name', 'value']

rows = []
for data in temp:
    data_id = data['ID']
    criteria = data['profile']['criteria']
    for d in criteria:
        rows.append([data_id, criteria.index(d)+1, *list(d.values())[:-1]])

df = pd.DataFrame(rows, columns=cols)

这绝不是优雅的。它更像是一个快速而肮脏的解决方案，因为我不知道 JSON 数据是如何精确格式化的 - 但是，根据您提供的内容，我上面的代码将生成所需的 DataFrame。

       ID  criteria   type   name value
0  123456         1  type1  name1     7
1  123456         2  type2  name2     6
2  123456         3  type3  name3     5
3  456789         1  type4  name4     6

此外，如果您需要“加载” JSON 数据，您可以像这样使用json 库：

import json

temp = json.loads(json_string)

# Or from a file...
with open('some_json.json') as json_file:
    temp = json.load(json_file)

请注意json.loads 和json.load 之间的区别。

【讨论】：

以上是关于对熊猫数据框的深度嵌套 JSON 响应的主要内容，如果未能解决你的问题，请参考以下文章