JSON多层嵌套复杂结构数据扁平化处理转为行列数据
Posted codest
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了JSON多层嵌套复杂结构数据扁平化处理转为行列数据相关的知识,希望对你有一定的参考价值。
背景
公司的中台产品,需要对外部API接口返回的JSON数据进行采集入湖,有时候外部API接口返回的JSON数据层级嵌套比较深,举个栗子:
上述的JSON数据中,最外层为请求返回对象,data里面包含返回的业务数据,业务数据按照学校 / 班级 / 学生进行嵌套
在数据入湖时,需要按照最内层的学生视角将数据拆分为行列数据,最终的拆分结果如下:
由于对接的外部API接口返回的JSON数据结构不是统一的、固定的,所以需要通过一种算法对每一层对象、数组进行遍历和钻取,实现JSON数据的扁平化
网上找了一些JSON扁平化的中间件,例如:Json2Flat在扁平化处理过程不太完美,不支持跨层级的数组嵌套结构
所以决定自己实现扁平化处理
关键代码如下:
public class LinkedNode private LinkedNode parent; private String parentName; private Map<String, Object> data; public LinkedNode(LinkedNode parent, String parentName, Map<String, Object> data) this.parent = parent; this.parentName = parentName; this.data = data;
public class JSONFlatProcessor private LinkedList<LinkedNode> nodes; private LinkedList<String> column; private List<Object[]> data; public void find(LinkedNode parent, String parentName, Map<String, Object> data) LinkedNode node = new LinkedNode(parent, parentName, data); if (!hasObjectOrArray(data)) nodes.add(node); else for (Map.Entry entry : data.entrySet()) if (entry.getValue() instanceof Map) find(node, String.valueOf(entry.getKey()), (Map<String, Object>) entry.getValue()); else if (isObjectArray(entry.getValue())) find(node, String.valueOf(entry.getKey()), (List<Map<String, Object>>) entry.getValue()); public void find(LinkedNode parent, String parentName, List<Map<String, Object>> data) for (Map<String, Object> item : data) find(parent, parentName, item); protected Boolean hasObjectOrArray(Map<String, Object> item) Object field; for (Map.Entry entry : item.entrySet()) field = entry.getValue(); if (field instanceof Map || isObjectArray(field)) return Boolean.TRUE; return Boolean.FALSE; protected Boolean isObjectArray(Object object) return object instanceof List && !CollectionUtils.isEmpty((List) object) && ((List) object).get(0) instanceof Map; public JSONFlatProcessor process(List<Map<String, Object>> data) nodes = new LinkedList<>(); find(null, null, data); return this; public JSONFlatProcessor process(Map<String, Object> data) nodes = new LinkedList<>(); find(null, null, data); return this; public LinkedList<LinkedNode> getNodes() return nodes; public List<String> getColumn() if (CollectionUtils.isEmpty(nodes)) return Collections.emptyList(); column = new LinkedList<>(); collectColumn(nodes.getFirst()); return column; protected void collectColumn(LinkedNode node) List<String> innerColumn = new ArrayList<>(node.getData().size()); String columnBuilder; for (Map.Entry entry : node.getData().entrySet()) if (!(entry.getValue() instanceof Map || isObjectArray(entry.getValue()))) columnBuilder = null == node.getParentName()? String.valueOf(entry.getKey()) : String.format("%s.%s", node.getParentName(), entry.getKey()); innerColumn.add(columnBuilder); column.addAll(0, innerColumn); if (null != node.getParent()) collectColumn(node.getParent()); public List<Object[]> getData() if (CollectionUtils.isEmpty(nodes)) return Collections.emptyList(); data = new ArrayList<>(nodes.size()); LinkedList<Object> container; for (LinkedNode node : nodes) container = new LinkedList<>(); collectData(node, container); data.add(container.toArray()); return data; protected void collectData(LinkedNode node, LinkedList<Object> container) List<Object> innerData = new ArrayList<>(node.getData().size()); for (Map.Entry entry : node.getData().entrySet()) if (!(entry.getValue() instanceof Map || isObjectArray(entry.getValue()))) innerData.add(entry.getValue()); container.addAll(0, innerData); if (null != node.getParent()) collectData(node.getParent(), container); protected static class CollectionUtils public static boolean isEmpty(Collection<?> collection) return (collection == null || collection.isEmpty());
public class MainTests
public static void main(String[] args) throws Exception
String jsonStr = "\\"code\\":200,\\"requestId\\":\\"1680177848458\\",\\"data\\":[\\"school\\":\\"xxx市第一实验小学\\",\\"no\\":\\"1001\\",\\"class\\":[\\"name\\":\\"一(1)班\\",\\"teacher\\":\\"吴老师\\",\\"student\\":[\\"name\\":\\"张同学\\",\\"age\\":6,\\"name\\":\\"王同学\\",\\"age\\":7]],\\"school\\":\\"xxx市第二实验小学\\",\\"no\\":\\"1002\\",\\"class\\":[\\"name\\":\\"一(2)班\\",\\"teacher\\":\\"陈老师\\",\\"student\\":[\\"name\\":\\"欧阳同学\\",\\"age\\":6]]]";
ObjectMapper jsonMapper = new ObjectMapper();
// List<Map<String, Object>> map = jsonMapper.readValue(jsonStr, List.class);
Map<String, Object> map = jsonMapper.readValue(jsonStr, Map.class);
JSONFlatProcessor processor = new JSONFlatProcessor().process(map);
System.out.println("数据条数: " + processor.getNodes().size());
System.out.println("字段名: " + processor.getColumn());
System.out.println("首行数据: " + new ObjectMapper().writeValueAsString(processor.getData().get(0)));
数据条数: 3 字段名: [code, requestId, data.school, data.no, class.name, class.teacher, student.name, student.age] 首行数据: [200,"1680177848458","xxx市第一实验小学","1001","一(1)班","吴老师","张同学",6]
多层次查询深度嵌套复杂的 JSON 数据
【中文标题】多层次查询深度嵌套复杂的 JSON 数据【英文标题】:Querying deeply nested and complex JSON data with multiple levels 【发布时间】:2021-09-26 16:35:10 【问题描述】:我正在努力分解从深度嵌套的复杂 JSON 数据中提取数据所需的方法。我有以下代码来获取 JSON。
import requests
import pandas as pd
import json
import pprint
import seaborn as sns
import matplotlib.pyplot as plt
base_url="https://data.sec.gov/api/xbrl/companyfacts/CIK0001627475.json"
headers='User-Agent': 'Myheaderdata'
first_response=requests.get(base_url,headers=headers)
response_dic=first_response.json()
print(response_dic)
base_df=pd.DataFrame(response_dic)
base_df.head()
它提供了一个显示 JSON 和 Pandas DataFrame 的输出。数据框有两列,第三列 (FACTS) 包含大量嵌套数据。
我想了解的是如何导航到该嵌套结构中,以检索某些数据。例如,我可能想要转到 DEI 级别或 US GAAP 级别并检索特定属性。假设 DEI > EntityCommonStockSharesOutstanding 并获取“标签”、“价值”和“FY”详细信息。
当我尝试如下使用get函数时;
data=[]
for response in response_dic:
data.append("EntityCommonStockSharesOutstanding":response.get('EntityCommonStockSharesOutstanding'))
new_df=pd.DataFrame(data)
new_df.head()
我最终得到以下属性错误;
AttributeError Traceback (most recent call last)
<ipython-input-15-15c1685065f0> in <module>
1 data=[]
2 for response in response_dic:
----> 3 data.append("EntityCommonStockSharesOutstanding":response.get('EntityCommonStockSharesOutstanding'))
4 base_df=pd.DataFrame(data)
5 base_df.head()
AttributeError: 'str' object has no attribute 'get'
【问题讨论】:
你看过response_dic的结构了吗?这是一个嵌套字典。你的循环,即for response in response_dic:
只是循环遍历它的键,这些键是字符串 cik、entityName、facts(不知道你为什么这样做)。要导航到“dei”中的“标签”,只需:response_dic['facts']['dei']['EntityCommonStockSharesOutstanding']['label']
,结果为“实体普通股,流通股”
【参考方案1】:
使用pd.json_normalize
:
例如:
entity1 = response_dic['facts']['dei']['EntityCommonStockSharesOutstanding']
entity2 = response_dic['facts']['dei']['EntityPublicFloat']
df1 = pd.json_normalize(entity1, record_path=['units', 'shares'],
meta=['label', 'description'])
df2 = pd.json_normalize(entity2, record_path=['units', 'USD'],
meta=['label', 'description'])
>>> df1
end val accn ... frame label description
0 2018-10-31 106299106 0001564590-18-028629 ... CY2018Q3I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
1 2019-02-28 106692030 0001627475-19-000007 ... NaN Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
2 2019-04-30 107160359 0001627475-19-000015 ... CY2019Q1I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
3 2019-07-31 110803709 0001627475-19-000025 ... CY2019Q2I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
4 2019-10-31 112020807 0001628280-19-013517 ... CY2019Q3I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
5 2020-02-28 113931825 0001627475-20-000006 ... NaN Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
6 2020-04-30 115142604 0001627475-20-000018 ... CY2020Q1I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
7 2020-07-31 120276173 0001627475-20-000031 ... CY2020Q2I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
8 2020-10-31 122073553 0001627475-20-000044 ... CY2020Q3I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
9 2021-01-31 124962279 0001627475-21-000015 ... CY2020Q4I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
10 2021-04-30 126144849 0001627475-21-000022 ... CY2021Q1I Entity Common Stock, Shares Outstanding Indicate number of shares or other units outst...
[11 rows x 10 columns]
>>> df2
end val accn fy fp form filed frame label description
0 2018-10-03 900000000 0001627475-19-000007 2018 FY 10-K 2019-03-07 CY2018Q3I Entity Public Float The aggregate market value of the voting and n...
1 2019-06-28 1174421292 0001627475-20-000006 2019 FY 10-K 2020-03-02 CY2019Q2I Entity Public Float The aggregate market value of the voting and n...
2 2020-06-30 1532720862 0001627475-21-000015 2020 FY 10-K 2021-02-24 CY2020Q2I Entity Public Float The aggregate market value of the voting and n...
【讨论】:
以上是关于JSON多层嵌套复杂结构数据扁平化处理转为行列数据的主要内容,如果未能解决你的问题,请参考以下文章