JSON多层嵌套复杂结构数据扁平化处理转为行列数据

Posted codest

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了JSON多层嵌套复杂结构数据扁平化处理转为行列数据相关的知识,希望对你有一定的参考价值。

背景

公司的中台产品,需要对外部API接口返回的JSON数据进行采集入湖,有时候外部API接口返回的JSON数据层级嵌套比较深,举个栗子:

 

 

 上述的JSON数据中,最外层为请求返回对象,data里面包含返回的业务数据,业务数据按照学校 / 班级 / 学生进行嵌套

在数据入湖时,需要按照最内层的学生视角将数据拆分为行列数据,最终的拆分结果如下:

 

 

由于对接的外部API接口返回的JSON数据结构不是统一的、固定的,所以需要通过一种算法对每一层对象、数组进行遍历和钻取,实现JSON数据的扁平化

网上找了一些JSON扁平化的中间件,例如:Json2Flat在扁平化处理过程不太完美,不支持跨层级的数组嵌套结构

所以决定自己实现扁平化处理

关键代码如下:

public class LinkedNode 

    private LinkedNode parent;

    private String parentName;

    private Map<String, Object> data;

    public LinkedNode(LinkedNode parent, String parentName, Map<String, Object> data) 
        this.parent = parent;
        this.parentName = parentName;
        this.data = data;
    
public class JSONFlatProcessor 

    private LinkedList<LinkedNode> nodes;

    private LinkedList<String> column;

    private List<Object[]> data;

    public void find(LinkedNode parent, String parentName, Map<String, Object> data) 
        LinkedNode node = new LinkedNode(parent, parentName, data);
        if (!hasObjectOrArray(data)) 
            nodes.add(node);
         else 
            for (Map.Entry entry : data.entrySet()) 
                if (entry.getValue() instanceof Map) 
                    find(node, String.valueOf(entry.getKey()), (Map<String, Object>) entry.getValue());
                 else if (isObjectArray(entry.getValue())) 
                    find(node, String.valueOf(entry.getKey()), (List<Map<String, Object>>) entry.getValue());
                
            
        
    

    public void find(LinkedNode parent, String parentName, List<Map<String, Object>> data) 
        for (Map<String, Object> item : data) 
            find(parent, parentName, item);
        
    

    protected Boolean hasObjectOrArray(Map<String, Object> item) 
        Object field;
        for (Map.Entry entry : item.entrySet()) 
            field = entry.getValue();
            if (field instanceof Map || isObjectArray(field)) 
                return Boolean.TRUE;
            
        

        return Boolean.FALSE;
    

    protected Boolean isObjectArray(Object object) 
        return object instanceof List
                && !CollectionUtils.isEmpty((List) object)
                && ((List) object).get(0) instanceof Map;
    

    public JSONFlatProcessor process(List<Map<String, Object>> data) 
        nodes = new LinkedList<>();
        find(null, null, data);
        return this;
    

    public JSONFlatProcessor process(Map<String, Object> data) 
        nodes = new LinkedList<>();
        find(null, null, data);
        return this;
    

    public LinkedList<LinkedNode> getNodes() 
        return nodes;
    

    public List<String> getColumn() 

        if (CollectionUtils.isEmpty(nodes)) 
            return Collections.emptyList();
        

        column = new LinkedList<>();
        collectColumn(nodes.getFirst());

        return column;

    

    protected void collectColumn(LinkedNode node) 
        List<String> innerColumn = new ArrayList<>(node.getData().size());
        String columnBuilder;
        for (Map.Entry entry : node.getData().entrySet()) 
            if (!(entry.getValue() instanceof Map || isObjectArray(entry.getValue()))) 
                columnBuilder = null == node.getParentName()? String.valueOf(entry.getKey()) : String.format("%s.%s", node.getParentName(), entry.getKey());
                innerColumn.add(columnBuilder);
            
        
        column.addAll(0, innerColumn);

        if (null != node.getParent()) 
            collectColumn(node.getParent());
        
    

    public List<Object[]> getData() 

        if (CollectionUtils.isEmpty(nodes)) 
            return Collections.emptyList();
        

        data = new ArrayList<>(nodes.size());

        LinkedList<Object> container;

        for (LinkedNode node : nodes) 
            container = new LinkedList<>();
            collectData(node, container);
            data.add(container.toArray());
        

        return data;

    

    protected void collectData(LinkedNode node, LinkedList<Object> container) 
        List<Object> innerData = new ArrayList<>(node.getData().size());
        for (Map.Entry entry : node.getData().entrySet()) 
            if (!(entry.getValue() instanceof Map || isObjectArray(entry.getValue()))) 
                innerData.add(entry.getValue());
            
        
        container.addAll(0, innerData);

        if (null != node.getParent()) 
            collectData(node.getParent(), container);
        
    

    protected static class CollectionUtils 
        public static boolean isEmpty(Collection<?> collection) 
            return (collection == null || collection.isEmpty());
        
    

public class MainTests 

public static void main(String[] args) throws Exception
String jsonStr = "\\"code\\":200,\\"requestId\\":\\"1680177848458\\",\\"data\\":[\\"school\\":\\"xxx市第一实验小学\\",\\"no\\":\\"1001\\",\\"class\\":[\\"name\\":\\"一(1)班\\",\\"teacher\\":\\"吴老师\\",\\"student\\":[\\"name\\":\\"张同学\\",\\"age\\":6,\\"name\\":\\"王同学\\",\\"age\\":7]],\\"school\\":\\"xxx市第二实验小学\\",\\"no\\":\\"1002\\",\\"class\\":[\\"name\\":\\"一(2)班\\",\\"teacher\\":\\"陈老师\\",\\"student\\":[\\"name\\":\\"欧阳同学\\",\\"age\\":6]]]";
ObjectMapper jsonMapper = new ObjectMapper();
// List<Map<String, Object>> map = jsonMapper.readValue(jsonStr, List.class);
Map<String, Object> map = jsonMapper.readValue(jsonStr, Map.class);


JSONFlatProcessor processor = new JSONFlatProcessor().process(map);
System.out.println("数据条数: " + processor.getNodes().size());
System.out.println("字段名: " + processor.getColumn());
System.out.println("首行数据: " + new ObjectMapper().writeValueAsString(processor.getData().get(0)));


数据条数: 3
字段名: [code, requestId, data.school, data.no, class.name, class.teacher, student.name, student.age]
首行数据: [200,"1680177848458","xxx市第一实验小学","1001","一(1)班","吴老师","张同学",6]

 

多层次查询深度嵌套复杂的 JSON 数据

【中文标题】多层次查询深度嵌套复杂的 JSON 数据【英文标题】:Querying deeply nested and complex JSON data with multiple levels 【发布时间】:2021-09-26 16:35:10 【问题描述】:

我正在努力分解从深度嵌套的复杂 JSON 数据中提取数据所需的方法。我有以下代码来获取 JSON。

import requests
import pandas as pd
import json
import pprint
import seaborn as sns
import matplotlib.pyplot as plt

base_url="https://data.sec.gov/api/xbrl/companyfacts/CIK0001627475.json"
headers='User-Agent': 'Myheaderdata'
first_response=requests.get(base_url,headers=headers)
response_dic=first_response.json()   
print(response_dic)
base_df=pd.DataFrame(response_dic)
base_df.head()

它提供了一个显示 JSON 和 Pandas DataFrame 的输出。数据框有两列,第三列 (FACTS) 包含大量嵌套数据。

我想了解的是如何导航到该嵌套结构中,以检索某些数据。例如,我可能想要转到 DEI 级别或 US GAAP 级别并检索特定属性。假设 DEI > EntityCommonStockSharesOutstanding 并获取“标签”、“价值”和“FY”详细信息。

当我尝试如下使用get函数时;

data=[]
for response in response_dic:

        data.append("EntityCommonStockSharesOutstanding":response.get('EntityCommonStockSharesOutstanding'))
    new_df=pd.DataFrame(data)
    new_df.head()

我最终得到以下属性错误;

AttributeError                            Traceback (most recent call last)
<ipython-input-15-15c1685065f0> in <module>
      1 data=[]
      2 for response in response_dic:
----> 3     data.append("EntityCommonStockSharesOutstanding":response.get('EntityCommonStockSharesOutstanding'))
      4 base_df=pd.DataFrame(data)
      5 base_df.head()

AttributeError: 'str' object has no attribute 'get'

【问题讨论】:

你看过response_dic的结构了吗?这是一个嵌套字典。你的循环,即for response in response_dic: 只是循环遍历它的键,这些键是字符串 cik、entityName、facts(不知道你为什么这样做)。要导航到“dei”中的“标签”,只需:response_dic['facts']['dei']['EntityCommonStockSharesOutstanding']['label'],结果为“实体普通股,流通股” 【参考方案1】:

使用pd.json_normalize:

例如:

entity1 = response_dic['facts']['dei']['EntityCommonStockSharesOutstanding']
entity2 = response_dic['facts']['dei']['EntityPublicFloat']

df1 = pd.json_normalize(entity1, record_path=['units', 'shares'],
                        meta=['label', 'description'])

df2 = pd.json_normalize(entity2, record_path=['units', 'USD'],
                        meta=['label', 'description'])
>>> df1
           end        val                  accn  ...      frame                                    label                                        description
0   2018-10-31  106299106  0001564590-18-028629  ...  CY2018Q3I  Entity Common Stock, Shares Outstanding  Indicate number of shares or other units outst...
1   2019-02-28  106692030  0001627475-19-000007  ...        NaN  Entity Common Stock, Shares Outstanding  Indicate number of shares or other units outst...
2   2019-04-30  107160359  0001627475-19-000015  ...  CY2019Q1I  Entity Common Stock, Shares Outstanding  Indicate number of shares or other units outst...
3   2019-07-31  110803709  0001627475-19-000025  ...  CY2019Q2I  Entity Common Stock, Shares Outstanding  Indicate number of shares or other units outst...
4   2019-10-31  112020807  0001628280-19-013517  ...  CY2019Q3I  Entity Common Stock, Shares Outstanding  Indicate number of shares or other units outst...
5   2020-02-28  113931825  0001627475-20-000006  ...        NaN  Entity Common Stock, Shares Outstanding  Indicate number of shares or other units outst...
6   2020-04-30  115142604  0001627475-20-000018  ...  CY2020Q1I  Entity Common Stock, Shares Outstanding  Indicate number of shares or other units outst...
7   2020-07-31  120276173  0001627475-20-000031  ...  CY2020Q2I  Entity Common Stock, Shares Outstanding  Indicate number of shares or other units outst...
8   2020-10-31  122073553  0001627475-20-000044  ...  CY2020Q3I  Entity Common Stock, Shares Outstanding  Indicate number of shares or other units outst...
9   2021-01-31  124962279  0001627475-21-000015  ...  CY2020Q4I  Entity Common Stock, Shares Outstanding  Indicate number of shares or other units outst...
10  2021-04-30  126144849  0001627475-21-000022  ...  CY2021Q1I  Entity Common Stock, Shares Outstanding  Indicate number of shares or other units outst...

[11 rows x 10 columns]


>>> df2
          end         val                  accn    fy  fp  form       filed      frame                label                                        description
0  2018-10-03   900000000  0001627475-19-000007  2018  FY  10-K  2019-03-07  CY2018Q3I  Entity Public Float  The aggregate market value of the voting and n...
1  2019-06-28  1174421292  0001627475-20-000006  2019  FY  10-K  2020-03-02  CY2019Q2I  Entity Public Float  The aggregate market value of the voting and n...
2  2020-06-30  1532720862  0001627475-21-000015  2020  FY  10-K  2021-02-24  CY2020Q2I  Entity Public Float  The aggregate market value of the voting and n...

【讨论】:

以上是关于JSON多层嵌套复杂结构数据扁平化处理转为行列数据的主要内容,如果未能解决你的问题,请参考以下文章

如何用Python解析多层嵌套的JSON?

c#中怎么解析多层json数据

ES6语法将扁平的JSON对象结构化

SQL:LATERAL VIEW函数解析多嵌套的json

SpringBoot Fastjson解析多层嵌套复杂Json字符串

SpringBoot Fastjson解析多层嵌套复杂Json字符串