如何在熊猫中读取和规范化以下json？

Posted 2023-02-15

技术标签:

【中文标题】如何在熊猫中读取和规范化以下json？【英文标题】：How to read and normalize following json in pandas? 【发布时间】：2020-04-22 19:21:32 【问题描述】：

我在使用 pandas 的 *** 中看到了许多 json 读取问题，但我仍然无法解决这个简单的问题。

数据

"session_id":"0":["X061RFWB06K9V"],"1":["5AZ2X2A9BHH5U"],"unix_timestamp":"0":[1442503708],"1":[1441353991],"cities":"0":["New York NY, Newark NJ"],"1":["New York NY, Jersey City NJ, Philadelphia PA"],"user":"0":[["user_id":2024,"joining_date":"2015-03-22","country":"UK"]],"1":[["user_id":2853,"joining_date":"2015-03-28","country":"DE"]]

我的尝试

import numpy as np
import pandas as pd
import json
from pandas.io.json import json_normalize

# attempt1
df = pd.read_json('a.json')

# attempt2
with open('a.json') as fi:
    data = json.load(fi)
    df = json_normalize(data,record_path='user',meta=['session_id','unix_timestamp','cities'])

Both of them do not give me the required output.

需要的输出

      session_id unix_timestamp       cities  user_id joining_date country 
0  X061RFWB06K9V     1442503708  New York NY     2024   2015-03-22      UK   
0  X061RFWB06K9V     1442503708    Newark NJ     2024   2015-03-22      UK

首选方法

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html

I would love to see implementation of pd.io.json.json_normalize

pandas.io.json.json_normalize(data: Union[Dict, List[Dict]], record_path: Union[str, List, NoneType] = None, meta: Union[str, List, NoneType] = None, meta_prefix: Union[str, NoneType] = None, record_prefix: Union[str, NoneType] = None, errors: Union[str, NoneType] = 'raise', sep: str = '.', max_level: Union[int, NoneType] = None)

相关链接

Pandas explode list of dictionaries into rows How to normalize json correctly by Python Pandas JSON to pandas DataFrame

【问题讨论】：

【参考方案1】：

只是想我会分享另一种将数据从嵌套 json 提取到 pandas 中的方法，以供未来访问该问题的人使用。在读入 pandas 之前提取每一列。 jmespath 在这里派上用场，因为它可以轻松遍历 json 数据：

import jmespath
from pprint import pprint
expression = jmespath.compile('''session_id:session_id.*[],
                                  unix_timestamp : unix_timestamp.*[],
                                  cities:cities.*[],
                                  user_id : user.*[][].user_id,
                                  joining_date : user.*[][].joining_date,
                                  country : user.*[][].country
                              ''')
res = expression.search(data)
pprint(res)

'cities': ['New York NY, Newark NJ',
            'New York NY, Jersey City NJ, Philadelphia PA'],
 'country': ['UK', 'DE'],
 'joining_date': ['2015-03-22', '2015-03-28'],
 'session_id': ['X061RFWB06K9V', '5AZ2X2A9BHH5U'],
 'unix_timestamp': [1442503708, 1441353991],
 'user_id': [2024, 2853]

将数据读入 pandas，split 将城市读入单独的行：

df = (pd.DataFrame(res)
      .assign(cities = lambda x: x.cities.str.split(','))
      .explode('cities')
     )
df

session_id      unix_timestamp  cities       user_id      joining_date  country
0   X061RFWB06K9V   1442503708  New York NY     2024      2015-03-22    UK
0   X061RFWB06K9V   1442503708  Newark NJ       2024      2015-03-22    UK
1   5AZ2X2A9BHH5U   1441353991  New York NY     2853      2015-03-28    DE
1   5AZ2X2A9BHH5U   1441353991  Jersey City NJ  2853      2015-03-28    DE
1   5AZ2X2A9BHH5U   1441353991  Philadelphia PA 2853      2015-03-28    DE

【讨论】：

【参考方案2】：

这是另一种方式：

df = pd.read_json(r'C:\path\file.json')

final=df.stack().str[0].unstack()
final=final.assign(cities=final['cities'].str.split(',')).explode('cities')
final=final.assign(**pd.DataFrame(final.pop('user').str[0].tolist()))
print(final)

      session_id unix_timestamp            cities  user_id joining_date  \
0  X061RFWB06K9V     1442503708       New York NY     2024   2015-03-22   
0  X061RFWB06K9V     1442503708         Newark NJ     2024   2015-03-22   
1  5AZ2X2A9BHH5U     1441353991       New York NY     2024   2015-03-22   
1  5AZ2X2A9BHH5U     1441353991    Jersey City NJ     2024   2015-03-22   
1  5AZ2X2A9BHH5U     1441353991   Philadelphia PA     2024   2015-03-22   

  country  
0      UK  
0      UK  
1      UK  
1      UK  
1      UK

【讨论】：

这里为什么选择cities和user？ @Jonnyboi 我不太记得了，因为那是一年多以前的事了，但从外观上看，read_json 返回了一个与我们想要作为行相同的 session_id 和 unix_timestamp 的列表 - 因此我们爆炸了它。然后我们将用户（这也是一个列表，但我们希望它们作为列）转换为数据框并分配回来。【参考方案3】：

一旦你有了df，你就可以合并两个部分：

df = pd.read_json('a.json')
df1 = df.drop('user',axis=1)
df2 = json_normalize(df['user'])

df = df1.merge(df2,left_index=True,right_index=True)

【讨论】：

【参考方案4】：

我正在使用explode 和join

s=pd.DataFrame(j).apply(lambda x : x.str[0])
s['cities']=s.cities.str.split(',')
s=s.explode('cities')
s.reset_index(drop=True,inplace=True)
s=s.join(pd.DataFrame(sum(s.user.tolist(),[])))
      session_id  unix_timestamp  ... joining_date country
0  X061RFWB06K9V      1442503708  ...   2015-03-22      UK
1  X061RFWB06K9V      1442503708  ...   2015-03-22      UK
2  5AZ2X2A9BHH5U      1441353991  ...   2015-03-28      DE
3  5AZ2X2A9BHH5U      1441353991  ...   2015-03-28      DE
4  5AZ2X2A9BHH5U      1441353991  ...   2015-03-28      DE
[5 rows x 7 columns]

【讨论】：

【参考方案5】：

这是一种方法：

import pandas as pd

# lets say d is your json
df = pd.DataFrame.from_dict(d, orient='index').T.reset_index(drop=True)

# unlist each element
df = df.applymap(lambda x: x[0])

# convert user column to multiple cols
df = pd.concat([df.drop('user', axis=1), df['user'].apply(lambda x: x[0]).apply(pd.Series)], axis=1)

      session_id  unix_timestamp  \
0  X061RFWB06K9V      1442503708   
1  5AZ2X2A9BHH5U      1441353991   

                                         cities  user_id joining_date country  
0                        New York NY, Newark NJ     2024   2015-03-22      UK  
1  New York NY, Jersey City NJ, Philadelphia PA     2853   2015-03-28      DE

【讨论】：

以上是关于如何在熊猫中读取和规范化以下json？的主要内容，如果未能解决你的问题，请参考以下文章