如何在熊猫中读取和规范化以下json?
Posted
技术标签:
【中文标题】如何在熊猫中读取和规范化以下json?【英文标题】:How to read and normalize following json in pandas? 【发布时间】:2020-04-22 19:21:32 【问题描述】:我在使用 pandas 的 *** 中看到了许多 json 读取问题,但我仍然无法解决这个简单的问题。
数据
"session_id":"0":["X061RFWB06K9V"],"1":["5AZ2X2A9BHH5U"],"unix_timestamp":"0":[1442503708],"1":[1441353991],"cities":"0":["New York NY, Newark NJ"],"1":["New York NY, Jersey City NJ, Philadelphia PA"],"user":"0":[["user_id":2024,"joining_date":"2015-03-22","country":"UK"]],"1":[["user_id":2853,"joining_date":"2015-03-28","country":"DE"]]
我的尝试
import numpy as np
import pandas as pd
import json
from pandas.io.json import json_normalize
# attempt1
df = pd.read_json('a.json')
# attempt2
with open('a.json') as fi:
data = json.load(fi)
df = json_normalize(data,record_path='user',meta=['session_id','unix_timestamp','cities'])
Both of them do not give me the required output.
需要的输出
session_id unix_timestamp cities user_id joining_date country
0 X061RFWB06K9V 1442503708 New York NY 2024 2015-03-22 UK
0 X061RFWB06K9V 1442503708 Newark NJ 2024 2015-03-22 UK
首选方法
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html
I would love to see implementation of pd.io.json.json_normalize
pandas.io.json.json_normalize(data: Union[Dict, List[Dict]], record_path: Union[str, List, NoneType] = None, meta: Union[str, List, NoneType] = None, meta_prefix: Union[str, NoneType] = None, record_prefix: Union[str, NoneType] = None, errors: Union[str, NoneType] = 'raise', sep: str = '.', max_level: Union[int, NoneType] = None)
相关链接
Pandas explode list of dictionaries into rows How to normalize json correctly by Python Pandas JSON to pandas DataFrame【问题讨论】:
【参考方案1】:只是想我会分享另一种将数据从嵌套 json 提取到 pandas 中的方法,以供未来访问该问题的人使用。在读入 pandas 之前提取每一列。 jmespath 在这里派上用场,因为它可以轻松遍历 json 数据:
import jmespath
from pprint import pprint
expression = jmespath.compile('''session_id:session_id.*[],
unix_timestamp : unix_timestamp.*[],
cities:cities.*[],
user_id : user.*[][].user_id,
joining_date : user.*[][].joining_date,
country : user.*[][].country
''')
res = expression.search(data)
pprint(res)
'cities': ['New York NY, Newark NJ',
'New York NY, Jersey City NJ, Philadelphia PA'],
'country': ['UK', 'DE'],
'joining_date': ['2015-03-22', '2015-03-28'],
'session_id': ['X061RFWB06K9V', '5AZ2X2A9BHH5U'],
'unix_timestamp': [1442503708, 1441353991],
'user_id': [2024, 2853]
将数据读入 pandas,split 将城市读入单独的行:
df = (pd.DataFrame(res)
.assign(cities = lambda x: x.cities.str.split(','))
.explode('cities')
)
df
session_id unix_timestamp cities user_id joining_date country
0 X061RFWB06K9V 1442503708 New York NY 2024 2015-03-22 UK
0 X061RFWB06K9V 1442503708 Newark NJ 2024 2015-03-22 UK
1 5AZ2X2A9BHH5U 1441353991 New York NY 2853 2015-03-28 DE
1 5AZ2X2A9BHH5U 1441353991 Jersey City NJ 2853 2015-03-28 DE
1 5AZ2X2A9BHH5U 1441353991 Philadelphia PA 2853 2015-03-28 DE
【讨论】:
【参考方案2】:这是另一种方式:
df = pd.read_json(r'C:\path\file.json')
final=df.stack().str[0].unstack()
final=final.assign(cities=final['cities'].str.split(',')).explode('cities')
final=final.assign(**pd.DataFrame(final.pop('user').str[0].tolist()))
print(final)
session_id unix_timestamp cities user_id joining_date \
0 X061RFWB06K9V 1442503708 New York NY 2024 2015-03-22
0 X061RFWB06K9V 1442503708 Newark NJ 2024 2015-03-22
1 5AZ2X2A9BHH5U 1441353991 New York NY 2024 2015-03-22
1 5AZ2X2A9BHH5U 1441353991 Jersey City NJ 2024 2015-03-22
1 5AZ2X2A9BHH5U 1441353991 Philadelphia PA 2024 2015-03-22
country
0 UK
0 UK
1 UK
1 UK
1 UK
【讨论】:
这里为什么选择cities
和user
?
@Jonnyboi 我不太记得了,因为那是一年多以前的事了,但从外观上看,read_json 返回了一个与我们想要作为行相同的 session_id 和 unix_timestamp 的列表 - 因此我们爆炸了它。然后我们将用户(这也是一个列表,但我们希望它们作为列)转换为数据框并分配回来。【参考方案3】:
一旦你有了df,你就可以合并两个部分:
df = pd.read_json('a.json')
df1 = df.drop('user',axis=1)
df2 = json_normalize(df['user'])
df = df1.merge(df2,left_index=True,right_index=True)
【讨论】:
【参考方案4】:我正在使用explode
和join
s=pd.DataFrame(j).apply(lambda x : x.str[0])
s['cities']=s.cities.str.split(',')
s=s.explode('cities')
s.reset_index(drop=True,inplace=True)
s=s.join(pd.DataFrame(sum(s.user.tolist(),[])))
session_id unix_timestamp ... joining_date country
0 X061RFWB06K9V 1442503708 ... 2015-03-22 UK
1 X061RFWB06K9V 1442503708 ... 2015-03-22 UK
2 5AZ2X2A9BHH5U 1441353991 ... 2015-03-28 DE
3 5AZ2X2A9BHH5U 1441353991 ... 2015-03-28 DE
4 5AZ2X2A9BHH5U 1441353991 ... 2015-03-28 DE
[5 rows x 7 columns]
【讨论】:
【参考方案5】:这是一种方法:
import pandas as pd
# lets say d is your json
df = pd.DataFrame.from_dict(d, orient='index').T.reset_index(drop=True)
# unlist each element
df = df.applymap(lambda x: x[0])
# convert user column to multiple cols
df = pd.concat([df.drop('user', axis=1), df['user'].apply(lambda x: x[0]).apply(pd.Series)], axis=1)
session_id unix_timestamp \
0 X061RFWB06K9V 1442503708
1 5AZ2X2A9BHH5U 1441353991
cities user_id joining_date country
0 New York NY, Newark NJ 2024 2015-03-22 UK
1 New York NY, Jersey City NJ, Philadelphia PA 2853 2015-03-28 DE
【讨论】:
以上是关于如何在熊猫中读取和规范化以下json?的主要内容,如果未能解决你的问题,请参考以下文章
如何在 Python 中规范化包含列表(应保存为列表)的 json 文件熊猫?