将 JSON 数组提取到数据框列中
Posted
技术标签:
【中文标题】将 JSON 数组提取到数据框列中【英文标题】:Extract JSON arrays into dataframe columns 【发布时间】:2021-10-06 17:25:51 【问题描述】:我有project.json
文件,其中包含如下数据:
"student_id": "ST0001", "project": ["subject_id": "S003", "date_of_submission": "2021-05-23 20:03:05", "subject_id": "S004", "date_of_submission": "2021-05-24 21:03:05", "subject_id": "S005", "date_of_submission": "2021-05-30 05:09:30"], "project_year": "Second"
"student_id": "ST0002", "project": ["subject_id": "S003", "date_of_submission": "2021-06-02 15:05:05", "subject_id": "S007", "date_of_submission": "2021-04-28 21:03:01"], "project_year": "Second"
"student_id": "ST0002", "project": ["subject_id": "S0018", "date_of_submission": "2020-06-03 08:15:21"], "project_year": "First"
我需要将嵌套的subject_id
和date_of_submission
提取到单独的列中,例如:
student_id | subject_id | date_of_submission | project_year |
---|---|---|---|
ST0001 | S003 | 23/05/2021 20:03 | Second |
ST0001 | S004 | 24/05/2021 21:03 | Second |
ST0001 | S005 | 30/05/2021 05:09 | Second |
ST0002 | S003 | 02/06/2021 15:05 | Second |
ST0002 | S007 | 28/04/2021 21:03 | Second |
ST0002 | S0018 | 03/06/2020 08:15 | First |
我认为我们可以使用json_normalize
提取一个级别,有人可以帮我完成这个吗?
import pandas as pd
df=pd.read_json('project.json', lines=True)
df = pd.DataFrame(df).explode('project')
【问题讨论】:
你检查过***.com/questions/39899005/… 你的问题解决了吗? 【参考方案1】:您可以尝试在json_normailze()
方法中使用record_path
和meta
参数:
s=pd.read_json('project.json',lines=True).melt()['value'].tolist()
df=pd.json_normalize(s,record_path=['project'],meta=['student_id','project_year'])
#here data is your json data
df
的输出:
subject_id date_of_submission student_id project_year
0 S003 2021-05-23 20:03:05 ST0001 Second
1 S004 2021-05-24 21:03:05 ST0001 Second
2 S005 2021-05-30 05:09:30 ST0001 Second
3 S003 2021-06-02 15:05:05 ST0002 Second
4 S007 2021-04-28 21:03:01 ST0002 Second
5 S0018 2020-06-03 08:15:21 ST0002 First
【讨论】:
我将您的答案更改为data
与df
,它会引发错误TypeError: 'int' object is not subscriptable
。你能告诉我哪里不对吗?
@MNAf 对于给定的示例 json,它正在工作......很抱歉,没有数据我无法判断
@MNAf 我们像这样读取 json 文件pd.read_json()
......更新的答案希望它变得清晰:)
user:14289892 我附加了多个 json 文件,它们没有像 data
变量那样的方括号。有没有办法使用df=pd.read_json('project.json', lines=True)
我仍然收到一个错误,我稍微调整了你的答案并且它起作用了。read_json_to_df = pd.read_json('project.json', lines=True) json_struct = json.loads(read_json_to_df.to_json(orient="records")) df=pd.json_normalize(json_struct,record_path=['project'],meta=['student_id','project_year']) df
以上是关于将 JSON 数组提取到数据框列中的主要内容,如果未能解决你的问题,请参考以下文章
数据框列中的嵌套列表,提取数据框列中列表的值 Pyspark Spark