用python中的嵌套结构构建一个带有pandas的数据框
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了用python中的嵌套结构构建一个带有pandas的数据框相关的知识,希望对你有一定的参考价值。
我想用数据集实现机器学习有点过于复杂。我想和大熊猫一起工作,然后在skit-learn中使用一些内置模型。
数据外观在JSON文件中给出,示例如下所示:
{
"demo_Profile": {
"sex": "male",
"age": 98,
"height": 160,
"weight": 139,
"bmi": 5,
"someinfo1": [
"some_more_info1"
],
"someinfo2": [
"some_more_inf2"
],
"someinfo3": [
"some_more_info3"
],
},
"event": {
"info_personal": {
"info1": 219.59,
"info2": 129.18,
"info3": 41.15,
"info4": 94.19,
},
"symptoms": [
{
"name": "name1",
"socrates": {
"associations": [
"associations1"
],
"onsetType": "onsetType1",
"timeCourse": "timeCourse1"
}
},
{
"name": "name2",
"socrates": {
"timeCourse": "timeCourse2"
}
},
{
"name": "name3",
"socrates": {
"onsetType": "onsetType2"
}
},
{
"name": "name4",
"socrates": {
"onsetType": "onsetType3"
}
},
{
"name": "name5",
"socrates": {
"associations": [
"associations2"
]
}
}
],
"labs": [
{
"name": "name1 ",
"value": "valuelab"
}
]
}
}
我想创建一个考虑这种“嵌套数据”的pandas数据框,但我不知道如何构建一个数据框,除了“单个参数”之外还要考虑“嵌套参数”
例如,我不知道如何合并包含“单个参数”的“demo_Profile”和症状,这些症状是相同的单个值的列表,在其他情况下是列表。
有谁知道处理这个问题的方法?
编辑*********
上面显示的JSON只是一个示例,在其他情况下,列表中的值的数量将不同,以及症状的数量。因此,上面显示的示例并非针对每种情况都是固定的。
平整json数据的一种快速简便的方法是使用flatten_json包,可以通过pip安装
pip install flatten_json
我希望你有一个列表,其中包含许多你提供的条目。因此,以下代码将为您提供所需的结果:
import pandas as pd
from flatten_json import flatten
json_data = [{...patient1...}, {patient2...}, ...]
flattened = (flatten(entry) for entry in json_data)
df = pd.DataFrame(flattened)
在展平的数据中,列表条目后缀为数字(我在“实验室”列表中添加了另一个患者的附加条目):
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| index demo_Profile_age demo_Profile_bmi demo_Profile_height demo_Profile_sex demo_Profile_someinfo1_0 demo_Profile_someinfo2_0 demo_Profile_someinfo3_0 demo_Profile_weight event_info_personal_info1 event_info_personal_info2 event_info_personal_info3 event_info_personal_info4 event_labs_0_name event_labs_0_value event_labs_1_name event_labs_1_value event_symptoms_0_name event_symptoms_0_socrates_associations_0 event_symptoms_0_socrates_onsetType event_symptoms_0_socrates_timeCourse event_symptoms_1_name event_symptoms_1_socrates_timeCourse event_symptoms_2_name event_symptoms_2_socrates_onsetType event_symptoms_3_name event_symptoms_3_socrates_onsetType event_symptoms_4_name event_symptoms_4_socrates_associations_0 |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 0 98 5 160 male some_more_info1 some_more_inf2 some_more_info3 139 219.59 129.18 41.15 94.19 name1 valuelab NaN NaN name1 associations1 onsetType1 timeCourse1 name2 timeCourse2 name3 onsetType2 name4 onsetType3 name5 associations2 |
| 1 98 5 160 male some_more_info1 some_more_inf2 some_more_info3 139 219.59 129.18 41.15 94.19 name1 valuelab name2 valuelabr2 name1 associations1 onsetType1 timeCourse1 name2 timeCourse2 name3 onsetType2 name4 onsetType3 name5 associations2 |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
flatten方法包含用于删除不需要的列或前缀的其他参数。
注意:虽然此方法可以根据需要为您提供展平的DataFrame,但我希望您在将数据集输入机器学习算法时遇到其他问题,具体取决于您的预测目标是什么以及您希望如何将数据编码为功能。
考虑一下熊猫的json_normalize。但是,因为甚至有更深的巢,所以考虑分别处理数据,然后与“标准化”列的填充连接在一起。
import json
import pandas as pd
from pandas.io.json import json_normalize
with open('myfile.json', 'r') as f:
data = json.loads(f.read())
final_df = pd.concat([json_normalize(data['demo_Profile']),
json_normalize(data['event']['symptoms']),
json_normalize(data['event']['info_personal']),
json_normalize(data['event']['labs'])], axis=1)
# FLATTEN NESTED LISTS
n_list = ['someinfo1', 'someinfo2', 'someinfo3', 'socrates.associations']
final_df[n_list] = final_df[n_list].apply(lambda col:
col.apply(lambda x: x if pd.isnull(x) else x[0]))
# FILLING FORWARD
norm_list = ['age', 'bmi', 'height', 'weight', 'sex', 'someinfo1', 'someinfo2', 'someinfo3',
'info1', 'info2', 'info3', 'info4', 'name', 'value']
final_df[norm_list] = final_df[norm_list].ffill()
产量
print(final_df)
# age bmi height sex someinfo1 someinfo2 someinfo3 weight name socrates.associations socrates.onsetType socrates.timeCourse info1 info2 info3 info4 name value
# 0 98.0 5.0 160.0 male some_more_info1 some_more_inf2 some_more_info3 139.0 name1 associations1 onsetType1 timeCourse1 219.59 129.18 41.15 94.19 name1 valuelab
# 1 98.0 5.0 160.0 male some_more_info1 some_more_inf2 some_more_info3 139.0 name2 NaN NaN timeCourse2 219.59 129.18 41.15 94.19 name1 valuelab
# 2 98.0 5.0 160.0 male some_more_info1 some_more_inf2 some_more_info3 139.0 name3 NaN onsetType2 NaN 219.59 129.18 41.15 94.19 name1 valuelab
# 3 98.0 5.0 160.0 male some_more_info1 some_more_inf2 some_more_info3 139.0 name4 NaN onsetType3 NaN 219.59 129.18 41.15 94.19 name1 valuelab
# 4 98.0 5.0 160.0 male some_more_info1 some_more_inf2 some_more_info3 139.0 name5 associations2 NaN NaN 219.59 129.18 41.15 94.19 name1 valuelab
以上是关于用python中的嵌套结构构建一个带有pandas的数据框的主要内容,如果未能解决你的问题,请参考以下文章
构建 MultiIndex pandas DataFrame 嵌套 Python 字典
使用 Pandas 在 Python 中过滤嵌套的 JSON 数据