如何使用 Pandas 加入 json
Posted
技术标签:
【中文标题】如何使用 Pandas 加入 json【英文标题】:How to join json with Pandas 【发布时间】:2021-04-05 08:39:34 【问题描述】:这里的目标是计算每种诊断的患者数量。在病历中,就诊ID是唯一的,而在诊断记录中,由于一次就诊可能有多个诊断,同一个就诊ID可能有多个诊断ID。
为此,我认为 2 数据框需要与实地访问 ID 相关联。任何人都可以阐明如何通过 Pandas 链接 2 json 并计算每个诊断的患者数量。非常感谢
病历
JSON [患者记录]
[
"Doctor id":"AU1254",
"Patient":[
"Patient id":"BK1221",
"Patient name":"Tim"
],
"Visit id":"B0001"
,
"Doctor id":"AU8766",
"Patient":[
"Patient id":"BK1209",
"Patient name":"Sue"
],
"Visit id":"B0002"
,
"Doctor id":"AU1254",
"Patient":[
"Patient id":"BK1323",
"Patient name":"Sary"
],
"Visit id":"B0003"
]
诊断记录
JSON [诊断记录]
[
"Visit id":"B0001",
"Diagnosis":[
"diagnosis id":"D1001",
"diagnosis name":"fever"
,
"diagnosis id":"D1987",
"diagnosis name":"cough"
,
"diagnosis id":"D1265",
"diagnosis name":"running nose"
]
,
"Visit id":"B0002",
"Diagnosis":[
"diagnosis id":"D1987",
"diagnosis name":"cough"
,
"diagnosis id":"D1453",
"diagnosis name":"stomach ache"
]
]
【问题讨论】:
【参考方案1】:您可以在visit id
上使用左merge()
。 merge
> from pandas.io.json import json_normalize
> import json
> json1 = <your first json here>
> json2 = <your second json here>
> df1=pd.json_normalize(data=json.loads(json1), record_path='Patient', meta=['Doctor id','Visit id'])
> df2=pd.json_normalize(data=json.loads(json2), record_path='Diagnosis', meta=['Visit id'])
> print(df1.merge(df2, on='Visit id', how='left').dropna())
Patient id Patient name Doctor id Visit id diagnosis id diagnosis name
0 BK1221 Tim AU1254 B0001 D1001 fever
1 BK1221 Tim AU1254 B0001 D1987 cough
2 BK1221 Tim AU1254 B0001 D1265 running nose
3 BK1209 Sue AU8766 B0002 D1987 cough
4 BK1209 Sue AU8766 B0002 D1453 stomach ache
你也可以做一些花哨的分组/打印
pd.pivot_table(df3, index=['Patient id','Patient name','Doctor id','Visit id'], values=['diagnosis id','diagnosis name'], aggfunc=list)
diagnosis id diagnosis name
Patient id Patient name Doctor id Visit id
BK1209 Sue AU8766 B0002 [D1987, D1453] [cough, stomach ache]
BK1221 Tim AU1254 B0001 [D1001, D1987, D1265] [fever, cough, running nose]
每个诊断/每个患者的计数
df3.groupby(['diagnosis id', 'diagnosis name']).agg('Patient name': [list, 'count'])
Patient name
list count
diagnosis id diagnosis name
D1001 fever [Tim] 1
D1265 running nose [Tim] 1
D1453 stomach ache [Sue] 1
D1987 cough [Tim, Sue] 2
【讨论】:
嗨@Danail,我遇到语法错误:扫描字符串文字时出现EOL。我可以知道 json1 是否是用 ' ' 括起来的字符串 只需将 json 内容粘贴为字符串,就可以了。像这样:json1=''' 复制->粘贴'''【参考方案2】:试试:(x
-->JSON [Patient record]
,y
-->JSON [Diagnosis record]
df = pd.DataFrame(x)
df = pd.concat([df.pop('Patient').apply(lambda x: pd.Series(x[0])), df], axis=1)
df1 = pd.DataFrame(y)
df1 = pd.concat([df1.explode('Diagnosis')['Diagnosis'].apply(pd.Series), df1], axis=1)
df1.pop('Diagnosis')
df_merge = pd.merge(df,df1, on='Visit id', how='right')
df_merge:
Patient id Patient name Doctor id Visit id diagnosis id diagnosis name
0 BK1221 Tim AU1254 B0001 D1001 fever
1 BK1221 Tim AU1254 B0001 D1987 cough
2 BK1221 Tim AU1254 B0001 D1265 running nose
3 BK1209 Sue AU8766 B0002 D1987 cough
4 BK1209 Sue AU8766 B0002 D1453 stomach ache
统计:
df_merge.groupby('diagnosis name')['Patient id'].count()
编辑:
试试:
df_merge.groupby('diagnosis name').agg('Patient name': [list, 'count']).reset_index()
diagnosis name Patient name
list count
cough [Tim, Sue] 2
fever [Tim] 1
running nose [Tim] 1
stomach ache [Sue] 1
【讨论】:
这可能会有所帮助:***.com/questions/50839737/… 嗨@Pygirl,诊断为“咳嗽”,计数不是2 我按 id 分组,然后按名称进行。我已经更新了我的答案:) 嗨,因为我想统计每次诊断的患者人数,而诊断咳嗽的患者人数是 2,即 Tim 和 Sue,您能否建议 @epiphany:现在检查【参考方案3】:对患者记录尝试以下操作。
patients_df = pd.read_json(patients.json)
patient_id = []
patient_name =[]
# Get attributes from nested nested datatypes in Patient column
for patient in patients_df['Patients']:
patient_id = patient[0]['Patient id']
patient_name = patient[0]['Patient name']
# Add to the pandas dataframe
patients_df['Patient name'] = patient_name
patient_df['Patient id'] = patient_id
# Drop the 'Patient' column
patients_df = patients_df.drop(column='Patient')
【讨论】:
以上是关于如何使用 Pandas 加入 json的主要内容,如果未能解决你的问题,请参考以下文章
如何让 pandas.read_json 将此 API 返回识别为有效的 .json?
如何使用 pandas read_json 读取 ADSB json 数据? [复制]
如何使用 python pandas 在本地系统 Jupyter Notebook 中读取两个较大的 5GB csv 文件?如何在本地加入两个数据框进行数据分析?