我需要取消嵌套 JSON 数组元素并确保与“ID”列正确映射
Posted
技术标签:
【中文标题】我需要取消嵌套 JSON 数组元素并确保与“ID”列正确映射【英文标题】:I need to Un-nest JSON array elements AND ensure correct mapping with 'ID' column 【发布时间】:2018-10-02 09:42:06 【问题描述】:输入的DataFrame“df”如下(请注意'id'列的值):
| id | name |
|-------|---------------------------------------------------------------------------------------|
| a1xy | [ "event": "sports", "start": "100", "event": "lunch", "start": "121" ] |
| a7yz | [ "event": "lunch", "start": "109", "event": "movie", "start": "97" ] |
| bx4y | [ "event": "dinner", "start": "78", "event": "sleep", "start": "25" ] |
我想展平 JSON 数组元素,以便我的结果输出为:
| id | name.event | name.start |
|-------|------------|------------|
| a1xy | sports | 100 |
| a1xy | lunch | 121 |
| a7yz | lunch | 109 |
| a7yz | movie | 97 |
| bx4y | dinner | 78 |
| bx4y | sleep | 25 |
“id”列中的值需要正确映射。如何在 Python 中做到这一点?
我试过了:
k = df.name.map(json.loads).apply(pd.DataFrame).tolist()
final_df = pd.concat(k)
但我无法映射“id”列中的值。
【问题讨论】:
pandas.pydata.org/pandas-docs/stable/generated/… 输入是json
?可以使用json_normalize
吗?
【参考方案1】:
假设您有 json 对象列表作为以下输入
data = ['id': 'a1xy', 'name': ['event': 'sports', 'start': '100','event': 'lunch', 'start': '121'],
'id': 'a7yz', 'name': ['event':'lunch', 'start': '109','event': 'movie', 'start': '97'],
'id': 'bx4y', 'name': ['event': 'dinner', 'start': '78','event': 'sleep', 'start': '25']]
df = json_normalize(data, record_path='name', meta='id', record_prefix='name.')
print(df)
【讨论】:
【参考方案2】:您可以将列表理解与展平结合使用,并通过id
值更新每个字典,最后调用DataFrame
构造函数:
df['name'] = df['name'].map(json.loads)
df = pd.DataFrame([dict(y, id=i) for i, x in zip(df['id'],df['name']) for y in x])
print (df)
event id start
0 sports a1xy 100
1 lunch a1xy 121
2 lunch a7yz 109
3 movie a7yz 97
4 dinner bx4y 78
5 sleep bx4y 25
但如果输入是json
,最好使用json_normalize
。
时间安排:
df=pd.DataFrame([
['a1xy',[ "event": "sports", "start": "100", "event": "lunch", "start": "121" ]],
['a7yz',[ "event": "lunch", "start": "109", "event": "movie", "start": "97" ]],
['bx4y',[ "event": "dinner", "start": "78", "event": "sleep", "start": "25" ]]],
columns=['id','name'])
print (df)
#3k rows
df = pd.concat([df] * 1000, ignore_index=True)
In [276]: %%timeit
...: pd.DataFrame([dict(y, id=i) for i, x in zip(df['id'],df['name']) for y in x])
9.49 ms ± 230 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [277]: %%timeit
...: finalArray=[]
...: df.apply(lambda x: addtoArray(x,finalArray),axis=1)
...: pd.DataFrame(finalArray,columns=['col1','event','start'])
...:
1.81 s ± 33.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
列表理解解决方案更快180x
。
【讨论】:
我如何“以编程方式”将“id”列的元素添加到“name”列中?我想使用 json_normalize @Symphony -json
看起来怎么样?
@Symphony - 就像['id': 'a1xy', 'name': ['event': 'sports', 'start': '100','event': 'lunch', 'start': '121'], 'id': 'a7yz', 'name': ['event':'lunch', 'start': '109','event': 'movie', 'start': '97'], 'id': 'bx4y', 'name': ['event': 'dinner', 'start': '78','event': 'sleep', 'start': '25']]
?
[ "event": "sports", "start": "100", "event": "lunch", "start": "121" ]
[ "id": "a1xy", "event": "sports", "start": "100", "id": "a1xy", "event": "lunch ", "开始": "121" ]【参考方案3】:
您也可以在 apply 函数中使用外部函数
import json
data=pd.DataFrame([
['a1xy',[ "event": "sports", "start": "100", "event": "lunch", "start": "121" ]],
['a7yz',[ "event": "lunch", "start": "109", "event": "movie", "start": "97" ]],
['bx4y',[ "event": "dinner", "start": "78", "event": "sleep", "start": "25" ]]],columns=['id','name'])
def addtoArray(x,finalArray):
finalArray.extend(np.insert(pd.DataFrame(x['name']).values,0,x['id'],axis=1).tolist())
finalArray=[]
data.apply(lambda x: addtoArray(x,finalArray),axis=1)
finalArray=pd.DataFrame(finalArray,columns=['col1','event','start'])
print(finalArray)
col1 event start
0 a1xy sports 100
1 a1xy lunch 121
2 a7yz lunch 109
3 a7yz movie 97
4 bx4y dinner 78
5 bx4y sleep 25
【讨论】:
以上是关于我需要取消嵌套 JSON 数组元素并确保与“ID”列正确映射的主要内容,如果未能解决你的问题,请参考以下文章
查询嵌套 JSON 数组 PostgreSQL 中的所有元素