如何将带有字典列表的熊猫列拆分为每个键的单独列
Posted
技术标签:
【中文标题】如何将带有字典列表的熊猫列拆分为每个键的单独列【英文标题】:How to split a pandas column with a list of dicts into separate columns for each key 【发布时间】:2021-04-13 17:15:09 【问题描述】:我正在分析 来自 Facebook 的政治广告,这是由 ProPublica 发布的 dataset here。
这就是我的意思。 我有一整列要分析的目标,但它的格式对于我的技能水平的人来说非常难以访问。
这仅来自 1 个单元格:
["target": "NAge", "segment": "21 and older", "target": "MinAge", "segment": "21", "target": "Retargeting", "segment": "people who may be similar to their customers", "target": "Region", "segment": "the United States"]
还有另一个:
["target": "NAge", "segment": "18 and older", "target": "Location Type", "segment": "HOME", "target": "Interest", "segment": "Hispanic culture", "target": "Interest", "segment": "Republican Party (United States)", "target": "Location Granularity", "segment": "country", "target": "Country", "segment": "the United States", "target": "MinAge", "segment": 18]
我需要做的是将每个“目标”项分开以成为列标签,其中每个对应的“段”成为该列中的可能值。
或者,是否创建一个函数来调用每行中的每个字典键来计算频率?
【问题讨论】:
查看更新以获取计数。 【参考方案1】: 这些列是lists
的dicts
。
可以使用pandas.explode()
将list
中的每个dict
移动到单独的列中。
将dicts
的列转换为数据帧,其中键是列标题,值是观察值,方法是使用pandas.json_normalize()
、.join()
这回到df
。
使用.drop()
删除不需要的列。
如果该列包含作为字符串的 dicts 列表(例如 "[key: value]"
),请在 Splitting dictionary/list inside a Pandas Column into Separate Columns 中引用此 solution,并使用:
df.col2 = df.col2.apply(literal_eval)
,与from ast import literal_eval
。
import pandas as pd
# create sample dataframe
df = pd.DataFrame('col1': ['x', 'y'], 'col2': [["target": "NAge", "segment": "21 and older", "target": "MinAge", "segment": "21", "target": "Retargeting", "segment": "people who may be similar to their customers", "target": "Region", "segment": "the United States"], ["target": "NAge", "segment": "18 and older", "target": "Location Type", "segment": "HOME", "target": "Interest", "segment": "Hispanic culture", "target": "Interest", "segment": "Republican Party (United States)", "target": "Location Granularity", "segment": "country", "target": "Country", "segment": "the United States", "target": "MinAge", "segment": 18]])
# display(df)
col1 col2
0 x ['target': 'NAge', 'segment': '21 and older', 'target': 'MinAge', 'segment': '21', 'target': 'Retargeting', 'segment': 'people who may be similar to their customers', 'target': 'Region', 'segment': 'the United States']
1 y ['target': 'NAge', 'segment': '18 and older', 'target': 'Location Type', 'segment': 'HOME', 'target': 'Interest', 'segment': 'Hispanic culture', 'target': 'Interest', 'segment': 'Republican Party (United States)', 'target': 'Location Granularity', 'segment': 'country', 'target': 'Country', 'segment': 'the United States', 'target': 'MinAge', 'segment': 18]
# use explode to give each dict in a list a separate row
df = df.explode('col2').reset_index(drop=True)
# normalize the column of dicts, join back to the remaining dataframe columns, and drop the unneeded column
df = df.join(pd.json_normalize(df.col2)).drop(columns=['col2'])
display(df)
col1 target segment
0 x NAge 21 and older
1 x MinAge 21
2 x Retargeting people who may be similar to their customers
3 x Region the United States
4 y NAge 18 and older
5 y Location Type HOME
6 y Interest Hispanic culture
7 y Interest Republican Party (United States)
8 y Location Granularity country
9 y Country the United States
10 y MinAge 18
获取count
如果目标是为每个'target'
和关联的'segment'
获取count
counts = df.groupby(['target', 'segment']).count()
更新
此更新针对完整文件实施import pandas as pd
from ast import literal_eval
# load the file
df = pd.read_csv('en-US.csv')
# replace NaNs with '[]', otherwise literal_eval will error
df.targets = df.targets.fillna('[]')
# replace null with None, otherwise literal_eval will error
df.targets = df.targets.str.replace('null', 'None')
# convert the strings to lists of dicts
df.targets = df.targets.apply(literal_eval)
# use explode to give each dict in a list a separate row
df = df.explode('targets').reset_index(drop=True)
# fillna with is required for json_normalize
df.targets = df.targets.fillna(i: for i in df.index)
# normalize the column of dicts, join back to the remaining dataframe columns, and drop the unneeded column
normalized = pd.json_normalize(df.targets)
# get the counts
counts = normalized.groupby(['target', 'segment']).segment.count().reset_index(name='counts')
【讨论】:
以上是关于如何将带有字典列表的熊猫列拆分为每个键的单独列的主要内容,如果未能解决你的问题,请参考以下文章