多索引熊猫数据框到字典
Posted
技术标签:
【中文标题】多索引熊猫数据框到字典【英文标题】:multi-index pandas dataframe to a dictionary 【发布时间】:2017-12-03 01:17:20 【问题描述】:我有一个如下的数据框:
raw_data = 'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
如果我按两列分组并计算大小,
df.groupby(['regiment','company']).size()
我得到以下信息:
regiment company
Dragoons 1st 2
2nd 2
Nighthawks 1st 2
2nd 2
Scouts 1st 2
2nd 2
dtype: int64
我想要的输出是一个字典,如下所示:
'Dragoons':'1st':2,'2nd':2,
'Nighthawks': '1st':2,'2nd':2,
...
我尝试了不同的方法,但无济于事。有没有相对干净的方法来实现上述目标?
非常感谢您!!!!
【问题讨论】:
【参考方案1】:您可以添加Series.unstack
和DataFrame.to_dict
:
d = df.groupby(['regiment','company']).size().unstack().to_dict(orient='index')
print (d)
'Dragoons': '2nd': 2, '1st': 2,
'Nighthawks': '2nd': 2, '1st': 2,
'Scouts': '2nd': 2, '1st': 2
另一个解决方案,与另一个答案非常相似:
from collections import Counter
df = i: dict(Counter(x['company'])) for i, x in df.groupby('regiment')
print (df)
'Dragoons': '2nd': 2, '1st': 2,
'Nighthawks': '2nd': 2, '1st': 2,
'Scouts': '2nd': 2, '1st': 2
但是如果使用第一个解决方案,NaN
s 会有问题(这取决于数据)
示例:
raw_data = 'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '3rd'],
'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
print (df)
regiment company name preTestScore postTestScore
0 Nighthawks 1st Miller 4 25
1 Nighthawks 1st Jacobson 24 94
2 Nighthawks 2nd Ali 31 57
3 Nighthawks 2nd Milner 2 62
4 Dragoons 1st Cooze 3 70
5 Dragoons 1st Jacon 4 25
6 Dragoons 2nd Ryaner 24 94
7 Dragoons 2nd Sone 31 57
8 Scouts 1st Sloan 2 62
9 Scouts 1st Piger 3 70
10 Scouts 2nd Riani 2 62
11 Scouts 3rd Ali 3 70
df1 = df.groupby(['regiment','company']).size().unstack()
print (df1)
company 1st 2nd 3rd
regiment
Dragoons 2.0 2.0 NaN
Nighthawks 2.0 2.0 NaN
Scouts 2.0 1.0 1.0
d = df1.to_dict(orient='index')
print (d)
'Dragoons': '3rd': nan, '2nd': 2.0, '1st': 2.0,
'Nighthawks': '3rd': nan, '2nd': 2.0, '1st': 2.0,
'Scouts': '3rd': 1.0, '2nd': 1.0, '1st': 2.0
那么就要用到了:
d = i: dict(Counter(x['company'])) for i, x in df.groupby('regiment')
print (d)
'Dragoons': '2nd': 2, '1st': 2,
'Nighthawks': '2nd': 2, '1st': 2,
'Scouts': '3rd': 1, '2nd': 1, '1st': 2
或另一个John Galt 答案。
【讨论】:
我在第一个答案中发现问题 - 仅适用于所有类别(如您的示例数据中)。所以更一般的是第二个答案或其他解决方案...... 我明白了。我最终采用了第二种解决方案,因为它不会产生带有 nans 的密钥。【参考方案2】:您可以在分组后重置索引并根据需要旋转数据。下面的代码给出了所需的输出。
df = df.groupby(['regiment','company']).size().reset_index()
print(pd.pivot_table(df, values=0, index='regiment', columns='company').to_dict(orient='index'))
输出:
'Nighthawks': '2nd': 2, '1st': 2, 'Scouts': '2nd': 2, '1st': 2, 'Dragoons': '2nd': 2, '1st': 2
【讨论】:
【参考方案3】:如何创建具有组理解的字典。
In [409]: g:v['company'].value_counts().to_dict() for g, v in df.groupby('regiment')
Out[409]:
'Dragoons': '1st': 2, '2nd': 2,
'Nighthawks': '1st': 2, '2nd': 2,
'Scouts': '1st': 2, '2nd': 2
【讨论】:
以上是关于多索引熊猫数据框到字典的主要内容,如果未能解决你的问题,请参考以下文章