如何对熊猫中的多索引进行分组?

Posted

技术标签:

【中文标题】如何对熊猫中的多索引进行分组?【英文标题】:How to do group by on a multiindex in pandas? 【发布时间】:2013-11-16 20:56:16 【问题描述】:

下面是我的数据框。我进行了一些转换来创建类别列并删除了它派生的原始列。现在我需要做一个分组来删除重复,例如LoveFashion 可以通过 groupby 总和进行汇总。

df.colunms = array([category, clicks, revenue, date, impressions, size], dtype=object)
df.values=
[[Love 0 0.36823 2013-11-04 380 300x250]
 [Love 183 474.81522 2013-11-04 374242 300x250]
 [Fashion 0 0.19434 2013-11-04 197 300x250]
 [Fashion 9 18.26422 2013-11-04 13363 300x250]]

这是我创建数据框时创建的索引

print df.index
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48])

我假设我想删除索引,并将日期和类别创建为 multiindex,然后对指标进行 groupby 总和。如何在 pandas 数据框中执行此操作?

df.head(15).to_dict()= 'category': 0: 'Love', 1: 'Love', 2: 'Fashion', 3: 'Fashion', 4: 'Hair', 5: 'Movies', 6: 'Movies', 7: 'Health', 8: 'Health', 9: 'Celebs', 10: 'Celebs', 11: 'Travel', 12: 'Weightloss', 13: 'Diet', 14: 'Bags', 'impressions': 0: 380, 1: 374242, 2: 197, 3: 13363, 4: 4, 5: 189, 6: 60632, 7: 269, 8: 40189, 9: 138, 10: 66590, 11: 2227, 12: 22668, 13: 21707, 14: 229, 'date': 0: '2013-11-04', 1: '2013-11-04', 2: '2013-11-04', 3: '2013-11-04', 4: '2013-11-04', 5: '2013-11-04', 6: '2013-11-04', 7: '2013-11-04', 8: '2013-11-04', 9: '2013-11-04', 10: '2013-11-04', 11: '2013-11-04', 12: '2013-11-04', 13: '2013-11-04', 14: '2013-11-04', 'cpc_cpm_revenue': 0: 0.36823, 1: 474.81522000000001, 2: 0.19434000000000001, 3: 18.264220000000002, 4: 0.00080000000000000004, 5: 0.23613000000000001, 6: 81.391139999999993, 7: 0.27171000000000001, 8: 51.258200000000002, 9: 0.11536, 10: 83.966859999999997, 11: 3.43248, 12: 31.695889999999999, 13: 28.459320000000002, 14: 0.43524000000000002, 'clicks': 0: 0, 1: 183, 2: 0, 3: 9, 4: 0, 5: 1, 6: 20, 7: 0, 8: 21, 9: 0, 10: 32, 11: 1, 12: 12, 13: 9, 14: 2, 'size': 0: '300x250', 1: '300x250', 2: '300x250', 3: '300x250', 4: '300x250', 5: '300x250', 6: '300x250', 7: '300x250', 8: '300x250', 9: '300x250', 10: '300x250', 11: '300x250', 12: '300x250', 13: '300x250', 14: '300x250'

Python 是 2.7,pandas 在 ubuntu 12.04 上是 0.7.0。如果我运行以下内容,以下是我得到的错误

import pandas
print pandas.__version__
df = pandas.DataFrame.from_dict(
    
     'category': 0: 'Love', 1: 'Love', 2: 'Fashion', 3: 'Fashion', 4: 'Hair', 5: 'Movies', 6: 'Movies', 7: 'Health', 8: 'Health', 9: 'Celebs', 10: 'Celebs', 11: 'Travel', 12: 'Weightloss', 13: 'Diet', 14: 'Bags', 
     'impressions': 0: 380, 1: 374242, 2: 197, 3: 13363, 4: 4, 5: 189, 6: 60632, 7: 269, 8: 40189, 9: 138, 10: 66590, 11: 2227, 12: 22668, 13: 21707, 14: 229, 
     'date': 0: '2013-11-04', 1: '2013-11-04', 2: '2013-11-04', 3: '2013-11-04', 4: '2013-11-04', 5: '2013-11-04', 6: '2013-11-04', 7: '2013-11-04', 8: '2013-11-04', 9: '2013-11-04', 10: '2013-11-04', 11: '2013-11-04', 12: '2013-11-04', 13: '2013-11-04', 14: '2013-11-04', 'cpc_cpm_revenue': 0: 0.36823, 1: 474.81522000000001, 2: 0.19434000000000001, 3: 18.264220000000002, 4: 0.00080000000000000004, 5: 0.23613000000000001, 6: 81.391139999999993, 7: 0.27171000000000001, 8: 51.258200000000002, 9: 0.11536, 10: 83.966859999999997, 11: 3.43248, 12: 31.695889999999999, 13: 28.459320000000002, 14: 0.43524000000000002, 'clicks': 0: 0, 1: 183, 2: 0, 3: 9, 4: 0, 5: 1, 6: 20, 7: 0, 8: 21, 9: 0, 10: 32, 11: 1, 12: 12, 13: 9, 14: 2, 'size': 0: '300x250', 1: '300x250', 2: '300x250', 3: '300x250', 4: '300x250', 5: '300x250', 6: '300x250', 7: '300x250', 8: '300x250', 9: '300x250', 10: '300x250', 11: '300x250', 12: '300x250', 13: '300x250', 14: '300x250'
    
)
df.set_index(['date', 'category'], inplace=True)
df.groupby(level=[0,1]).sum()


Traceback (most recent call last):
  File "/home/ubuntu/workspace/devops/reports/groupby_sub.py", line 9, in <module>
    df.set_index(['date', 'category'], inplace=True)
  File "/usr/lib/pymodules/python2.7/pandas/core/frame.py", line 1927, in set_index
    raise Exception('Index has duplicate keys: %s' % duplicates)
Exception: Index has duplicate keys: [('2013-11-04', 'Celebs'), ('2013-11-04', 'Fashion'), ('2013-11-04', 'Health'), ('2013-11-04', 'Love'), ('2013-11-04', 'Movies')]

【问题讨论】:

【参考方案1】:

您可以在现有数据框上创建索引。使用提供的数据子集,这对我有用:

import pandas
df = pandas.DataFrame.from_dict(
    
     'category': 0: 'Love', 1: 'Love', 2: 'Fashion', 3: 'Fashion', 4: 'Hair', 5: 'Movies', 6: 'Movies', 7: 'Health', 8: 'Health', 9: 'Celebs', 10: 'Celebs', 11: 'Travel', 12: 'Weightloss', 13: 'Diet', 14: 'Bags', 
     'impressions': 0: 380, 1: 374242, 2: 197, 3: 13363, 4: 4, 5: 189, 6: 60632, 7: 269, 8: 40189, 9: 138, 10: 66590, 11: 2227, 12: 22668, 13: 21707, 14: 229, 
     'date': 0: '2013-11-04', 1: '2013-11-04', 2: '2013-11-04', 3: '2013-11-04', 4: '2013-11-04', 5: '2013-11-04', 6: '2013-11-04', 7: '2013-11-04', 8: '2013-11-04', 9: '2013-11-04', 10: '2013-11-04', 11: '2013-11-04', 12: '2013-11-04', 13: '2013-11-04', 14: '2013-11-04', 'cpc_cpm_revenue': 0: 0.36823, 1: 474.81522000000001, 2: 0.19434000000000001, 3: 18.264220000000002, 4: 0.00080000000000000004, 5: 0.23613000000000001, 6: 81.391139999999993, 7: 0.27171000000000001, 8: 51.258200000000002, 9: 0.11536, 10: 83.966859999999997, 11: 3.43248, 12: 31.695889999999999, 13: 28.459320000000002, 14: 0.43524000000000002, 'clicks': 0: 0, 1: 183, 2: 0, 3: 9, 4: 0, 5: 1, 6: 20, 7: 0, 8: 21, 9: 0, 10: 32, 11: 1, 12: 12, 13: 9, 14: 2, 'size': 0: '300x250', 1: '300x250', 2: '300x250', 3: '300x250', 4: '300x250', 5: '300x250', 6: '300x250', 7: '300x250', 8: '300x250', 9: '300x250', 10: '300x250', 11: '300x250', 12: '300x250', 13: '300x250', 14: '300x250'
    
)
df.set_index(['date', 'category'], inplace=True)
df.groupby(level=[0,1]).sum()

如果您在使用完整数据集时遇到重复索引问题,则需要稍微清理一下数据。如果可行,请删除重复的行。如果重复的行是有效的,那么是什么让它们彼此区分开来?如果您可以将其添加到数据框并将其包含在索引中,那是理想的。如果没有,只需创建一个默认为 1 的虚拟列,但可以是 2 或 3 或 ...NN 重复的情况下 - 然后将该字段也包含在索引中。

或者,我很确定您可以跳过索引创建并直接使用列groupby

df.groupby(by=['date', 'category']).sum()

同样,这适用于您发布的数据子集。

【讨论】:

raise Exception('Index has duplicate keys: %s' % duplicates) Exception: Index has duplicate keys: [('2013-11-04', 'Beauty'), ('2013-11 -04', '名人'), ('2013-11-04', '饮食'), ('2013-11-04', '时尚'), ('2013-11-04', '健康') , ('2013-11-04', '灵感'), ('2013-11-04', '生活方式'), ('2013-11-04', '爱情'), ('2013-11-04 ', '电影'), ('2013-11-04', '育儿')] @Tampa 看起来您可能需要稍微清理一下数据。您发布的部分对我有用(请参阅我的编辑)。 这行得通... df.groupby(by=['date', 'category']).sum() 谢谢!【参考方案2】:

我通常在尝试取消堆叠多索引时尝试这样做,但由于存在重复值而失败。

这是我运行查找有问题的项目的简单命令:

df.groupby(level=df.index.names).count()

【讨论】:

以上是关于如何对熊猫中的多索引进行分组?的主要内容,如果未能解决你的问题,请参考以下文章

正在进行的数字作为熊猫中的第一个多索引

来自按级别分组的多索引熊猫数据框的子图

根据熊猫数据框中的列标签对数据进行分组

熊猫使用多索引选择第二个索引的最后一行

按对象将熊猫分组转换为多索引数据框

如何合并熊猫数据透视表中的多索引层?