python pandas 对分组并对每列求和时少了一列

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python pandas 对分组并对每列求和时少了一列相关的知识,希望对你有一定的参考价值。

命令如下:
df = pd.read_excel('top8.xlsx')
df
显示如下
a b c d e f g custom product
0 429.60 19.87 32.2 17.12931 27.75 20 429.6 1 1
1 429.60 19.87 32.2 17.12931 27.75 20 429.6 1 2
2 214.80 19.87 16.1 17.12931 13.88 10 214.8 1 1
3 1,074.00 19.87 80.5 17.12931 69.39 50 1074.0 1 2
4 214.80 19.87 16.1 17.12931 13.88 10 214.8 2 1
5 429.60 19.87 32.2 17.12931 27.75 20 429.6 2 2
6 1,074.00 19.87 80.5 17.12931 69.39 50 1074.0 2 1
7 429.60 19.87 32.2 17.12931 27.75 20 429.6 2 2

要对custom和product作为键对a-g列求和
命令如下:
df.groupby(['custom','product']).sum()
显示如下:
custom product b c d e f g
1 1 39.74 48.3 34.25862 41.63 30 644.4
2 39.74 112.7 34.25862 97.14 70 1503.6
2 1 39.74 96.6 34.25862 83.27 60 1288.8
2 39.74 64.4 34.25862 55.50 40 859.2

结果中没有对a列求和,怎么解决?

我尝试单独对a列求和
df.groupby(['custom','product'])[['a']].sum()
输出如下:
custom product a
1 1 429.60214.80
2 429.601,074.00
2 1 214.801,074.00
2 429.60429.60
pandas把a列结果直接连接起来了,怎么解决?

因为a列不全部是数字:
1,074.00 19.87 80.5 17.12931 69.39 50 1074.0 1 2
1,074.00 19.87 80.5 17.12931 69.39 50 1074.0 2 1
这里的1,074.00与1,074.00导致整列不被识别为数字,修改xlsx文件里的数据使之存储为数值不含数字之外的符号或者直接修改该列是数值类型即可。
也可以在读入后对df做处理,排除这样的记录。

之后再做sum就可以。
参考技术A 这种多数据的建议你用MATLAB

将两个 Pandas 按对象分组

【中文标题】将两个 Pandas 按对象分组【英文标题】:Sum Two Pandas Group By Objects 【发布时间】:2018-12-24 06:31:08 【问题描述】:

我有两个按对象分组的熊猫,我想对它们的值求和。我无法弄清楚如何合并这两个数据框,以便CALL_BLOCK 列具有该DOW 的所有十个调用块,并对这些值求和。我尝试了几种方法,例如重置索引和合并两个数据帧,但我仍然无法获得列CALL_BLOCKS 的所有十个调用块。我会感谢你的帮助。提前非常感谢。

已编辑

df1 = ('1-100019B', 'a_8:00AM to 9:00AM'): 0.6493506493506493,
 ('1-100019B', 'b_9:00AM to 10:00AM'): 0.7272727272727273,
 ('1-100019B', 'c_10:00AM to 11:00AM'): 0.16883116883116883,
 ('1-100019B', 'd_11:00AM to 12:00PM'): 0.025974025974025976,
 ('1-100019B', 'e_12:00PM to 1:00PM'): 0.38961038961038963,
 ('1-100019B', 'f_1:00PM to 2:00PM'): 0.14285714285714285,
 ('1-100019B', 'g_2:00PM to 3:00PM'): 0.0,
 ('1-100019B', 'h_3:00PM to 4:00PM'): 0.12987012987012986,
 ('1-100019B', 'i_4:00PM to 5:00PM'): 0.0,
 ('1-100019B', 'j_After 5PM'): 0.0

df2 = 
('1-100019B', 0, 'a_8:00AM to 9:00AM'): 0.5,
 ('1-100019B', 0, 'b_9:00AM to 10:00AM'): 0.6666666666666666,
 ('1-100019B', 0, 'c_10:00AM to 11:00AM'): 0.25,
 ('1-100019B', 0, 'e_12:00PM to 1:00PM'): 0.3333333333333333,
 ('1-100019B', 0, 'f_1:00PM to 2:00PM'): 0.0,
 ('1-100019B', 0, 'h_3:00PM to 4:00PM'): 1.0

预期输出:

df = 
CONTACT_ID  DOW  CALL_BLOCKS         
1-100019B   0    a_8:00AM to 9:00AM      1.149
                 b_9:00AM to 10:00AM     1.380
                 c_10:00AM to 11:00AM    0.410
                 d_11:00AM to 12:00PM    0.026
                 e_12:00PM to 1:00PM     0.710
                 f_1:00PM to 2:00PM      0.140
                 g_2:00PM to 3:00PM      0.000
                 h_3:00PM to 4:00PM      1.120
                 i_4:00PM to 5:00PM      0.000
                 j_After 5PM             0.000

【问题讨论】:

你能把 df1.to_dict() 和 df2.to_dict() 添加到这个问题中吗? 嗨,斯科特,已编辑。这有帮助吗? 【参考方案1】:

使用@jpp 设置,

df1.merge(df2.reset_index('DOW'), on=['CONTACTS_ID','CALL_BLOCKS'], how='outer')\
   .set_index('DOW', append=True).sum(1)

输出:

CONTACTS_ID  CALL_BLOCKS           DOW
1-100019B    a_8:00AM to 9:00AM    0.0    1.149351
             b_9:00AM to 10:00AM   0.0    1.393939
             c_10:00AM to 11:00AM  0.0    0.418831
             d_11:00AM to 12:00PM  NaN    0.025974
             e_12:00PM to 1:00PM   0.0    0.722944
             f_1:00PM to 2:00PM    0.0    0.142857
             g_2:00PM to 3:00PM    NaN    0.000000
             h_3:00PM to 4:00PM    0.0    1.129870
             i_4:00PM to 5:00PM    NaN    0.000000
             j_After 5PM           NaN    0.000000
dtype: float64

【讨论】:

这有帮助。谢谢。 @KrishnangKDalal 我很高兴这有帮助。不客气。编码愉快!【参考方案2】:

从第二个数据帧中删除未使用的MultiIndex 级别,然后使用pd.Series.add

df2.index = df2.index.droplevel(1)

res = df1.add(df2, fill_value=0)

print(res)

                                0
idx1      idx3                          
1-100019B a_8:00AM to 9:00AM    1.149351
          b_9:00AM to 10:00AM   1.393939
          c_10:00AM to 11:00AM  0.418831
          d_11:00AM to 12:00PM  0.025974
          e_12:00PM to 1:00PM   0.722944
          f_1:00PM to 2:00PM    0.142857
          g_2:00PM to 3:00PM    0.000000
          h_3:00PM to 4:00PM    1.129870
          i_4:00PM to 5:00PM    0.000000
          j_After 5PM           0.000000

设置

这是我用来从您的输入字典中获取到MultiIndex 系列的代码,这是您将看到的groupby 操作的输出。

df1 = pd.DataFrame.from_dict(df1, orient='index').reset_index()
df1 = df1.join(pd.DataFrame(df1['index'].values.tolist(), columns=['idx1', 'idx3'])).drop('index', 1)
df1 = df1.set_index(['idx1', 'idx3'])

df2 = pd.DataFrame.from_dict(df2, orient='index').reset_index()
df2 = df2.join(pd.DataFrame(df2['index'].values.tolist(), columns=['idx1', 'idx2', 'idx3'])).drop('index', 1)
df2 = df2.set_index(['idx1', 'idx2', 'idx3'])

【讨论】:

感谢您的回答。我不能删除level=1 (DOW),因为我想要DOW 列的特定值,类似于我在预期输出下描述的值。 在这些 group by 对象上使用 reset_index() 将它们转换为 pandas 数据框并处理它们会更简单吗?在这种情况下,输出将是描述格式的数据框?

以上是关于python pandas 对分组并对每列求和时少了一列的主要内容,如果未能解决你的问题,请参考以下文章

根据 pandas 中的字典对数据帧的行进行分组并对相应的分子求和

Pandas Dataframe - 按照Col A分组并对每个组进行求和[C]重复

如何在 Python 中对每列的唯一值求和? [复制]

将两个 Pandas 按对象分组

我如何在熊猫中分组然后对值求和? [复制]

如何在SQL中对相邻行进行分组并对数据求和