熊猫数据框分组和求和，组内，跨行值而不是按列

Posted 2023-02-16

技术标签:

【中文标题】熊猫数据框分组和求和，组内，跨行值而不是按列【英文标题】：Panda dataframe groupby and summation, within group, across row values rather than by columns 【发布时间】：2021-05-30 06:53:16 【问题描述】：

似乎有很多关于 Dataframe groupby() 方法的在线示例，它似乎描述了按列分组和跨多行（系列）的数据，即“从上到下”

给定 2 个数据帧 df_1 和 df_2：

df_1:
                  Instru_1  Instru_2  Instru_3  Instru_5  Instru_6  Instru_7
2020-10-01        10        10        20        20        10        30

其中行值是分类 ID 和

df_2:
                   Instru_1  Instru_2  Instru_3  Instru_5  Instru_6  Instru_7
2020-10-01         0.1       0.2       0.2       0.2       0.2       0.1

其中行值是权重总和为 1.0

如果我需要跨 df_1 的行（值）进行分组，其中仪器的数量可能不确定，那么 groupby() 是否仍然是前进的方向，以获得结果 df_result：

df_result：

                  10         20        30
2020-10-01        0.5        0.4       0.1

where: The columns are the classification IDs from df_1 record 
       The values are the sum for each classification ID from df_2

（例如，分类 ID=10，元素值 = 0.1 + 0.2 + 0.2 = 0.5，分类 ID=20，元素 = 0.2 + 0.2 = 0.4 和 ID=30，元素 = 0.1）

Is the quickest still to to perform multiple steps (merge df_1 and df_2 and process per row) ?
Step 1: Enum row 1 classification Ids and create df_result
Step 2: Enum row 2 and perform the summation per classification (this looks tricky!)

任何关于最佳方法的建议将不胜感激..（或指向跨行值分组的指针..）在此先感谢..

【问题讨论】：

【参考方案1】：

你可以尝试 concat 与 reshaping 和 groupby：

u = pd.concat((df1,df2),keys=['cols','rows'])
out = (u.unstack().T.reset_index(-1)
       .groupby(['level_1','cols'])['rows'].sum().unstack(fill_value=0))

print(out)

             10   20   30
2020-10-01  0.5  0.4  0.1

多列的示例运行：

【讨论】：

如果第二行与第一行完全不同，get_dummies 是否有效？ get_dummies之后会变成12列吗？感谢解释，但我的意思是如果第二行是（40,40,50,50,60,60），那么get_dummies不会返回12列吗？【参考方案2】：

有点难看，但这里有一种方法，方法是解开数据帧并加入，然后是 group by、sum 和 stacking：

df3 = df1.unstack().to_frame().join(df2.unstack().to_frame(), lsuffix='l', rsuffix='r')

df4 = df3.reset_index().groupby(['level_1', '0l']).sum('0r').reset_index().pivot_table('0r', 'level_1', '0l')

df4.index.name = None
df4.columns.name = None

print(df4)
             10   20   30
2020-10-01  0.5  0.4  0.1

【讨论】：

【参考方案3】：

让我们试试吧：

s1, s2 = df1.stack(), df2.stack()
out = s2.groupby([s2.droplevel(1).index, s2.index.map(s1)]).sum().unstack()

详情：

stack 数据帧df1 和df2 从而创建多索引系列s1 和s2：

>>> s1
2020-10-01  Instru_1    10
            Instru_2    10
            Instru_3    20
            Instru_5    20
            Instru_6    10
            Instru_7    30
dtype: int64

>>> s2
2020-10-01  Instru_1    0.1
            Instru_2    0.2
            Instru_3    0.2
            Instru_5    0.2
            Instru_6    0.2
            Instru_7    0.1
dtype: float64

maps2 与系列s1 的索引以获取新数据框的列，即10, 20, 30...：

>>> s2.index.map(s1)
Int64Index([10, 10, 20, 20, 10, 30], dtype='int64')

最后group 系列s2 上level=0 连同上面的映射列和聚合使用sum 后跟unstack 到reshape：

>>> out
             10   20   30
2020-10-01  0.5  0.4  0.1

【讨论】：

【参考方案4】： 通过命名行和列索引来准备数据帧 unstack() 改为基于行 join() 将两个未堆叠的 DF 合并为行现在很简单groupby() unstack() 根据需要改回基于列

df_1 = pd.read_csv(io.StringIO("""                  Instru_1  Instru_2  Instru_3  Instru_5  Instru_6  Instru_7
2020-10-01        10        10        20        20        10        30"""), sep="\s+")

df_2 = pd.read_csv(io.StringIO("""                   Instru_1  Instru_2  Instru_3  Instru_5  Instru_6  Instru_7
2020-10-01         0.1       0.2       0.2       0.2       0.2       0.1"""), sep="\s+")

df_1.columns.set_names("instrument", inplace=True)
df_1.index.set_names("date", inplace=True)
df_2.columns.set_names("instrument", inplace=True)
df_2.index.set_names("date", inplace=True)


(df_1.unstack().to_frame().rename(columns=0:"classification")
 .join(df_2.unstack().to_frame().rename(columns=0:"weight"))
 .groupby(["date","classification"]).sum()
 .unstack(1).droplevel(0, axis=1)
)

date	10	20	30
2020-10-01	0.5	0.4	0.1

【讨论】：

谢谢。给出的明确步骤有助于理解其他一些解决方案和基本原则 - c.f. “教某人如何钓鱼”，然后可能会进一步研究其他“map2 + “crosstab”解决方案提案..

以上是关于熊猫数据框分组和求和，组内，跨行值而不是按列的主要内容，如果未能解决你的问题，请参考以下文章