为每一行添加唯一组到 DF,包括来自其他列的总和

Posted

技术标签:

【中文标题】为每一行添加唯一组到 DF,包括来自其他列的总和【英文标题】:Add unique groups to DF for each row including sum from other columns 【发布时间】:2020-10-24 07:39:27 【问题描述】:

我得到了一个如下所示的 DatFrame:

ID     field_1     area_1    field_2       area_2    field_3     area_3    field_4      area_4
1      scoccer     500       basketball    200       swimming    100       basketball   50
2      volleyball  100       np.nan        np.nan    np.nan      np.nan    np.nan       np.nan
3      basketball  1000      football      10        np.nan      np.nan    np.nan       np.nan
4      swimming    280       swimming      200       basketball  320       np.nan       np.nan
5      volleyball  110       football      160       volleyball  30        np.nan       np.nan 

原始DataFrame具有相同的结构,但包含从field_1到field_30以及area_1到area_30的列。

我想将列添加到具有水平组的 DF,具体取决于 'field_x' 中的不同表达式,并汇总相应区域...添加的列应如下所示:

ID   group_1     area_1     group_2     area_2     group_3    area_3
        
1    scoccer     500        basketball  250        swimming   100
2    volleyball  100 
3    basketball  1000       football    10
4    swimming    480        basketball  320         
5    volleyball  140        football    160

有没有简单的方法来实现这一点?

【问题讨论】:

【参考方案1】:

使用pd.wide_to_long 重塑DataFrame,它允许您按字段和ID 分组并对区域求和。然后pivot_table 回到宽格式,在使用cumcount 创建列标签之后。

df = (pd.wide_to_long(df, i='ID', j='num', stubnames=['field', 'area'], sep='_')
        .groupby(['ID', 'field'])['area'].sum()
        .reset_index())
#   ID       field    area
#0   1  basketball   250.0
#1   1     scoccer   500.0
#2   1    swimming   100.0
#3   2  volleyball   100.0
#4   3  basketball  1000.0
#5   3    football    10.0
#6   4  basketball   320.0
#7   4    swimming   480.0
#8   5    football   160.0
#9   5  volleyball   140.0

df['idx'] = df.groupby('ID').cumcount()+1
df = (pd.pivot_table(df, index='ID', columns='idx', values=['field', 'area'], 
                     aggfunc='first')
        .sort_index(axis=1, level=1))
df.columns = ['_'.join(map(str, tup)) for tup in df.columns]

    area_1     field_1  area_2     field_2  area_3   field_3
ID                                                          
1    250.0  basketball   500.0     scoccer   100.0  swimming
2    100.0  volleyball     NaN         NaN     NaN       NaN
3   1000.0  basketball    10.0    football     NaN       NaN
4    320.0  basketball   480.0    swimming     NaN       NaN
5    160.0    football   140.0  volleyball     NaN       NaN

为了好玩,您可以使用未记录的 pd.lreshape 代替 wide_to_long

# Change range to (1,31) for your real data.
pd.lreshape(df, 'area': [f'area_i' for i in range(1,5)],
                 'field': [f'field_i' for i in range(1,5)]

#    ID    area       field
#0    1   500.0     scoccer
#1    2   100.0  volleyball
#2    3  1000.0  basketball
#3    4   280.0    swimming
#4    5   110.0  volleyball
#5    1   200.0  basketball
#....
#10   4   320.0  basketball
#11   5    30.0  volleyball
#12   1    50.0  basketball

【讨论】:

wide_to_long 那个我从来没想过的,不错的 @Ben.T 出于某种原因,我比melt 更喜欢它。如果您关心存根名称后的后缀,这将特别有用,尽管这种情况很少发生。

以上是关于为每一行添加唯一组到 DF,包括来自其他列的总和的主要内容,如果未能解决你的问题,请参考以下文章

使用 ag-Grid 为每一行添加唯一的 id

合并来自谷歌表格列的记录

具有唯一值的列的 SQL 总和量

R 中各列的平均值,不包括 NA

如何根据火花数据框中的值的累积总和为每一行分配一个类别?

当根据单元格值为每一行添加其他数据时,VBA 创建超链接