Python：分组后的索引不正确，具有不同的聚合到一组列

Posted 2023-03-11

技术标签:

【中文标题】Python：分组后的索引不正确，具有不同的聚合到一组列【英文标题】：Python: Incorrect index after group by with different aggregation to set of columns 【发布时间】：2020-01-18 18:11:43 【问题描述】：

我想按CurrentDate, Car 字段分组并应用以下功能：

np.mean 函数到['Attr1',...'Attr5'] 列的列表；

np.random 用于Factory 列；

这里介绍了df的例子：

Index   Car       Attr1    Attr2  Attr3  Attr4  Attr5  AttrFactory  CurrentDate                           
0      Nissan     0.0       1.7    3.7    0.0    6.8      F1          01/07/18
1      Nissan     0.0       1.7    3.7    0.0    6.8      F2          01/07/18
2      Nissan     0.0       1.7    3.7    0.0    6.8      F3          03/08/18
3      Porsche    10.0      0.0    2.8    3.5    6.5      F2          05/08/18
4      Porsche    10.0      2.0    0.8    3.5    6.5      F1          05/08/18   
5      Golf       0.0       1.7    3.0    2.0    6.3      F4          07/09/18       
6      Tiguan     1.0       0.0    3.0    5.2    5.8      F5          10/09/18         
7      Porsche    0.0       0.0    3.0    4.2    7.8      F4          12/09/18     
8      Tiguan     0.0       0.0    0.0    7.2    9.0      F3          13/09/18    
9      Golf       0.0       3.0    0.0    0.0    4.8      F5          25/09/18 
10     Golf       0.0       3.0    0.0    0.0    4.8      F1          25/09/18  
11     Golf       0.0       3.0    0.0    0.0    4.8      F3          25/09/18

我尝试通过以下代码做到这一点：

metric_cols = df.filter(regex='^Attr',axis=1).columns #it's list of all Attr columns;

addt_col = list(df.filter(regex='^Attr',axis=1).columns).remove('AttrFactory')


df_gr = df.groupby(['CurrentDate', 'Car'], as_index=False)[metric_cols].agg(addt_col: np.mean, 'AttrFactory': lambda x: x.iloc[np.random.choice(range(0,len(x)))])

结果我收到了带有 inctorrect 索引的df：

CurrentDate     Car          NaN
                         CurrentDate   Car    Attr1  Attr2  Attr3  Attr4  Attr5 AttrFactory                           
01/07/18      Nissan       01/07/18   Nissan    0.0   1.7    3.7    0.0    6.8      F1                   
03/08/18      Nissan       03/08/18   Nissan    0.0   1.7    3.7    0.0    6.8      F3          
05/08/18      Porsche      05/08/18   Porsche   10.0  1.0    1.8    3.5    6.5      F1                    
  ...           ...         ...        ...      ...   ...    ...    ...    ...      ...  
13/09/18      Tiguan       13/09/18   Tiguan    0.0   0.0    0.0    7.2    9.0      F3          
25/09/18      Golf         25/09/18   Golf      0.0   1.0    0.0    0.0    4.8      F3

预期输出为df_gr:

                           Attr1  Attr2  Attr3  Attr4  Attr5  AttrFactory                           
01/07/18      Nissan        0.0    1.7    3.7    0.0    6.8       F1                   
03/08/18      Nissan        0.0    1.7    3.7    0.0    6.8       F3          
05/08/18      Porsche       10.0   1.0    1.8    3.5    6.5       F1                    
  ...         ...           ...    ...    ...    ...    ...       ...      
13/09/18      Tiguan        0.0    0.0    0.0    7.2    9.0       F3          
25/09/18      Golf          0.0    1.0    0.0    0.0    4.8       F3

如何修复结果顶部的CurrentDate Car Nan 错误索引？我很感激任何想法，谢谢）

【问题讨论】：

【参考方案1】：

您可以制作一个聚合字典并将它们传递给 agg

在：

metric_cols = df.filter(regex='^Attr\d',axis=1).columns 

d = dict.fromkeys(metric_cols, ['mean'])
d['AttrFactory'] = lambda x: x.iloc[np.random.choice(range(0,len(x)))]

df = df.groupby(['CurrentDate', 'Car'], as_index=False).agg(d).droplevel(1, axis=1)

输出：

|   | CurrentDate | Car     | Attr1 | Attr2 | Attr3              | Attr4 | Attr5 | AttrFactory |
|---|-------------|---------|-------|-------|--------------------|-------|-------|-------------|
| 0 | 01/07/18    | Nissan  | 0.0   | 1.7   | 3.7                | 0.0   | 6.8   | F2          |
| 1 | 03/08/18    | Nissan  | 0.0   | 1.7   | 3.7                | 0.0   | 6.8   | F3          |
| 2 | 05/08/18    | Porsche | 10.0  | 1.0   | 1.7999999999999998 | 3.5   | 6.5   | F1          |
| 3 | 07/09/18    | Golf    | 0.0   | 1.7   | 3.0                | 2.0   | 6.3   | F4          |
| 4 | 10/09/18    | Tiguan  | 1.0   | 0.0   | 3.0                | 5.2   | 5.8   | F5          |
| 5 | 12/09/18    | Porsche | 0.0   | 0.0   | 3.0                | 4.2   | 7.8   | F4          |
| 6 | 13/09/18    | Tiguan  | 0.0   | 0.0   | 0.0                | 7.2   | 9.0   | F3          |
| 7 | 25/09/18    | Golf    | 0.0   | 3.0   | 0.0                | 0.0   | 4.8   | F1          |

【讨论】：

【参考方案2】：

您的聚合器是按列应用的，因此存储在第 2 级，而列名存储在第 1 级（以防止覆盖）。这在每列应用多个聚合器时特别有用。

解决方案如下：

# Merge the aggregator with the column name
df_gr.columns = ['_'.join(x) for x in df_gr.columns.values.reshape(-1)]

【讨论】：

以上是关于Python：分组后的索引不正确，具有不同的聚合到一组列的主要内容，如果未能解决你的问题，请参考以下文章