Pandas 数据透视表和小计

Posted

技术标签:

【中文标题】Pandas 数据透视表和小计【英文标题】:Pandas pivot and subtotals 【发布时间】:2021-11-03 00:13:37 【问题描述】:

使用这些数据 -

d2 = 'Division': ['DIV1', 'DIV2', 'DIV1', 'DIV3', 'DIV2'],'Region': ['DIV1-South', 'DIV2-North', 'DIV1-North', "DIV3-East", "DIV2-South"]
    ,'MD': ["Susie", 'Martha', "Jane", "Nichole", "Randall"], 'Month': ['JAN', 'JAN', 'FEB', 'MAR', "APR"]
df2 = pd.DataFrame(d2)

看起来像这样:

    Division  Region        MD        Month
0    DIV1      DIV1-South    Susie    JAN
1    DIV2      DIV2-North    Martha    JAN
2    DIV1      DIV1-North    Jane    FEB
3    DIV3      DIV3-East        Nichole    MAR
4    DIV2      DIV2-South    Randall    APR

感谢这里的社区,我能够对这些数据进行透视以获得各个月份的总数:使用这行代码

pivoted = df.pivot_table(index=['Division', 'Region', 'NP'], columns='Month', aggfunc=len, fill_value=0)

                        Month    APR    FEB    JAN    MAR
Division    Region        MD
DIV1        DIV1-North    Jane    0    1    0    0
            DIV1-South    Susie    0    0    1    0
DIV2        DIV2-North    Martha    0    0    1    0
            DIV2-South    Randall    1    0    0    0
DIV3        DIV3-East    Nichole    0    0    0    1

因此,这可能是不可能的,但我只在网上找到了一个参考来生成包含各个部分的小计的数据透视结果。不幸的是,这个例子不起作用。

理想的结果是:

Month                                    APR    FEB    JAN    MAR
Division    Region                MD
DIV1        DIV1-North            Jane    0    1    0    0
            DIV1-North SubTotal         0    1    0    0
            DIV1-South            Susie    0    0    1    0
            DIV1-South SubTotal         0    0    1    0
            DIV1 TOTAL                  0   1   1   0
DIV2        DIV2-North            Martha    0    0    1    0
            DIV2-North SubTotal         0    0    1    0
            DIV2-South            Randall    1    0    0    0
            DIV2-South SubTotal         1    0    0    0
            DIV2 TOTAL                  1   0   1   0
DIV3        DIV3-East            Nichole    0    0    0    1
            DIV3-East SubTotal          0    0    0    1
            DIV3 TOTAL                  0   0   0   1

这有点令人费解,甚至可能不可能,但由于这在 Excel 数据透视表中相当容易,我希望 Pandas 在某个地方启用了此功能,但我找不到它。 (尽管经过数天的搜索和测试,这仍然是正确的。)

【问题讨论】:

【参考方案1】:
df = pd.DataFrame("A": ["foo", "foo", "foo", "foo", "foo",
                         "bar", "bar", "bar", "bar"],
                   "B": ["one", "one", "one", "two", "two",
                         "one", "one", "two", "two"],
                   "C": ["small", "large", "large", "small",
                         "small", "large", "small", "small",
                         "large"],
                   "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
                   "E": [2, 4, 5, 5, 6, 6, 8, 9, 9])

输出

     A    B      C  D  E
0  foo  one  small  1  2
1  foo  one  large  2  4
2  foo  one  large  2  5
3  foo  two  small  3  5
4  foo  two  small  3  6
5  bar  one  large  4  6
6  bar  one  small  5  8
7  bar  two  small  6  9

table = pd.pivot_table(df, values='D', index=['A', 'B'],
                    columns=['C'], aggfunc=np.sum)

输出数据透视表

table
C        large  small
A   B
bar one    4.0    5.0
    two    7.0    6.0
foo one    4.0    1.0
    two    NaN    6.0

【讨论】:

这看起来像一个标准的数据透视表,似乎没有解决关于“小计”的问题 谢谢你的回答,但我的实际数据在上面。为了便于使用,我添加了示例数据框的代码。但是,您的代码是以您提供的一组新数据为中心的,而我正在使用的数据是由原始数据的第一个数据中心产生的多索引 df。 @SeaBean 您的回复非常有效。感谢您花时间添加。我赞成你的回答。绝对是一个复杂的案例,你成功了!【参考方案2】:

您可以通过.groupby()GroupBy.sum() 与相应级别分组来创建Division TotalRegion SubTotal,如下所示: p>

pivoted2 = pivoted.reset_index()

# Create `Division` Total
df_Div_sum = pivoted2.groupby('Division', as_index=False).sum()
df_Div_sum['Region'] = '_' + df_Div_sum['Division'] + ' Total'
df_Div_sum['MD'] = ''

# Create `Region` SubTotal
df_Reg_sum = pivoted2.groupby(['Division', 'Region'], as_index=False).sum()
df_Reg_sum['MD'] = '_' + df_Reg_sum['Region'] + ' SubTotal'

# Concat results and set index + sort index
df_out = (pd.concat([pivoted2,
                     df_Reg_sum,
                     df_Div_sum
                    ])
            .set_index(['Division', 'Region', 'MD'])
            .sort_index()
         )         

输入设置

d2 = 'Division': ['DIV1', 'DIV2', 'DIV1', 'DIV3', 'DIV2'],'Region': ['DIV1-South', 'DIV2-North', 'DIV1-North', "DIV3-East", "DIV2-South"]
    ,'MD': ["Susie", 'Martha', "Jane", "Nichole", "Randall"], 'Month': ['JAN', 'JAN', 'FEB', 'MAR', "APR"]
df = pd.DataFrame(d2)

pivoted = df.pivot_table(index=['Division', 'Region', 'MD'], columns='Month', aggfunc=len, fill_value=0)

输出

print(df_out)


                                    Month  APR  FEB  JAN  MAR
Division Region      MD                                      
DIV1     DIV1-North  Jane                    0    1    0    0
                     _DIV1-North SubTotal    0    1    0    0
         DIV1-South  Susie                   0    0    1    0
                     _DIV1-South SubTotal    0    0    1    0
         _DIV1 Total                         0    1    1    0
DIV2     DIV2-North  Martha                  0    0    1    0
                     _DIV2-North SubTotal    0    0    1    0
         DIV2-South  Randall                 1    0    0    0
                     _DIV2-South SubTotal    1    0    0    0
         _DIV2 Total                         1    0    1    0
DIV3     DIV3-East   Nichole                 0    0    0    1
                     _DIV3-East SubTotal     0    0    0    1
         _DIV3 Total                         0    0    0    1

扩展测试数据

由于您的样本数据每个Region只有一个数据,因此我添加了更多测试数据以进行更完整的测试:

输入设置

data = 'Division': ['DIV1', 'DIV1', 'DIV2', 'DIV2', 'DIV1', 'DIV1', 'DIV3', 'DIV3', 'DIV2', 'DIV2', 'DIV2'],
 'Region': ['DIV1-South', 'DIV1-South', 'DIV2-North', 'DIV2-North', 'DIV1-North', 'DIV1-North', 'DIV3-East', 'DIV3-East', 'DIV2-South', 'DIV2-South', 'DIV2-South'],
 'MD': ['Susie', 'Susie2', 'Martha', 'Martha2', 'Jane', 'Jane2', 'Nichole', 'Nichole2', 'Randall2', 'Randall3', 'Randall'],
 'Month': ['JAN', 'FEB', 'JAN',  'MAR', 'FEB', 'APR', 'MAR', 'APR', 'FEB', 'MAR', 'APR']
df = pd.DataFrame(data)

pivoted = df.pivot_table(index=['Division', 'Region', 'MD'], columns='Month', aggfunc=len, fill_value=0)

print(pivoted)

Month                         APR  FEB  JAN  MAR
Division Region     MD                          
DIV1     DIV1-North Jane        0    1    0    0
                    Jane2       1    0    0    0
         DIV1-South Susie       0    0    1    0
                    Susie2      0    1    0    0
DIV2     DIV2-North Martha      0    0    1    0
                    Martha2     0    0    0    1
         DIV2-South Randall     1    0    0    0
                    Randall2    0    1    0    0
                    Randall3    0    0    0    1
DIV3     DIV3-East  Nichole     0    0    0    1
                    Nichole2    1    0    0    0

输出

print(df_out)

Month                                      APR  FEB  JAN  MAR
Division Region      MD                                      
DIV1     DIV1-North  Jane                    0    1    0    0
                     Jane2                   1    0    0    0
                     _DIV1-North SubTotal    1    1    0    0
         DIV1-South  Susie                   0    0    1    0
                     Susie2                  0    1    0    0
                     _DIV1-South SubTotal    0    1    1    0
         _DIV1 Total                         1    2    1    0
DIV2     DIV2-North  Martha                  0    0    1    0
                     Martha2                 0    0    0    1
                     _DIV2-North SubTotal    0    0    1    1
         DIV2-South  Randall                 1    0    0    0
                     Randall2                0    1    0    0
                     Randall3                0    0    0    1
                     _DIV2-South SubTotal    1    1    0    1
         _DIV2 Total                         1    1    1    2
DIV3     DIV3-East   Nichole                 0    0    0    1
                     Nichole2                1    0    0    0
                     _DIV3-East SubTotal     1    0    0    1
         _DIV3 Total                         1    0    0    1

【讨论】:

以上是关于Pandas 数据透视表和小计的主要内容,如果未能解决你的问题,请参考以下文章

Pandas 数据透视表行小计

具有多索引的 Pandas 数据透视表小计

使用数据透视表(熊猫)中的小计行时保留索引部分(不同的列)

Pandas 数据透视表和分组按月和小时

从 R 中的数据透视表库呈现的数据透视表中删除小计和总计

数据透视表不显示小计