Pandas Cookbook -- 07 分组聚合过滤转换

Posted shiyushiyu

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Pandas Cookbook -- 07 分组聚合过滤转换相关的知识,希望对你有一定的参考价值。

分组聚合、过滤、转换

简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧

import pandas as pd
import numpy as np

设定最大列数和最大行数

pd.set_option(‘max_columns‘,8 , ‘max_rows‘, 8)

1 聚合

读取flights数据集,查询头部

flights = pd.read_csv(‘data/flights.csv‘)
flights.head()
MONTH DAY WEEKDAY AIRLINE ... SCHED_ARR ARR_DELAY DIVERTED CANCELLED
0 1 1 4 WN ... 1905 65.0 0 0
1 1 1 4 UA ... 1333 -13.0 0 0
2 1 1 4 MQ ... 1453 35.0 0 0
3 1 1 4 AA ... 1935 -7.0 0 0
4 1 1 4 WN ... 2225 39.0 0 0

5 rows × 14 columns

1.1 单列聚合

按照AIRLINE分组,使用agg方法,传入要聚合的列和聚合函数

flights.groupby(‘AIRLINE‘).agg({‘ARR_DELAY‘:‘mean‘}).head()
ARR_DELAY
AIRLINE
AA 5.542661
AS -0.833333
B6 8.692593
DL 0.339691
EV 7.034580

或者要选取的列使用索引,聚合函数作为字符串传入agg

flights.groupby(‘AIRLINE‘)[‘ARR_DELAY‘].agg(‘mean‘).head()
AIRLINE
AA    5.542661
AS   -0.833333
B6    8.692593
DL    0.339691
EV    7.034580
Name: ARR_DELAY, dtype: float64

也可以向agg中传入NumPy的mean函数

flights.groupby(‘AIRLINE‘)[‘ARR_DELAY‘].agg(np.mean).head()
AIRLINE
AA    5.542661
AS   -0.833333
B6    8.692593
DL    0.339691
EV    7.034580
Name: ARR_DELAY, dtype: float64

也可以直接使用mean()函数

flights.groupby(‘AIRLINE‘)[‘ARR_DELAY‘].mean().head()
AIRLINE
AA    5.542661
AS   -0.833333
B6    8.692593
DL    0.339691
EV    7.034580
Name: ARR_DELAY, dtype: float64

1.2 多列聚合

每家航空公司每周平均每天取消的航班数

flights.groupby([‘AIRLINE‘, ‘WEEKDAY‘])[‘CANCELLED‘].agg(‘sum‘).head(7)
AIRLINE  WEEKDAY
AA       1          41
         2           9
         3          16
         4          20
         5          18
         6          21
         7          29
Name: CANCELLED, dtype: int64

分组可以是多个
选取可以是多组
聚合函数也可以是多个
每周每家航空公司取消或改变航线的航班总数和比例

flights.groupby([‘AIRLINE‘, ‘WEEKDAY‘])[‘CANCELLED‘, ‘DIVERTED‘].agg([‘sum‘, ‘mean‘]).head(7)
CANCELLED DIVERTED
sum mean sum mean
AIRLINE WEEKDAY
AA 1 41 0.032106 6 0.004699
2 9 0.007341 2 0.001631
3 16 0.011949 2 0.001494
4 20 0.015004 5 0.003751
5 18 0.014151 1 0.000786
6 21 0.018667 9 0.008000
7 29 0.021837 1 0.000753

用列表和嵌套字典对多列分组和聚合
对于每条航线,找到总航班数,取消的数量和比例,飞行时间的平均时间和方差

group_cols = [‘ORG_AIR‘, ‘DEST_AIR‘]
agg_dict = {‘CANCELLED‘:[‘sum‘, ‘mean‘, ‘size‘], 
            ‘AIR_TIME‘:[‘mean‘, ‘var‘]}
flights.groupby(group_cols).agg(agg_dict).head()
CANCELLED AIR_TIME
sum mean size mean var
ORG_AIR DEST_AIR
ATL ABE 0 0.0 31 96.387097 45.778495
ABQ 0 0.0 16 170.500000 87.866667
ABY 0 0.0 19 28.578947 6.590643
ACY 0 0.0 6 91.333333 11.466667
AEX 0 0.0 40 78.725000 47.332692

1.3 DataFrameGroupBy对象

groupby方法产生的是一个DataFrameGroupBy对象

college = pd.read_csv(‘data/college.csv‘)
grouped = college.groupby([‘STABBR‘, ‘RELAFFIL‘])

查看分组对象的类型

type(grouped)
pandas.core.groupby.groupby.DataFrameGroupBy

用dir函数找到该对象所有的可用函数

print([attr for attr in dir(grouped) if not attr.startswith(‘_‘)])
[‘CITY‘, ‘CURROPER‘, ‘DISTANCEONLY‘, ‘GRAD_DEBT_MDN_SUPP‘, ‘HBCU‘, ‘INSTNM‘, ‘MD_EARN_WNE_P10‘, ‘MENONLY‘, ‘PCTFLOAN‘, ‘PCTPELL‘, ‘PPTUG_EF‘, ‘RELAFFIL‘, ‘SATMTMID‘, ‘SATVRMID‘, ‘STABBR‘, ‘UG25ABV‘, ‘UGDS‘, ‘UGDS_2MOR‘, ‘UGDS_AIAN‘, ‘UGDS_ASIAN‘, ‘UGDS_BLACK‘, ‘UGDS_HISP‘, ‘UGDS_NHPI‘, ‘UGDS_NRA‘, ‘UGDS_UNKN‘, ‘UGDS_WHITE‘, ‘WOMENONLY‘, ‘agg‘, ‘aggregate‘, ‘all‘, ‘any‘, ‘apply‘, ‘backfill‘, ‘bfill‘, ‘boxplot‘, ‘corr‘, ‘corrwith‘, ‘count‘, ‘cov‘, ‘cumcount‘, ‘cummax‘, ‘cummin‘, ‘cumprod‘, ‘cumsum‘, ‘describe‘, ‘diff‘, ‘dtypes‘, ‘expanding‘, ‘ffill‘, ‘fillna‘, ‘filter‘, ‘first‘, ‘get_group‘, ‘groups‘, ‘head‘, ‘hist‘, ‘idxmax‘, ‘idxmin‘, ‘indices‘, ‘last‘, ‘mad‘, ‘max‘, ‘mean‘, ‘median‘, ‘min‘, ‘ndim‘, ‘ngroup‘, ‘ngroups‘, ‘nth‘, ‘nunique‘, ‘ohlc‘, ‘pad‘, ‘pct_change‘, ‘pipe‘, ‘plot‘, ‘prod‘, ‘quantile‘, ‘rank‘, ‘resample‘, ‘rolling‘, ‘sem‘, ‘shift‘, ‘size‘, ‘skew‘, ‘std‘, ‘sum‘, ‘tail‘, ‘take‘, ‘transform‘, ‘tshift‘, ‘var‘]

用ngroups属性查看分组的数量

grouped.ngroups
112

查看每个分组的唯一识别标签
groups属性是一个字典,包含每个独立分组与行索引标签的对应

groups = list(grouped.groups.keys())
groups[:6]
[(‘AK‘, 0), (‘AK‘, 1), (‘AL‘, 0), (‘AL‘, 1), (‘AR‘, 0), (‘AR‘, 1)]

用get_group,传入分组标签的元组
例如,获取佛罗里达州所有与宗教相关的学校

grouped.get_group((‘FL‘, 1)).head()
INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
712 The Baptist College of Florida Graceville FL 0.0 ... 0.5602 0.3531 30800 20052
713 Barry University Miami FL 0.0 ... 0.6733 0.4361 44100 28250
714 Gooding Institute of Nurse Anesthesia Panama City FL 0.0 ... NaN NaN NaN PrivacySuppressed
715 Bethune-Cookman University Daytona Beach FL 1.0 ... 0.8867 0.0647 29400 36250
724 Johnson University Florida Kissimmee FL 0.0 ... 0.7384 0.2185 26300 20199

5 rows × 27 columns

groupby对象是一个可迭代对象,可以挨个查看每个独立分组

i = 0
for name, group in grouped:
    print(name)
    display(group.head(2))
    i += 1
    if i == 5:
     break
(‘AK‘, 0)
INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
60 University of Alaska Anchorage Anchorage AK 0.0 ... 0.2647 0.4386 42500 19449.5
62 University of Alaska Fairbanks Fairbanks AK 0.0 ... 0.2550 0.4519 36200 19355

2 rows × 27 columns

(‘AK‘, 1)
INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
61 Alaska Bible College Palmer AK 0.0 ... 0.2857 0.4286 NaN PrivacySuppressed
64 Alaska Pacific University Anchorage AK 0.0 ... 0.5297 0.4910 47000 23250

2 rows × 27 columns

(‘AL‘, 0)
INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
0 Alabama A & M University Normal AL 1.0 ... 0.8284 0.1049 30300 33888
1 University of Alabama at Birmingham Birmingham AL 0.0 ... 0.5214 0.2422 39700 21941.5

2 rows × 27 columns

(‘AL‘, 1)
INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
2 Amridge University Montgomery AL 0.0 ... 0.7795 0.8540 40100 23370
10 Birmingham Southern College Birmingham AL 0.0 ... 0.4809 0.0152 44200 27000

2 rows × 27 columns

(‘AR‘, 0)
INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
128 University of Arkansas at Little Rock Little Rock AR 0.0 ... 0.4775 0.4062 33900 21736
129 University of Arkansas for Medical Sciences Little Rock AR 0.0 ... 0.6144 0.5133 61400 12500

2 rows × 27 columns

groupby对象使用head方法,可以在一个DataFrame钟显示每个分组的头几行

grouped.head(2).head(6)
INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
0 Alabama A & M University Normal AL 1.0 ... 0.8284 0.1049 30300 33888
1 University of Alabama at Birmingham Birmingham AL 0.0 ... 0.5214 0.2422 39700 21941.5
2 Amridge University Montgomery AL 0.0 ... 0.7795 0.8540 40100 23370
10 Birmingham Southern College Birmingham AL 0.0 ... 0.4809 0.0152 44200 27000
43 Prince Institute-Southeast Elmhurst IL 0.0 ... 0.9375 0.6569 PrivacySuppressed 20992
60 University of Alaska Anchorage Anchorage AK 0.0 ... 0.2647 0.4386 42500 19449.5

6 rows × 27 columns

nth方法可以选出每个分组指定行的数据,下面选出的是第1行和最后1行

grouped.nth([1, -1]).head(8)
CITY CURROPER DISTANCEONLY GRAD_DEBT_MDN_SUPP ... UGDS_NRA UGDS_UNKN UGDS_WHITE WOMENONLY
STABBR RELAFFIL
AK 0 Fairbanks 1 0.0 19355 ... 0.0110 0.3060 0.4259 0.0
0 Barrow 1 0.0 PrivacySuppressed ... 0.0183 0.0000 0.1376 0.0
1 Anchorage 1 0.0 23250 ... 0.0000 0.0873 0.5309 0.0
1 Soldotna 1 0.0 PrivacySuppressed ... 0.0000 0.1324 0.0588 0.0
AL 0 Birmingham 1 0.0 21941.5 ... 0.0179 0.0100 0.5922 0.0
0 Dothan 1 0.0 PrivacySuppressed ... NaN NaN NaN 0.0
1 Birmingham 1 0.0 27000 ... 0.0000 0.0051 0.7983 0.0
1 Huntsville 1 NaN 36173.5 ... NaN NaN NaN NaN

8 rows × 25 columns

2 聚合函数

college = pd.read_csv(‘data/college.csv‘)
college.head()
INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
0 Alabama A & M University Normal AL 1.0 ... 0.8284 0.1049 30300 33888
1 University of Alabama at Birmingham Birmingham AL 0.0 ... 0.5214 0.2422 39700 21941.5
2 Amridge University Montgomery AL 0.0 ... 0.7795 0.8540 40100 23370
3 University of Alabama in Huntsville Huntsville AL 0.0 ... 0.4596 0.2640 45500 24097
4 Alabama State University Montgomery AL 1.0 ... 0.7554 0.1270 26600 33118.5

5 rows × 27 columns

2.1 自定义聚合函数

求出每个州的本科生的平均值和标准差

college.groupby(‘STABBR‘)[‘UGDS‘].agg([‘mean‘, ‘std‘]).round(0).head()
mean std
STABBR
AK 2493.0 4052.0
AL 2790.0 4658.0
AR 1644.0 3143.0
AS 1276.0 NaN
AZ 4130.0 14894.0

远离平均值的标准差的最大个数,写一个自定义函数

def max_deviation(s):
    std_score = (s - s.mean()) / s.std()
    return std_score.abs().max()

agg聚合函数在调用方法时,直接引入自定义的函数名

college.groupby(‘STABBR‘)[‘UGDS‘].agg(max_deviation).round(1).head()
STABBR
AK    2.6
AL    5.8
AR    6.3
AS    NaN
AZ    9.9
Name: UGDS, dtype: float64

自定义的聚合函数也适用于多个数值列

college.groupby(‘STABBR‘)[‘UGDS‘, ‘SATVRMID‘, ‘SATMTMID‘].agg(max_deviation).round(1).head()
UGDS SATVRMID SATMTMID
STABBR
AK 2.6 NaN NaN
AL 5.8 1.6 1.8
AR 6.3 2.2 2.3
AS NaN NaN NaN
AZ 9.9 1.9 1.4

自定义聚合函数也可以和预先定义的函数一起使用

college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘, ‘SATVRMID‘, ‘SATMTMID‘].agg([max_deviation, ‘mean‘, ‘std‘]).round(1).head()
UGDS SATVRMID SATMTMID
max_deviation mean std max_deviation ... std max_deviation mean std
STABBR RELAFFIL
AK 0 2.1 3508.9 4539.5 NaN ... NaN NaN NaN NaN
1 1.1 123.3 132.9 NaN ... NaN NaN 503.0 NaN
AL 0 5.2 3248.8 5102.4 1.6 ... 56.5 1.7 515.8 56.7
1 2.4 979.7 870.8 1.5 ... 53.0 1.4 485.6 61.4
AR 0 5.8 1793.7 3401.6 1.9 ... 37.9 2.0 503.6 39.0

5 rows × 9 columns

Pandas使用函数名作为返回列的名字;你可以直接使用rename方法修改,或通过__name__属性修改

max_deviation.__name__
‘max_deviation‘
max_deviation.__name__ = ‘Max Deviation‘
college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘, ‘SATVRMID‘, ‘SATMTMID‘]    .agg([max_deviation, ‘mean‘, ‘std‘]).round(1).head()
UGDS SATVRMID SATMTMID
Max Deviation mean std Max Deviation ... std Max Deviation mean std
STABBR RELAFFIL
AK 0 2.1 3508.9 4539.5 NaN ... NaN NaN NaN NaN
1 1.1 123.3 132.9 NaN ... NaN NaN 503.0 NaN
AL 0 5.2 3248.8 5102.4 1.6 ... 56.5 1.7 515.8 56.7
1 2.4 979.7 870.8 1.5 ... 53.0 1.4 485.6 61.4
AR 0 5.8 1793.7 3401.6 1.9 ... 37.9 2.0 503.6 39.0

5 rows × 9 columns

2.2 用 *args 和 **kwargs 自定义聚合函数

自定义一个返回去本科生人数在1000和3000之间的比例的函数

def pct_between_1_3k(s):
    return s.between(1000, 3000).mean()

用州和宗教分组,再聚合

college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘].agg(pct_between_1_3k).head(9)
STABBR  RELAFFIL
AK      0           0.142857
        1           0.000000
AL      0           0.236111
        1           0.333333
                      ...   
AR      1           0.111111
AS      0           1.000000
AZ      0           0.096774
        1           0.000000
Name: UGDS, Length: 9, dtype: float64

但是这个函数不能让用户自定义上下限,再新写一个函数

def pct_between(s, low, high):
    return s.between(low, high).mean()

使用这个自定义聚合函数,并传入最大和最小值

college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘].agg(pct_between, 1000, 10000).head(9)
STABBR  RELAFFIL
AK      0           0.428571
        1           0.000000
AL      0           0.458333
        1           0.375000
                      ...   
AR      1           0.166667
AS      0           1.000000
AZ      0           0.233871
        1           0.111111
Name: UGDS, Length: 9, dtype: float64

显示指定最大和最小值

college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘].agg(pct_between, high=10000, low=1000).head(9)
STABBR  RELAFFIL
AK      0           0.428571
        1           0.000000
AL      0           0.458333
        1           0.375000
                      ...   
AR      1           0.166667
AS      0           1.000000
AZ      0           0.233871
        1           0.111111
Name: UGDS, Length: 9, dtype: float64

也可以关键字参数和非关键字参数混合使用,只要非关键字参数在后面

college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘].agg(pct_between, 1000, high=10000).head(9)
STABBR  RELAFFIL
AK      0           0.428571
        1           0.000000
AL      0           0.458333
        1           0.375000
                      ...   
AR      1           0.166667
AS      0           1.000000
AZ      0           0.233871
        1           0.111111
Name: UGDS, Length: 9, dtype: float64

Pandas不支持多重聚合时,使用参数

用闭包自定义聚合函数

def make_agg_func(func, name, *args, **kwargs):
     def wrapper(x):
         return func(x, *args, **kwargs)
     wrapper.__name__ = name
     return wrapper
my_agg1 = make_agg_func(pct_between, ‘pct_1_3k‘, low=1000, high=3000)
college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘].agg([my_agg1,make_agg_func(pct_between, ‘pct_10_30k‘, 10000, 30000)])
pct_1_3k pct_10_30k
STABBR RELAFFIL
AK 0 0.142857 0.142857
1 0.000000 0.000000
AL 0 0.236111 0.083333
1 0.333333 0.000000
... ... ... ...
WI 1 0.360000 0.000000
WV 0 0.246154 0.015385
1 0.375000 0.000000
WY 0 0.545455 0.000000

112 rows × 2 columns

3 聚合后去除多级索引

读取数据

flights = pd.read_csv(‘data/flights.csv‘)
flights.head()
MONTH DAY WEEKDAY AIRLINE ... SCHED_ARR ARR_DELAY DIVERTED CANCELLED
0 1 1 4 WN ... 1905 65.0 0 0
1 1 1 4 UA ... 1333 -13.0 0 0
2 1 1 4 MQ ... 1453 35.0 0 0
3 1 1 4 AA ... 1935 -7.0 0 0
4 1 1 4 WN ... 2225 39.0 0 0

5 rows × 14 columns

按‘AIRLINE‘, ‘WEEKDAY‘分组,分别对DIST和ARR_DELAY聚合

airline_info = flights.groupby([‘AIRLINE‘, ‘WEEKDAY‘])    .agg({‘DIST‘:[‘sum‘, ‘mean‘],‘ARR_DELAY‘:[‘min‘, ‘max‘]})    .astype(int)
airline_info.head()
DIST ARR_DELAY
sum mean min max
AIRLINE WEEKDAY
AA 1 1455386 1139 -60 551
2 1358256 1107 -52 725
3 1496665 1117 -45 473
4 1452394 1089 -46 349
5 1427749 1122 -41 732

行和列都有两级索引

3.1 拼接列索引

get_level_values(0)取出第一级索引

level0 = airline_info.columns.get_level_values(0)

get_level_values(1)取出第二级索引

level1 = airline_info.columns.get_level_values(1)

一级和二级索引拼接成新的列索引

airline_info.columns = level0 + ‘_‘ + level1
airline_info.head(7)
DIST_sum DIST_mean ARR_DELAY_min ARR_DELAY_max
AIRLINE WEEKDAY
AA 1 1455386 1139 -60 551
2 1358256 1107 -52 725
3 1496665 1117 -45 473
4 1452394 1089 -46 349
5 1427749 1122 -41 732
6 1265340 1124 -50 858
7 1461906 1100 -49 626

3.2 重置行索引

reset_index()可以将行索引变成单级

airline_info.reset_index().head(7)
AIRLINE WEEKDAY DIST_sum DIST_mean ARR_DELAY_min ARR_DELAY_max
0 AA 1 1455386 1139 -60 551
1 AA 2 1358256 1107 -52 725
2 AA 3 1496665 1117 -45 473
3 AA 4 1452394 1089 -46 349
4 AA 5 1427749 1122 -41 732
5 AA 6 1265340 1124 -50 858
6 AA 7 1461906 1100 -49 626

Pandas默认会在分组运算后,将所有分组的列放在索引中,as_index设为False可以避免这么做。
分组后使用reset_index,也可以达到同样的效果

flights.groupby([‘AIRLINE‘], as_index=False)[‘DIST‘].agg(‘mean‘).round(0)
AIRLINE DIST
0 AA 1114.0
1 AS 1066.0
2 B6 1772.0
3 DL 866.0
... ... ...
10 UA 1231.0
11 US 1181.0
12 VX 1240.0
13 WN 810.0

14 rows × 2 columns

4 过滤聚合

college = pd.read_csv(‘data/college.csv‘, index_col=‘INSTNM‘)
grouped = college.groupby(‘STABBR‘)
grouped.ngroups
59

这等于求出不同州的个数,nunique()可以得到同样的结果

college[‘STABBR‘].nunique()
59

自定义一个计算少数民族学生总比例的函数,如果比例大于阈值,还返回True

def check_minority(df, threshold):
    minority_pct = 1 - df[‘UGDS_WHITE‘]
    total_minority = (df[‘UGDS‘] * minority_pct).sum()
    total_ugds = df[‘UGDS‘].sum()
    total_minority_pct = total_minority / total_ugds
    return total_minority_pct > threshold

grouped变量有一个filter方法,可以接收一个自定义函数,决定是否保留一个分组

college_filtered = grouped.filter(check_minority, threshold=.5)
college_filtered.head()
CITY STABBR HBCU MENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
Everest College-Phoenix Phoenix AZ 0.0 0.0 ... 0.7151 0.6700 28600 9500
Collins College Phoenix AZ 0.0 0.0 ... 0.8228 0.4764 25700 47000
Empire Beauty School-Paradise Valley Phoenix AZ 0.0 0.0 ... 0.5873 0.4651 17800 9588
Empire Beauty School-Tucson Tucson AZ 0.0 0.0 ... 0.6615 0.4229 18200 9833
Thunderbird School of Global Management Glendale AZ 0.0 0.0 ... 0.0000 0.0000 118900 PrivacySuppressed

5 rows × 26 columns

通过查看形状,可以看到过滤了60%,只有20个州的少数学生占据多数

college.shape
(7535, 26)
college_filtered.shape
(3028, 26)
college_filtered[‘STABBR‘].nunique()
20

用一些不同的阈值,检查形状和不同州的个数

college_filtered_20 = grouped.filter(check_minority, threshold=.2)
college_filtered_20.shape,college_filtered_20[‘STABBR‘].nunique()
((7461, 26), 57)
college_filtered_70 = grouped.filter(check_minority, threshold=.7)
college_filtered_70.shape,college_filtered_70[‘STABBR‘].nunique()
((957, 26), 10)
college_filtered_95 = grouped.filter(check_minority, threshold=.95)
college_filtered_95.shape,college_filtered_95[‘STABBR‘].nunique()
((156, 26), 7)

5 apply函数

apply函数是pandas里面所有函数中自由度最高的函数

读取college,‘UGDS‘, ‘SATMTMID‘, ‘SATVRMID‘三列如果有缺失值则删除行

college = pd.read_csv(‘data/college.csv‘)
subset = [‘UGDS‘, ‘SATMTMID‘, ‘SATVRMID‘]
college2 = college.dropna(subset=subset)
college.shape,college2.shape
((7535, 27), (1184, 27))

5.1 apply与agg

自定义一个求SAT数学成绩的加权平均值的函数

def weighted_math_average(df):
     weighted_math = df[‘UGDS‘] * df[‘SATMTMID‘]
     return int(weighted_math.sum() / df[‘UGDS‘].sum())

5.1.1 apply应用聚合函数

按州分组,并调用apply方法,传入自定义函数

college2.groupby(‘STABBR‘).apply(weighted_math_average).head()
STABBR
AK    503
AL    536
AR    529
AZ    569
CA    564
dtype: int64

5.1.2 agg应用聚合函数

college2.groupby(‘STABBR‘).agg(weighted_math_average).head()
INSTNM CITY HBCU MENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
STABBR
AK 503 503 503 503 ... 503 503 503 503
AL 536 536 536 536 ... 536 536 536 536
AR 529 529 529 529 ... 529 529 529 529
AZ 569 569 569 569 ... 569 569 569 569
CA 564 564 564 564 ... 564 564 564 564

5 rows × 26 columns

如果将列限制到SATMTMID,会报错。这是因为不能访问UGDS。

# college2.groupby(‘STABBR‘)[‘SATMTMID‘].agg(weighted_math_average)

5.2 apply创建新列

apply的一个不错的功能是通过返回Series,创建多个新的列

from collections import OrderedDict
def weighted_average(df):
    data = OrderedDict()
    weight_m = df[‘UGDS‘] * df[‘SATMTMID‘]
    weight_v = df[‘UGDS‘] * df[‘SATVRMID‘]

    data[‘weighted_math_avg‘] = weight_m.sum() / df[‘UGDS‘].sum()
    data[‘weighted_verbal_avg‘] = weight_v.sum() / df[‘UGDS‘].sum()
    data[‘math_avg‘] = df[‘SATMTMID‘].mean()
    data[‘verbal_avg‘] = df[‘SATVRMID‘].mean()
    data[‘count‘] = len(df)
    return pd.Series(data, dtype=‘int‘)
college2.groupby(‘STABBR‘).apply(weighted_average).head(10)
weighted_math_avg weighted_verbal_avg math_avg verbal_avg count
STABBR
AK 503 555 503 555 1
AL 536 533 504 508 21
AR 529 504 515 491 16
AZ 569 557 536 538 6
... ... ... ... ... ...
CT 545 533 522 517 14
DC 621 623 588 589 6
DE 569 553 495 486 3
FL 565 565 521 529 38

10 rows × 5 columns

5.3 apply创建dataframe

自定义一个返回DataFrame的函数
使用NumPy的函数average计算加权平均值,使用SciPy的gmean和hmean计算几何和调和平均值

from scipy.stats import gmean, hmean
def calculate_means(df):
    df_means = pd.DataFrame(index=[‘Arithmetic‘, ‘Weighted‘, ‘Geometric‘, ‘Harmonic‘])
    cols = [‘SATMTMID‘, ‘SATVRMID‘]
    for col in cols:
        arithmetic = df[col].mean()
        weighted = np.average(df[col], weights=df[‘UGDS‘])
        geometric = gmean(df[col])
        harmonic = hmean(df[col])
        df_means[col] = [arithmetic, weighted, geometric, harmonic]
    df_means[‘count‘] = len(df)
    return df_means.astype(int)
college2.groupby(‘STABBR‘)    .filter(lambda x: len(x) != 1)    .groupby(‘STABBR‘)    .apply(calculate_means).head(10)
SATMTMID SATVRMID count
STABBR
AL Arithmetic 504 508 21
Weighted 536 533 21
Geometric 500 505 21
Harmonic 497 502 21
... ... ... ... ...
AR Geometric 514 489 16
Harmonic 513 487 16
AZ Arithmetic 536 538 6
Weighted 569 557 6

10 rows × 3 columns









以上是关于Pandas Cookbook -- 07 分组聚合过滤转换的主要内容,如果未能解决你的问题,请参考以下文章

[Python Cookbook] Pandas Groupby

Pandas Cookbook -- 06索引对齐

《Pandas CookBook》---- DataFrame基础操作

Python Pandas 滚动聚合一列列表

[Python Cookbook] Pandas: Indexing of DataFrame

Pandas Cookbook -- 08数据清理