Pandas Cookbook -- 07 分组聚合过滤转换
Posted shiyushiyu
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Pandas Cookbook -- 07 分组聚合过滤转换相关的知识,希望对你有一定的参考价值。
简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧
import pandas as pd
import numpy as np
设定最大列数和最大行数
pd.set_option(‘max_columns‘,8 , ‘max_rows‘, 8)
1 聚合
读取flights数据集,查询头部
flights = pd.read_csv(‘data/flights.csv‘)
flights.head()
MONTH | DAY | WEEKDAY | AIRLINE | ... | SCHED_ARR | ARR_DELAY | DIVERTED | CANCELLED | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 4 | WN | ... | 1905 | 65.0 | 0 | 0 |
1 | 1 | 1 | 4 | UA | ... | 1333 | -13.0 | 0 | 0 |
2 | 1 | 1 | 4 | MQ | ... | 1453 | 35.0 | 0 | 0 |
3 | 1 | 1 | 4 | AA | ... | 1935 | -7.0 | 0 | 0 |
4 | 1 | 1 | 4 | WN | ... | 2225 | 39.0 | 0 | 0 |
5 rows × 14 columns
1.1 单列聚合
按照AIRLINE分组,使用agg方法,传入要聚合的列和聚合函数
flights.groupby(‘AIRLINE‘).agg({‘ARR_DELAY‘:‘mean‘}).head()
ARR_DELAY | |
---|---|
AIRLINE | |
AA | 5.542661 |
AS | -0.833333 |
B6 | 8.692593 |
DL | 0.339691 |
EV | 7.034580 |
或者要选取的列使用索引,聚合函数作为字符串传入agg
flights.groupby(‘AIRLINE‘)[‘ARR_DELAY‘].agg(‘mean‘).head()
AIRLINE
AA 5.542661
AS -0.833333
B6 8.692593
DL 0.339691
EV 7.034580
Name: ARR_DELAY, dtype: float64
也可以向agg中传入NumPy的mean函数
flights.groupby(‘AIRLINE‘)[‘ARR_DELAY‘].agg(np.mean).head()
AIRLINE
AA 5.542661
AS -0.833333
B6 8.692593
DL 0.339691
EV 7.034580
Name: ARR_DELAY, dtype: float64
也可以直接使用mean()函数
flights.groupby(‘AIRLINE‘)[‘ARR_DELAY‘].mean().head()
AIRLINE
AA 5.542661
AS -0.833333
B6 8.692593
DL 0.339691
EV 7.034580
Name: ARR_DELAY, dtype: float64
1.2 多列聚合
每家航空公司每周平均每天取消的航班数
flights.groupby([‘AIRLINE‘, ‘WEEKDAY‘])[‘CANCELLED‘].agg(‘sum‘).head(7)
AIRLINE WEEKDAY
AA 1 41
2 9
3 16
4 20
5 18
6 21
7 29
Name: CANCELLED, dtype: int64
分组可以是多个
选取可以是多组
聚合函数也可以是多个
每周每家航空公司取消或改变航线的航班总数和比例
flights.groupby([‘AIRLINE‘, ‘WEEKDAY‘])[‘CANCELLED‘, ‘DIVERTED‘].agg([‘sum‘, ‘mean‘]).head(7)
CANCELLED | DIVERTED | ||||
---|---|---|---|---|---|
sum | mean | sum | mean | ||
AIRLINE | WEEKDAY | ||||
AA | 1 | 41 | 0.032106 | 6 | 0.004699 |
2 | 9 | 0.007341 | 2 | 0.001631 | |
3 | 16 | 0.011949 | 2 | 0.001494 | |
4 | 20 | 0.015004 | 5 | 0.003751 | |
5 | 18 | 0.014151 | 1 | 0.000786 | |
6 | 21 | 0.018667 | 9 | 0.008000 | |
7 | 29 | 0.021837 | 1 | 0.000753 |
用列表和嵌套字典对多列分组和聚合
对于每条航线,找到总航班数,取消的数量和比例,飞行时间的平均时间和方差
group_cols = [‘ORG_AIR‘, ‘DEST_AIR‘]
agg_dict = {‘CANCELLED‘:[‘sum‘, ‘mean‘, ‘size‘],
‘AIR_TIME‘:[‘mean‘, ‘var‘]}
flights.groupby(group_cols).agg(agg_dict).head()
CANCELLED | AIR_TIME | |||||
---|---|---|---|---|---|---|
sum | mean | size | mean | var | ||
ORG_AIR | DEST_AIR | |||||
ATL | ABE | 0 | 0.0 | 31 | 96.387097 | 45.778495 |
ABQ | 0 | 0.0 | 16 | 170.500000 | 87.866667 | |
ABY | 0 | 0.0 | 19 | 28.578947 | 6.590643 | |
ACY | 0 | 0.0 | 6 | 91.333333 | 11.466667 | |
AEX | 0 | 0.0 | 40 | 78.725000 | 47.332692 |
1.3 DataFrameGroupBy对象
groupby方法产生的是一个DataFrameGroupBy对象
college = pd.read_csv(‘data/college.csv‘)
grouped = college.groupby([‘STABBR‘, ‘RELAFFIL‘])
查看分组对象的类型
type(grouped)
pandas.core.groupby.groupby.DataFrameGroupBy
用dir函数找到该对象所有的可用函数
print([attr for attr in dir(grouped) if not attr.startswith(‘_‘)])
[‘CITY‘, ‘CURROPER‘, ‘DISTANCEONLY‘, ‘GRAD_DEBT_MDN_SUPP‘, ‘HBCU‘, ‘INSTNM‘, ‘MD_EARN_WNE_P10‘, ‘MENONLY‘, ‘PCTFLOAN‘, ‘PCTPELL‘, ‘PPTUG_EF‘, ‘RELAFFIL‘, ‘SATMTMID‘, ‘SATVRMID‘, ‘STABBR‘, ‘UG25ABV‘, ‘UGDS‘, ‘UGDS_2MOR‘, ‘UGDS_AIAN‘, ‘UGDS_ASIAN‘, ‘UGDS_BLACK‘, ‘UGDS_HISP‘, ‘UGDS_NHPI‘, ‘UGDS_NRA‘, ‘UGDS_UNKN‘, ‘UGDS_WHITE‘, ‘WOMENONLY‘, ‘agg‘, ‘aggregate‘, ‘all‘, ‘any‘, ‘apply‘, ‘backfill‘, ‘bfill‘, ‘boxplot‘, ‘corr‘, ‘corrwith‘, ‘count‘, ‘cov‘, ‘cumcount‘, ‘cummax‘, ‘cummin‘, ‘cumprod‘, ‘cumsum‘, ‘describe‘, ‘diff‘, ‘dtypes‘, ‘expanding‘, ‘ffill‘, ‘fillna‘, ‘filter‘, ‘first‘, ‘get_group‘, ‘groups‘, ‘head‘, ‘hist‘, ‘idxmax‘, ‘idxmin‘, ‘indices‘, ‘last‘, ‘mad‘, ‘max‘, ‘mean‘, ‘median‘, ‘min‘, ‘ndim‘, ‘ngroup‘, ‘ngroups‘, ‘nth‘, ‘nunique‘, ‘ohlc‘, ‘pad‘, ‘pct_change‘, ‘pipe‘, ‘plot‘, ‘prod‘, ‘quantile‘, ‘rank‘, ‘resample‘, ‘rolling‘, ‘sem‘, ‘shift‘, ‘size‘, ‘skew‘, ‘std‘, ‘sum‘, ‘tail‘, ‘take‘, ‘transform‘, ‘tshift‘, ‘var‘]
用ngroups属性查看分组的数量
grouped.ngroups
112
查看每个分组的唯一识别标签
groups属性是一个字典,包含每个独立分组与行索引标签的对应
groups = list(grouped.groups.keys())
groups[:6]
[(‘AK‘, 0), (‘AK‘, 1), (‘AL‘, 0), (‘AL‘, 1), (‘AR‘, 0), (‘AR‘, 1)]
用get_group,传入分组标签的元组
例如,获取佛罗里达州所有与宗教相关的学校
grouped.get_group((‘FL‘, 1)).head()
INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|---|---|---|---|
712 | The Baptist College of Florida | Graceville | FL | 0.0 | ... | 0.5602 | 0.3531 | 30800 | 20052 |
713 | Barry University | Miami | FL | 0.0 | ... | 0.6733 | 0.4361 | 44100 | 28250 |
714 | Gooding Institute of Nurse Anesthesia | Panama City | FL | 0.0 | ... | NaN | NaN | NaN | PrivacySuppressed |
715 | Bethune-Cookman University | Daytona Beach | FL | 1.0 | ... | 0.8867 | 0.0647 | 29400 | 36250 |
724 | Johnson University Florida | Kissimmee | FL | 0.0 | ... | 0.7384 | 0.2185 | 26300 | 20199 |
5 rows × 27 columns
groupby对象是一个可迭代对象,可以挨个查看每个独立分组
i = 0
for name, group in grouped:
print(name)
display(group.head(2))
i += 1
if i == 5:
break
(‘AK‘, 0)
INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|---|---|---|---|
60 | University of Alaska Anchorage | Anchorage | AK | 0.0 | ... | 0.2647 | 0.4386 | 42500 | 19449.5 |
62 | University of Alaska Fairbanks | Fairbanks | AK | 0.0 | ... | 0.2550 | 0.4519 | 36200 | 19355 |
2 rows × 27 columns
(‘AK‘, 1)
INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|---|---|---|---|
61 | Alaska Bible College | Palmer | AK | 0.0 | ... | 0.2857 | 0.4286 | NaN | PrivacySuppressed |
64 | Alaska Pacific University | Anchorage | AK | 0.0 | ... | 0.5297 | 0.4910 | 47000 | 23250 |
2 rows × 27 columns
(‘AL‘, 0)
INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|---|---|---|---|
0 | Alabama A & M University | Normal | AL | 1.0 | ... | 0.8284 | 0.1049 | 30300 | 33888 |
1 | University of Alabama at Birmingham | Birmingham | AL | 0.0 | ... | 0.5214 | 0.2422 | 39700 | 21941.5 |
2 rows × 27 columns
(‘AL‘, 1)
INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|---|---|---|---|
2 | Amridge University | Montgomery | AL | 0.0 | ... | 0.7795 | 0.8540 | 40100 | 23370 |
10 | Birmingham Southern College | Birmingham | AL | 0.0 | ... | 0.4809 | 0.0152 | 44200 | 27000 |
2 rows × 27 columns
(‘AR‘, 0)
INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|---|---|---|---|
128 | University of Arkansas at Little Rock | Little Rock | AR | 0.0 | ... | 0.4775 | 0.4062 | 33900 | 21736 |
129 | University of Arkansas for Medical Sciences | Little Rock | AR | 0.0 | ... | 0.6144 | 0.5133 | 61400 | 12500 |
2 rows × 27 columns
groupby对象使用head方法,可以在一个DataFrame钟显示每个分组的头几行
grouped.head(2).head(6)
INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|---|---|---|---|
0 | Alabama A & M University | Normal | AL | 1.0 | ... | 0.8284 | 0.1049 | 30300 | 33888 |
1 | University of Alabama at Birmingham | Birmingham | AL | 0.0 | ... | 0.5214 | 0.2422 | 39700 | 21941.5 |
2 | Amridge University | Montgomery | AL | 0.0 | ... | 0.7795 | 0.8540 | 40100 | 23370 |
10 | Birmingham Southern College | Birmingham | AL | 0.0 | ... | 0.4809 | 0.0152 | 44200 | 27000 |
43 | Prince Institute-Southeast | Elmhurst | IL | 0.0 | ... | 0.9375 | 0.6569 | PrivacySuppressed | 20992 |
60 | University of Alaska Anchorage | Anchorage | AK | 0.0 | ... | 0.2647 | 0.4386 | 42500 | 19449.5 |
6 rows × 27 columns
nth方法可以选出每个分组指定行的数据,下面选出的是第1行和最后1行
grouped.nth([1, -1]).head(8)
CITY | CURROPER | DISTANCEONLY | GRAD_DEBT_MDN_SUPP | ... | UGDS_NRA | UGDS_UNKN | UGDS_WHITE | WOMENONLY | ||
---|---|---|---|---|---|---|---|---|---|---|
STABBR | RELAFFIL | |||||||||
AK | 0 | Fairbanks | 1 | 0.0 | 19355 | ... | 0.0110 | 0.3060 | 0.4259 | 0.0 |
0 | Barrow | 1 | 0.0 | PrivacySuppressed | ... | 0.0183 | 0.0000 | 0.1376 | 0.0 | |
1 | Anchorage | 1 | 0.0 | 23250 | ... | 0.0000 | 0.0873 | 0.5309 | 0.0 | |
1 | Soldotna | 1 | 0.0 | PrivacySuppressed | ... | 0.0000 | 0.1324 | 0.0588 | 0.0 | |
AL | 0 | Birmingham | 1 | 0.0 | 21941.5 | ... | 0.0179 | 0.0100 | 0.5922 | 0.0 |
0 | Dothan | 1 | 0.0 | PrivacySuppressed | ... | NaN | NaN | NaN | 0.0 | |
1 | Birmingham | 1 | 0.0 | 27000 | ... | 0.0000 | 0.0051 | 0.7983 | 0.0 | |
1 | Huntsville | 1 | NaN | 36173.5 | ... | NaN | NaN | NaN | NaN |
8 rows × 25 columns
2 聚合函数
college = pd.read_csv(‘data/college.csv‘)
college.head()
INSTNM | CITY | STABBR | HBCU | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|---|---|---|---|
0 | Alabama A & M University | Normal | AL | 1.0 | ... | 0.8284 | 0.1049 | 30300 | 33888 |
1 | University of Alabama at Birmingham | Birmingham | AL | 0.0 | ... | 0.5214 | 0.2422 | 39700 | 21941.5 |
2 | Amridge University | Montgomery | AL | 0.0 | ... | 0.7795 | 0.8540 | 40100 | 23370 |
3 | University of Alabama in Huntsville | Huntsville | AL | 0.0 | ... | 0.4596 | 0.2640 | 45500 | 24097 |
4 | Alabama State University | Montgomery | AL | 1.0 | ... | 0.7554 | 0.1270 | 26600 | 33118.5 |
5 rows × 27 columns
2.1 自定义聚合函数
求出每个州的本科生的平均值和标准差
college.groupby(‘STABBR‘)[‘UGDS‘].agg([‘mean‘, ‘std‘]).round(0).head()
mean | std | |
---|---|---|
STABBR | ||
AK | 2493.0 | 4052.0 |
AL | 2790.0 | 4658.0 |
AR | 1644.0 | 3143.0 |
AS | 1276.0 | NaN |
AZ | 4130.0 | 14894.0 |
远离平均值的标准差的最大个数,写一个自定义函数
def max_deviation(s):
std_score = (s - s.mean()) / s.std()
return std_score.abs().max()
agg聚合函数在调用方法时,直接引入自定义的函数名
college.groupby(‘STABBR‘)[‘UGDS‘].agg(max_deviation).round(1).head()
STABBR
AK 2.6
AL 5.8
AR 6.3
AS NaN
AZ 9.9
Name: UGDS, dtype: float64
自定义的聚合函数也适用于多个数值列
college.groupby(‘STABBR‘)[‘UGDS‘, ‘SATVRMID‘, ‘SATMTMID‘].agg(max_deviation).round(1).head()
UGDS | SATVRMID | SATMTMID | |
---|---|---|---|
STABBR | |||
AK | 2.6 | NaN | NaN |
AL | 5.8 | 1.6 | 1.8 |
AR | 6.3 | 2.2 | 2.3 |
AS | NaN | NaN | NaN |
AZ | 9.9 | 1.9 | 1.4 |
自定义聚合函数也可以和预先定义的函数一起使用
college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘, ‘SATVRMID‘, ‘SATMTMID‘].agg([max_deviation, ‘mean‘, ‘std‘]).round(1).head()
UGDS | SATVRMID | SATMTMID | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
max_deviation | mean | std | max_deviation | ... | std | max_deviation | mean | std | ||
STABBR | RELAFFIL | |||||||||
AK | 0 | 2.1 | 3508.9 | 4539.5 | NaN | ... | NaN | NaN | NaN | NaN |
1 | 1.1 | 123.3 | 132.9 | NaN | ... | NaN | NaN | 503.0 | NaN | |
AL | 0 | 5.2 | 3248.8 | 5102.4 | 1.6 | ... | 56.5 | 1.7 | 515.8 | 56.7 |
1 | 2.4 | 979.7 | 870.8 | 1.5 | ... | 53.0 | 1.4 | 485.6 | 61.4 | |
AR | 0 | 5.8 | 1793.7 | 3401.6 | 1.9 | ... | 37.9 | 2.0 | 503.6 | 39.0 |
5 rows × 9 columns
Pandas使用函数名作为返回列的名字;你可以直接使用rename方法修改,或通过__name__属性修改
max_deviation.__name__
‘max_deviation‘
max_deviation.__name__ = ‘Max Deviation‘
college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘, ‘SATVRMID‘, ‘SATMTMID‘] .agg([max_deviation, ‘mean‘, ‘std‘]).round(1).head()
UGDS | SATVRMID | SATMTMID | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Max Deviation | mean | std | Max Deviation | ... | std | Max Deviation | mean | std | ||
STABBR | RELAFFIL | |||||||||
AK | 0 | 2.1 | 3508.9 | 4539.5 | NaN | ... | NaN | NaN | NaN | NaN |
1 | 1.1 | 123.3 | 132.9 | NaN | ... | NaN | NaN | 503.0 | NaN | |
AL | 0 | 5.2 | 3248.8 | 5102.4 | 1.6 | ... | 56.5 | 1.7 | 515.8 | 56.7 |
1 | 2.4 | 979.7 | 870.8 | 1.5 | ... | 53.0 | 1.4 | 485.6 | 61.4 | |
AR | 0 | 5.8 | 1793.7 | 3401.6 | 1.9 | ... | 37.9 | 2.0 | 503.6 | 39.0 |
5 rows × 9 columns
2.2 用 *args 和 **kwargs 自定义聚合函数
自定义一个返回去本科生人数在1000和3000之间的比例的函数
def pct_between_1_3k(s):
return s.between(1000, 3000).mean()
用州和宗教分组,再聚合
college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘].agg(pct_between_1_3k).head(9)
STABBR RELAFFIL
AK 0 0.142857
1 0.000000
AL 0 0.236111
1 0.333333
...
AR 1 0.111111
AS 0 1.000000
AZ 0 0.096774
1 0.000000
Name: UGDS, Length: 9, dtype: float64
但是这个函数不能让用户自定义上下限,再新写一个函数
def pct_between(s, low, high):
return s.between(low, high).mean()
使用这个自定义聚合函数,并传入最大和最小值
college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘].agg(pct_between, 1000, 10000).head(9)
STABBR RELAFFIL
AK 0 0.428571
1 0.000000
AL 0 0.458333
1 0.375000
...
AR 1 0.166667
AS 0 1.000000
AZ 0 0.233871
1 0.111111
Name: UGDS, Length: 9, dtype: float64
显示指定最大和最小值
college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘].agg(pct_between, high=10000, low=1000).head(9)
STABBR RELAFFIL
AK 0 0.428571
1 0.000000
AL 0 0.458333
1 0.375000
...
AR 1 0.166667
AS 0 1.000000
AZ 0 0.233871
1 0.111111
Name: UGDS, Length: 9, dtype: float64
也可以关键字参数和非关键字参数混合使用,只要非关键字参数在后面
college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘].agg(pct_between, 1000, high=10000).head(9)
STABBR RELAFFIL
AK 0 0.428571
1 0.000000
AL 0 0.458333
1 0.375000
...
AR 1 0.166667
AS 0 1.000000
AZ 0 0.233871
1 0.111111
Name: UGDS, Length: 9, dtype: float64
Pandas不支持多重聚合时,使用参数
用闭包自定义聚合函数
def make_agg_func(func, name, *args, **kwargs):
def wrapper(x):
return func(x, *args, **kwargs)
wrapper.__name__ = name
return wrapper
my_agg1 = make_agg_func(pct_between, ‘pct_1_3k‘, low=1000, high=3000)
college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘].agg([my_agg1,make_agg_func(pct_between, ‘pct_10_30k‘, 10000, 30000)])
pct_1_3k | pct_10_30k | ||
---|---|---|---|
STABBR | RELAFFIL | ||
AK | 0 | 0.142857 | 0.142857 |
1 | 0.000000 | 0.000000 | |
AL | 0 | 0.236111 | 0.083333 |
1 | 0.333333 | 0.000000 | |
... | ... | ... | ... |
WI | 1 | 0.360000 | 0.000000 |
WV | 0 | 0.246154 | 0.015385 |
1 | 0.375000 | 0.000000 | |
WY | 0 | 0.545455 | 0.000000 |
112 rows × 2 columns
3 聚合后去除多级索引
读取数据
flights = pd.read_csv(‘data/flights.csv‘)
flights.head()
MONTH | DAY | WEEKDAY | AIRLINE | ... | SCHED_ARR | ARR_DELAY | DIVERTED | CANCELLED | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 4 | WN | ... | 1905 | 65.0 | 0 | 0 |
1 | 1 | 1 | 4 | UA | ... | 1333 | -13.0 | 0 | 0 |
2 | 1 | 1 | 4 | MQ | ... | 1453 | 35.0 | 0 | 0 |
3 | 1 | 1 | 4 | AA | ... | 1935 | -7.0 | 0 | 0 |
4 | 1 | 1 | 4 | WN | ... | 2225 | 39.0 | 0 | 0 |
5 rows × 14 columns
按‘AIRLINE‘, ‘WEEKDAY‘分组,分别对DIST和ARR_DELAY聚合
airline_info = flights.groupby([‘AIRLINE‘, ‘WEEKDAY‘]) .agg({‘DIST‘:[‘sum‘, ‘mean‘],‘ARR_DELAY‘:[‘min‘, ‘max‘]}) .astype(int)
airline_info.head()
DIST | ARR_DELAY | ||||
---|---|---|---|---|---|
sum | mean | min | max | ||
AIRLINE | WEEKDAY | ||||
AA | 1 | 1455386 | 1139 | -60 | 551 |
2 | 1358256 | 1107 | -52 | 725 | |
3 | 1496665 | 1117 | -45 | 473 | |
4 | 1452394 | 1089 | -46 | 349 | |
5 | 1427749 | 1122 | -41 | 732 |
行和列都有两级索引
3.1 拼接列索引
get_level_values(0)取出第一级索引
level0 = airline_info.columns.get_level_values(0)
get_level_values(1)取出第二级索引
level1 = airline_info.columns.get_level_values(1)
一级和二级索引拼接成新的列索引
airline_info.columns = level0 + ‘_‘ + level1
airline_info.head(7)
DIST_sum | DIST_mean | ARR_DELAY_min | ARR_DELAY_max | ||
---|---|---|---|---|---|
AIRLINE | WEEKDAY | ||||
AA | 1 | 1455386 | 1139 | -60 | 551 |
2 | 1358256 | 1107 | -52 | 725 | |
3 | 1496665 | 1117 | -45 | 473 | |
4 | 1452394 | 1089 | -46 | 349 | |
5 | 1427749 | 1122 | -41 | 732 | |
6 | 1265340 | 1124 | -50 | 858 | |
7 | 1461906 | 1100 | -49 | 626 |
3.2 重置行索引
reset_index()可以将行索引变成单级
airline_info.reset_index().head(7)
AIRLINE | WEEKDAY | DIST_sum | DIST_mean | ARR_DELAY_min | ARR_DELAY_max | |
---|---|---|---|---|---|---|
0 | AA | 1 | 1455386 | 1139 | -60 | 551 |
1 | AA | 2 | 1358256 | 1107 | -52 | 725 |
2 | AA | 3 | 1496665 | 1117 | -45 | 473 |
3 | AA | 4 | 1452394 | 1089 | -46 | 349 |
4 | AA | 5 | 1427749 | 1122 | -41 | 732 |
5 | AA | 6 | 1265340 | 1124 | -50 | 858 |
6 | AA | 7 | 1461906 | 1100 | -49 | 626 |
Pandas默认会在分组运算后,将所有分组的列放在索引中,as_index设为False可以避免这么做。
分组后使用reset_index,也可以达到同样的效果
flights.groupby([‘AIRLINE‘], as_index=False)[‘DIST‘].agg(‘mean‘).round(0)
AIRLINE | DIST | |
---|---|---|
0 | AA | 1114.0 |
1 | AS | 1066.0 |
2 | B6 | 1772.0 |
3 | DL | 866.0 |
... | ... | ... |
10 | UA | 1231.0 |
11 | US | 1181.0 |
12 | VX | 1240.0 |
13 | WN | 810.0 |
14 rows × 2 columns
4 过滤聚合
college = pd.read_csv(‘data/college.csv‘, index_col=‘INSTNM‘)
grouped = college.groupby(‘STABBR‘)
grouped.ngroups
59
这等于求出不同州的个数,nunique()可以得到同样的结果
college[‘STABBR‘].nunique()
59
自定义一个计算少数民族学生总比例的函数,如果比例大于阈值,还返回True
def check_minority(df, threshold):
minority_pct = 1 - df[‘UGDS_WHITE‘]
total_minority = (df[‘UGDS‘] * minority_pct).sum()
total_ugds = df[‘UGDS‘].sum()
total_minority_pct = total_minority / total_ugds
return total_minority_pct > threshold
grouped变量有一个filter方法,可以接收一个自定义函数,决定是否保留一个分组
college_filtered = grouped.filter(check_minority, threshold=.5)
college_filtered.head()
CITY | STABBR | HBCU | MENONLY | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|---|---|---|---|
INSTNM | |||||||||
Everest College-Phoenix | Phoenix | AZ | 0.0 | 0.0 | ... | 0.7151 | 0.6700 | 28600 | 9500 |
Collins College | Phoenix | AZ | 0.0 | 0.0 | ... | 0.8228 | 0.4764 | 25700 | 47000 |
Empire Beauty School-Paradise Valley | Phoenix | AZ | 0.0 | 0.0 | ... | 0.5873 | 0.4651 | 17800 | 9588 |
Empire Beauty School-Tucson | Tucson | AZ | 0.0 | 0.0 | ... | 0.6615 | 0.4229 | 18200 | 9833 |
Thunderbird School of Global Management | Glendale | AZ | 0.0 | 0.0 | ... | 0.0000 | 0.0000 | 118900 | PrivacySuppressed |
5 rows × 26 columns
通过查看形状,可以看到过滤了60%,只有20个州的少数学生占据多数
college.shape
(7535, 26)
college_filtered.shape
(3028, 26)
college_filtered[‘STABBR‘].nunique()
20
用一些不同的阈值,检查形状和不同州的个数
college_filtered_20 = grouped.filter(check_minority, threshold=.2)
college_filtered_20.shape,college_filtered_20[‘STABBR‘].nunique()
((7461, 26), 57)
college_filtered_70 = grouped.filter(check_minority, threshold=.7)
college_filtered_70.shape,college_filtered_70[‘STABBR‘].nunique()
((957, 26), 10)
college_filtered_95 = grouped.filter(check_minority, threshold=.95)
college_filtered_95.shape,college_filtered_95[‘STABBR‘].nunique()
((156, 26), 7)
5 apply函数
apply函数是pandas里面所有函数中自由度最高的函数
读取college,‘UGDS‘, ‘SATMTMID‘, ‘SATVRMID‘三列如果有缺失值则删除行
college = pd.read_csv(‘data/college.csv‘)
subset = [‘UGDS‘, ‘SATMTMID‘, ‘SATVRMID‘]
college2 = college.dropna(subset=subset)
college.shape,college2.shape
((7535, 27), (1184, 27))
5.1 apply与agg
自定义一个求SAT数学成绩的加权平均值的函数
def weighted_math_average(df):
weighted_math = df[‘UGDS‘] * df[‘SATMTMID‘]
return int(weighted_math.sum() / df[‘UGDS‘].sum())
5.1.1 apply应用聚合函数
按州分组,并调用apply方法,传入自定义函数
college2.groupby(‘STABBR‘).apply(weighted_math_average).head()
STABBR
AK 503
AL 536
AR 529
AZ 569
CA 564
dtype: int64
5.1.2 agg应用聚合函数
college2.groupby(‘STABBR‘).agg(weighted_math_average).head()
INSTNM | CITY | HBCU | MENONLY | ... | PCTFLOAN | UG25ABV | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|---|---|---|---|
STABBR | |||||||||
AK | 503 | 503 | 503 | 503 | ... | 503 | 503 | 503 | 503 |
AL | 536 | 536 | 536 | 536 | ... | 536 | 536 | 536 | 536 |
AR | 529 | 529 | 529 | 529 | ... | 529 | 529 | 529 | 529 |
AZ | 569 | 569 | 569 | 569 | ... | 569 | 569 | 569 | 569 |
CA | 564 | 564 | 564 | 564 | ... | 564 | 564 | 564 | 564 |
5 rows × 26 columns
如果将列限制到SATMTMID,会报错。这是因为不能访问UGDS。
# college2.groupby(‘STABBR‘)[‘SATMTMID‘].agg(weighted_math_average)
5.2 apply创建新列
apply的一个不错的功能是通过返回Series,创建多个新的列
from collections import OrderedDict
def weighted_average(df):
data = OrderedDict()
weight_m = df[‘UGDS‘] * df[‘SATMTMID‘]
weight_v = df[‘UGDS‘] * df[‘SATVRMID‘]
data[‘weighted_math_avg‘] = weight_m.sum() / df[‘UGDS‘].sum()
data[‘weighted_verbal_avg‘] = weight_v.sum() / df[‘UGDS‘].sum()
data[‘math_avg‘] = df[‘SATMTMID‘].mean()
data[‘verbal_avg‘] = df[‘SATVRMID‘].mean()
data[‘count‘] = len(df)
return pd.Series(data, dtype=‘int‘)
college2.groupby(‘STABBR‘).apply(weighted_average).head(10)
weighted_math_avg | weighted_verbal_avg | math_avg | verbal_avg | count | |
---|---|---|---|---|---|
STABBR | |||||
AK | 503 | 555 | 503 | 555 | 1 |
AL | 536 | 533 | 504 | 508 | 21 |
AR | 529 | 504 | 515 | 491 | 16 |
AZ | 569 | 557 | 536 | 538 | 6 |
... | ... | ... | ... | ... | ... |
CT | 545 | 533 | 522 | 517 | 14 |
DC | 621 | 623 | 588 | 589 | 6 |
DE | 569 | 553 | 495 | 486 | 3 |
FL | 565 | 565 | 521 | 529 | 38 |
10 rows × 5 columns
5.3 apply创建dataframe
自定义一个返回DataFrame的函数
使用NumPy的函数average计算加权平均值,使用SciPy的gmean和hmean计算几何和调和平均值
from scipy.stats import gmean, hmean
def calculate_means(df):
df_means = pd.DataFrame(index=[‘Arithmetic‘, ‘Weighted‘, ‘Geometric‘, ‘Harmonic‘])
cols = [‘SATMTMID‘, ‘SATVRMID‘]
for col in cols:
arithmetic = df[col].mean()
weighted = np.average(df[col], weights=df[‘UGDS‘])
geometric = gmean(df[col])
harmonic = hmean(df[col])
df_means[col] = [arithmetic, weighted, geometric, harmonic]
df_means[‘count‘] = len(df)
return df_means.astype(int)
college2.groupby(‘STABBR‘) .filter(lambda x: len(x) != 1) .groupby(‘STABBR‘) .apply(calculate_means).head(10)
SATMTMID | SATVRMID | count | ||
---|---|---|---|---|
STABBR | ||||
AL | Arithmetic | 504 | 508 | 21 |
Weighted | 536 | 533 | 21 | |
Geometric | 500 | 505 | 21 | |
Harmonic | 497 | 502 | 21 | |
... | ... | ... | ... | ... |
AR | Geometric | 514 | 489 | 16 |
Harmonic | 513 | 487 | 16 | |
AZ | Arithmetic | 536 | 538 | 6 |
Weighted | 569 | 557 | 6 |
10 rows × 3 columns
以上是关于Pandas Cookbook -- 07 分组聚合过滤转换的主要内容,如果未能解决你的问题,请参考以下文章
[Python Cookbook] Pandas Groupby
《Pandas CookBook》---- DataFrame基础操作