Pandas Cookbook -- 06索引对齐
Posted shiyushiyu
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Pandas Cookbook -- 06索引对齐相关的知识,希望对你有一定的参考价值。
简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧
import pandas as pd
import numpy as np
1 索引方法
college = pd.read_csv(‘data/college.csv‘)
college.iloc[:5,:5]
INSTNM | CITY | STABBR | HBCU | MENONLY | |
---|---|---|---|---|---|
0 | Alabama A & M University | Normal | AL | 1.0 | 0.0 |
1 | University of Alabama at Birmingham | Birmingham | AL | 0.0 | 0.0 |
2 | Amridge University | Montgomery | AL | 0.0 | 0.0 |
3 | University of Alabama in Huntsville | Huntsville | AL | 0.0 | 0.0 |
4 | Alabama State University | Montgomery | AL | 1.0 | 0.0 |
1.1 基础方法
提取所有的列索引
columns = college.columns
columns
Index([‘INSTNM‘, ‘CITY‘, ‘STABBR‘, ‘HBCU‘, ‘MENONLY‘, ‘WOMENONLY‘, ‘RELAFFIL‘,
‘SATVRMID‘, ‘SATMTMID‘, ‘DISTANCEONLY‘, ‘UGDS‘, ‘UGDS_WHITE‘,
‘UGDS_BLACK‘, ‘UGDS_HISP‘, ‘UGDS_ASIAN‘, ‘UGDS_AIAN‘, ‘UGDS_NHPI‘,
‘UGDS_2MOR‘, ‘UGDS_NRA‘, ‘UGDS_UNKN‘, ‘PPTUG_EF‘, ‘CURROPER‘, ‘PCTPELL‘,
‘PCTFLOAN‘, ‘UG25ABV‘, ‘MD_EARN_WNE_P10‘, ‘GRAD_DEBT_MDN_SUPP‘],
dtype=‘object‘)
用values属性,访问底层的NumPy数组
columns.values
array([‘INSTNM‘, ‘CITY‘, ‘STABBR‘, ‘HBCU‘, ‘MENONLY‘, ‘WOMENONLY‘,
‘RELAFFIL‘, ‘SATVRMID‘, ‘SATMTMID‘, ‘DISTANCEONLY‘, ‘UGDS‘,
‘UGDS_WHITE‘, ‘UGDS_BLACK‘, ‘UGDS_HISP‘, ‘UGDS_ASIAN‘, ‘UGDS_AIAN‘,
‘UGDS_NHPI‘, ‘UGDS_2MOR‘, ‘UGDS_NRA‘, ‘UGDS_UNKN‘, ‘PPTUG_EF‘,
‘CURROPER‘, ‘PCTPELL‘, ‘PCTFLOAN‘, ‘UG25ABV‘, ‘MD_EARN_WNE_P10‘,
‘GRAD_DEBT_MDN_SUPP‘], dtype=object)
取出该数组的第6个值
columns[5]
‘WOMENONLY‘
取出该数组的第2911
columns[[1,8,10]]
Index([‘CITY‘, ‘SATMTMID‘, ‘UGDS‘], dtype=‘object‘)
逆序切片选取
columns[-7:-4]
Index([‘PPTUG_EF‘, ‘CURROPER‘, ‘PCTPELL‘], dtype=‘object‘)
索引有许多和Series和DataFrame相同的方法
columns.min(), columns.max(), columns.isnull().sum()
(‘CITY‘, ‘WOMENONLY‘, 0)
索引对象可以直接通过字符串方法修改,返回一个copy,索引对象本身是不可变类型,修改其本身会导致报错
columns + ‘_A‘
Index([‘INSTNM_A‘, ‘CITY_A‘, ‘STABBR_A‘, ‘HBCU_A‘, ‘MENONLY_A‘, ‘WOMENONLY_A‘,
‘RELAFFIL_A‘, ‘SATVRMID_A‘, ‘SATMTMID_A‘, ‘DISTANCEONLY_A‘, ‘UGDS_A‘,
‘UGDS_WHITE_A‘, ‘UGDS_BLACK_A‘, ‘UGDS_HISP_A‘, ‘UGDS_ASIAN_A‘,
‘UGDS_AIAN_A‘, ‘UGDS_NHPI_A‘, ‘UGDS_2MOR_A‘, ‘UGDS_NRA_A‘,
‘UGDS_UNKN_A‘, ‘PPTUG_EF_A‘, ‘CURROPER_A‘, ‘PCTPELL_A‘, ‘PCTFLOAN_A‘,
‘UG25ABV_A‘, ‘MD_EARN_WNE_P10_A‘, ‘GRAD_DEBT_MDN_SUPP_A‘],
dtype=‘object‘)
索引对象也可以通过比较运算符,得到布尔索引
columns > ‘G‘
array([ True, False, True, True, True, True, True, True, True,
False, True, True, True, True, True, True, True, True,
True, True, True, False, True, True, True, True, True])
1.2 集合方法
索引对象支持集合运算:联合、交叉、求差、对称差
c1 = columns[:4]
c2 = columns[2:5]
1.2.1 联合
c1.union(c2)
Index([‘CITY‘, ‘HBCU‘, ‘INSTNM‘, ‘MENONLY‘, ‘STABBR‘], dtype=‘object‘)
c1 | c2
Index([‘CITY‘, ‘HBCU‘, ‘INSTNM‘, ‘MENONLY‘, ‘STABBR‘], dtype=‘object‘)
1.2.2 对称差
c1.symmetric_difference(c2)
Index([‘CITY‘, ‘INSTNM‘, ‘MENONLY‘], dtype=‘object‘)
c1 ^ c2
Index([‘CITY‘, ‘INSTNM‘, ‘MENONLY‘], dtype=‘object‘)
2 笛卡尔积与索引爆炸
创建两个有不同索引、但包含一些相同值的Series
s1 = pd.Series(index=list(‘aaab‘), data=np.arange(4))
s2 = pd.Series(index=list(‘cababb‘), data=np.arange(6))
2.1 产生笛卡尔积
除非当两组索引元素完全相同、顺序也相同时,不会生成笛卡尔积,其它情况都会产生笛卡尔积
s1 + s2
a 1.0
a 3.0
a 2.0
a 4.0
a 3.0
a 5.0
b 5.0
b 7.0
b 8.0
c NaN
dtype: float64
2.2 避免笛卡尔积
索引一致是,数据会按照它们的索引对齐。下面的例子,两个Series完全相同,结果也是整数
s1 = pd.Series(index=list(‘aaabb‘), data=np.arange(5))
s2 = pd.Series(index=list(‘aaabb‘), data=np.arange(5))
s1 + s2
a 0
a 2
a 4
b 6
b 8
dtype: int64
如果索引元素相同,但顺序不同,是能产生笛卡尔积的
s1 = pd.Series(index=list(‘aaabb‘), data=np.arange(5))
s2 = pd.Series(index=list(‘bbaaa‘), data=np.arange(5))
s1 + s2
a 2
a 3
a 4
a 3
a 4
a 5
a 4
a 5
a 6
b 3
b 4
b 4
b 5
dtype: int64
2.2.1 索引爆炸
读取employee数据集,设定行索引是RACE
employee = pd.read_csv(‘data/employee.csv‘, index_col=‘RACE‘)
2.2.1.1 数据准备
copy方法 -- 复制数据,而非创造引用
salary1 = employee[‘BASE_SALARY‘]
salary2 = employee[‘BASE_SALARY‘]
salary1 is salary2
True
结果是True,表明二者指向的同一个对象。这意味着,如果修改一个,另一个也会去改变。为了收到一个全新的数据,使用copy方法
salary1 = employee[‘BASE_SALARY‘].copy()
salary2 = employee[‘BASE_SALARY‘].copy()
salary1 is salary2
False
2.2.1.2 索引顺序不一致
salary1 = salary1.sort_index()
salary_add1 = salary1 + salary2
salary_add1.shape
(1175424,)
2.2.1.3 索引完全一致
salary_add2 = salary1 + salary1
2.2.1.4 比较结果
len(salary1), len(salary2), len(salary_add1), len(salary_add2)
(2000, 2000, 1175424, 2000)
查看几个所得结果的长度,可以看到长度从2000到达了117万
2.2.1.5 验证结果
因为笛卡尔积是作用在相同索引元素上的,可以对其平方值求和
index_vc = salary2.index.value_counts(dropna=False)
index_vc
Black or African American 700
White 665
Hispanic/Latino 480
Asian/Pacific Islander 107
NaN 35
American Indian or Alaskan Native 11
Others 2
Name: RACE, dtype: int64
index_vc.pow(2).sum()
1175424
3 不等索引填充数值
3.1 不同索引的series
读取三个baseball数据集,行索引设为playerID
baseball_14 = pd.read_csv(‘data/baseball14.csv‘, index_col=‘playerID‘)
baseball_15 = pd.read_csv(‘data/baseball15.csv‘, index_col=‘playerID‘)
baseball_16 = pd.read_csv(‘data/baseball16.csv‘, index_col=‘playerID‘)
baseball_14.iloc[:5,:]
yearID | stint | teamID | lgID | G | AB | R | H | 2B | 3B | ... | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
playerID | |||||||||||||||||||||
altuvjo01 | 2014 | 1 | HOU | AL | 158 | 660 | 85 | 225 | 47 | 3 | ... | 59.0 | 56.0 | 9.0 | 36 | 53.0 | 7.0 | 5.0 | 1.0 | 5.0 | 20.0 |
cartech02 | 2014 | 1 | HOU | AL | 145 | 507 | 68 | 115 | 21 | 1 | ... | 88.0 | 5.0 | 2.0 | 56 | 182.0 | 6.0 | 5.0 | 0.0 | 4.0 | 12.0 |
castrja01 | 2014 | 1 | HOU | AL | 126 | 465 | 43 | 103 | 21 | 2 | ... | 56.0 | 1.0 | 0.0 | 34 | 151.0 | 1.0 | 9.0 | 1.0 | 3.0 | 11.0 |
corpoca01 | 2014 | 1 | HOU | AL | 55 | 170 | 22 | 40 | 6 | 0 | ... | 19.0 | 0.0 | 0.0 | 14 | 37.0 | 0.0 | 3.0 | 1.0 | 2.0 | 3.0 |
dominma01 | 2014 | 1 | HOU | AL | 157 | 564 | 51 | 121 | 17 | 0 | ... | 57.0 | 0.0 | 1.0 | 29 | 125.0 | 2.0 | 5.0 | 2.0 | 7.0 | 23.0 |
5 rows × 21 columns
3.1.1 索引分析
用索引方法difference,找到哪些索引标签在baseball_14中,却不在baseball_15、baseball_16中
baseball_14.index.difference(baseball_15.index)
Index([‘corpoca01‘, ‘dominma01‘, ‘fowlede01‘, ‘grossro01‘, ‘guzmaje01‘,
‘hoeslj01‘, ‘krausma01‘, ‘preslal01‘, ‘singljo02‘],
dtype=‘object‘, name=‘playerID‘)
baseball_14.index.difference(baseball_16.index)
Index([‘cartech02‘, ‘corpoca01‘, ‘dominma01‘, ‘fowlede01‘, ‘grossro01‘,
‘guzmaje01‘, ‘hoeslj01‘, ‘krausma01‘, ‘preslal01‘, ‘singljo02‘,
‘villajo01‘],
dtype=‘object‘, name=‘playerID‘)
3.1.2 数据分析
找到每名球员在过去三个赛季的击球数,H列包含了这个数据
hits_14 = baseball_14[‘H‘]
hits_15 = baseball_15[‘H‘]
hits_16 = baseball_16[‘H‘]
将hits_14和hits_15两列相加
(hits_14 + hits_15).head()
playerID
altuvjo01 425.0
cartech02 193.0
castrja01 174.0
congeha01 NaN
corpoca01 NaN
Name: H, dtype: float64
3.1.3 缺失值
3.1.3.1 可修补的缺失值
congeha01 和 corpoca01 在2015年是有记录的,但是结果缺失了。使用add方法和参数fill_value,避免产生缺失值
hits_14.add(hits_15, fill_value=0).head()
playerID
altuvjo01 425.0
cartech02 193.0
castrja01 174.0
congeha01 46.0
corpoca01 40.0
Name: H, dtype: float64
再将2016的数据也加上
hits_total = hits_14.add(hits_15, fill_value=0).add(hits_16, fill_value=0)
hits_total.head()
playerID
altuvjo01 641.0
bregmal01 53.0
cartech02 193.0
castrja01 243.0
congeha01 46.0
Name: H, dtype: float64
3.1.3.2 无法修补的缺失值
如果一个元素在两个Series都是缺失值,即便使用了fill_value,相加的结果也仍是缺失值
s = pd.Series(index=[‘a‘, ‘b‘, ‘c‘, ‘d‘], data=[np.nan, 3, np.nan, 1])
s1 = pd.Series(index=[‘a‘, ‘b‘, ‘c‘], data=[np.nan, 6, 10])
s.add(s1, fill_value=5)
a NaN
b 9.0
c 15.0
d 6.0
dtype: float64
3.2 不同索引的dataframe
从baseball_14中选取一些列
df_14 = baseball_14[[‘G‘,‘AB‘, ‘R‘, ‘H‘]]
df_14.head()
G | AB | R | H | |
---|---|---|---|---|
playerID | ||||
altuvjo01 | 158 | 660 | 85 | 225 |
cartech02 | 145 | 507 | 68 | 115 |
castrja01 | 126 | 465 | 43 | 103 |
corpoca01 | 55 | 170 | 22 | 40 |
dominma01 | 157 | 564 | 51 | 121 |
再从baseball_15中选取一些列,有相同的、也有不同的
df_15 = baseball_15[[‘AB‘, ‘R‘, ‘H‘, ‘HR‘]]
df_15.head()
AB | R | H | HR | |
---|---|---|---|---|
playerID | ||||
altuvjo01 | 638 | 86 | 200 | 15 |
cartech02 | 391 | 50 | 78 | 24 |
castrja01 | 337 | 38 | 71 | 11 |
congeha01 | 201 | 25 | 46 | 11 |
correca01 | 387 | 52 | 108 | 22 |
将二者相加的话,只要行或列不能对齐,就会产生缺失值。style属性的highlight_null方法可以高亮缺失值
(df_14 + df_15).head(10).style.highlight_null(‘yellow‘)
AB | G | H | HR | R | |
---|---|---|---|---|---|
playerID | |||||
altuvjo01 | 1298 | nan | 425 | nan | 171 |
cartech02 | 898 | nan | 193 | nan | 118 |
castrja01 | 802 | nan | 174 | nan | 81 |
congeha01 | nan | nan | nan | nan | nan |
corpoca01 | nan | nan | nan | nan | nan |
correca01 | nan | nan | nan | nan | nan |
dominma01 | nan | nan | nan | nan | nan |
fowlede01 | nan | nan | nan | nan | nan |
gattiev01 | nan | nan | nan | nan | nan |
gomezca01 | nan | nan | nan | nan | nan |
即便使用了fill_value=0,有些值也会是缺失值,这是因为一些行和列的组合根本不存在输入的数据中
df_14.add(df_15, fill_value=0).head(10).style.highlight_null(‘yellow‘)
AB | G | H | HR | R | |
---|---|---|---|---|---|
playerID | |||||
altuvjo01 | 1298 | 158 | 425 | 15 | 171 |
cartech02 | 898 | 145 | 193 | 24 | 118 |
castrja01 | 802 | 126 | 174 | 11 | 81 |
congeha01 | 201 | nan | 46 | 11 | 25 |
corpoca01 | 170 | 55 | 40 | nan | 22 |
correca01 | 387 | nan | 108 | 22 | 52 |
dominma01 | 564 | 157 | 121 | nan | 51 |
fowlede01 | 434 | 116 | 120 | nan | 61 |
gattiev01 | 566 | nan | 139 | 27 | 66 |
gomezca01 | 149 | nan | 36 | 4 | 19 |
4 从不同的DataFrame追加列
读取employee数据,选取‘DEPARTMENT‘, ‘BASE_SALARY‘这两列
employee = pd.read_csv(‘data/employee.csv‘)
dept_sal = employee[[‘DEPARTMENT‘, ‘BASE_SALARY‘]]
4.1 加入无重复索引的series
在每个部门内,对BASE_SALARY进行排序
dept_sal = dept_sal.sort_values([‘DEPARTMENT‘, ‘BASE_SALARY‘],ascending=[True, False])
用drop_duplicates方法保留每个部门的第一行
max_dept_sal = dept_sal.drop_duplicates(subset=‘DEPARTMENT‘)
max_dept_sal.head()
DEPARTMENT | BASE_SALARY | |
---|---|---|
1494 | Admn. & Regulatory Affairs | 140416.0 |
149 | City Controller‘s Office | 64251.0 |
236 | City Council | 100000.0 |
647 | Convention and Entertainment | 38397.0 |
1500 | Dept of Neighborhoods (DON) | 89221.0 |
使用DEPARTMENT作为行索引
max_dept_sal = max_dept_sal.set_index(‘DEPARTMENT‘)
employee = employee.set_index(‘DEPARTMENT‘)
现在行索引包含匹配值了,可以向employee的DataFrame新增一列
employee[‘MAX_DEPT_SALARY‘] = max_dept_sal[‘BASE_SALARY‘]
employee.head()
UNIQUE_ID | POSITION_TITLE | BASE_SALARY | RACE | EMPLOYMENT_TYPE | GENDER | EMPLOYMENT_STATUS | HIRE_DATE | JOB_DATE | MAX_DEPT_SALARY | |
---|---|---|---|---|---|---|---|---|---|---|
DEPARTMENT | ||||||||||
Municipal Courts Department | 0 | ASSISTANT DIRECTOR (EX LVL) | 121862.0 | Hispanic/Latino | Full Time | Female | Active | 2006-06-12 | 2012-10-13 | 121862.0 |
Library | 1 | LIBRARY ASSISTANT | 26125.0 | Hispanic/Latino | Full Time | Female | Active | 2000-07-19 | 2010-09-18 | 107763.0 |
Houston Police Department-HPD | 2 | POLICE OFFICER | 45279.0 | White | Full Time | Male | Active | 2015-02-03 | 2015-02-03 | 199596.0 |
Houston Fire Department (HFD) | 3 | ENGINEER/OPERATOR | 63166.0 | White | Full Time | Male | Active | 1982-02-08 | 1991-05-25 | 210588.0 |
General Services Department | 4 | ELECTRICIAN | 56347.0 | White | Full Time | Male | Active | 1989-06-19 | 1994-10-22 | 89194.0 |
现在可以用query查看是否有BASE_SALARY大于MAX_DEPT_SALARY的
employee.query(‘BASE_SALARY > MAX_DEPT_SALARY‘)
UNIQUE_ID | POSITION_TITLE | BASE_SALARY | RACE | EMPLOYMENT_TYPE | GENDER | EMPLOYMENT_STATUS | HIRE_DATE | JOB_DATE | MAX_DEPT_SALARY | |
---|---|---|---|---|---|---|---|---|---|---|
DEPARTMENT |
4.2 加入有重复索引的series
用random从dept_sal随机取10行,不做替换
random_salary = dept_sal.sample(n=10).set_index(‘DEPARTMENT‘)
random_salary
BASE_SALARY | |
---|---|
DEPARTMENT | |
Houston Police Department-HPD | 86534.0 |
Fleet Management Department | 49088.0 |
Houston Airport System (HAS) | 76097.0 |
Houston Police Department-HPD | 66614.0 |
Library | 59748.0 |
Houston Airport System (HAS) | 29286.0 |
Houston Police Department-HPD | 61643.0 |
Houston Fire Department (HFD) | 52644.0 |
Solid Waste Management | 36712.0 |
Houston Police Department-HPD | NaN |
random_salary中是有重复索引的,employee DataFrame的标签要对应random_salary中的多个标签
新增RANDOM_SALARY列,会引起报错
# employee[‘RANDOM_SALARY‘] = random_salary[‘BASE_SALARY‘]
4.3 加入部分索引的series
选取max_dept_sal[‘BASE_SALARY‘]的前三行,赋值给employee[‘MAX_SALARY2‘]
employee[‘MAX_SALARY2‘] = max_dept_sal[‘BASE_SALARY‘].head(3)
employee.MAX_SALARY2.value_counts()
140416.0 29
100000.0 11
64251.0 5
Name: MAX_SALARY2, dtype: int64
因为只填充了三个部门的值,所有其它部门在结果中都是缺失值
employee.MAX_SALARY2.isnull().mean()
0.9775
以上是关于Pandas Cookbook -- 06索引对齐的主要内容,如果未能解决你的问题,请参考以下文章
《Pandas CookBook》---- 第五章 布尔索引