Pandas Cookbook -- 06索引对齐

Posted shiyushiyu

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Pandas Cookbook -- 06索引对齐相关的知识,希望对你有一定的参考价值。

索引对齐

简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧

import pandas as pd
import numpy as np

1 索引方法

college = pd.read_csv(‘data/college.csv‘)
college.iloc[:5,:5]
INSTNM CITY STABBR HBCU MENONLY
0 Alabama A & M University Normal AL 1.0 0.0
1 University of Alabama at Birmingham Birmingham AL 0.0 0.0
2 Amridge University Montgomery AL 0.0 0.0
3 University of Alabama in Huntsville Huntsville AL 0.0 0.0
4 Alabama State University Montgomery AL 1.0 0.0

1.1 基础方法

提取所有的列索引

columns = college.columns
columns
Index([‘INSTNM‘, ‘CITY‘, ‘STABBR‘, ‘HBCU‘, ‘MENONLY‘, ‘WOMENONLY‘, ‘RELAFFIL‘,
       ‘SATVRMID‘, ‘SATMTMID‘, ‘DISTANCEONLY‘, ‘UGDS‘, ‘UGDS_WHITE‘,
       ‘UGDS_BLACK‘, ‘UGDS_HISP‘, ‘UGDS_ASIAN‘, ‘UGDS_AIAN‘, ‘UGDS_NHPI‘,
       ‘UGDS_2MOR‘, ‘UGDS_NRA‘, ‘UGDS_UNKN‘, ‘PPTUG_EF‘, ‘CURROPER‘, ‘PCTPELL‘,
       ‘PCTFLOAN‘, ‘UG25ABV‘, ‘MD_EARN_WNE_P10‘, ‘GRAD_DEBT_MDN_SUPP‘],
      dtype=‘object‘)

用values属性,访问底层的NumPy数组

columns.values
array([‘INSTNM‘, ‘CITY‘, ‘STABBR‘, ‘HBCU‘, ‘MENONLY‘, ‘WOMENONLY‘,
       ‘RELAFFIL‘, ‘SATVRMID‘, ‘SATMTMID‘, ‘DISTANCEONLY‘, ‘UGDS‘,
       ‘UGDS_WHITE‘, ‘UGDS_BLACK‘, ‘UGDS_HISP‘, ‘UGDS_ASIAN‘, ‘UGDS_AIAN‘,
       ‘UGDS_NHPI‘, ‘UGDS_2MOR‘, ‘UGDS_NRA‘, ‘UGDS_UNKN‘, ‘PPTUG_EF‘,
       ‘CURROPER‘, ‘PCTPELL‘, ‘PCTFLOAN‘, ‘UG25ABV‘, ‘MD_EARN_WNE_P10‘,
       ‘GRAD_DEBT_MDN_SUPP‘], dtype=object)

取出该数组的第6个值

columns[5]
‘WOMENONLY‘

取出该数组的第2911

columns[[1,8,10]]
Index([‘CITY‘, ‘SATMTMID‘, ‘UGDS‘], dtype=‘object‘)

逆序切片选取

columns[-7:-4]
Index([‘PPTUG_EF‘, ‘CURROPER‘, ‘PCTPELL‘], dtype=‘object‘)

索引有许多和Series和DataFrame相同的方法

columns.min(), columns.max(), columns.isnull().sum()
(‘CITY‘, ‘WOMENONLY‘, 0)

索引对象可以直接通过字符串方法修改,返回一个copy,索引对象本身是不可变类型,修改其本身会导致报错

columns + ‘_A‘
Index([‘INSTNM_A‘, ‘CITY_A‘, ‘STABBR_A‘, ‘HBCU_A‘, ‘MENONLY_A‘, ‘WOMENONLY_A‘,
       ‘RELAFFIL_A‘, ‘SATVRMID_A‘, ‘SATMTMID_A‘, ‘DISTANCEONLY_A‘, ‘UGDS_A‘,
       ‘UGDS_WHITE_A‘, ‘UGDS_BLACK_A‘, ‘UGDS_HISP_A‘, ‘UGDS_ASIAN_A‘,
       ‘UGDS_AIAN_A‘, ‘UGDS_NHPI_A‘, ‘UGDS_2MOR_A‘, ‘UGDS_NRA_A‘,
       ‘UGDS_UNKN_A‘, ‘PPTUG_EF_A‘, ‘CURROPER_A‘, ‘PCTPELL_A‘, ‘PCTFLOAN_A‘,
       ‘UG25ABV_A‘, ‘MD_EARN_WNE_P10_A‘, ‘GRAD_DEBT_MDN_SUPP_A‘],
      dtype=‘object‘)

索引对象也可以通过比较运算符,得到布尔索引

columns > ‘G‘
array([ True, False,  True,  True,  True,  True,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True])

1.2 集合方法

索引对象支持集合运算:联合、交叉、求差、对称差

c1 = columns[:4]
c2 = columns[2:5]

1.2.1 联合

c1.union(c2)
Index([‘CITY‘, ‘HBCU‘, ‘INSTNM‘, ‘MENONLY‘, ‘STABBR‘], dtype=‘object‘)
c1 | c2
Index([‘CITY‘, ‘HBCU‘, ‘INSTNM‘, ‘MENONLY‘, ‘STABBR‘], dtype=‘object‘)

1.2.2 对称差

c1.symmetric_difference(c2)
Index([‘CITY‘, ‘INSTNM‘, ‘MENONLY‘], dtype=‘object‘)
c1 ^ c2
Index([‘CITY‘, ‘INSTNM‘, ‘MENONLY‘], dtype=‘object‘)

2 笛卡尔积与索引爆炸

创建两个有不同索引、但包含一些相同值的Series

s1 = pd.Series(index=list(‘aaab‘), data=np.arange(4))
s2 = pd.Series(index=list(‘cababb‘), data=np.arange(6))

2.1 产生笛卡尔积

除非当两组索引元素完全相同、顺序也相同时,不会生成笛卡尔积,其它情况都会产生笛卡尔积

s1 + s2
a    1.0
a    3.0
a    2.0
a    4.0
a    3.0
a    5.0
b    5.0
b    7.0
b    8.0
c    NaN
dtype: float64

2.2 避免笛卡尔积

索引一致是,数据会按照它们的索引对齐。下面的例子,两个Series完全相同,结果也是整数

s1 = pd.Series(index=list(‘aaabb‘), data=np.arange(5))
s2 = pd.Series(index=list(‘aaabb‘), data=np.arange(5))
s1 + s2
a    0
a    2
a    4
b    6
b    8
dtype: int64

如果索引元素相同,但顺序不同,是能产生笛卡尔积的

 s1 = pd.Series(index=list(‘aaabb‘), data=np.arange(5))
 s2 = pd.Series(index=list(‘bbaaa‘), data=np.arange(5))
 s1 + s2
a    2
a    3
a    4
a    3
a    4
a    5
a    4
a    5
a    6
b    3
b    4
b    4
b    5
dtype: int64

2.2.1 索引爆炸

读取employee数据集,设定行索引是RACE

employee = pd.read_csv(‘data/employee.csv‘, index_col=‘RACE‘)

2.2.1.1 数据准备

copy方法 -- 复制数据,而非创造引用

salary1 = employee[‘BASE_SALARY‘]
salary2 = employee[‘BASE_SALARY‘]
salary1 is salary2
True

结果是True,表明二者指向的同一个对象。这意味着,如果修改一个,另一个也会去改变。为了收到一个全新的数据,使用copy方法

salary1 = employee[‘BASE_SALARY‘].copy()
salary2 = employee[‘BASE_SALARY‘].copy()
salary1 is salary2
False

2.2.1.2 索引顺序不一致

salary1 = salary1.sort_index()
salary_add1 = salary1 + salary2
salary_add1.shape
(1175424,)

2.2.1.3 索引完全一致

salary_add2 = salary1 + salary1

2.2.1.4 比较结果

len(salary1), len(salary2), len(salary_add1), len(salary_add2)
(2000, 2000, 1175424, 2000)

查看几个所得结果的长度,可以看到长度从2000到达了117万

2.2.1.5 验证结果

因为笛卡尔积是作用在相同索引元素上的,可以对其平方值求和

index_vc = salary2.index.value_counts(dropna=False)
index_vc
Black or African American            700
White                                665
Hispanic/Latino                      480
Asian/Pacific Islander               107
NaN                                   35
American Indian or Alaskan Native     11
Others                                 2
Name: RACE, dtype: int64
index_vc.pow(2).sum()
1175424

3 不等索引填充数值

3.1 不同索引的series

读取三个baseball数据集,行索引设为playerID

baseball_14 = pd.read_csv(‘data/baseball14.csv‘, index_col=‘playerID‘)
baseball_15 = pd.read_csv(‘data/baseball15.csv‘, index_col=‘playerID‘)
baseball_16 = pd.read_csv(‘data/baseball16.csv‘, index_col=‘playerID‘)
baseball_14.iloc[:5,:]
yearID stint teamID lgID G AB R H 2B 3B ... RBI SB CS BB SO IBB HBP SH SF GIDP
playerID
altuvjo01 2014 1 HOU AL 158 660 85 225 47 3 ... 59.0 56.0 9.0 36 53.0 7.0 5.0 1.0 5.0 20.0
cartech02 2014 1 HOU AL 145 507 68 115 21 1 ... 88.0 5.0 2.0 56 182.0 6.0 5.0 0.0 4.0 12.0
castrja01 2014 1 HOU AL 126 465 43 103 21 2 ... 56.0 1.0 0.0 34 151.0 1.0 9.0 1.0 3.0 11.0
corpoca01 2014 1 HOU AL 55 170 22 40 6 0 ... 19.0 0.0 0.0 14 37.0 0.0 3.0 1.0 2.0 3.0
dominma01 2014 1 HOU AL 157 564 51 121 17 0 ... 57.0 0.0 1.0 29 125.0 2.0 5.0 2.0 7.0 23.0

5 rows × 21 columns

3.1.1 索引分析

用索引方法difference,找到哪些索引标签在baseball_14中,却不在baseball_15、baseball_16中

baseball_14.index.difference(baseball_15.index)
Index([‘corpoca01‘, ‘dominma01‘, ‘fowlede01‘, ‘grossro01‘, ‘guzmaje01‘,
       ‘hoeslj01‘, ‘krausma01‘, ‘preslal01‘, ‘singljo02‘],
      dtype=‘object‘, name=‘playerID‘)
baseball_14.index.difference(baseball_16.index)
Index([‘cartech02‘, ‘corpoca01‘, ‘dominma01‘, ‘fowlede01‘, ‘grossro01‘,
       ‘guzmaje01‘, ‘hoeslj01‘, ‘krausma01‘, ‘preslal01‘, ‘singljo02‘,
       ‘villajo01‘],
      dtype=‘object‘, name=‘playerID‘)

3.1.2 数据分析

找到每名球员在过去三个赛季的击球数,H列包含了这个数据

hits_14 = baseball_14[‘H‘]
hits_15 = baseball_15[‘H‘]
hits_16 = baseball_16[‘H‘]

将hits_14和hits_15两列相加

(hits_14 + hits_15).head()
playerID
altuvjo01    425.0
cartech02    193.0
castrja01    174.0
congeha01      NaN
corpoca01      NaN
Name: H, dtype: float64

3.1.3 缺失值

3.1.3.1 可修补的缺失值

congeha01 和 corpoca01 在2015年是有记录的,但是结果缺失了。使用add方法和参数fill_value,避免产生缺失值

hits_14.add(hits_15, fill_value=0).head()
playerID
altuvjo01    425.0
cartech02    193.0
castrja01    174.0
congeha01     46.0
corpoca01     40.0
Name: H, dtype: float64

再将2016的数据也加上

hits_total = hits_14.add(hits_15, fill_value=0).add(hits_16, fill_value=0)
hits_total.head()
playerID
altuvjo01    641.0
bregmal01     53.0
cartech02    193.0
castrja01    243.0
congeha01     46.0
Name: H, dtype: float64

3.1.3.2 无法修补的缺失值

如果一个元素在两个Series都是缺失值,即便使用了fill_value,相加的结果也仍是缺失值

s = pd.Series(index=[‘a‘, ‘b‘, ‘c‘, ‘d‘], data=[np.nan, 3, np.nan, 1])
s1 = pd.Series(index=[‘a‘, ‘b‘, ‘c‘], data=[np.nan, 6, 10])
s.add(s1, fill_value=5)
a     NaN
b     9.0
c    15.0
d     6.0
dtype: float64

3.2 不同索引的dataframe

从baseball_14中选取一些列

df_14 = baseball_14[[‘G‘,‘AB‘, ‘R‘, ‘H‘]]
df_14.head()
G AB R H
playerID
altuvjo01 158 660 85 225
cartech02 145 507 68 115
castrja01 126 465 43 103
corpoca01 55 170 22 40
dominma01 157 564 51 121

再从baseball_15中选取一些列,有相同的、也有不同的

df_15 = baseball_15[[‘AB‘, ‘R‘, ‘H‘, ‘HR‘]]
df_15.head()
AB R H HR
playerID
altuvjo01 638 86 200 15
cartech02 391 50 78 24
castrja01 337 38 71 11
congeha01 201 25 46 11
correca01 387 52 108 22

将二者相加的话,只要行或列不能对齐,就会产生缺失值。style属性的highlight_null方法可以高亮缺失值

(df_14 + df_15).head(10).style.highlight_null(‘yellow‘)
AB G H HR R
playerID
altuvjo01 1298 nan 425 nan 171
cartech02 898 nan 193 nan 118
castrja01 802 nan 174 nan 81
congeha01 nan nan nan nan nan
corpoca01 nan nan nan nan nan
correca01 nan nan nan nan nan
dominma01 nan nan nan nan nan
fowlede01 nan nan nan nan nan
gattiev01 nan nan nan nan nan
gomezca01 nan nan nan nan nan

即便使用了fill_value=0,有些值也会是缺失值,这是因为一些行和列的组合根本不存在输入的数据中

df_14.add(df_15, fill_value=0).head(10).style.highlight_null(‘yellow‘)
AB G H HR R
playerID
altuvjo01 1298 158 425 15 171
cartech02 898 145 193 24 118
castrja01 802 126 174 11 81
congeha01 201 nan 46 11 25
corpoca01 170 55 40 nan 22
correca01 387 nan 108 22 52
dominma01 564 157 121 nan 51
fowlede01 434 116 120 nan 61
gattiev01 566 nan 139 27 66
gomezca01 149 nan 36 4 19

4 从不同的DataFrame追加列

读取employee数据,选取‘DEPARTMENT‘, ‘BASE_SALARY‘这两列

employee = pd.read_csv(‘data/employee.csv‘)
dept_sal = employee[[‘DEPARTMENT‘, ‘BASE_SALARY‘]]

4.1 加入无重复索引的series

在每个部门内,对BASE_SALARY进行排序

dept_sal = dept_sal.sort_values([‘DEPARTMENT‘, ‘BASE_SALARY‘],ascending=[True, False])

用drop_duplicates方法保留每个部门的第一行

max_dept_sal = dept_sal.drop_duplicates(subset=‘DEPARTMENT‘)
max_dept_sal.head()
DEPARTMENT BASE_SALARY
1494 Admn. & Regulatory Affairs 140416.0
149 City Controller‘s Office 64251.0
236 City Council 100000.0
647 Convention and Entertainment 38397.0
1500 Dept of Neighborhoods (DON) 89221.0

使用DEPARTMENT作为行索引

max_dept_sal = max_dept_sal.set_index(‘DEPARTMENT‘)
employee = employee.set_index(‘DEPARTMENT‘)

现在行索引包含匹配值了,可以向employee的DataFrame新增一列

employee[‘MAX_DEPT_SALARY‘] = max_dept_sal[‘BASE_SALARY‘]
employee.head()
UNIQUE_ID POSITION_TITLE BASE_SALARY RACE EMPLOYMENT_TYPE GENDER EMPLOYMENT_STATUS HIRE_DATE JOB_DATE MAX_DEPT_SALARY
DEPARTMENT
Municipal Courts Department 0 ASSISTANT DIRECTOR (EX LVL) 121862.0 Hispanic/Latino Full Time Female Active 2006-06-12 2012-10-13 121862.0
Library 1 LIBRARY ASSISTANT 26125.0 Hispanic/Latino Full Time Female Active 2000-07-19 2010-09-18 107763.0
Houston Police Department-HPD 2 POLICE OFFICER 45279.0 White Full Time Male Active 2015-02-03 2015-02-03 199596.0
Houston Fire Department (HFD) 3 ENGINEER/OPERATOR 63166.0 White Full Time Male Active 1982-02-08 1991-05-25 210588.0
General Services Department 4 ELECTRICIAN 56347.0 White Full Time Male Active 1989-06-19 1994-10-22 89194.0

现在可以用query查看是否有BASE_SALARY大于MAX_DEPT_SALARY的

employee.query(‘BASE_SALARY > MAX_DEPT_SALARY‘)
UNIQUE_ID POSITION_TITLE BASE_SALARY RACE EMPLOYMENT_TYPE GENDER EMPLOYMENT_STATUS HIRE_DATE JOB_DATE MAX_DEPT_SALARY
DEPARTMENT

4.2 加入有重复索引的series

用random从dept_sal随机取10行,不做替换

random_salary = dept_sal.sample(n=10).set_index(‘DEPARTMENT‘)
random_salary
BASE_SALARY
DEPARTMENT
Houston Police Department-HPD 86534.0
Fleet Management Department 49088.0
Houston Airport System (HAS) 76097.0
Houston Police Department-HPD 66614.0
Library 59748.0
Houston Airport System (HAS) 29286.0
Houston Police Department-HPD 61643.0
Houston Fire Department (HFD) 52644.0
Solid Waste Management 36712.0
Houston Police Department-HPD NaN

random_salary中是有重复索引的,employee DataFrame的标签要对应random_salary中的多个标签

新增RANDOM_SALARY列,会引起报错

# employee[‘RANDOM_SALARY‘] = random_salary[‘BASE_SALARY‘]

4.3 加入部分索引的series

选取max_dept_sal[‘BASE_SALARY‘]的前三行,赋值给employee[‘MAX_SALARY2‘]

employee[‘MAX_SALARY2‘] = max_dept_sal[‘BASE_SALARY‘].head(3)
employee.MAX_SALARY2.value_counts()
140416.0    29
100000.0    11
64251.0      5
Name: MAX_SALARY2, dtype: int64

因为只填充了三个部门的值,所有其它部门在结果中都是缺失值

employee.MAX_SALARY2.isnull().mean()
0.9775

以上是关于Pandas Cookbook -- 06索引对齐的主要内容,如果未能解决你的问题,请参考以下文章

《Pandas CookBook》---- 第五章 布尔索引

Pandas 将多个数据帧与时间戳索引对齐

全网最完整Python数据分析笔记系列工具篇:Pandas

Pandas的对齐运算和函数

[Python Cookbook] Pandas Groupby

pandas DataFrame 从不规则时间序列索引中重新采样