Pandas Cookbook -- 08数据清理
Posted shiyushiyu
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Pandas Cookbook -- 08数据清理相关的知识,希望对你有一定的参考价值。
简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧
import pandas as pd
import numpy as np
设定最大列数和最大行数
pd.set_option(‘max_columns‘,5 , ‘max_rows‘, 5)
1 宽格式转长格式
state_fruit = pd.read_csv(‘data/state_fruit.csv‘, index_col=0)
state_fruit
Apple | Orange | Banana | |
---|---|---|---|
Texas | 12 | 10 | 40 |
Arizona | 9 | 7 | 12 |
Florida | 0 | 14 | 190 |
1.1 stack
DataFrame.stack(level=-1, dropna=True)
- Stack the prescribed level(s) from columns to index.
- 将列索引转化为行索引
Return a reshaped DataFrame or Series having a multi-level index with one or more new inner-most levels compared to the current DataFrame. The new inner-most levels are created by pivoting the columns of the current dataframe:
- if the columns have a single level, the output is a Series;
- if the columns have multiple levels, the new index level(s) is (are) taken from the prescribed level(s) and the output is a DataFrame.
1.1.1 stack将列索引转化为行索引
stack方法可以将所有列名,转变为垂直的一级行索引
state_fruit.stack()
Texas Apple 12
Orange 10
...
Florida Orange 14
Banana 190
Length: 9, dtype: int64
使用reset_index(),将结果变为DataFrame
state_fruit_tidy = state_fruit.stack().reset_index()
state_fruit_tidy
level_0 | level_1 | 0 | |
---|---|---|---|
0 | Texas | Apple | 12 |
1 | Texas | Orange | 10 |
... | ... | ... | ... |
7 | Florida | Orange | 14 |
8 | Florida | Banana | 190 |
9 rows × 3 columns
重命名列名
state_fruit_tidy.columns = [‘state‘, ‘fruit‘, ‘weight‘]
state_fruit_tidy
state | fruit | weight | |
---|---|---|---|
0 | Texas | Apple | 12 |
1 | Texas | Orange | 10 |
... | ... | ... | ... |
7 | Florida | Orange | 14 |
8 | Florida | Banana | 190 |
9 rows × 3 columns
也可以使用rename_axis给不同的行索引层级命名
state_fruit.stack().rename_axis([‘state‘, ‘fruit‘])
state fruit
Texas Apple 12
Orange 10
...
Florida Orange 14
Banana 190
Length: 9, dtype: int64
再次使用reset_index方法
state_fruit.stack().rename_axis([‘state‘, ‘fruit‘]) .reset_index(name=‘weight‘)
state | fruit | weight | |
---|---|---|---|
0 | Texas | Apple | 12 |
1 | Texas | Orange | 10 |
... | ... | ... | ... |
7 | Florida | Orange | 14 |
8 | Florida | Banana | 190 |
9 rows × 3 columns
1.1.2 同时stack多组变量
即stack后将所有的列名,按照规则分为两列,或者多列
movie = pd.read_csv(‘data/movie.csv‘)
actor = movie[[
‘movie_title‘, ‘actor_1_name‘, ‘actor_2_name‘,
‘actor_3_name‘, ‘actor_1_facebook_likes‘,
‘actor_2_facebook_likes‘, ‘actor_3_facebook_likes‘]]
创建一个自定义函数,用来改变列名。wide_to_long要求分组的变量要有相同的数值结尾
def change_col_name(col_name):
col_name = col_name.replace(‘_name‘, ‘‘)
if ‘facebook‘ in col_name:
fb_idx = col_name.find(‘facebook‘)
col_name = col_name[:5] + col_name[fb_idx - 1:] + col_name[5:fb_idx-1]
return col_name
actor2 = actor.rename(columns=change_col_name)
actor2.iloc[:5,:5]
movie_title | actor_1 | actor_2 | actor_3 | actor_facebook_likes_1 | |
---|---|---|---|---|---|
0 | Avatar | CCH Pounder | Joel David Moore | Wes Studi | 1000.0 |
1 | Pirates of the Caribbean: At World‘s End | Johnny Depp | Orlando Bloom | Jack Davenport | 40000.0 |
2 | Spectre | Christoph Waltz | Rory Kinnear | Stephanie Sigman | 11000.0 |
3 | The Dark Knight Rises | Tom Hardy | Christian Bale | Joseph Gordon-Levitt | 27000.0 |
4 | Star Wars: Episode VII - The Force Awakens | Doug Walker | Rob Walker | NaN | 131.0 |
使用wide_to_long函数,同时stack两列actor和Facebook
stubs = [‘actor‘, ‘actor_facebook_likes‘]
actor2_tidy = pd.wide_to_long(actor2,
stubnames=stubs,
i=[‘movie_title‘],
j=‘actor_num‘,
sep=‘_‘)
actor2_tidy.head(10)
actor | actor_facebook_likes | ||
---|---|---|---|
movie_title | actor_num | ||
Avatar | 1 | CCH Pounder | 1000.0 |
Pirates of the Caribbean: At World‘s End | 1 | Johnny Depp | 40000.0 |
... | ... | ... | ... |
Avengers: Age of Ultron | 1 | Chris Hemsworth | 26000.0 |
Harry Potter and the Half-Blood Prince | 1 | Alan Rickman | 25000.0 |
10 rows × 2 columns
1.2 melt
pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name=‘value‘, col_level=None)
- This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
- 这个函数是用来转化dataframe的格式,新的格式将满足一列或者多列作为id列,一列作为测量指标列,最后一列作为前面几列对应的测量值列
- id_vars : tuple, list, or ndarray, optional
- Column(s) to use as identifier variables.
- value_vars : tuple, list, or ndarray, optional
- Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
- var_name : scalar
- Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.
- value_name : scalar, default ‘value’
- Name to use for the ‘value’ column.
- col_level : int or string, optional
- If columns are a MultiIndex then use this level to melt.
读取state_fruit2数据集
state_fruit2 = pd.read_csv(‘data/state_fruit2.csv‘)
state_fruit2
State | Apple | Orange | Banana | |
---|---|---|---|---|
0 | Texas | 12 | 10 | 40 |
1 | Arizona | 9 | 7 | 12 |
2 | Florida | 0 | 14 | 190 |
melt可以将原先的列名作为变量,原先的值作为值
var_name和value_name可以用来重命名新生成的变量列和值的列
state_fruit2.melt(id_vars=[‘State‘],value_vars=[‘Apple‘, ‘Orange‘, ‘Banana‘])
State | variable | value | |
---|---|---|---|
0 | Texas | Apple | 12 |
1 | Arizona | Apple | 9 |
... | ... | ... | ... |
7 | Arizona | Banana | 12 |
8 | Florida | Banana | 190 |
9 rows × 3 columns
随意设定一个行索引
state_fruit2.index=list(‘abc‘)
state_fruit2.index.name = ‘letter‘
state_fruit2
State | Apple | Orange | Banana | |
---|---|---|---|---|
letter | ||||
a | Texas | 12 | 10 | 40 |
b | Arizona | 9 | 7 | 12 |
c | Florida | 0 | 14 | 190 |
var_name和value_name可以用来重命名新生成的变量列和值的列
var_name对应value_vars 默认为variable
value_name对应原有表格中的值 默认为value
state_fruit2.melt(id_vars=[‘State‘],
value_vars=[‘Apple‘, ‘Orange‘, ‘Banana‘],
var_name=‘Fruit‘,
value_name=‘Weight‘)
State | Fruit | Weight | |
---|---|---|---|
0 | Texas | Apple | 12 |
1 | Arizona | Apple | 9 |
... | ... | ... | ... |
7 | Arizona | Banana | 12 |
8 | Florida | Banana | 190 |
9 rows × 3 columns
如果想让所有值都位于一列,旧的列标签位于另一列,可以直接使用melt
state_fruit2.melt()
variable | value | |
---|---|---|
0 | State | Texas |
1 | State | Arizona |
... | ... | ... |
10 | Banana | 12 |
11 | Banana | 190 |
12 rows × 2 columns
要指明id变量,只需使用id_vars参数
state_fruit2.melt(id_vars=‘State‘)
State | variable | value | |
---|---|---|---|
0 | Texas | Apple | 12 |
1 | Arizona | Apple | 9 |
... | ... | ... | ... |
7 | Arizona | Banana | 12 |
8 | Florida | Banana | 190 |
9 rows × 3 columns
2 长格式转宽格式
2.1 unstack
- DataFrame.unstack(level=-1, fill_value=None)
- level : int, string, or list of these, default -1 (last level)
- Level(s) of index to unstack, can pass level name
- 将行索引中的哪一级别展开
- fill_value : replace NaN with this value if the unstack produces
- missing values
- 缺失值怎样填充
读取college数据集,学校名作为行索引,只选取本科生的列
usecol_func = lambda x: ‘UGDS_‘ in x or x == ‘INSTNM‘
college = pd.read_csv(‘data/college.csv‘,index_col=‘INSTNM‘, usecols=usecol_func)
用stack方法,将所有水平列名,转化为垂直的行索引
college_stacked = college.stack()
college_stacked.head(18)
INSTNM
Alabama A & M University UGDS_WHITE 0.0333
UGDS_BLACK 0.9353
...
University of Alabama at Birmingham UGDS_NRA 0.0179
UGDS_UNKN 0.0100
Length: 18, dtype: float64
unstack方法可以将其还原
college_stacked.unstack().head()
UGDS_WHITE | UGDS_BLACK | ... | UGDS_NRA | UGDS_UNKN | |
---|---|---|---|---|---|
INSTNM | |||||
Alabama A & M University | 0.0333 | 0.9353 | ... | 0.0059 | 0.0138 |
University of Alabama at Birmingham | 0.5922 | 0.2600 | ... | 0.0179 | 0.0100 |
Amridge University | 0.2990 | 0.4192 | ... | 0.0000 | 0.2715 |
University of Alabama in Huntsville | 0.6988 | 0.1255 | ... | 0.0332 | 0.0350 |
Alabama State University | 0.0158 | 0.9208 | ... | 0.0243 | 0.0137 |
5 rows × 9 columns
2.2 pivot
DataFrame.pivot(index=None, columns=None, values=None)
- Return reshaped DataFrame organized by given index / column values.
通过给的index和column返回一个重塑的dataframe
- index : string or object, optional
- Column to use to make new frame’s index. If None, uses existing index.
- 原有的dataframe中的哪些列的值作为新的datarame索引
- columns : string or object
- Column to use to make new frame’s columns.
- 原有的dataframe中的哪些列中的值作为新的datarame的列
- values : string, object or a list of the previous, optional
- Column(s) to use for populating new frame’s values. If not specified, all remaining columns will be used and the result will have hierarchically indexed columns.
- 原有的datarame哪个列的值,作为新的表中的值
另一种方式是先用melt,再用pivot。先加载数据,不指定行索引名
college2 = pd.read_csv(‘data/college.csv‘, usecols=usecol_func)
college_melted = college2.melt(id_vars=‘INSTNM‘, var_name=‘Race‘,value_name=‘Percentage‘)
college_melted.head()
INSTNM | Race | Percentage | |
---|---|---|---|
0 | Alabama A & M University | UGDS_WHITE | 0.0333 |
1 | University of Alabama at Birmingham | UGDS_WHITE | 0.5922 |
2 | Amridge University | UGDS_WHITE | 0.2990 |
3 | University of Alabama in Huntsville | UGDS_WHITE | 0.6988 |
4 | Alabama State University | UGDS_WHITE | 0.0158 |
用pivot还原
melted_inv = college_melted.pivot(index=‘INSTNM‘,columns=‘Race‘,values=‘Percentage‘)
melted_inv.head()
Race | UGDS_2MOR | UGDS_AIAN | ... | UGDS_UNKN | UGDS_WHITE |
---|---|---|---|---|---|
INSTNM | |||||
A & W Healthcare Educators | 0.0000 | 0.0 | ... | 0.0000 | 0.0000 |
A T Still University of Health Sciences | NaN | NaN | ... | NaN | NaN |
ABC Beauty Academy | 0.0000 | 0.0 | ... | 0.0000 | 0.0000 |
ABC Beauty College Inc | 0.0000 | 0.0 | ... | 0.0000 | 0.2895 |
AI Miami International University of Art and Design | 0.0018 | 0.0 | ... | 0.4644 | 0.0324 |
5 rows × 9 columns
3 透视表
3.1 使用pivot_table
pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc=‘mean‘, fill_value=None, margins=False, dropna=True, margins_name=‘All‘)
Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame
类似pivot
- data : DataFrame
- values : column to aggregate, optional
- index : column, Grouper, array, or list of the previous
- If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.
- columns : column, Grouper, array, or list of the previous
- If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.
- 当colomns传递的列表有多个元素时,也会建立入index的多级索引
- aggfunc : function, list of functions, dict, default numpy.mean
- If list of functions passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves) If dict is passed, the key is column to aggregate and value is function or list of functions
- 对所选取的值做聚合
- fill_value : scalar, default None
- Value to replace missing values with
flights = pd.read_csv(‘data/flights.csv‘)
flights.head()
MONTH | DAY | ... | DIVERTED | CANCELLED | |
---|---|---|---|---|---|
0 | 1 | 1 | ... | 0 | 0 |
1 | 1 | 1 | ... | 0 | 0 |
2 | 1 | 1 | ... | 0 | 0 |
3 | 1 | 1 | ... | 0 | 0 |
4 | 1 | 1 | ... | 0 | 0 |
5 rows × 14 columns
用pivot_table方法求出每条航线每个始发地的被取消的航班总数
fp = flights.pivot_table(index=‘AIRLINE‘,
columns=‘ORG_AIR‘,
values=‘CANCELLED‘,
aggfunc=‘sum‘,
fill_value=0).round(2)
fp.head()
ORG_AIR | ATL | DEN | ... | PHX | SFO |
---|---|---|---|---|---|
AIRLINE | |||||
AA | 3 | 4 | ... | 4 | 2 |
AS | 0 | 0 | ... | 0 | 0 |
B6 | 0 | 0 | ... | 0 | 1 |
DL | 28 | 1 | ... | 1 | 2 |
EV | 18 | 6 | ... | 0 | 0 |
5 rows × 10 columns
3.2 使用groupby方法
groupby聚合不能直接复现这张表,需要先按所有index和columns的列聚合.
fg = flights.groupby([‘AIRLINE‘, ‘ORG_AIR‘])[‘CANCELLED‘].sum()
fg.head()
AIRLINE ORG_AIR
AA ATL 3
DEN 4
DFW 86
IAH 3
LAS 3
Name: CANCELLED, dtype: int64
再使用unstack,将ORG_AIR这层索引作为列名
fg_unstack = fg.unstack(‘ORG_AIR‘, fill_value=0)
fg_unstack.head()
ORG_AIR | ATL | DEN | ... | PHX | SFO |
---|---|---|---|---|---|
AIRLINE | |||||
AA | 3 | 4 | ... | 4 | 2 |
AS | 0 | 0 | ... | 0 | 0 |
B6 | 0 | 0 | ... | 0 | 1 |
DL | 28 | 1 | ... | 1 | 2 |
EV | 18 | 6 | ... | 0 | 0 |
5 rows × 10 columns
3.3 两种方式的比较
fg_unstack = fg.unstack(‘ORG_AIR‘, fill_value=0)
fp.equals(fg_unstack)
True
fp2 = flights.pivot_table(index=[‘AIRLINE‘, ‘MONTH‘],
columns=[‘ORG_AIR‘, ‘CANCELLED‘],
values=[‘DEP_DELAY‘, ‘DIST‘],
aggfunc=[np.mean, np.sum],
fill_value=0)
fp2
mean | ... | sum | ||||
---|---|---|---|---|---|---|
DEP_DELAY | ... | DIST | ||||
ORG_AIR | ATL | ... | SFO | |||
CANCELLED | 0 | 1 | ... | 0 | 1 | |
AIRLINE | MONTH | |||||
AA | 1 | -3.250000 | 0 | ... | 33483 | 0 |
2 | -3.000000 | 0 | ... | 32110 | 2586 | |
... | ... | ... | ... | ... | ... | ... |
WN | 11 | 5.932203 | 0 | ... | 23235 | 784 |
12 | 15.691589 | 0 | ... | 30508 | 0 |
149 rows × 80 columns
用groupby和unstack复现上面的方法
flights.groupby([‘AIRLINE‘, ‘MONTH‘, ‘ORG_AIR‘, ‘CANCELLED‘])[‘DEP_DELAY‘, ‘DIST‘] .agg([‘mean‘, ‘sum‘]) .unstack([‘ORG_AIR‘, ‘CANCELLED‘], fill_value=0) .swaplevel(0, 1, axis=‘columns‘) .head()
mean | ... | sum | ||||
---|---|---|---|---|---|---|
DEP_DELAY | ... | DIST | ||||
ORG_AIR | ATL | ... | SFO | |||
CANCELLED | 0 | 1 | ... | 0 | 1 | |
AIRLINE | MONTH | |||||
AA | 1 | -3.250000 | NaN | ... | 33483.0 | NaN |
2 | -3.000000 | NaN | ... | 32110.0 | 2586.0 | |
3 | -0.166667 | NaN | ... | 43580.0 | NaN | |
4 | 0.071429 | NaN | ... | 51054.0 | NaN | |
5 | 5.777778 | NaN | ... | 40233.0 | NaN |
5 rows × 80 columns
4 数据清理TIPS
一些数据分析的案例和技巧
4.1 为了更容易reshaping,重新命名索引层
读取college数据集,分组后,统计本科生的SAT数学成绩信息
college = pd.read_csv(‘data/college.csv‘)
cg = college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘, ‘SATMTMID‘].agg([‘count‘, ‘min‘, ‘max‘]).head(6)
cg
UGDS | ... | SATMTMID | ||||
---|---|---|---|---|---|---|
count | min | ... | min | max | ||
STABBR | RELAFFIL | |||||
AK | 0 | 7 | 109.0 | ... | NaN | NaN |
1 | 3 | 27.0 | ... | 503.0 | 503.0 | |
... | ... | ... | ... | ... | ... | ... |
AR | 0 | 68 | 18.0 | ... | 427.0 | 565.0 |
1 | 14 | 20.0 | ... | 495.0 | 600.0 |
6 rows × 6 columns
行索引的两级都有名字,而列索引没有名字。用rename_axis给列索引的两级命名
cg = cg.rename_axis([‘AGG_COLS‘, ‘AGG_FUNCS‘], axis=‘columns‘)
cg
AGG_COLS | UGDS | ... | SATMTMID | |||
---|---|---|---|---|---|---|
AGG_FUNCS | count | min | ... | min | max | |
STABBR | RELAFFIL | |||||
AK | 0 | 7 | 109.0 | ... | NaN | NaN |
1 | 3 | 27.0 | ... | 503.0 | 503.0 | |
... | ... | ... | ... | ... | ... | ... |
AR | 0 | 68 | 18.0 | ... | 427.0 | 565.0 |
1 | 14 | 20.0 | ... | 495.0 | 600.0 |
6 rows × 6 columns
将AGG_FUNCS列移到行索引
cg.stack(‘AGG_FUNCS‘).head()
AGG_COLS | UGDS | SATMTMID | ||
---|---|---|---|---|
STABBR | RELAFFIL | AGG_FUNCS | ||
AK | 0 | count | 7.0 | 0.0 |
min | 109.0 | NaN | ||
max | 12865.0 | NaN | ||
1 | count | 3.0 | 1.0 | |
min | 27.0 | 503.0 |
stack默认是将列放到行索引的最内层,可以使用swaplevel改变层级
cg.stack(‘AGG_FUNCS‘).swaplevel(‘AGG_FUNCS‘, ‘STABBR‘, axis=‘index‘).head()
AGG_COLS | UGDS | SATMTMID | ||
---|---|---|---|---|
AGG_FUNCS | RELAFFIL | STABBR | ||
count | 0 | AK | 7.0 | 0.0 |
min | 0 | AK | 109.0 | NaN |
max | 0 | AK | 12865.0 | NaN |
count | 1 | AK | 3.0 | 1.0 |
min | 1 | AK | 27.0 | 503.0 |
在此前的基础上再做sort_index
cg.stack(‘AGG_FUNCS‘) .swaplevel(‘AGG_FUNCS‘, ‘STABBR‘, axis=‘index‘) .sort_index(level=‘RELAFFIL‘, axis=‘index‘) .sort_index(level=‘AGG_COLS‘, axis=‘columns‘).head(6)
AGG_COLS | SATMTMID | UGDS | ||
---|---|---|---|---|
AGG_FUNCS | RELAFFIL | STABBR | ||
count | 0 | AK | 0.0 | 7.0 |
AL | 13.0 | 71.0 | ||
... | ... | ... | ... | ... |
min | 0 | AL | 420.0 | 12.0 |
AR | 427.0 | 18.0 |
6 rows × 2 columns
对一些列做stack,对其它列做unstack
cg.stack(‘AGG_FUNCS‘).unstack([‘RELAFFIL‘, ‘STABBR‘])
AGG_COLS | UGDS | ... | SATMTMID | ||
---|---|---|---|---|---|
RELAFFIL | 0 | 1 | ... | 0 | 1 |
STABBR | AK | AK | ... | AR | AR |
AGG_FUNCS | |||||
count | 7.0 | 3.0 | ... | 9.0 | 7.0 |
min | 109.0 | 27.0 | ... | 427.0 | 495.0 |
max | 12865.0 | 275.0 | ... | 565.0 | 600.0 |
3 rows × 12 columns
对所有列做stack,会返回一个Series
cg.stack([‘AGG_FUNCS‘, ‘AGG_COLS‘]).head(12)
STABBR RELAFFIL AGG_FUNCS AGG_COLS
AK 0 count UGDS 7.0
SATMTMID 0.0
...
AL 0 count UGDS 71.0
SATMTMID 13.0
Length: 12, dtype: float64
删除行和列索引所有层级的名称
cg.rename_axis([None, None], axis=‘index‘).rename_axis([None, None], axis=‘columns‘)
UGDS | ... | SATMTMID | ||||
---|---|---|---|---|---|---|
count | min | ... | min | max | ||
AK | 0 | 7 | 109.0 | ... | NaN | NaN |
1 | 3 | 27.0 | ... | 503.0 | 503.0 | |
... | ... | ... | ... | ... | ... | ... |
AR | 0 | 68 | 18.0 | ... | 427.0 | 565.0 |
1 | 14 | 20.0 | ... | 495.0 | 600.0 |
6 rows × 6 columns
4.2 当多个变量被存储为列名时进行清理
当多个变量被存储为列名时进行清理
读取weightlifting数据集
weightlifting = pd.read_csv(‘data/weightlifting_men.csv‘)
weightlifting
Weight Category | M35 35-39 | ... | M75 75-79 | M80 80+ | |
---|---|---|---|---|---|
0 | 56 | 137 | ... | 62 | 55 |
1 | 62 | 152 | ... | 67 | 57 |
... | ... | ... | ... | ... | ... |
6 | 105 | 210 | ... | 95 | 80 |
7 | 105+ | 217 | ... | 100 | 85 |
8 rows × 11 columns
用melt方法,将sex_age放入一个单独的列
wl_melt = weightlifting.melt(id_vars=‘Weight Category‘,
var_name=‘sex_age‘,
value_name=‘Qual Total‘)
wl_melt.head()
Weight Category | sex_age | Qual Total | |
---|---|---|---|
0 | 56 | M35 35-39 | 137 |
1 | 62 | M35 35-39 | 152 |
2 | 69 | M35 35-39 | 167 |
3 | 77 | M35 35-39 | 182 |
4 | 85 | M35 35-39 | 192 |
用split方法将sex_age列分为两列
sex_age = wl_melt[‘sex_age‘].str.split(expand=True)
sex_age.head()
0 | 1 | |
---|---|---|
0 | M35 | 35-39 |
1 | M35 | 35-39 |
2 | M35 | 35-39 |
3 | M35 | 35-39 |
4 | M35 | 35-39 |
sex_age.columns = [‘Sex‘, ‘Age Group‘]
sex_age.head()
Sex | Age Group | |
---|---|---|
0 | M35 | 35-39 |
1 | M35 | 35-39 |
2 | M35 | 35-39 |
3 | M35 | 35-39 |
4 | M35 | 35-39 |
只取出字符串中的M
sex_age[‘Sex‘] = sex_age[‘Sex‘].str[0]
sex_age.head()
Sex | Age Group | |
---|---|---|
0 | M | 35-39 |
1 | M | 35-39 |
2 | M | 35-39 |
3 | M | 35-39 |
4 | M | 35-39 |
用concat方法,将sex_age,与wl_cat_total连接起来
wl_cat_total = wl_melt[[‘Weight Category‘, ‘Qual Total‘]]
wl_tidy = pd.concat([sex_age, wl_cat_total], axis=‘columns‘)
wl_tidy.head()
Sex | Age Group | Weight Category | Qual Total | |
---|---|---|---|---|
0 | M | 35-39 | 56 | 137 |
1 | M | 35-39 | 62 | 152 |
2 | M | 35-39 | 69 | 167 |
3 | M | 35-39 | 77 | 182 |
4 | M | 35-39 | 85 | 192 |
上面的结果也可以如下实现
cols = [‘Weight Category‘, ‘Qual Total‘]
sex_age[cols] = wl_melt[cols]
也可以通过assign的方法,动态加载新的列
age_group = wl_melt.sex_age.str.extract(‘(d{2}[-+](?:d{2})?)‘, expand=False)
sex = wl_melt.sex_age.str[0]
new_cols = {‘Sex‘:sex,‘Age Group‘: age_group}
wl_tidy2 = wl_melt.assign(**new_cols).drop(‘sex_age‘, axis=‘columns‘)
wl_tidy2.head()
Weight Category | Qual Total | Sex | Age Group | |
---|---|---|---|---|
0 | 56 | 137 | M | 35-39 |
1 | 62 | 152 | M | 35-39 |
2 | 69 | 167 | M | 35-39 |
3 | 77 | 182 | M | 35-39 |
4 | 85 | 192 | M | 35-39 |
4.3 当多个变量被存储为列的值时进行清理
读取restaurant_inspections数据集,将Date列的数据类型变为datetime64
inspections = pd.read_csv(‘data/restaurant_inspections.csv‘, parse_dates=[‘Date‘])
inspections.head(10)
Name | Date | Info | Value | |
---|---|---|---|---|
0 | E & E Grill House | 2017-08-08 | Borough | MANHATTAN |
1 | E & E Grill House | 2017-08-08 | Cuisine | American |
... | ... | ... | ... | ... |
8 | PIZZA WAGON | 2017-04-12 | Grade | A |
9 | PIZZA WAGON | 2017-04-12 | Score | 10.0 |
10 rows × 4 columns
4.3.1 stack方式
inspections.set_index([‘Name‘,‘Date‘, ‘Info‘]).unstack(‘Info‘).head()
Value | ||||||
---|---|---|---|---|---|---|
Info | Borough | Cuisine | Description | Grade | Score | |
Name | Date | |||||
3 STAR JUICE CENTER | 2017-05-10 | BROOKLYN | Juice, Smoothies, Fruit Salads | Facility not vermin proof. Harborage or condit... | A | 12.0 |
A & L PIZZA RESTAURANT | 2017-08-22 | BROOKLYN | Pizza | Facility not vermin proof. Harborage or condit... | A | 9.0 |
AKSARAY TURKISH CAFE AND RESTAURANT | 2017-07-25 | BROOKLYN | Turkish | Plumbing not properly installed or maintained;... | A | 13.0 |
ANTOJITOS DELI FOOD | 2017-06-01 | BROOKLYN | Latin (Cuban, Dominican, Puerto Rican, South &... | Live roaches present in facility‘s food and/or... | A | 10.0 |
BANGIA | 2017-06-16 | MANHATTAN | Korean | Covered garbage receptacle not provided or ina... | A | 9.0 |
用reset_index方法,使行索引层级与列索引相同
insp_tidy = inspections.set_index([‘Name‘,‘Date‘, ‘Info‘]) .unstack(‘Info‘) .reset_index(col_level=-1)
insp_tidy.head()
... | Value | ||||
---|---|---|---|---|---|
Info | Name | Date | ... | Grade | Score |
0 | 3 STAR JUICE CENTER | 2017-05-10 | ... | A | 12.0 |
1 | A & L PIZZA RESTAURANT | 2017-08-22 | ... | A | 9.0 |
2 | AKSARAY TURKISH CAFE AND RESTAURANT | 2017-07-25 | ... | A | 13.0 |
3 | ANTOJITOS DELI FOOD | 2017-06-01 | ... | A | 10.0 |
4 | BANGIA | 2017-06-16 | ... | A | 9.0 |
5 rows × 7 columns
除掉列索引的最外层,重命名行索引的层为None
insp_tidy.columns = insp_tidy.columns.droplevel(0).rename(None)
insp_tidy.head()
Name | Date | ... | Grade | Score | |
---|---|---|---|---|---|
0 | 3 STAR JUICE CENTER | 2017-05-10 | ... | A | 12.0 |
1 | A & L PIZZA RESTAURANT | 2017-08-22 | ... | A | 9.0 |
2 | AKSARAY TURKISH CAFE AND RESTAURANT | 2017-07-25 | ... | A | 13.0 |
3 | ANTOJITOS DELI FOOD | 2017-06-01 | ... | A | 10.0 |
4 | BANGIA | 2017-06-16 | ... | A | 9.0 |
5 rows × 7 columns
4.3.2 pivot_table方式
pivot_table需要传入聚合函数,才能产生一个单一值
inspections.pivot_table(index=[‘Name‘, ‘Date‘],
columns=‘Info‘,
values=‘Value‘,
aggfunc=‘first‘) .reset_index() .rename_axis(None, axis=‘columns‘)
Name | Date | ... | Grade | Score | |
---|---|---|---|---|---|
0 | 3 STAR JUICE CENTER | 2017-05-10 | ... | A | 12.0 |
1 | A & L PIZZA RESTAURANT | 2017-08-22 | ... | A | 9.0 |
... | ... | ... | ... | ... | ... |
98 | WANG MANDOO HOUSE | 2017-08-29 | ... | A | 12.0 |
99 | XIAOYAN YABO INC | 2017-08-29 | ... | Z | 49.0 |
100 rows × 7 columns
# inspections.pivot(index=[‘Name‘, ‘Date‘], columns=‘Info‘, values=‘Value‘)
# 运行pivot会报错,因为没有聚合函数,通过[‘Name‘, ‘Date‘]索引和columns=‘Info‘对应的是多个值
4.4 当两个或多个值存储于一个单元格时进行清理
读取texas_cities数据集
cities = pd.read_csv(‘data/texas_cities.csv‘)
cities
City | Geolocation | |
---|---|---|
0 | Houston | 29.7604° N, 95.3698° W |
1 | Dallas | 32.7767° N, 96.7970° W |
2 | Austin | 30.2672° N, 97.7431° W |
将Geolocation分解为四个单独的列
geolocations = cities.Geolocation.str.split(pat=‘. ‘, expand=True)
geolocations.columns = [‘latitude‘, ‘latitude direction‘, ‘longitude‘, ‘longitude direction‘]
geolocations
latitude | latitude direction | longitude | longitude direction | |
---|---|---|---|---|
0 | 29.7604 | N | 95.3698 | W |
1 | 32.7767 | N | 96.7970 | W |
2 | 30.2672 | N | 97.7431 | W |
转变数据类型
geolocations = geolocations.astype({‘latitude‘:‘float‘, ‘longitude‘:‘float‘})
geolocations.dtypes
latitude float64
latitude direction object
longitude float64
longitude direction object
dtype: object
将新列与原先的city列连起来
cities_tidy = pd.concat([cities[‘City‘], geolocations], axis=‘columns‘)
cities_tidy
City | latitude | latitude direction | longitude | longitude direction | |
---|---|---|---|---|---|
0 | Houston | 29.7604 | N | 95.3698 | W |
1 | Dallas | 32.7767 | N | 96.7970 | W |
2 | Austin | 30.2672 | N | 97.7431 | W |
函数to_numeric可以将每列自动变为整数或浮点数
temp = geolocations.apply(pd.to_numeric, errors=‘ignore‘)
temp
latitude | latitude direction | longitude | longitude direction | |
---|---|---|---|---|
0 | 29.7604 | N | 95.3698 | W |
1 | 32.7767 | N | 96.7970 | W |
2 | 30.2672 | N | 97.7431 | W |
|符,可以对多个标记进行分割
cities.Geolocation.str.split(pat=‘° |, ‘, expand=True)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 29.7604 | N | 95.3698 | W |
1 | 32.7767 | N | 96.7970 | W |
2 | 30.2672 | N | 97.7431 | W |
更复杂的提取方式
cities.Geolocation.str.extract(‘([0-9.]+). (N|S), ([0-9.]+). (E|W)‘, expand=True)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 29.7604 | N | 95.3698 | W |
1 | 32.7767 | N | 96.7970 | W |
2 | 30.2672 | N | 97.7431 | W |
4.5 当多个变量被存储为列名和列值时进行清理
读取sensors数据集
sensors = pd.read_csv(‘data/sensors.csv‘)
sensors
Group | Property | ... | 2015 | 2016 | |
---|---|---|---|---|---|
0 | A | Pressure | ... | 973 | 870 |
1 | A | Temperature | ... | 1036 | 1042 |
... | ... | ... | ... | ... | ... |
4 | B | Temperature | ... | 1002 | 1013 |
5 | B | Flow | ... | 824 | 873 |
6 rows × 7 columns
用melt清理数据
sensors.melt(id_vars=[‘Group‘, ‘Property‘], var_name=‘Year‘).head(6)
Group | Property | Year | value | |
---|---|---|---|---|
0 | A | Pressure | 2012 | 928 |
1 | A | Temperature | 2012 | 1026 |
... | ... | ... | ... | ... |
4 | B | Temperature | 2012 | 1008 |
5 | B | Flow | 2012 | 887 |
6 rows × 4 columns
用pivot_table,将Property列转化为新的列名
sensors.melt(id_vars=[‘Group‘, ‘Property‘], var_name=‘Year‘) .pivot_table(index=[‘Group‘, ‘Year‘], columns=‘Property‘, values=‘value‘) .reset_index() .rename_axis(None, axis=‘columns‘)
Group | Year | Flow | Pressure | Temperature | |
---|---|---|---|---|---|
0 | A | 2012 | 819 | 928 | 1026 |
1 | A | 2013 | 806 | 873 | 1038 |
... | ... | ... | ... | ... | ... |
8 | B | 2015 | 824 | 806 | 1002 |
9 | B | 2016 | 873 | 942 | 1013 |
10 rows × 5 columns
用stack和unstack实现上述方法
sensors.set_index([‘Group‘, ‘Property‘]) .stack() .unstack(‘Property‘) .rename_axis([‘Group‘, ‘Year‘], axis=‘index‘) .rename_axis(None, axis=‘columns‘) .reset_index()
Group | Year | Flow | Pressure | Temperature | |
---|---|---|---|---|---|
0 | A | 2012 | 819 | 928 | 1026 |
1 | A | 2013 | 806 | 873 | 1038 |
... | ... | ... | ... | ... | ... |
8 | B | 2015 | 824 | 806 | 1002 |
9 | B | 2016 | 873 | 942 | 1013 |
10 rows × 5 columns
4.6 当多个观察单位被存储于同一张表时进行清理
就是将一张表拆成多张表
读取movie_altered数据集
movie = pd.read_csv(‘data/movie_altered.csv‘)
movie.head()
title | rating | ... | actor_fb_likes_2 | actor_fb_likes_3 | |
---|---|---|---|---|---|
0 | Avatar | PG-13 | ... | 936.0 | 855.0 |
1 | Pirates of the Caribbean: At World‘s End | PG-13 | ... | 5000.0 | 1000.0 |
2 | Spectre | PG-13 | ... | 393.0 | 161.0 |
3 | The Dark Knight Rises | PG-13 | ... | 23000.0 | 23000.0 |
4 | Star Wars: Episode VII - The Force Awakens | NaN | ... | 12.0 | NaN |
5 rows × 12 columns
插入新的列,用来标识每一部电影
movie.insert(0, ‘id‘, np.arange(len(movie)))
用wide_to_long,将所有演员放到一列,将所有Facebook likes放到一列
stubnames = [‘director‘, ‘director_fb_likes‘, ‘actor‘, ‘actor_fb_likes‘]
movie_long = pd.wide_to_long(movie, stubnames=stubnames, i=‘id‘, j=‘num‘, sep=‘_‘).reset_index()
movie_long[‘num‘] = movie_long[‘num‘].astype(int)
movie_long.head(9)
id | num | ... | actor | actor_fb_likes | |
---|---|---|---|---|---|
0 | 0 | 1 | ... | CCH Pounder | 1000.0 |
1 | 0 | 2 | ... | Joel David Moore | 936.0 |
... | ... | ... | ... | ... | ... |
7 | 2 | 2 | ... | Rory Kinnear | 393.0 |
8 | 2 | 3 | ... | Stephanie Sigman | 161.0 |
9 rows × 10 columns
movie.columns
Index([‘id‘, ‘title‘, ‘rating‘, ‘year‘, ‘duration‘, ‘director_1‘,
‘director_fb_likes_1‘, ‘actor_1‘, ‘actor_2‘, ‘actor_3‘,
‘actor_fb_likes_1‘, ‘actor_fb_likes_2‘, ‘actor_fb_likes_3‘],
dtype=‘object‘)
movie_long.columns
Index([‘id‘, ‘num‘, ‘year‘, ‘duration‘, ‘rating‘, ‘title‘, ‘director‘,
‘director_fb_likes‘, ‘actor‘, ‘actor_fb_likes‘],
dtype=‘object‘)
将这个数据分解成多个小表
movie_table = movie_long[[‘id‘,‘title‘, ‘year‘, ‘duration‘, ‘rating‘]]
director_table = movie_long[[‘id‘, ‘director‘, ‘num‘, ‘director_fb_likes‘]]
actor_table = movie_long[[‘id‘, ‘actor‘, ‘num‘, ‘actor_fb_likes‘]]
做一些去重和去除缺失值的工作
movie_table = movie_table.drop_duplicates().reset_index(drop=True)
director_table = director_table.dropna().reset_index(drop=True)
actor_table = actor_table.dropna().reset_index(drop=True)
以上是关于Pandas Cookbook -- 08数据清理的主要内容,如果未能解决你的问题,请参考以下文章
Pandas Cookbook -- 07 分组聚合过滤转换
[Python Cookbook] Pandas Groupby
《Pandas CookBook》---- DataFrame基础操作