《Pandas CookBook》---- 第五章 布尔索引
Posted shiyushiyu
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了《Pandas CookBook》---- 第五章 布尔索引相关的知识,希望对你有一定的参考价值。
简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧
import pandas as pd
import numpy as np
设定最大列数和最大行数
pd.set_option(‘max_columns‘,5 , ‘max_rows‘, 5)
1 布尔值统计信息
movie = pd.read_csv(‘data/movie.csv‘, index_col=‘movie_title‘)
movie.head()
color | director_name | ... | aspect_ratio | movie_facebook_likes | |
---|---|---|---|---|---|
movie_title | |||||
Avatar | Color | James Cameron | ... | 1.78 | 33000 |
Pirates of the Caribbean: At World‘s End | Color | Gore Verbinski | ... | 2.35 | 0 |
Spectre | Color | Sam Mendes | ... | 2.35 | 85000 |
The Dark Knight Rises | Color | Christopher Nolan | ... | 2.35 | 164000 |
Star Wars: Episode VII - The Force Awakens | NaN | Doug Walker | ... | NaN | 0 |
5 rows × 27 columns
1.1 基础方法
判断电影时长是否超过两小时
movie_2_hours = movie[‘duration‘] > 120
movie_2_hours.head(10)
movie_title
Avatar True
Pirates of the Caribbean: At World‘s End True
...
Avengers: Age of Ultron True
Harry Potter and the Half-Blood Prince True
Name: duration, Length: 10, dtype: bool
有多少时长超过两小时的电影
movie_2_hours.sum()
1039
超过两小时的电影的比例
movie_2_hours.mean()
0.2113506916192026
实际上,dureation这列是有缺失值的,要想获得真正的超过两小时的电影的比例,需要先删掉缺失值
movie[‘duration‘].dropna().gt(120).mean()
0.21199755152009794
1.2 统计信息
用describe()输出一些该布尔Series信息
movie_2_hours.describe()
count 4916
unique 2
top False
freq 3877
Name: duration, dtype: object
统计False和True值的比例
movie_2_hours.value_counts(normalize=True)
False 0.788649
True 0.211351
Name: duration, dtype: float64
2 布尔索引
2.1 布尔条件
在Pandas中,位运算符(&, |, ~)的优先级高于比较运算符
2.1.1 创建多个布尔条件
criteria1 = movie.imdb_score > 8
criteria2 = movie.content_rating == ‘PG-13‘
criteria3 = (movie.title_year < 2000) | (movie.title_year >= 2010)
criteria3.head()
movie_title
Avatar False
Pirates of the Caribbean: At World‘s End False
Spectre True
The Dark Knight Rises True
Star Wars: Episode VII - The Force Awakens False
Name: title_year, dtype: bool
2.1.2 将这些布尔条件合并成一个
criteria_final = criteria1 & criteria2 & criteria3
criteria_final.head()
movie_title
Avatar False
Pirates of the Caribbean: At World‘s End False
Spectre False
The Dark Knight Rises True
Star Wars: Episode VII - The Force Awakens False
dtype: bool
2.2 布尔过滤
创建第一个布尔条件
crit_a1 = movie.imdb_score > 8
crit_a2 = movie.content_rating == ‘PG-13‘
crit_a3 = (movie.title_year < 2000) | (movie.title_year > 2009)
final_crit_a = crit_a1 & crit_a2 & crit_a3
创建第二个布尔条件
crit_b1 = movie.imdb_score < 5
crit_b2 = movie.content_rating == ‘R‘
crit_b3 = (movie.title_year >= 2000) & (movie.title_year <= 2010)
final_crit_b = crit_b1 & crit_b2 & crit_b3
合并布尔条件
final_crit_all = final_crit_a | final_crit_b
final_crit_all.head()
movie_title
Avatar False
Pirates of the Caribbean: At World‘s End False
Spectre False
The Dark Knight Rises True
Star Wars: Episode VII - The Force Awakens False
dtype: bool
过滤数据
movie[final_crit_all].head()
color | director_name | ... | aspect_ratio | movie_facebook_likes | |
---|---|---|---|---|---|
movie_title | |||||
The Dark Knight Rises | Color | Christopher Nolan | ... | 2.35 | 164000 |
The Avengers | Color | Joss Whedon | ... | 1.85 | 123000 |
Captain America: Civil War | Color | Anthony Russo | ... | 2.35 | 72000 |
Guardians of the Galaxy | Color | James Gunn | ... | 2.35 | 96000 |
Interstellar | Color | Christopher Nolan | ... | 2.35 | 349000 |
5 rows × 27 columns
验证过滤
cols = [‘imdb_score‘, ‘content_rating‘, ‘title_year‘]
movie_filtered = movie.loc[final_crit_all, cols]
movie_filtered.head(10)
imdb_score | content_rating | title_year | |
---|---|---|---|
movie_title | |||
The Dark Knight Rises | 8.5 | PG-13 | 2012.0 |
The Avengers | 8.1 | PG-13 | 2012.0 |
... | ... | ... | ... |
Sex and the City 2 | 4.3 | R | 2010.0 |
Rollerball | 3.0 | R | 2002.0 |
10 rows × 3 columns
2.3 与标签索引对比
college = pd.read_csv(‘data/college.csv‘)
college2 = college.set_index(‘STABBR‘)
2.3.1 单个标签
college2中STABBR作为行索引,用loc选取
college2.loc[‘TX‘].head()
INSTNM | CITY | ... | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|
STABBR | |||||
TX | Abilene Christian University | Abilene | ... | 40200 | 25985 |
TX | Alvin Community College | Alvin | ... | 34500 | 6750 |
TX | Amarillo College | Amarillo | ... | 31700 | 10950 |
TX | Angelina College | Lufkin | ... | 26900 | PrivacySuppressed |
TX | Angelo State University | San Angelo | ... | 37700 | 21319.5 |
5 rows × 26 columns
college中,用布尔索引选取所有得克萨斯州的学校
college[college[‘STABBR‘] == ‘TX‘].head()
INSTNM | CITY | ... | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|
3610 | Abilene Christian University | Abilene | ... | 40200 | 25985 |
3611 | Alvin Community College | Alvin | ... | 34500 | 6750 |
3612 | Amarillo College | Amarillo | ... | 31700 | 10950 |
3613 | Angelina College | Lufkin | ... | 26900 | PrivacySuppressed |
3614 | Angelo State University | San Angelo | ... | 37700 | 21319.5 |
5 rows × 27 columns
比较二者的速度
法一
%timeit college[college[‘STABBR‘] == ‘TX‘]
937 μs ± 58.9 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
法二
%timeit college2.loc[‘TX‘]
520 μs ± 21.2 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit college2 = college.set_index(‘STABBR‘)
2.11 ms ± 185 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.3.2 多个标签
布尔索引和标签选取多列
states =[‘TX‘, ‘CA‘, ‘NY‘]
college[college[‘STABBR‘].isin(states)]
INSTNM | CITY | ... | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|
192 | Academy of Art University | San Francisco | ... | 36000 | 35093 |
193 | ITT Technical Institute-Rancho Cordova | Rancho Cordova | ... | 38800 | 25827.5 |
... | ... | ... | ... | ... | ... |
7533 | Bay Area Medical Academy - San Jose Satellite ... | San Jose | ... | NaN | PrivacySuppressed |
7534 | Excel Learning Center-San Antonio South | San Antonio | ... | NaN | 12125 |
1704 rows × 27 columns
college2.loc[states].head()
INSTNM | CITY | ... | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|
STABBR | |||||
TX | Abilene Christian University | Abilene | ... | 40200 | 25985 |
TX | Alvin Community College | Alvin | ... | 34500 | 6750 |
TX | Amarillo College | Amarillo | ... | 31700 | 10950 |
TX | Angelina College | Lufkin | ... | 26900 | PrivacySuppressed |
TX | Angelo State University | San Angelo | ... | 37700 | 21319.5 |
5 rows × 26 columns
3 查询方法
使用查询方法提高布尔索引的可读性
# 读取employee数据,确定选取的部门和列
employee = pd.read_csv(‘data/employee.csv‘)
depts = [‘Houston Police Department-HPD‘, ‘Houston Fire Department (HFD)‘]
select_columns = [‘UNIQUE_ID‘, ‘DEPARTMENT‘, ‘GENDER‘, ‘BASE_SALARY‘]
# 创建查询字符串,并执行query方法
qs = "DEPARTMENT in @depts and GENDER == ‘Female‘ and 80000 <= BASE_SALARY <= 120000"
emp_filtered = employee.query(qs)
emp_filtered[select_columns].head()
UNIQUE_ID | DEPARTMENT | GENDER | BASE_SALARY | |
---|---|---|---|---|
61 | 61 | Houston Fire Department (HFD) | Female | 96668.0 |
136 | 136 | Houston Police Department-HPD | Female | 81239.0 |
367 | 367 | Houston Police Department-HPD | Female | 86534.0 |
474 | 474 | Houston Police Department-HPD | Female | 91181.0 |
513 | 513 | Houston Police Department-HPD | Female | 81239.0 |
4 唯一和有序索引
4.1 单列索引
college = pd.read_csv(‘data/college.csv‘)
college2 = college.set_index(‘STABBR‘)
college2.index.is_monotonic
False
将college2排序,存储成另一个对象,查看其是否有序
college3 = college2.sort_index()
college3.index.is_monotonic
True
使用INSTNM作为行索引,检测行索引是否唯一
college_unique = college.set_index(‘INSTNM‘)
college_unique.index.is_unique
True
4.2 拼装索引
使用CITY和STABBR两列作为行索引,并进行排序
college.index = college[‘CITY‘] + ‘, ‘ + college[‘STABBR‘]
college = college.sort_index()
college.head()
INSTNM | CITY | ... | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|
ARTESIA, CA | Angeles Institute | ARTESIA | ... | NaN | 16850 |
Aberdeen, SD | Presentation College | Aberdeen | ... | 35900 | 25000 |
Aberdeen, SD | Northern State University | Aberdeen | ... | 33600 | 24847 |
Aberdeen, WA | Grays Harbor College | Aberdeen | ... | 27000 | 11490 |
Abilene, TX | Hardin-Simmons University | Abilene | ... | 38700 | 25864 |
5 rows × 27 columns
college.index.is_unique
False
选取所有Miami, FL的大学
法一
college.loc[‘Miami, FL‘].head()
INSTNM | CITY | ... | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|
Miami, FL | New Professions Technical Institute | Miami | ... | 18700 | 8682 |
Miami, FL | Management Resources College | Miami | ... | PrivacySuppressed | 12182 |
Miami, FL | Strayer University-Doral | Miami | ... | 49200 | 36173.5 |
Miami, FL | Keiser University- Miami | Miami | ... | 29700 | 26063 |
Miami, FL | George T Baker Aviation Technical College | Miami | ... | 38600 | PrivacySuppressed |
5 rows × 27 columns
法二
crit1 = college[‘CITY‘] == ‘Miami‘
crit2 = college[‘STABBR‘] == ‘FL‘
college[crit1 & crit2]
INSTNM | CITY | ... | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|
Miami, FL | New Professions Technical Institute | Miami | ... | 18700 | 8682 |
Miami, FL | Management Resources College | Miami | ... | PrivacySuppressed | 12182 |
... | ... | ... | ... | ... | ... |
Miami, FL | Advanced Technical Centers | Miami | ... | PrivacySuppressed | PrivacySuppressed |
Miami, FL | Lindsey Hopkins Technical College | Miami | ... | 29800 | PrivacySuppressed |
50 rows × 27 columns
5 loc/iloc中使用布尔
movie = pd.read_csv(‘data/movie.csv‘, index_col=‘movie_title‘)
5.1 行
c1 = movie[‘content_rating‘] == ‘G‘
c2 = movie[‘imdb_score‘] < 4
criteria = c1 & c2
bool_movie = movie[criteria]
bool_movie
color | director_name | ... | aspect_ratio | movie_facebook_likes | |
---|---|---|---|---|---|
movie_title | |||||
The True Story of Puss‘N Boots | Color | Jér?me Deschamps | ... | NaN | 90 |
Doogal | Color | Dave Borthwick | ... | 1.85 | 346 |
... | ... | ... | ... | ... | ... |
Justin Bieber: Never Say Never | Color | Jon M. Chu | ... | 1.85 | 62000 |
Sunday School Musical | Color | Rachel Goldenberg | ... | 1.85 | 777 |
6 rows × 27 columns
loc使用bool
法一
movie_loc = movie.loc[criteria]
检查loc条件和布尔条件创建出来的两个DataFrame是否一样
movie_loc.equals(movie[criteria])
True
法二
movie_loc2 = movie.loc[criteria.values]
movie_loc2.equals(movie[criteria])
True
iloc使用bool
因为criteria是包含行索引的一个Series,必须要使用底层的ndarray,才能使用,iloc
movie_iloc = movie.iloc[criteria.values]
movie_iloc.equals(movie_loc)
True
5.2 列
布尔索引也可以用来选取列
criteria_col = movie.dtypes == np.int64
criteria_col.head()
color False
director_name False
num_critic_for_reviews False
duration False
director_facebook_likes False
dtype: bool
movie.loc[:, criteria_col].head()
num_voted_users | cast_total_facebook_likes | movie_facebook_likes | |
---|---|---|---|
movie_title | |||
Avatar | 886204 | 4834 | 33000 |
Pirates of the Caribbean: At World‘s End | 471220 | 48350 | 0 |
Spectre | 275868 | 11700 | 85000 |
The Dark Knight Rises | 1144337 | 106759 | 164000 |
Star Wars: Episode VII - The Force Awakens | 8 | 143 | 0 |
movie.iloc[:, criteria_col.values].head()
num_voted_users | cast_total_facebook_likes | movie_facebook_likes | |
---|---|---|---|
movie_title | |||
Avatar | 886204 | 4834 | 33000 |
Pirates of the Caribbean: At World‘s End | 471220 | 48350 | 0 |
Spectre | 275868 | 11700 | 85000 |
The Dark Knight Rises | 1144337 | 106759 | 164000 |
Star Wars: Episode VII - The Force Awakens | 8 | 143 | 0 |
6 使用布尔值 - where/mask
mask() is the inverse boolean operation of where.
DataFrame.where(cond, other=nan, inplace=False **kwgs)
Parameters:
cond : boolean NDFrame, array-like, or callable
- Where cond is True, keep the original value. Where False, replace with corresponding value from other. If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).
- cond是一个与df通型的dataframe,当dataframe与cond对应的位置是true是,保留原值。否则便为other对应的值
- other : scalar, NDFrame, or callable
- inplace : boolean, default False
- Whether to perform the operation in place on the data
6.1 Series使用where
movie = pd.read_csv(‘data/movie.csv‘, index_col=‘movie_title‘)
fb_likes = movie[‘actor_1_facebook_likes‘].dropna()
fb_likes.head()
movie_title
Avatar 1000.0
Pirates of the Caribbean: At World‘s End 40000.0
Spectre 11000.0
The Dark Knight Rises 27000.0
Star Wars: Episode VII - The Force Awakens 131.0
Name: actor_1_facebook_likes, dtype: float64
使用describe获得对数据的认知
fb_likes.describe(percentiles=[.1, .25, .5, .75, .9]).astype(int)
count 4909
mean 6494
...
90% 18000
max 640000
Name: actor_1_facebook_likes, Length: 10, dtype: int64
检测小于20000个喜欢的的比例
criteria_high = fb_likes < 20000
criteria_high.mean().round(2)
0.91
where条件可以返回一个同样大小的Series,但是所有False会被替换成缺失值
fb_likes.where(criteria_high).head()
movie_title
Avatar 1000.0
Pirates of the Caribbean: At World‘s End NaN
Spectre 11000.0
The Dark Knight Rises NaN
Star Wars: Episode VII - The Force Awakens 131.0
Name: actor_1_facebook_likes, dtype: float64
第二个参数other,可以让你控制替换值
fb_likes.where(criteria_high, other=20000).head()
movie_title
Avatar 1000.0
Pirates of the Caribbean: At World‘s End 20000.0
Spectre 11000.0
The Dark Knight Rises 20000.0
Star Wars: Episode VII - The Force Awakens 131.0
Name: actor_1_facebook_likes, dtype: float64
通过where条件,设定上下限的值
criteria_low = fb_likes > 300
fb_likes_cap = fb_likes.where(criteria_high, other=20000).where(criteria_low, 300)
fb_likes_cap.head()
movie_title
Avatar 1000.0
Pirates of the Caribbean: At World‘s End 20000.0
Spectre 11000.0
The Dark Knight Rises 20000.0
Star Wars: Episode VII - The Force Awakens 300.0
Name: actor_1_facebook_likes, dtype: float64
原始Series和修改过的Series的长度是一样的
len(fb_likes), len(fb_likes_cap)
(4909, 4909)
6.2 dataframe使用where
df = pd.DataFrame({‘vals‘: [1, 2, 3, 4], ‘ids‘: [‘a‘, ‘b‘, ‘f‘, ‘n‘],‘ids2‘: [‘a‘, ‘n‘, ‘c‘, ‘n‘]})
print(df)
print(df < 2)
df.where(df<2,1000)
vals ids ids2
0 1 a a
1 2 b n
2 3 f c
3 4 n n
vals ids ids2
0 True True True
1 False True True
2 False True True
3 False True True
vals | ids | ids2 | |
---|---|---|---|
0 | 1 | a | a |
1 | 1000 | b | n |
2 | 1000 | f | c |
3 | 1000 | n | n |
下面的代码等价于 df.where(df < 0,1000).
print(df[df < 2])
df[df < 2].fillna(1000)
vals ids ids2
0 1.0 a a
1 NaN b n
2 NaN f c
3 NaN n n
vals | ids | ids2 | |
---|---|---|---|
0 | 1.0 | a | a |
1 | 1000.0 | b | n |
2 | 1000.0 | f | c |
3 | 1000.0 | n | n |
以上是关于《Pandas CookBook》---- 第五章 布尔索引的主要内容,如果未能解决你的问题,请参考以下文章
Unity Shaders and Effects Cookbook (5-1)LitSphere lighting model
[Python Cookbook] Pandas Groupby
《Pandas CookBook》---- DataFrame基础操作