Python机器学习(九十五)Pandas 有条件访问
Posted 大码王
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python机器学习(九十五)Pandas 有条件访问相关的知识,希望对你有一定的参考价值。
有时需要根据条件,访问DataFrame中的数据。例如,找出电影数据集中,某个导演的电影:
# 加载数据 movies_df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title") movies_df.columns = [‘rank‘, ‘genre‘, ‘description‘, ‘director‘, ‘actors‘, ‘year‘, ‘runtime‘, ‘rating‘, ‘votes‘, ‘revenue_millions‘, ‘metascore‘] # 有条件访问 condition = (movies_df[‘director‘] == "Ridley Scott") condition.head()
输出
Title
Guardians of the Galaxy False
Prometheus True
Split False
Sing False
Suicide Squad False
Name: director, dtype: bool
与isnull()
类似,它返回一系列真值和假值: 如果是该导演的电影为真,否则为假。
上面的方法,没有过滤掉不属于该导演的电影。下面的例子,选出某个导演的所有电影,过滤掉不属于他的电影:
movies_df[movies_df[‘director‘] == "Ridley Scott"]
输出
rank genre ... revenue_millions metascore Title ... Prometheus 2 Adventure,Mystery,Sci-Fi ... 126.46 65.0 The Martian 103 Adventure,Drama,Sci-Fi ... 228.43 80.0 Robin Hood 388 Action,Adventure,Drama ... 105.22 53.0 American Gangster 471 Biography,Crime,Drama ... 130.13 76.0 Exodus: Gods and Kings 517 Action,Adventure,Drama ... 65.01 52.0 The Counselor 522 Crime,Drama,Thriller ... 16.97 48.0 A Good Year 531 Comedy,Drama,Romance ... 7.46 47.0 Body of Lies 738 Action,Drama,Romance ... 39.38 57.0 [8 rows x 11 columns]
此代码可以理解为下面的sql语句:
Select * from movies_df where director = ‘Ridley Scott‘
让我们看看使用数值的条件选择,通过评分过滤DataFrame:
movies_df[movies_df[‘rating‘] >= 8.6].head(3)
输出
rank genre ... revenue_millions metascore Title ... Interstellar 37 Adventure,Drama,Sci-Fi ... 187.99 74.0 The Dark Knight 55 Action,Crime,Drama ... 533.32 82.0 Inception 81 Action,Adventure,Sci-Fi ... 292.57 74.0 [3 rows x 11 columns]
可以使用逻辑运算符|
(或),&
(与)来生成更复杂的条件语句。
例如,只显示克里斯托弗·诺兰(Christopher Nolan)或雷德利·斯科特(Ridley Scott)的电影:
movies_df[(movies_df[‘director‘] == ‘Christopher Nolan‘) | (movies_df[‘director‘] == ‘Ridley Scott‘)].head()
输出
rank genre ... revenue_millions metascore Title ... Prometheus 2 Adventure,Mystery,Sci-Fi ... 126.46 65.0 Interstellar 37 Adventure,Drama,Sci-Fi ... 187.99 74.0 The Dark Knight 55 Action,Crime,Drama ... 533.32 82.0 The Prestige 65 Drama,Mystery,Sci-Fi ... 53.08 66.0 Inception 81 Action,Adventure,Sci-Fi ... 292.57 74.0 [5 rows x 11 columns]
可以使用isin()
方法,代码更简洁:
movies_df[movies_df[‘director‘].isin([‘Christopher Nolan‘, ‘Ridley Scott‘])].head()
输出
rank genre ... revenue_millions metascore Title ... Prometheus 2 Adventure,Mystery,Sci-Fi ... 126.46 65.0 Interstellar 37 Adventure,Drama,Sci-Fi ... 187.99 74.0 The Dark Knight 55 Action,Crime,Drama ... 533.32 82.0 The Prestige 65 Drama,Mystery,Sci-Fi ... 53.08 66.0 Inception 81 Action,Adventure,Sci-Fi ... 292.57 74.0 [5 rows x 11 columns]
下面的例子,选取所有在2005年到2010年间上映的电影,其评级都在8.0以上,但票房收入却低于25%。
movies_df[ ((movies_df[‘year‘] >= 2005) & (movies_df[‘year‘] <= 2010)) & (movies_df[‘rating‘] > 8.0) & (movies_df[‘revenue_millions‘] < movies_df[‘revenue_millions‘].quantile(0.25)) ]
输出
rank genre ... revenue_millions metascore Title ... 3 Idiots 431 Comedy,Drama ... 6.52 67.0 The Lives of Others 477 Drama,Thriller ... 11.28 89.0 Incendies 714 Drama,Mystery,War ... 6.86 80.0 Taare Zameen Par 992 Drama,Family,Music ... 1.20 42.0 [4 rows x 11 columns]
这里只有4部电影符合这个标准。
回想一下,前面章节,使用.describe()
时,收入的第25个百分位数大约是17.4,所以可以使用方法quantile(0.25)
直接访问这个值。
以上是关于Python机器学习(九十五)Pandas 有条件访问的主要内容,如果未能解决你的问题,请参考以下文章
Python机器学习(九十一)Pandas 填充(Imputation)空值