《Pandas CookBook》---- DataFrame基础操作
Posted shiyushiyu
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了《Pandas CookBook》---- DataFrame基础操作相关的知识,希望对你有一定的参考价值。
简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧
import pandas as pd
import numpy as np
设定最大列数和最大行数
pd.set_option(‘max_columns‘,5 , ‘max_rows‘, 5)
选取多个DataFrame列
用列表选取多个列
movie = pd.read_csv(‘data/movie.csv‘)
cols =[‘actor_1_name‘, ‘actor_2_name‘, ‘actor_3_name‘, ‘director_name‘]
movie_actor_director = movie[cols]
movie_actor_director
actor_1_name | actor_2_name | actor_3_name | director_name | |
---|---|---|---|---|
0 | CCH Pounder | Joel David Moore | Wes Studi | James Cameron |
1 | Johnny Depp | Orlando Bloom | Jack Davenport | Gore Verbinski |
... | ... | ... | ... | ... |
4914 | Alan Ruck | Daniel Henney | Eliza Coupe | Daniel Hsia |
4915 | John August | Brian Herzlinger | Jon Gunn | Jon Gunn |
4916 rows × 4 columns
使用select_dtypes选取类型
select_dtypes(include=None, exclude=None)
- To select all numeric types, use np.number or ‘number‘
- To select strings you must use the object dtype, but note that this will return all object dtype columns,See the numpy dtype hierarchy
- To select datetimes, use np.datetime64, ‘datetime‘ or ‘datetime64‘
- To select timedeltas, use np.timedelta64, ‘timedelta‘ or ‘timedelta64‘
- To select Pandas categorical dtypes, use ‘category‘
movie.shape
(4916, 28)
选取整数列
movie.select_dtypes(include=[‘int‘]).head()
num_voted_users | cast_total_facebook_likes | movie_facebook_likes | |
---|---|---|---|
0 | 886204 | 4834 | 33000 |
1 | 471220 | 48350 | 0 |
2 | 275868 | 11700 | 85000 |
3 | 1144337 | 106759 | 164000 |
4 | 8 | 143 | 0 |
选取非整数列
movie.select_dtypes(exclude=[‘int‘]).head()
color | director_name | ... | imdb_score | aspect_ratio | |
---|---|---|---|---|---|
0 | Color | James Cameron | ... | 7.9 | 1.78 |
1 | Color | Gore Verbinski | ... | 7.1 | 2.35 |
2 | Color | Sam Mendes | ... | 6.8 | 2.35 |
3 | Color | Christopher Nolan | ... | 8.5 | 2.35 |
4 | NaN | Doug Walker | ... | 7.1 | NaN |
5 rows × 25 columns
通过filter函数过滤选取多列
filter(items=None, like=None, regex=None, axis=None)
- items : list-like
- List of info axis to restrict to (must not all be present)
- 传递个列名或行名列表
- like : string
- Keep info axis where “arg in col == True”
- 类似Python里面字符串的find()函数,col.find(arg)
- regex : string (regular expression)
- Keep info axis with re.search(regex, col) == True
通过filter()函数过滤选取多列
movie.filter(like=‘facebook‘).head()
director_facebook_likes | actor_3_facebook_likes | ... | actor_2_facebook_likes | movie_facebook_likes | |
---|---|---|---|---|---|
0 | 0.0 | 855.0 | ... | 936.0 | 33000 |
1 | 563.0 | 1000.0 | ... | 5000.0 | 0 |
2 | 0.0 | 161.0 | ... | 393.0 | 85000 |
3 | 22000.0 | 23000.0 | ... | 23000.0 | 164000 |
4 | 131.0 | NaN | ... | 12.0 | 0 |
5 rows × 6 columns
通过正则表达式选取多列
movie.filter(regex=‘d‘).head()
actor_3_facebook_likes | actor_2_name | ... | actor_3_name | actor_2_facebook_likes | |
---|---|---|---|---|---|
0 | 855.0 | Joel David Moore | ... | Wes Studi | 936.0 |
1 | 1000.0 | Orlando Bloom | ... | Jack Davenport | 5000.0 |
2 | 161.0 | Rory Kinnear | ... | Stephanie Sigman | 393.0 |
3 | 23000.0 | Christian Bale | ... | Joseph Gordon-Levitt | 23000.0 |
4 | NaN | Rob Walker | ... | NaN | 12.0 |
5 rows × 6 columns
filter()函数,传递列表到参数items,选取多列
movie.filter(items=[‘actor_1_name‘, ‘actor_3_name‘]).head()
actor_1_name | actor_3_name | |
---|---|---|
0 | CCH Pounder | Wes Studi |
1 | Johnny Depp | Jack Davenport |
2 | Christoph Waltz | Stephanie Sigman |
3 | Tom Hardy | Joseph Gordon-Levitt |
4 | Doug Walker | NaN |
DataFrame上操作
基本方法
数据的个数 数据集的维度 数据集的长度
movie.shape,movie.size,movie.ndim
((4916, 28), 137648, 2)
各个列的非空值的个数
movie.count()
color 4897
director_name 4814
...
aspect_ratio 4590
movie_facebook_likes 4916
Length: 28, dtype: int64
统计信息
movie.shape
(4916, 28)
最大 最小值
数值类型
# min max quantile
movie_min = movie.min()
movie_min.name = ‘最小值‘
movie_min
num_critic_for_reviews 1.00
duration 7.00
...
aspect_ratio 1.18
movie_facebook_likes 0.00
Name: 最小值, Length: 16, dtype: float64
计算是默认会跳过缺失值的,可设置skipna=False使其包含缺失,但这样不具有意义
movie.min(skipna=False)
num_critic_for_reviews NaN
duration NaN
...
aspect_ratio NaN
movie_facebook_likes 0.0
Length: 16, dtype: float64
字符串类型
当字符串类型的列包含缺失值时,聚合方法min、max、sum,不会返回任何值。
movie[[‘color‘, ‘movie_title‘, ‘color‘]].max()
Series([], dtype: float64)
要让pandas强行返回每列的值,必须填入缺失值。下面填入的是空字符串
movie[[‘color‘, ‘movie_title‘, ‘color‘]].fillna(‘‘).max()
color Color
movie_title ?on Flux
color Color
dtype: object
统计信息
数值型
使用percentiles参数指定分位数
movie.describe(percentiles=[.01, .3, .99])
num_critic_for_reviews | duration | ... | aspect_ratio | movie_facebook_likes | |
---|---|---|---|---|---|
count | 4867.000000 | 4901.000000 | ... | 4590.000000 | 4916.000000 |
mean | 137.988905 | 107.090798 | ... | 2.222349 | 7348.294142 |
... | ... | ... | ... | ... | ... |
99% | 546.680000 | 189.000000 | ... | 4.000000 | 93850.000000 |
max | 813.000000 | 511.000000 | ... | 16.000000 | 349000.000000 |
9 rows × 16 columns
字符串型
movie.select_dtypes(include=‘object‘).describe()
color | director_name | ... | country | content_rating | |
---|---|---|---|---|---|
count | 4897 | 4814 | ... | 4911 | 4616 |
unique | 2 | 2397 | ... | 65 | 18 |
top | Color | Steven Spielberg | ... | USA | R |
freq | 4693 | 26 | ... | 3710 | 2067 |
4 rows × 12 columns
方法的组合
使用isnull方法将每个值转变为布尔值
movie.isnull().head()
color | director_name | ... | aspect_ratio | movie_facebook_likes | |
---|---|---|---|---|---|
0 | False | False | ... | False | False |
1 | False | False | ... | False | False |
2 | False | False | ... | False | False |
3 | False | False | ... | False | False |
4 | True | False | ... | True | False |
5 rows × 28 columns
sum统计布尔值,返回的是Series
movie.isnull().sum().head()
color 19
director_name 102
num_critic_for_reviews 49
duration 15
director_facebook_likes 102
dtype: int64
对这个Series再使用sum,返回整个DataFrame的缺失值的个数,返回值是个标量
movie.isnull().sum().sum()
2654
判断整个DataFrame有没有缺失值,方法是连着使用两个any
movie.isnull().any().any()
True
运算符
行索引名设为INSTNM,用UGDS_过滤出本科生的种族比例
college = pd.read_csv(‘data/college.csv‘, index_col=‘INSTNM‘)
college_ugds_ = college.filter(like=‘UGDS_‘)
college_ugds_
UGDS_WHITE | UGDS_BLACK | ... | UGDS_NRA | UGDS_UNKN | |
---|---|---|---|---|---|
INSTNM | |||||
Alabama A & M University | 0.0333 | 0.9353 | ... | 0.0059 | 0.0138 |
University of Alabama at Birmingham | 0.5922 | 0.2600 | ... | 0.0179 | 0.0100 |
... | ... | ... | ... | ... | ... |
Bay Area Medical Academy - San Jose Satellite Location | NaN | NaN | ... | NaN | NaN |
Excel Learning Center-San Antonio South | NaN | NaN | ... | NaN | NaN |
7535 rows × 9 columns
college_ugds_的数值类型都是float,可以进行整数运算
college_ugds_.dtypes
UGDS_WHITE float64
UGDS_BLACK float64
...
UGDS_NRA float64
UGDS_UNKN float64
Length: 9, dtype: object
加减乘除
college_ugds_.head() + .00501
UGDS_WHITE | UGDS_BLACK | ... | UGDS_NRA | UGDS_UNKN | |
---|---|---|---|---|---|
INSTNM | |||||
Alabama A & M University | 0.03831 | 0.94031 | ... | 0.01091 | 0.01881 |
University of Alabama at Birmingham | 0.59721 | 0.26501 | ... | 0.02291 | 0.01501 |
Amridge University | 0.30401 | 0.42421 | ... | 0.00501 | 0.27651 |
University of Alabama in Huntsville | 0.70381 | 0.13051 | ... | 0.03821 | 0.04001 |
Alabama State University | 0.02081 | 0.92581 | ... | 0.02931 | 0.01871 |
5 rows × 9 columns
计算样例数据的百分比
方式一
college_ugds_op_round = (college_ugds_ + .00501) // .01 / 100
college_ugds_op_round.head()
UGDS_WHITE | UGDS_BLACK | ... | UGDS_NRA | UGDS_UNKN | |
---|---|---|---|---|---|
INSTNM | |||||
Alabama A & M University | 0.03 | 0.94 | ... | 0.01 | 0.01 |
University of Alabama at Birmingham | 0.59 | 0.26 | ... | 0.02 | 0.01 |
Amridge University | 0.30 | 0.42 | ... | 0.00 | 0.27 |
University of Alabama in Huntsville | 0.70 | 0.13 | ... | 0.03 | 0.04 |
Alabama State University | 0.02 | 0.92 | ... | 0.02 | 0.01 |
5 rows × 9 columns
方式二
college_ugds_round = (college_ugds_ + .00001).round(2)
college_ugds_round.head()
UGDS_WHITE | UGDS_BLACK | ... | UGDS_NRA | UGDS_UNKN | |
---|---|---|---|---|---|
INSTNM | |||||
Alabama A & M University | 0.03 | 0.94 | ... | 0.01 | 0.01 |
University of Alabama at Birmingham | 0.59 | 0.26 | ... | 0.02 | 0.01 |
Amridge University | 0.30 | 0.42 | ... | 0.00 | 0.27 |
University of Alabama in Huntsville | 0.70 | 0.13 | ... | 0.03 | 0.04 |
Alabama State University | 0.02 | 0.92 | ... | 0.02 | 0.01 |
5 rows × 9 columns
方式三
college_ugds_op_round_methods = college_ugds_.add(.00501).floordiv(.01).div(100)
college_ugds_op_round_methods.head()
UGDS_WHITE | UGDS_BLACK | ... | UGDS_NRA | UGDS_UNKN | |
---|---|---|---|---|---|
INSTNM | |||||
Alabama A & M University | 0.03 | 0.94 | ... | 0.01 | 0.01 |
University of Alabama at Birmingham | 0.59 | 0.26 | ... | 0.02 | 0.01 |
Amridge University | 0.30 | 0.42 | ... | 0.00 | 0.27 |
University of Alabama in Huntsville | 0.70 | 0.13 | ... | 0.03 | 0.04 |
Alabama State University | 0.02 | 0.92 | ... | 0.02 | 0.01 |
5 rows × 9 columns
比较缺失值
Pandas使用NumPy NaN(np.nan)对象表示缺失值。这是一个不等于自身的特殊对象:
np.nan == np.nan
False
所有和np.nan的比较都返回False,除了不等于:
5 > np.nan
False
5 != np.nan
True
无法通过直接比较比较,含有缺失值的df是否一致
movie_equal = movie == movie
movie_equal.all().all()
False
movie_equal.size - movie_equal.sum().sum()
2654
movie.isnull().sum().sum()
2654
比较两个DataFrame最直接的方法是使用equals()方法
from pandas.testing import assert_frame_equal
assert_frame_equal(movie, movie)
以上是关于《Pandas CookBook》---- DataFrame基础操作的主要内容,如果未能解决你的问题,请参考以下文章
《Pandas CookBook》---- DataFrame基础操作
[Python Cookbook] Pandas: Indexing of DataFrame
《Pandas Cookbook》第02章 DataFrame基本操作