Pandas 10分钟入门(官方文档注释版二)
Posted angelxp
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Pandas 10分钟入门(官方文档注释版二)相关的知识,希望对你有一定的参考价值。
本文接续注释版1,前文重点讲述了如何创建一个panads对象,本文重点讲述如何查看这些已经创建的对象。
【查看数据】
- See the top & bottom rows of the frame(查看frame头部和尾部的行)
>>> import pandas as pd >>> long_series = pd.Series(np.random.randn(1000)) >>> import numpy as np >>> long_series = pd.Series(np.random.randn(1000)) >>> long_series 0 0.526507 1 -0.085210 2 1.292113 3 -1.948114 4 -1.386582 5 -2.596821 6 0.268965 7 -0.635905 8 -1.839953 9 -1.240820 10 0.122215 .......
上面为完成的series,可以看到定义了一个10000个值,现在我们只取头部和尾部,因此可以使用head()和tail()两个方法,两个方法默认取的数据都是5个,当然你可以自己定义取几个,具体如下:
>>> long_series.head() 0 0.526507 1 -0.085210 2 1.292113 3 -1.948114 4 -1.386582 dtype: float64 >>> long_series.tail(6) ----lst: 取最后6个值 994 -1.300574 995 0.659815 996 -0.340045 997 0.685664 998 -0.972145 999 0.410191 dtype: float64
- 显示索引、列和底层numpy数据
pandas获取这些比较简单,直接采用属性的方式即可。如下:
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates,columns=list(‘ABCD‘)) >>> df A B C D 2017-01-01 0.906245 1.815924 0.123356 -1.798571 2017-01-02 -0.459646 0.520100 0.511138 0.183975 2017-01-03 0.463326 -0.970487 -1.120780 -0.614481 2017-01-04 1.505464 -1.743313 1.020903 -1.049047 2017-01-05 -0.709366 1.378030 1.874955 -1.017548 2017-01-06 1.113554 -0.951963 -1.266802 -0.586571 >>> df.index 获取行索引 DatetimeIndex([‘2017-01-01‘, ‘2017-01-02‘, ‘2017-01-03‘, ‘2017-01-04‘, ‘2017-01-05‘, ‘2017-01-06‘], dtype=‘datetime64[ns]‘, freq=‘D‘) >>> df.columns 获取列索引 Index([u‘A‘, u‘B‘, u‘C‘, u‘D‘], dtype=‘object‘) >>> df.values 获取值 array([[ 0.90624543, 1.81592368, 0.12335647, -1.79857091], [-0.45964616, 0.52009988, 0.51113763, 0.1839755 ], [ 0.46332631, -0.97048662, -1.12078016, -0.61448135], [ 1.50546445, -1.74331294, 1.02090281, -1.04904748], [-0.70936561, 1.37802983, 1.87495471, -1.01754786], [ 1.11355431, -0.95196258, -1.2668023 , -0.58657136]])
- 对数据的一些快速基本统计
>>> df.describe() A B C D count 6.000000 6.000000 6.000000 6.000000 mean 0.469930 0.008049 0.190462 -0.813707 std 0.886775 1.439019 1.222903 0.656284 min -0.709366 -1.743313 -1.266802 -1.798571 25% -0.228903 -0.965856 -0.809746 -1.041173 50% 0.684786 -0.215931 0.317247 -0.816015 75% 1.061727 1.163547 0.893462 -0.593549 max 1.505464 1.815924 1.874955 0.183975
注意上述的统计,是按照不同维度(也就是列)进行统计。
- 数据的行列转换
>>> df.T 2017-01-01 2017-01-02 2017-01-03 2017-01-04 2017-01-05 2017-01-06 A 0.906245 -0.4596 46 0.463326 1.505464 -0.709366 1.113554 B 1.815924 0.520100 -0.970487 -1.743313 1.378030 -0.951963 C 0.123356 0.511138 -1.120780 1.020903 1.874955 -1.266802 D -1.798571 0.183975 -0.614481 -1.049047 -1.017548 -0.586571
- 按照某一个轴axis进行排序
>>> df.sort_index(axis=1,ascending=False)
D C B A
2017-01-01 -1.798571 0.123356 1.815924 0.906245
2017-01-02 0.183975 0.511138 0.520100 -0.459646
2017-01-03 -0.614481 -1.120780 -0.970487 0.463326
2017-01-04 -1.049047 1.020903 -1.743313 1.505464
2017-01-05 -1.017548 1.874955 1.378030 -0.709366
2017-01-06 -0.586571 -1.266802 -0.951963 1.113554
- 按值进行排序 (lst:以前的版本是sort(columns=xxx),该方法将被废止,现在官方已经开始使用sort_values)
>>> df.sort_values(by=‘B‘) A B C D 2017-01-04 1.505464 -1.743313 1.020903 -1.049047 2017-01-03 0.463326 -0.970487 -1.120780 -0.614481 2017-01-06 1.113554 -0.951963 -1.266802 -0.586571 2017-01-02 -0.459646 0.520100 0.511138 0.183975 2017-01-05 -0.709366 1.378030 1.874955 -1.017548 2017-01-01 0.906245 1.815924 0.123356 -1.798571
【选择数据】
注意
:虽然标准的Python/Numpy表达式是直观且可用的,但是我们推荐使用优化后的pandas方法,例如:.at,.iat,.loc,.iloc以及.ix 详情请查看: Indexing and Selecting Data 和 MultiIndex / Advanced Indexing
-
获取
获取一个单独的列
>>> df[‘A‘] 2017-01-01 0.906245 2017-01-02 -0.459646 2017-01-03 0.463326 2017-01-04 1.505464 2017-01-05 -0.709366 2017-01-06 1.113554 Freq: D, Name: A, dtype: float64
通过切片获取数据
>>> df[1:3]
A B C D
2017-01-02 -0.459646 0.520100 0.511138 0.183975
2017-01-03 0.463326 -0.970487 -1.120780 -0.614481
通过标签获取数据 (获取时间为2017-01-01的数据)
>> df.loc[dates[0]] A 0.906245 B 1.815924 C 0.123356 D -1.798571 Name: 2017-01-01 00:00:00, dtype: float64
通过标签获取多轴数据
>>> df.loc[:,[‘A‘,‘C‘]] A C 2017-01-01 0.906245 0.123356 2017-01-02 -0.459646 0.511138 2017-01-03 0.463326 -1.120780 2017-01-04 1.505464 1.020903 2017-01-05 -0.709366 1.874955 2017-01-06 1.113554 -1.266802
标签切片(Showing label slicing, both endpoints are included)
>>> df.loc[‘20170101‘:‘20170103‘,[‘A‘,‘B‘]] A B 2017-01-01 0.906245 1.815924 2017-01-02 -0.459646 0.520100 2017-01-03 0.463326 -0.970487
- 对返回的对象进行维度缩减
>>> df.loc[‘20170103‘,[‘A‘,‘B‘]] A 0.463326 B -0.970487 Name: 2017-01-03 00:00:00, dtype: float64
获取单个值
>>> df.loc[dates[0],‘A‘] 0.90624542800545049
快速访问单个值(与上相同,区别还不明白)
>>> df.at[dates[0],‘A‘] 0.90624542800545049
以上获取数据,大部分都是采用loc的方式获取的数据,下面将主要采用iloc的方式获取数据。两者主要的区别是:loc主要是通过行标签的方式获取,仔细观察上面的代码,可以发现我们变换的主要都是第一个参数,也就是行的标签,而下面获取的iloc主要变换的是行号。
- 位置式选择获取
数值选择获取
>>> df A B C D 2017-01-01 0.906245 1.815924 0.123356 -1.798571 2017-01-02 -0.459646 0.520100 0.511138 0.183975 2017-01-03 0.463326 -0.970487 -1.120780 -0.614481 2017-01-04 1.505464 -1.743313 1.020903 -1.049047 2017-01-05 -0.709366 1.378030 1.874955 -1.017548 2017-01-06 1.113554 -0.951963 -1.266802 -0.586571 >>> df.iloc[3] A 1.505464 B -1.743313 C 1.020903 D -1.049047 Name: 2017-01-04 00:00:00, dtype: float64
数值切片
>>> df.iloc[3:5,0:2] 注意切片是左闭环
A B
2017-01-04 1.505464 -1.743313
2017-01-05 -0.709366 1.378030
获取指定列表位置数据
>>> df.iloc[[1,2,4],[0,2]]
A C
2017-01-02 -0.459646 0.511138
2017-01-03 0.463326 -1.120780
2017-01-05 -0.709366 1.874955
>>>
行、列切片
>>> df.iloc[1:3,:] A B C D 2017-01-02 -0.459646 0.520100 0.511138 0.183975 2017-01-03 0.463326 -0.970487 -1.120780 -0.614481 >>> df.iloc[:,1:3] B C 2017-01-01 1.815924 0.123356 2017-01-02 0.520100 0.511138 2017-01-03 -0.970487 -1.120780 2017-01-04 -1.743313 1.020903 2017-01-05 1.378030 1.874955 2017-01-06 -0.951963 -1.266802
获取特定值
>>> df.iloc[1,1] 0.52009988180243594 >>> df.iat[1,1] 0.52009988180243594
- 布尔索引(通过增加条件判断的结果来获取数据)
使用一个单独列的值来选择数据
>>> df A B C D 2017-01-01 0.906245 1.815924 0.123356 -1.798571 2017-01-02 -0.459646 0.520100 0.511138 0.183975 2017-01-03 0.463326 -0.970487 -1.120780 -0.614481 2017-01-04 1.505464 -1.743313 1.020903 -1.049047 2017-01-05 -0.709366 1.378030 1.874955 -1.017548 2017-01-06 1.113554 -0.951963 -1.266802 -0.586571 >>> df[df.A>0] A B C D 2017-01-01 0.906245 1.815924 0.123356 -1.798571 2017-01-03 0.463326 -0.970487 -1.120780 -0.614481 2017-01-04 1.505464 -1.743313 1.020903 -1.049047 2017-01-06 1.113554 -0.951963 -1.266802 -0.586571
Selecting values from a DataFrame where a boolean condition is met.
(获取所有DataFrame中满足条件的数据)
>>> df[df>0] A B C D 2017-01-01 0.906245 1.815924 0.123356 NaN 2017-01-02 NaN 0.520100 0.511138 0.183975 2017-01-03 0.463326 NaN NaN NaN 2017-01-04 1.505464 NaN 1.020903 NaN 2017-01-05 NaN 1.378030 1.874955 NaN 2017-01-06 1.113554 NaN NaN NaN
通过isin()过滤数据
>>> df2 = df.copy() >>> df2[‘E‘] =[‘one‘,‘one‘,‘two‘,‘three‘,‘four‘,‘three‘] >>> df2 A B C D E 2017-01-01 0.906245 1.815924 0.123356 -1.798571 one 2017-01-02 -0.459646 0.520100 0.511138 0.183975 one 2017-01-03 0.463326 -0.970487 -1.120780 -0.614481 two 2017-01-04 1.505464 -1.743313 1.020903 -1.049047 three 2017-01-05 -0.709366 1.378030 1.874955 -1.017548 four 2017-01-06 1.113554 -0.951963 -1.266802 -0.586571 three >>> df2[df2[‘E‘].isin([‘two‘,‘four‘])] A B C D E 2017-01-03 0.463326 -0.970487 -1.120780 -0.614481 two 2017-01-05 -0.709366 1.378030 1.874955 -1.017548 four
lst:此处官方的例子有点复杂。在Series的isin的方法中,其应该是返回一个包含布尔类型的Series对象,用以表示源对象是否包含传入的参数值才对(DataFrame也类似)。isin的官方定义如下:
>>> df3 = pd.DataFrame({‘A‘:[1,2,3],‘B‘:[‘a‘,‘b‘,‘c‘]}) >>> df3 A B 0 1 a 1 2 b 2 3 c >>> df3.isin([1,3]) A B 0 True False 1 False False 2 True False >>> df
但在官方的例子中,返回的是一个DataFrame,主要原因是判断完毕two和four是否在df2中以后,如果为TRUE将判断结果传入df2,并返回符合的结果。
- 设置数据
通过索引新增一列数据
>>> s3 = pd.Series([1,2,3,4,5,6],index=pd.date_range(‘20170101‘,periods=6)) >>> s3 2017-01-01 1 2017-01-02 2 2017-01-03 3 2017-01-04 4 2017-01-05 5 2017-01-06 6 Freq: D, dtype: int64 >>> df[‘F‘]= s3 >>> df A B C D E F 2017-01-01 0.906245 1.815924 0.123356 -1.798571 NaN 1 2017-01-02 -0.459646 0.520100 0.511138 0.183975 NaN 2 2017-01-03 0.463326 -0.970487 -1.120780 -0.614481 NaN 3 2017-01-04 1.505464 -1.743313 1.020903 -1.049047 NaN 4 2017-01-05 -0.709366 1.378030 1.874955 -1.017548 NaN 5 2017-01-06 1.113554 -0.951963 -1.266802 -0.586571 NaN 6
通过标签更新值
>>> df.at[dates[0],‘A‘] =1.5 >>> df.at[dates[0],‘A‘] 1.5
通过位置更新值
>>> df.iat[0,1]=2.5 >>> df.iat[0,1] 2.5 >>> df A B C D E F 2017-01-01 1.500000 2.500000 0.123356 -1.798571 NaN 1 2017-01-02 -0.459646 0.520100 0.511138 0.183975 NaN 2 2017-01-03 0.463326 -0.970487 -1.120780 -0.614481 NaN 3 2017-01-04 1.505464 -1.743313 1.020903 -1.049047 NaN 4 2017-01-05 -0.709366 1.378030 1.874955 -1.017548 NaN 5 2017-01-06 1.113554 -0.951963 -1.266802 -0.586571 NaN 6
通过数组更新
>>> df.loc[:,‘E‘] =np.array([5]*len(df)) >>> df A B C D E F 2017-01-01 1.500000 2.500000 0.123356 -1.798571 5 1 2017-01-02 -0.459646 0.520100 0.511138 0.183975 5 2 2017-01-03 0.463326 -0.970487 -1.120780 -0.614481 5 3 2017-01-04 1.505464 -1.743313 1.020903 -1.049047 5 4 2017-01-05 -0.709366 1.378030 1.874955 -1.017548 5 5 2017-01-06 1.113554 -0.951963 -1.266802 -0.586571 5 6
通过where条件更新值
>>> df4= df.copy() >>> df4[df4<0] = 3.6 >>> df4 A B C D E F 2017-01-01 1.500000 2.50000 0.123356 3.600000 5 1 2017-01-02 3.600000 0.52010 0.511138 0.183975 5 2 2017-01-03 0.463326 3.60000 3.600000 3.600000 5 3 2017-01-04 1.505464 3.60000 1.020903 3.600000 5 4 2017-01-05 3.600000 1.37803 1.874955 3.600000 5 5 2017-01-06 1.113554 3.60000 3.600000 3.600000 5 6
以上是关于Pandas 10分钟入门(官方文档注释版二)的主要内容,如果未能解决你的问题,请参考以下文章