reindex:重新索引
pandas对象有一个重要的方法reindex,作用:创建一个适应新索引的新对象
以Series为例
1 >>> series_obj = Series([4.5,1.3,5,-5.5],index=(‘a‘,‘b‘,‘c‘,‘d‘)) 2 >>> series_obj 3 a 4.5 4 b 1.3 5 c 5.0 6 d -5.5 7 dtype: float64 8 >>> obj2 = series_obj.reindex([‘a‘,‘b‘,‘c‘,‘e‘,‘f‘]) 9 >>> obj2 10 a 4.5 11 b 1.3 12 c 5.0 13 e NaN 14 f NaN 15 dtype: float64
重新索引的时候可以自动填充Nan值
1 >>> obj3 = series_obj.reindex([‘a‘,‘b‘,‘c‘,‘e‘,‘f‘],fill_value=‘0‘) 2 >>> obj3 3 a 4.5 4 b 1.3 5 c 5 6 e 0 7 f 0
对于时间序列这样的有序数据,重新索引可能需要做一些插值操作,reindex的method参数提供此功能。
method的可选选项有:
ffill或pad :前向填充或搬运值
bfill或backfill:后向填充或搬运值
不存在前向或后项的行自动填充Nan
1 >>> obj4 = Series([‘red‘,‘blue‘,‘green‘],index=[0,2,4]) 2 >>> obj4 3 0 red 4 2 blue 5 4 green 6 dtype: object 7 >>> obj4.reindex(range(6),method=‘ffill‘) 8 0 red 9 1 red 10 2 blue 11 3 blue 12 4 green 13 5 green 14 dtype: object
DataFrame的重新索引
只传入一个序列的时候,默认是重新索引“行”,可以用关键字参数来定义行索引(index)和列索引(columns)。
1 >>> frame = DataFrame(np.arange(9).reshape((3,3)),index = [‘a‘,‘b‘,‘c‘],columns = [‘Ohio‘,‘Texas‘,"Cali"]) 2 >>> frame2 = frame.reindex([‘a‘,‘b‘,‘c‘,‘d‘]) 3 >>> frame2 4 Ohio Texas Cali 5 a 0.0 1.0 2.0 6 b 3.0 4.0 5.0 7 c 6.0 7.0 8.0 8 d NaN NaN NaN 9 10 >>> frame3 = frame.reindex(columns = [‘Ohio‘,‘Texas‘,‘Cali‘,‘Wile‘],index=[‘a‘,‘b‘,‘c‘,‘d‘],fill_value=4) 11 >>> frame3 12 Ohio Texas Cali Wile 13 a 0 1 2 4 14 b 3 4 5 4 15 c 6 7 8 4 16 d 4 4 4 4 17 >>>
如果对DataFrame的行和列重新索引的时候,插值只能按行应用
如果利用ix的标签索功能,重新索引会变得更简洁
1 >>> frame5 = frame.ix[[‘a‘,‘b‘,‘c‘,‘d‘], [‘Ohio‘,‘Texas‘,‘Cali‘,‘Wile‘]] 2 >>> frame5 3 Ohio Texas Cali Wile 4 a 0.0 1.0 2.0 NaN 5 b 3.0 4.0 5.0 NaN 6 c 6.0 7.0 8.0 NaN 7 d NaN NaN NaN NaN
drop:丢弃指定轴上的项
>>> obj = Series(np.arange(5),index=[‘a‘,‘b‘,‘c‘,‘d‘,‘e‘]) >>> obj a 0 b 1 c 2 d 3 e 4 dtype: int32 >>> new_obj = obj.drop(‘b‘) >>> new_obj a 0 c 2 d 3 e 4 >>> new_obj2 = obj.drop([‘b‘,‘c‘]) >>> new_obj2 a 0 d 3 e 4 dtype: int32
#dataframe >>> frame = DataFrame(np.arange(16).reshape((4,4)),index=[‘a‘,‘b‘,‘c‘,‘d‘],columns=[‘one‘,‘two‘,‘three‘,‘four‘]) >>> frame one two three four a 0 1 2 3 b 4 5 6 7 c 8 9 10 11 d 12 13 14 15 >>> new_frame = frame.drop(‘a‘) >>> new_frame one two three four b 4 5 6 7 c 8 9 10 11 d 12 13 14 15 >>> new_frame2 = frame.drop([‘two‘,‘four‘],axis = 1) >>> new_frame2 one three a 0 2 b 4 6 c 8 10 d 12 14
索引、选取和过滤
Series的索引,既可以是类似NumPy数组的索引,也可以是自定义的index
>>> obj a 0 b 1 c 2 d 3 e 4 dtype: int32 >>> obj[‘a‘] 0 >>> obj[1] 1
注意:利用标签的切片运算,标签的右侧是封闭区间的,即包含末端。 >>> obj[‘a‘:‘c‘] a 0 b 1 c 2 dtype: int32 >>> obj[3:4] d 3 dtype: int32 >>> obj[2:3] c 2 dtype: int32 >>> obj[[3,1]] d 3 b 1 dtype: int32 >>> obj[[‘a‘,‘c‘]] a 0 c 2 dtype: int32 >>>
通过索引修改值
>>> obj[[‘b‘,‘d‘]] *=2 >>> obj a 0 b 2 c 2 d 6 e 4 dtype: int32
dataframe的索引:
通过直接索引只能获取列
>>> frame one two three four a 0 1 2 3 b 4 5 6 7 c 8 9 10 11 d 12 13 14 15 >>> frame[‘a‘] KeyError: ‘a‘ >>> frame[‘one‘] a 0 b 4 c 8 d 12 Name: one, dtype: int32 >>> frame[[‘one‘,‘four‘]] one four a 0 3 b 4 7 c 8 11 d 12 15
通过切片或布尔型数组,选取的是行
>>> frame[1:3] #不闭合区间 one two three four b 4 5 6 7 c 8 9 10 11 >>> frame[frame[‘three‘] > 8] one two three four c 8 9 10 11 d 12 13 14 15 >>>
DataFrame的索引字段ix
>>> frame.ix[‘a‘] #按照行索引 one 0 two 1 three 2 four 3 Name: a, dtype: int32 >>> frame.ix[[‘b‘,‘d‘]] one two three four b 4 5 6 7 d 12 13 14 15
>>> frame.ix[1]#同样是按照行索引 one 4 two 5 three 6 four 7 Name: b, dtype: int32 >>> frame.ix[1:3] one two three four b 4 5 6 7 c 8 9 10 11
>>> frame.ix[1:2,[2,3,1]] three four two b 6 7 5 >>> frame.ix[1:3,[2,3,1]] three four two b 6 7 5 c 10 11 9 >>> frame.ix[[‘b‘,‘d‘],[‘one‘,‘three‘]] one three b 4 6 d 12 14 >>> frame.ix[[‘b‘,‘d‘],[3,1,2]] four two three b 7 5 6 d 15 13 14 >>> frame.ix[:,[2,3,1]]# 选取所有行 three four two a 2 3 1 b 6 7 5 c 10 11 9 d 14 15 13
>>> frame.ix[frame.three >5,:3]
one two three
b 4 5 6
c 8 9 10
d 12 13 14
算术运算和数据对齐
>>> s1 = Series([1.3,4.5,6.6,3.4],index=[‘a‘,‘b‘,‘c‘,‘d‘]) >>> s2 = Series([1,2,3,4,5,6,7],index=[‘a‘,‘b‘,‘c‘,‘d‘,‘e‘,‘f‘,‘g‘]) >>> s1+s2 a 2.3 b 6.5 c 9.6 d 7.4 e NaN f NaN g NaN dtype: float64 #不重叠的索引处引入缺失值 #DataFrame也是同理
再算术方法中填充缺失值
>>> df1 = DataFrame(np.arange(12).reshape((3,4)),columns=list(‘abcd‘)) >>> df2 = DataFrame(np.arange(20).reshape((4,5)),columns=list(‘abcde‘)) >>> df1+df2#普通的算术运算会产生缺失值 a b c d e 0 0.0 2.0 4.0 6.0 NaN 1 9.0 11.0 13.0 15.0 NaN 2 18.0 20.0 22.0 24.0 NaN 3 NaN NaN NaN NaN NaN #用算术运算方法,可以填充缺失值 >>> df1.add(df2,fill_value=0) a b c d e 0 0.0 2.0 4.0 6.0 4.0 1 9.0 11.0 13.0 15.0 9.0 2 18.0 20.0 22.0 24.0 14.0 3 15.0 16.0 17.0 18.0 19.0 >>>
算术运算方法有
add 加法
sub 减法
div 除法
mul 乘法
DataFrame和Series之间的运算
>>> frame one two three four a 0 1 2 3 b 4 5 6 7 c 8 9 10 11 d 12 13 14 15 >>> series = frame.ix[0] >>> series one 0 two 1 three 2 four 3 Name: a, dtype: int32 >>> frame - series one two three four a 0 0 0 0 b 4 4 4 4 c 8 8 8 8 d 12 12 12 12 >>>
两者之间的运算会将Series的索引匹配到DataFrame的列,然后沿着行一直向下广播。
如果某个索引值在DataFrame的列或Series的索引中找不到,则参与运算的连个对象就会被重新索引以形成并集。
>>> series2 = Series(range(3),index = [‘two‘,‘four‘,‘five‘]) >>> frame +series2 five four one three two a NaN 4.0 NaN NaN 1.0 b NaN 8.0 NaN NaN 5.0 c NaN 12.0 NaN NaN 9.0 d NaN 16.0 NaN NaN 13.0
如果希望匹配行,且在列上传播,则必须使用算术方法
>>> series3 = frame[‘two‘] >>> frame.sub(series3,axis = 0) one two three four a -1 0 1 2 b -1 0 1 2 c -1 0 1 2 d -1 0 1 2 >>>