Python数据分析
Posted 雨宙
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python数据分析相关的知识,希望对你有一定的参考价值。
Python数据分析(五)
打卡第九天啦!!!
pandas库(五)
数据规整
层次化索引
- 层次化索引的创建
data = pd.Series(np.random.randn(9),
index=[['a','a','a','b','b','c','c','d','d'],
[1,2,3,1,3,1,2,2,3]])
- 层次化索引的外层选取和内层选取
# 外层选取
data['a']
data['b':'c']
data.loc[['b','d']]
# 内层选取
data.loc[:,2]
- df.set_index() 使用现有列设置单(复合)索引,df.reset_index()还原索引
frame = pd.DataFrame('a':range(7),'b':range(7,0,-1),
'c':['one','one','one','two','two','two','two'],
'd':[0,1,2,0,1,2,3])
frame2 = frame.set_index(['c','d'])
frame2.reset_index()
数据连接
- pd.merge可以根据单个或多个键将不同的DataFrame的行连接起来,类似数据库的连接操作
- pd.merge:(left, right, how=‘inner’,on=None,left_on=None, right_on=None )
left:合并时左边的DataFrame
right:合并时右边的DataFrame
how:合并的方式,默认’inner’, ‘outer’, ‘left’, ‘right’
on:需要合并的列名,必须两边都有的列名,并以 left 和 right 中的列名的交集作为连接键
left_on: left Dataframe中用作连接键的列
right_on: right Dataframe中用作连接键的列 - 内连接 inner:对两张表都有的键的交集进行联合
全连接 outer:对两者表的都有的键的并集进行联合
左连接 left:对所有左表的键进行联合
右连接 right:对所有右表的键进行联合
import pandas as pd
import numpy as np
left = pd.DataFrame('key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'])
right = pd.DataFrame('key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'])
pd.merge(left,right,on='key') #指定连接键key
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3
left = pd.DataFrame('key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'])
right = pd.DataFrame('key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'])
pd.merge(left,right,on=['key1','key2']) #指定多个键,进行合并
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
#指定左连接
left = pd.DataFrame('key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'])
right = pd.DataFrame('key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'])
pd.merge(left, right, how='left', on=['key1', 'key2'])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
#指定右连接
left = pd.DataFrame('key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'])
right = pd.DataFrame('key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'])
pd.merge(left, right, how='right', on=['key1', 'key2'])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
3 K2 K0 NaN NaN C3 D3
# 指定外连接
left = pd.DataFrame('key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'])
right = pd.DataFrame('key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'])
pd.merge(left,right,how='outer',on=['key1','key2'])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
5 K2 K0 NaN NaN C3 D3
- 处理重复列名:参数suffixes:默认为_x, _y
# 处理重复列名
df_obj1 = pd.DataFrame('key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data' : np.random.randint(0,10,7))
df_obj2 = pd.DataFrame('key': ['a', 'b', 'd'],
'data' : np.random.randint(0,10,3))
print(pd.merge(df_obj1, df_obj2, on='key', suffixes=('_left', '_right')))
data_left key data_right
0 9 b 1
1 5 b 1
2 1 b 1
3 2 a 8
4 2 a 8
5 5 a 8
# 若不指定suffixes的默认情况
key data_x data_y
0 b 4 8
1 b 1 8
2 b 3 8
3 a 0 0
4 a 2 0
5 a 0 0
- 按索引连接:参数left_index=True或right_index=True
# 按索引连接
df_obj1 = pd.DataFrame('key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1' : np.random.randint(0,10,7))
df_obj2 = pd.DataFrame('data2' : np.random.randint(0,10,3), index=['a', 'b', 'd'])
print(pd.merge(df_obj1, df_obj2, left_on='key', right_index=True))
data1 key data2
0 3 b 6
1 4 b 6
6 8 b 6
2 6 a 0
4 3 a 0
5 0 a 0
数据合并
- 使用join方法对dataframe进行合并,能够起到和merge方法一样的效果,需要注意的是,使用join时要求没有重叠的列
left2 = pd.DataFrame([[1.,2.],[3.,4.],[5.,6.]],
index=['a','c','e'],
columns=['语文','数学'])
right2 = pd.DataFrame([[7.,8.],[9.,10.],[11.,12.],[13,14]],
index=['b','c','d','e'],
columns=['英语','综合'])
# pd.merge(left2,right2,how='outer',left_index=True,right_index=True)
left2.join(right2,how='outer')
语文 数学 英语 综合
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0
- 使用concat方法沿轴方向将多个对象合并到一起
(1)NumPy的concat:np.concatenate
import numpy as np
import pandas as pd
arr1 = np.random.randint(0, 10, (3, 4))
arr2 = np.random.randint(0, 10, (3, 4))
print(arr1)
print(arr2)
print(np.concatenate([arr1, arr2]))
print(np.concatenate([arr1, arr2], axis=1))
# print(arr1)
[[3 3 0 8]
[2 0 3 1]
[4 8 8 2]]
# print(arr2)
[[6 8 7 3]
[1 6 8 7]
[1 4 7 1]]
# print(np.concatenate([arr1, arr2]))
[[3 3 0 8]
[2 0 3 1]
[4 8 8 2]
[6 8 7 3]
[1 6 8 7]
[1 4 7 1]]
# print(np.concatenate([arr1, arr2], axis=1))
[[3 3 0 8 6 8 7 3]
[2 0 3 1 1 6 8 7]
[4 8 8 2 1 4 7 1]]
(2)pd.concat:注意指定轴方向,默认axis=0;join指定合并方式,默认为outer;Series合并时查看行索引有无重复
df1 = pd.DataFrame(np.arange(6).reshape(3,2),index=list('abc'),columns=['one','two'])
df2 = pd.DataFrame(np.arange(4).reshape(2,2)+5,index=list('ac'),columns=['three','four'])
pd.concat([df1,df2]) #默认外连接,axis=0
four one three two
a NaN 0.0 NaN 1.0
b NaN 2.0 NaN 3.0
c NaN 4.0 NaN 5.0
a 6.0 NaN 5.0 NaN
c 8.0 NaN 7.0 NaN
pd.concat([df1,df2],axis='columns') #指定axis=1连接
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
#同样我们也可以指定连接的方式为inner
pd.concat([df1,df2],axis=1,join='inner')
one two three four
a 0 1 5 6
c 4 5 7 8
重塑层次化索引
- stack方法能够将列索引转换为行索引,完成层级索引,即将dataframe转换为series,需要注意的是,stack默认过滤缺失数据,可以修改参数dropna为False来不忽略掉其中的缺失数据
import numpy as np
import pandas as pd
df_obj = pd.DataFrame(np.random.randint(0,10, (5,2)), columns=['data1', 'data2'])
print(df_obj)
stacked = df_obj.stack()
print(stacked)
# print(df_obj)
data1 data2
0 7 9
1 7 8
2 8 9
3 4 1
4 1 2
# print(stacked)
0 data1 7
data2 以上是关于Python数据分析的主要内容,如果未能解决你的问题,请参考以下文章
以大于 Python 列表中的值的最小差值对大多数数字进行采样的最快方法