Python数据分析大杀器之Pandas基础2万字详解(学pandas基础,这一篇就够啦)
Posted JoJo的数据分析历险记
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python数据分析大杀器之Pandas基础2万字详解(学pandas基础,这一篇就够啦)相关的知识,希望对你有一定的参考价值。
Python数据分析
- 🌸个人主页:JoJo的数据分析历险记
- 📝个人介绍:小编大四统计在读,目前保研到统计学top3高校继续攻读统计研究生
- 💌如果文章对你有帮助,欢迎关注、点赞、收藏、订阅专栏
本专栏主要介绍python数据分析领域的应用
参考资料:
利用python数据分析
文章目录
我们介绍了Numpy在数据处理方面的应用,本文介绍一下pandas在数据处理方面的应用,pandas可以是基于numpy构建的,但是可以让数据处理变得更便捷
导入相关库
import numpy as np
import pandas as pd
💮1.Series 对象
pandas主要有两个数据对象,一个是Series,类似于一个向量的形式,另一个是DataFrame数据框形式。我们先来看一下如何创建一个Series数据对象。
s = pd.Series([12,-4,7,9])
s
0 12
1 -4
2 7
3 9
dtype: int64
🏵️1.1 Series基本操作
s[2]
7
s[2]=5
s
s['a'] = 4
s
0 12
1 -4
2 5
3 9
a 4
dtype: int64
arr = np.array([1,2,3,4])
s2 = pd.Series(arr)
s2
arr[1] = 9
s2
0 1
1 9
2 3
3 4
dtype: int32
s[s>8]
0 12
3 9
dtype: int64
serd = pd.Series([1,0,2,1,2,3], index=['white','white','blue','green','green','yellow'])
serd
white 1
white 0
blue 2
green 1
green 2
yellow 3
dtype: int64
serd.unique()
array([1, 0, 2, 3], dtype=int64)
serd.value_counts()
2 2
1 2
3 1
0 1
dtype: int64
serd.isin([0,3])
white False
white True
blue False
green False
green False
yellow True
dtype: bool
serd[serd.isin([0,3])]
white 0
yellow 3
dtype: int64
s2 = pd.Series([-5,3,np.NaN,14])
s2
0 -5.0
1 3.0
2 NaN
3 14.0
dtype: float64
s2.isnull()
s2.notnull()
0 True
1 True
2 False
3 True
dtype: bool
s2
0 -5.0
1 3.0
2 NaN
3 14.0
dtype: float64
mydict = 'red':2000,'blue':1000,'yellow':500,'orange':1000
myseries = pd.Series(mydict)
myseries
red 2000
blue 1000
yellow 500
orange 1000
dtype: int64
当出现缺失值时,会直接用NaN替代
colors = ['red','blue','yellow','orange','green']
myseries = pd.Series(mydict, index = colors)
myseries
red 2000.0
blue 1000.0
yellow 500.0
orange 1000.0
green NaN
dtype: float64
进行运算时有NaN为NaN
mydict2 ='red':400,'yellow':1000,"black":700
myseries2 = pd.Series(mydict2)
myseries.fillna(0) + myseries2.fillna(0)
black NaN
blue NaN
green NaN
orange NaN
red 2400.0
yellow 1500.0
dtype: float64
🌹2.DataFrame对象
DataFrame对象是我们在进行数据分析时最常见的数据格式,相当于一个矩阵数据,由不同行不同列组成,通常每一列代表一个变量,每一行代表一个观察数据。我们先来看一下DataFrame的一些基础应用。
创建DataFrame对象
data = 'color':['blue','green','yellow','red','white'],
'object':['ball','pen','pencil','paper','mug'],
'price':[1.2,1.0,0.6,0.9,1.7]
frame = pd.DataFrame(data)
frame
| color | object | price |
---|
0 | blue | ball | 1.2 |
---|
1 | green | pen | 1.0 |
---|
2 | yellow | pencil | 0.6 |
---|
3 | red | paper | 0.9 |
---|
4 | white | mug | 1.7 |
---|
frame2 = pd.DataFrame(data, columns=['object','price'])
frame2
| object | price |
---|
0 | ball | 1.2 |
---|
1 | pen | 1.0 |
---|
2 | pencil | 0.6 |
---|
3 | paper | 0.9 |
---|
4 | mug | 1.7 |
---|
frame3 = pd.DataFrame(data,index=['one','two','three','four','five'])
frame3
| color | object | price |
---|
one | blue | ball | 1.2 |
---|
two | green | pen | 1.0 |
---|
three | yellow | pencil | 0.6 |
---|
four | red | paper | 0.9 |
---|
five | white | mug | 1.7 |
---|
frame.columns
Index(['color', 'object', 'price'], dtype='object')
frame.index
RangeIndex(start=0, stop=5, step=1)
frame.values
array([['blue', 'ball', 1.2],
['green', 'pen', 1.0],
['yellow', 'pencil', 0.6],
['red', 'paper', 0.9],
['white', 'mug', 1.7]], dtype=object)
frame['price']
0 1.2
1 1.0
2 0.6
3 0.9
4 1.7
Name: price, dtype: float64
frame.iloc[2]
color yellow
object pencil
price 0.6
Name: 2, dtype: object
frame.iloc[[2,4]]
| color | object | price |
---|
2 | yellow | pencil | 0.6 |
---|
4 | white | mug | 1.7 |
---|
frame[0:4]
对DataFrame进行行选择时,使用索引frame[0:1]返回第一行数据,[1:2]返回第二行数据
| color | object | price |
---|
0 | blue | ball | 1.2 |
---|
1 | green | pen | 1.0 |
---|
2 | yellow | pencil | 0.6 |
---|
3 | red | paper | 0.9 |
---|
frame['object'][3]
'paper'
frame['new']=12
frame
| color | object | price | new |
---|
0 | blue | ball | 1.2 | 12 |
---|
1 | green | pen | 1.0 | 12 |
---|
2 | yellow | pencil | 0.6 | 12 |
---|
3 | red | paper | 0.9 | 12 |
---|
4 | white | mug | 1.7 | 12 |
---|
frame['new']=[1,2,3,4,5]
frame
| color | object | price | new |
---|
0 | blue | ball | 1.2 | 1 |
---|
1 | green | pen | 1.0 | 2 |
---|
2 | yellow | pencil | 0.6 | 3 |
---|
3 | red | paper | 0.9 | 4 |
---|
4 | white | mug | 1.7 | 5 |
---|
frame['price'][2]=3.3
frame
| color | object | price | new |
---|
0 | blue | ball | 1.2 | 1 |
---|
1 | green | pen | 1.0 | 2 |
---|
2 | yellow | pencil | 3.3 | 3 |
---|
3 | red | paper | 0.9 | 4 |
---|
4 | white | mug | 1.7 | 5 |
---|
frame['new'] = 12
frame
del frame['new']
frame
| color | object | price |
---|
0 | blue | ball | 1.2 |
---|
1 | green | pen | 1.0 |
---|
2 | yellow | pencil | 3.3 |
---|
3 | red | paper | 0.9 |
---|
4 | white | mug | 1.7 |
---|
frame3 = pd.DataFrame(np.arange(16).reshape((4,4)),index = ['red','white','blue','green'],
columns=['ball','pen','pencil','paper'])
frame3
frame3[frame3>12]
| ball | pen | pencil | paper |
---|
red | NaN | NaN | NaN | NaN |
---|
white | NaN | NaN | NaN | NaN |
---|
blue | NaN | NaN | NaN | NaN |
---|
green | NaN | 13.0 | 14.0 | 15.0 |
---|
nestdict = 'red':2012:22, 2013:33,'white':2011: 13,2012:22,2013:16,'blue':2011:17,2012:27,2013:48
nestdict
'red': 2012: 22, 2013: 33,
'white': 2011: 13, 2012: 22, 2013: 16,
'blue': 2011: 17, 2012: 27, 2013: 48
frame2 = pd.DataFrame(nestdict)
frame2
| red | white | blue |
---|
2011 | NaN | 13 | 17 |
---|
2012 | 22.0 | 22 | 27 |
---|
2013 | 33.0 | 16 | 48 |
---|
进行转置
frame2.T
| 2011 | 2012 | 2013 |
---|
red | NaN | 22.0 | 33.0 |
---|
white | 13.0 | 22.0 | 16.0 |
---|
blue | 17.0 | 27.0 | 48.0 |
---|
ser = pd.Series([5,0,3,8,4], index=['red','blue','yellow','white','green'])
ser.index
Index(['red', 'blue', 'yellow', 'white', 'green'], dtype='object')
ser.idxmax()
'white'
ser.idxmin()
'blue'
serd = pd.Series(range(6), index=['white','white','blue','green','green','yellow'])
serd
white 0
white 1
blue 2
green 3
green 4
yellow 5
dtype: int64
serd['white']
white 0
white 1
dtype: int64
ser = pd.Series([2,5,7,4],index = ['one','two','three','four'])
ser
one 2
two 5
three 7
four 4
dtype: int64
ser.reindex(['three','one','five','two'])
three 7.0
one 2.0
five NaN
two 5.0
dtype: float64
ser3 = pd.Series([1,5,6,3],index=[0,3,5,6])
ser3
0 1
3 5
5 6
6 3
dtype: int64
ser3.reindex(range(6),method='ffill')
0 1
1 1
2 1
3 5
4 5
5 6
dtype: int64
ser3.reindex(range(8),method='bfill')
0 1.0
1 5.0
2 5.0
3 5.0
4 6.0
5 6.0
6 3.0
7 NaN
dtype: float64
frame.reindex(range(5), method='ffill',columns=['colors','price','new','object'])
| colors | price | new | object |
---|
0 | blue | 1.2 | blue | ball |
---|
1 | green | 1.0 | green | pen |
---|
2 | yellow | 3.3 | yellow | pencil |
---|
3 | red | 0.9 | red | paper |
---|
4 | white | 1.7 | white | mug |
---|
ser = pd.Series(np.arange(4.),index=['red','blue','yellow','white'])
ser
red 0.0
blue 1.0
yellow 2.0
white 3.0
dtype: float64
ser.drop('yellow')
red 0.0
blue 1.0
white 3.0
dtype: float64
ser.drop(['blue','white'])
red 0.0
yellow 2.0
dtype: float64
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
frame
| ball | pen | pencil | paper |
---|
red | 0 | 1 | 2 | 3 |
---|
blue | 4 | 5 | 6 | 7 |
---|
yellow | 8 | 9 | 10 | 11 |
---|
white | 12 | 13 | 14 | 15 |
---|
frame.drop(['pen'],axis=1)
| ball | pencil | paper |
---|
red | 0 | 2 | 3 |
---|
blue | 4 | 6 | 7 |
---|
yellow | 8 | 10 | 11 |
---|
white | 12 | 14 | 15 |
---|
🥀3.pandas基本数据运算
🌺3.1 算术运算
- 当有两个series或DataFrame对象时,如果一个标签,两个对象都有,则把他们的值相加
- 当一个标签只有一个对象有时,则为NaN
s1 = pd.Series([3,2,5,1],index=['white','yellow','green','blue'])
s1
white 3
yellow 2
green 5
blue 1
dtype: int64
s2 = pd.Series([1,4,7,2,1],['white','yellow','black','blue','brown'])
s1 + s2
black NaN
blue 3.0
brown NaN
green NaN
white 4.0
yellow 6.0
dtype: float64
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
columns=['ball','pen','pencil','paper'],
index = ['red','blue','yellow','white'])
frame1
| ball | pen | pencil | paper |
---|
red | 0 | 1 | 2 | 3 |
---|
blue | 4 | 5 | 6 | 7 |
---|
yellow | 8 | 9 | 10 | 11 |
---|
white | 12 | 13 | 14 | 15 |
---|
frame2 = pd.DataFrame(np.arange(12).reshape((4,3)),
index = ['blue','yellow','green','white']
,columns=['ball','pen','mug'])
frame2
| ball | pen | mug |
---|
blue | 0 | 1 | 2 |
---|
yellow | 3 | 4 | 5 |
---|
green | 6 | 7 | 8 |
---|
white | 9 | 10 | 11 |
---|
frame3 = frame1+frame2
frame3
| ball | mug | paper | pen | pencil |
---|
blue | 4.0 | NaN | NaN | 6.0 | NaN |
---|
green | NaN | NaN | NaN | NaN | NaN |
---|
red | NaN | NaN | NaN | NaN | NaN |
---|
white | 21.0 | NaN | NaN | 23.0 | NaN |
---|
yellow | 11.0 | NaN | NaN | 13.0 | NaN |
---|
🌻3.2 基本算术运算符
主要的算术运算符如下
- add() frame1.add(frame2) = frame1+frame2
- sub()
- div()
- mul()
下面通过一些案例来说明
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
columns=['ball','pen','pencil','paper'],
index = ['red','blue','yellow','white'])
frame
| ball | pen | pencil | paper |
---|
red | 0 | 1 | 2 | 3 |
---|
blue | 4 | 5 | 6 | 7 |
---|
yellow | 8 | 9 | 10 | 11 |
---|
white | 12 | 13 | 14 | 15 |
---|
ser = pd.Series(np.arange(4),['ball','pen','pencil','paper'])
ser
ball 0
pen 1
pencil 2
paper 3
dtype: int32
frame-ser
| ball | pen | pencil | paper |
---|
red | 0 | 0 | 0 | 0 |
---|
blue | 4 | 4 | 4 | 4 |
---|
yellow | 8 | 8 | 8 | 8 |
---|
white | 12 | 12 | 12 | 12 |
---|
当索引项只存在于其中一个数据结构时,那么运算结果会为其产生一个新的索引项,但其值为NaN
具体案例如下,我们给ser增加一列mug
ser['mug'] = 9
ser
ball 0
pen 1
pencil 2
paper 3
mug 9
dtype: int64
frame - ser
| ball | mug | paper | pen | pencil |
---|
red | 0 | NaN | 0 | 0 | 0 |
---|
blue | 4 | NaN | 4 | 4 | 4 |
---|
yellow | 8 | NaN | 8 | 8 | 8 |
---|
white | 12 | NaN | 12 | 12 | 12 |
---|
🌼3.3 函数映射
在dataframe和series数据对象中,可以使用函数对所有元素进行操作
frame
| ball | pen | pencil | paper |
---|
red | 0 | 1 | 2 | 3 |
---|
blue | 4 | 5 | 6 | 7 |
---|
yellow | 8 | 9 | 10 | 11 |
---|
white | 12 | 13 | 14 | 15 |
---|
np.sqrt(frame)
| ball | pen | pencil | paper |
---|
red | 0.000000 | 1.000000 | 1.414214 | 1.732051 |
---|
blue | 2.000000 | 2.236068 | 2.449490 | 2.645751 |
---|
yellow | 2.828427 | 3.000000 | 3.162278 | 3.316625 |
---|
white | 3.464102 | 3.605551 | 3.741657 | 3.872983 |
---|
f = lambda x:x.max()-x.min()
def f(x):
return x.max()-x.min()
frame.apply(f)
ball 12
pen 12
pencil 12
paper 12
dtype: int64
def f(x):
return pd.Series([x.min(),x.max()],index = ['min','max'])
frame.apply(f,axis = 1)