Python数据分析pandas真入门-----基础学习
Posted Geek_bao
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python数据分析pandas真入门-----基础学习相关的知识,希望对你有一定的参考价值。
Python数据分析基础
0. 前言
前面我们经历了十分钟学废pandas,相信大家一定都学废了(狗头保命),下面我们开始学习pandas基础知识。
1. Series
import pandas as pd
import numpy as np
# Series
s = pd.Series([1, 3, 6, np.nan, 44, 1])
print(s)
# 默认index从0开始,如果想要按照自己的索引设置,则修改index参数,如:index=[3, 4, 3, 7, 8, 9]
0 1.0
1 3.0
2 6.0
3 NaN
4 44.0
5 1.0
dtype: float64
2.DataFrame
2.1 DataFrame的简单运用
# dataFrame
dates = pd.date_range('2018-08-19', periods=6)
# dates = pd.date_range('2018-08-19', '2018-08-24') # 起始、结束、与上述代码等价
'''
numpy.random.randn(d0, d1,.... ,dn)是从标准正太分布中返回一个或多个样本值。
numpy.random.rand(d0, d1,...,dn)的随机样本位于[0,1]中
(6, 4)表示六行四列数据
'''
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=['a', 'b', 'c','d'])
print(df)
# DataFrame既有行索引也有列索引,它可以被看作由Series组成的大字典
a b c d
2018-08-19 -0.193563 0.774822 0.791951 -0.001489
2018-08-20 1.383536 0.013180 -1.013866 0.277929
2018-08-21 0.194067 -0.112442 0.537806 0.775922
2018-08-22 -1.257753 -1.241477 1.099022 0.487283
2018-08-23 -0.383184 -0.299835 -1.212893 0.884345
2018-08-24 0.691404 -1.207610 -0.168567 0.642692
print(df['b'])
2018-08-19 0.774822
2018-08-20 0.013180
2018-08-21 -0.112442
2018-08-22 -1.241477
2018-08-23 -0.299835
2018-08-24 -1.207610
Freq: D, Name: b, dtype: float64
# 未指定行标签和列标签的数据
df1 = pd.DataFrame(np.arange(12).reshape(3,4))
print(df1)
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
# 另一种方式,字典方式,健为列索引
df2 = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': pd.Timestamp('20180819'),
'C': pd.Series([1, 6, 9, 10], dtype='float32'),
'D': np.array([3] * 4, dtype='int32'),
'E': pd.Categorical(['test', 'train', 'test', 'train']),
'F': 'foo'
})
print(df2)
A B C D E F
0 1 2018-08-19 1.0 3 test foo
1 2 2018-08-19 6.0 3 train foo
2 3 2018-08-19 9.0 3 test foo
3 4 2018-08-19 10.0 3 train foo
print(df2.index)
RangeIndex(start=0, stop=4, step=1)
print(df2.columns)
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
print(df2.values)
[[1 Timestamp('2018-08-19 00:00:00') 1.0 3 'test' 'foo']
[2 Timestamp('2018-08-19 00:00:00') 6.0 3 'train' 'foo']
[3 Timestamp('2018-08-19 00:00:00') 9.0 3 'test' 'foo']
[4 Timestamp('2018-08-19 00:00:00') 10.0 3 'train' 'foo']]
# 数据总结。
# count:数出有该列有多少行数据
# mean:该列的平均值
# std:标准偏差值(总体标准差),即为方差开根号【√(Σ(x-E(x))²)/n】
# min:最小值
# 25%:正好有25%的样本比这个值高
# 50%:正好有50%的样本比这个值高,即中位数
# 75%:正好有75%的样本比这个值高
'''
对于四分位数的确定,有不同的方法,另外一种方法基于N-1 基础。即
Q1的位置=1+(n-1)x 0.25
Q2的位置=1+(n-1)x 0.5
Q3的位置=1+(n-1)x 0.75
'''
# max:最大值
print(df2.describe())
A C D
count 4.000000 4.000000 4.0
mean 2.500000 6.500000 3.0
std 1.290994 4.041452 0.0
min 1.000000 1.000000 3.0
25% 1.750000 4.750000 3.0
50% 2.500000 7.500000 3.0
75% 3.250000 9.250000 3.0
max 4.000000 10.000000 3.0
# 翻转数据
print(df2.T)
# print(np.transpose(df2))等价于上述操作
0 1 2 \\
A 1 2 3
B 2018-08-19 00:00:00 2018-08-19 00:00:00 2018-08-19 00:00:00
C 1 6 9
D 3 3 3
E test train test
F foo foo foo
3
A 4
B 2018-08-19 00:00:00
C 10
D 3
E train
F foo
'''
axis=1表示行
axis=0表示列
默认ascenging(升序)为True
ascending=True表示升序,ascending=False表示降序
下面两行分别表示按行升序和按行降序
'''
print(df2.sort_index(axis=1, ascending=True))
A B C D E F
0 1 2018-08-19 1.0 3 test foo
1 2 2018-08-19 6.0 3 train foo
2 3 2018-08-19 9.0 3 test foo
3 4 2018-08-19 10.0 3 train foo
print(df2.sort_index(axis=1, ascending=False))
F E D C B A
0 foo test 3 1.0 2018-08-19 1
1 foo train 3 6.0 2018-08-19 2
2 foo test 3 9.0 2018-08-19 3
3 foo train 3 10.0 2018-08-19 4
# 表示按列排序与降序排序
print(df2.sort_index(axis=0, ascending=False))
A B C D E F
3 4 2018-08-19 10.0 3 train foo
2 3 2018-08-19 9.0 3 test foo
1 2 2018-08-19 6.0 3 train foo
0 1 2018-08-19 1.0 3 test foo
print(df2.sort_index(axis=0, ascending=True))
A B C D E F
0 1 2018-08-19 1.0 3 test foo
1 2 2018-08-19 6.0 3 train foo
2 3 2018-08-19 9.0 3 test foo
3 4 2018-08-19 10.0 3 train foo
# 对特定列数值排序
# 表示对C列降序排序
print(df2.sort_values(by='C', ascending=False))
A B C D E F
3 4 2018-08-19 10.0 3 train foo
2 3 2018-08-19 9.0 3 test foo
1 2 2018-08-19 6.0 3 train foo
0 1 2018-08-19 1.0 3 test foo
3. 筛选
3.1 实战筛选
import pandas as pd
import numpy as np
dates = pd.date_range('20180819', periods=6)
df = pd.DataFrame(np.arange(24).reshape((6, 4)), index=dates, columns=['A', 'B', 'C', 'D'])
print(df)
A B C D
2018-08-19 0 1 2 3
2018-08-20 4 5 6 7
2018-08-21 8 9 10 11
2018-08-22 12 13 14 15
2018-08-23 16 17 18 19
2018-08-24 20 21 22 23
# 检索A列
print(df['A'])
2018-08-19 0
2018-08-20 4
2018-08-21 8
2018-08-22 12
2018-08-23 16
2018-08-24 20
Freq: D, Name: A, dtype: int32
print(df.A)
2018-08-19 0
2018-08-20 4
2018-08-21 8
2018-08-22 12
2018-08-23 16
2018-08-24 20
Freq: D, Name: A, dtype: int32
# 选择跨越多行或多列
# 选取前三行
print(df[0:3])
A B C D
2018-08-19 0 1 2 3
2018-08-20 4 5 6 7
2018-08-21 8 9 10 11
print(df['2018-08-19': '2018-08-21'])
A B C D
2018-08-19 0 1 2 3
2018-08-20 4 5 6 7
2018-08-21 8 9 10 11
# 根据标签选择数据
# 获取特定行或列
# 指定行数据
print(df.loc['20180819'])
A 0
B 1
C 2
D 3
Name: 2018-08-19 00:00:00, dtype: int32
# 指定列
# 两种方式
print(df.loc[:, 'A':'B'])
A B
2018-08-19 0 1
2018-08-20 4 5
2018-08-21 8 9
2018-08-22 12 13
2018-08-23 16 17
2018-08-24 20 21
print(df.loc[:, ['A', 'B']])
A B
2018-08-19 0 1
2018-08-20 4 5
2018-08-21 8 9
2018-08-22 12 13
2018-08-23 16 17
2018-08-24 20 21
# 行与列同时检索
print(df.loc['20180819', ['A', 'B']])
A 0
B 1
Name: 2018-08-19 00:00:00, dtype: int32
# 根据序列iloc
# 获取特定位置的值
print(df.iloc[3, 1])
13
print(df.iloc[3:5, 1:3]) # 不包含末尾5或3,同列表切片
B C
2018-08-22 13 14
2018-08-23 17 18
# 跨行操作
print(df.iloc[[1, 3, 5], 1:3])
B C
2018-08-20 5 6
2018-08-22 13 14
2018-08-24 21 22
# 混合选择
print(df.ix[:3, ['A', 'C']])
A C
2018-08-19 0 2
2018-08-20 4 6
2018-08-21 8 10
D:\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:2: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
print(df.iloc[:3, [0, 2]]) # 结果同上
A C
2018-08-19 0 2
2018-08-20 4 6
2018-08-21 8 10
# 通过判断的筛选
print(df[df.A>8])
A B C D
2018-08-22 12 13 14 15
2018-08-23 16 17 18 19
2018-08-24 20 21 22 23
# 通过判断的筛选
print(df.loc[df.A>8])
A B C D
2018-08-22 12 13 14 15
2018-08-23 16 17 18 19
2018-08-24 20 21 22 23
3.2 筛选总结
1. iloc与ix区别
总结:
相同点:iloc可以取相应的值,操作方便,与ix操作类似。
不同点:ix可以混合选择,可以填入column对应的字符选择,而iloc只能采用index索引,对于列数较多的情况下,ix要方便操作许多。
2. loc与iloc区别
总结:
相同点:都可以索引出块数据
不同点:iloc可以检索对应值,两者操作不同。
3. ix与loc、iloc三者的区别
总结:ix是混合loc与iloc操作
如下:对比三者操作,输出结果相同
print(df.loc['20180819', 'A':'B'])
print(df.iloc[0, 0:2])
print(df.ix[0, 'A':'B'])
A 0
B 1
Name: 2018-08-19 00:00:00, dtype: int32
A 0
B 1
Name: 2018-08-19 00:00:00, dtype: int32
A 0
B 1
Name: 2018-08-19 00:00:00, dtype: int32
D:\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:3: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
This is separate from the ipykernel package so we can avoid doing imports until
4.Pandas设置值
4.1 创建数据
import pandas as pd
import numpy as np
# 创建数据
dates = pd.date_range('20180820', periods=6)
df = pd.DataFrame(np.arange(24).reshape(6, 4), index=dates, columns=['A', 'B', 'C', 'D'])
print(df)
A B C D
2018-08-20 0 1 2 3
2018-08-21 4 5 6 7
2018-08-22 8 9 10 11
2018-08-23 12 13 14 15
2018-08-24 16 17 18 19
2018-08-25 20 21 22 23
4.2 根据位置设置loc与iloc
# 根据位置设置loc与iloc
df.iloc[2, 2] = 111
df.loc['20180820', 'B'] = 2222
print(df)
A B C D
2018-08-20 0 2222 2 3
2018-08-21 4 5 6 7
2018-08-22 8 9 111 11
2018-08-23 12 13 14 15
2018-08-24 16 17 18 19
2018-08-25 20 21 22 23
4.3 根据条件设置
# 根据条件设置
# 更改B中的数,而更改的位置取决于4的位置,并设置相应位置的数为0
df.B[df.A>4] = 0
print(df)
A B C D
2018-08-20 0 2222 2 3
2018-08-21 4 5 6 7
2018-08-22 8 0 111 11
2018-08-23 12 0 14 15
2018-08-24 16 0 18 19
2018-08-25 20 0 22 23
df.B.loc[df.A>4] = 0
print(df)
A B C D
2018-08-20 0 2222 2 3
2018-08-21 4 5 6 7
2018-08-22 8 0 111 11
2018-08-23 12 0 14 15
2018-08-24 16 0 18 19
2018-08-25 20 0 22 23
4.4 按行或列设置
# 按行或列设置
# 列批处理,F列全改为NaN
df['F'] = np.nan
print(df)
A B C D F
2018-08-20 0 2222 2 3 NaN
2018-08-21 4 5 6 7 NaN
2018-08-22 8 0 111 11 NaN
2018-08-23 12 0 14 15 NaN
2018-08-24 16 0 18 19 NaN
2018-08-25 20 0 22 23 NaN
4.5 添加Series序列(长度必须对齐)
df['E'] = pd.Series([1, 2, 3, 4, 5, 6], index = pd.date_range('20180820', periods=6))
print(df)
A B C D F E
2018-08-20 0 2222 2 3 NaN 1
2018-08-21 4 5 6 7 NaN 2
2018-08-22 8 0 111 11 NaN 3
2018-08-23 12 0 14 15 NaN 4
2018-08-24 16 0 18 19 NaN 5
2018-08-25 20 0 22 23 NaN 6
4.6 设定某行某列为特定值
# 设定某行某列为特定值
df.ix['20180820', 'A'] = 56
print(df)
# ix以后要剥离了,尽量不要用了
A B C D F E
2018-08-20 56 2222 2 3 NaN 1
2018-08-21 4 5 6 7 NaN 2
2018-08-22 8 0 111 11 NaN 3
2018-08-23 12 0 14 15 NaN 4
2018-08-24 16 0 18 19 NaN 5
2018-08-25 20 0 22 23 NaN 6
D:\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:2: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
df.loc['20180820', 'A'] = 67
print(df)
A B C D F E
2018-08-20 67 2222 2 3 NaN 1
2018-08-21 4 5 6 7 NaN 2
2018-08-22 8 0 111 11 NaN 3
2018-08-23 12 0 14 15 NaN 4
2018-08-24 16 0 18 19 NaN 5
2018-08-25 20 0 22 23 NaN 6
df.iloc[0, 0] = 76
print(df)
A B C D F E
2018-08-20 76 2222 2 3 NaN 1
2018-08-21 4 5 6 7 NaN 2
2018-08-22 8 0 111 11 NaN 3
2018-08-23 12 0 14 15 NaN 4
2018-08-24 16 0 18 19 NaN 5
2018-08-25 20 0 22 23 NaN 6
4.7 修改一整行数据
# 修改一整行数据
df.iloc[1] = np.nan # df.iloc[1,:] = np.nan
print(df)
A B C D F E
2018-08-20 76.0 2222.0 2.0 3.0 NaN 1.0
2018-08-21 NaN NaN NaN NaN NaN NaN
2018-08-22 8.0 0.0 111.0 11.0 NaN 3.0
2018-08-23 12.0 0.0 14.0 15.0 NaN 4.0
2018-08-24 16.0 0.0 18.0 19.0 NaN 5.0
2018-08-25 20.0 0.0 22.0 23.0 NaN 6.0
df.loc['20180823'] = np.nan # df.loc['20180823', :] = np.nan
print(df)
A B C D F E
2018-08-20 76.0 2222.0 2.0 3.0 NaN 1.0
2018-08-21 NaN NaN NaN NaN NaN NaN
2018-08-22 8.0 0.0 111.0 11.0 NaN 3.0
2018-08-23 NaN NaN NaN NaN NaN NaN
2018-08-24 16.0 0.0 18.0 19.0 NaN 5.0
2018-08-25 20.0 0.0 22.0 23.0 NaN 6.0
df.ix[2] = np.nan # df.ix[2, :]
print(df)
A B C D F E
2018-08-20 76.0 2222.0 2.0 3.0 NaN 1.0
2018-08-21 NaN NaN NaN NaN NaN NaN
2018-08-22 NaN NaN NaN NaN NaN NaN
2018-08-23 NaN NaN NaN NaN NaN NaN
2018-08-24 16.0 0.0 18.0 19.0 NaN 5.0
2018-08-25 20.0 0.0 22.0 23.0 NaN 6.0
D:\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:1: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
"""Entry point for launching an IPython kernel.
df.ix['20180824'] = np.nan
print(df)
A B C D F E
2018-08-20 76.0 2222.0 2.0 3.0 NaN 1.0
2018-08-21 NaN NaN NaN NaN NaN NaN
2018-08-22 NaN NaN NaN NaN NaN NaN
2018-08-23 NaN NaN NaN NaN NaN NaN
2018-08-24 NaN NaN NaN NaN NaN NaN
2018-08-25 20.0 0.0 22.0 23.0 NaN 6.0
D:\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:1: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
"""Entry point for launching an IPython kernel.
5. Pandas处理丢失数据
5.1 创建含NaN的矩阵
# Pandas处理丢失数据
import pandas as pd
import numpy as np
# 创建含NaN的矩阵
# 如何填充和删除NaN数据!
dates = pd.date_range('20180820', periods=6)
df = pd.DataFrame(np.arange(24).reshape((6, 4)), index=dates, columns=['A', 'B', 'C', 'D'])
print(df)
A B C D
2018-08-20 0 1 2 3
2018-08-21 4 5 6 7
2018-08-22 8 9 10 11
2018-08-23 12 13 14 15
2018-08-24 16 17 18 19
2018-08-25 20 21 22 23
# a.reshape(6, 4)等价于a.reshape((6, 4))
df.iloc[0, 1] = np.nan
df.iloc[1, 2] = np.nan
print(df)
A B C D
2018-08-20 0 NaN 2.0 3
2018-08-21 4 5.0 NaN 7
2018-08-22 8 9.0 10.0 11
2018-08-23 12 13.0 14.0 15
2018-08-24 16 17.0 18.0 19
2018-08-25 20 21.0 22.0 23
5.2 删除掉有NaN的行或者列
# 删除掉有NaN的行或列
print(df.dropna()) # 默认是删除掉含有NaN的行
A B C D
2018-08-22 8 9.0 10.0 11
2018-08-23 12 13.0 14.0 15
2018-08-24 16 17.0 18.0 19
2018-08-25 20 21.0 22.0 23
print(df.dropna(
axis = 0, # 0表示对行进行操作;1表示对列进行操作
how = 'any' # 'any':只要存在Nan就drop掉;‘all’:必须全部是NaN才drop掉
))
A B C D
2018-08-22 8 9.0 10.0 11
2018-08-23 12 13.0 14.0 15
2018-08-24 16 17.0 18.0 19
2018-08-25 20 21.0 22.0 23
# 删除掉所有含NaN的列
print(df.dropna(
axis = 1,
how = 'any'
))
A D
2018-08-20 0 3
2018-08-21 4 7
2018-08-22 8 11
2018-08-23 12 15
2018-08-24 16 19
2018-08-25 20 23
5.3 替换NaN值为0或其他
# 替换NaN值为0或者其他
print(df.fillna(value=0))
A B C D
2018-08-20 0 0.0 2.0 3
2018-08-21 4 5.0 0.0 7
2018-08-22 8 9.0 10.0 11
2018-08-23 12 13.0 14.0 15
2018-08-24 16 17.0 18.0 19
2018-08-25 20 21.0 22.0 23
5.4 是否有缺失数据NaN
# 是否有缺失数据NaN
# 是否为空
print(df.isnull())
A B C D
2018-08-20 False True False False
2018-08-21 False False True False
2018-08-22 False False False False
2018-08-23 False False False False
2018-08-24 False False False False
2018-08-25 False False False False
# 是否为NaN
print(df.isna())
A B C D
2018-08-20 False True False False
2018-08-21 False False True False
2018-08-22 False False False False
2018-08-23 False False False False
2018-08-24 False False False False
2018-08-25 False False False False
# 检测某列是否有缺失数据NaN
print(df.isnull().any())
A False
B True
C True
D False
dtype: bool
# 检测数据中是否存在NaN,如果存在就返回True
print(np.any(df.isnull()==True))
True
6. Pandas导入导出
6.1 导入数据
import pandas as pd
data = pd.read_csv('student.csv')
# 打印出data
print(data)
Student ID name age gender
0 1100 Kelly 22 Female
1 1101 Clo 21 Female
2 1102 Tilly 22 Female
3 1103 Tony 24 Male
4 1104 David 20 Male
5 1105 Catty 22 Female
6 1106 M 3 Female
7 1107 N 43 Male
8 1108 A 13 Male
9 1109 S 12 Male
10 1110 David 33 Male
11 1111 Dw 3 Female
12 1112 Q 23 Male
13 1113 W 21 Female
# 前三行
print(data.head(3))
Student ID name age gender
0 1100 Kelly 22 Female
1 1101 Clo 21 Female
2 1102 Tilly 22 Female
# 后三行
print(data.tail(3))
Student ID name age gender
11 1111 Dw 3 Female
12 1112 Q 23 Male
13 1113 W 21 Female
6.2 导出数据
# 将资料存取成pickle
data.to_pickle('student.pickle')
# 读取pickle文件
print(pd.read_pickle('student.pickle'))
Student ID name age gender
0 1100 Kelly 22 Female
1 1101 Clo 21 Female
2 1102 Tilly 22 Female
3 1103 Tony 24 Male
4 1104 David 20 Male
5 1105 Catty 22 Female
6 1106 M 3 Female
7 1107 N 43 Male
8 1108 A 13 Male
9 1109 S 12 Male
10 1110 David 33 Male
11 1111 Dw 3 Female
12 1112 Q 23 Male
13 1113 W 21 Female
7. Pandas合并操作
7.1 Pandas合并concat
import pandas as pd
import numpy as np
# 定义资料集
df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.ones((3, 4)) * 1, columns=['a', 'b'Python数据分析pandas真入门-----基础学习
Pandas高级数据分析快速入门之一——Python开发环境篇
Python数据分析pandas入门------十分钟入门pandas