学习pandas全套代码超详细数据查看输入输出选取集成清洗转换重塑数学和统计方法排序

Posted 报告,今天也有好好学习

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了学习pandas全套代码超详细数据查看输入输出选取集成清洗转换重塑数学和统计方法排序相关的知识,希望对你有一定的参考价值。

本篇博客将会给出大家平时使用pandas的时候经常需要用到的功能代码,同时也会给出运行结果,以帮助大家更进一步的理解。

另外,我也以注释的形式更进一步的补充说明代码的功能及其作用,需要本篇博文中用到的文档文件以及代码的朋友,也可以三连支持一下,并评论留下你的邮箱,我会在看到后的第一时间发送给你。

当然啦,你也可以把本篇博文当作一本小小的pandas书籍,当需要用到pandas哪些知识的时候,Ctrl+F就可以搜索到啦,现在不看的话就先收藏着。

更新的另外一篇,欢迎先来点击收藏:学习pandas全套代码【超详细】分箱操作、分组聚合、时间序列、数据可视化

第一部分:pandas数据结构

import numpy as np
import pandas as pd # pandas基于NumPy,升级

pandas的主要数据结构是 Series(⼀维数据)与 DataFrame(二维数据)。

1.1 Series

# Series
l = np.array([1,2,3,6,9]) # NumPy数组

s1 = pd.Series(data = l)
display(l,s1) # Series是一维的数组,和NumPy数组不一样:Series多了索引
array([1, 2, 3, 6, 9])



0    1
1    2
2    3
3    6
4    9
dtype: int64
s2 = pd.Series(data = l,index = list('ABCDE'))
s2
A    1
B    2
C    3
D    6
E    9
dtype: int64
s3 = pd.Series(data = {'A':149,'B':130,'C':118,'D':99,'E':66})
s3
A    149
B    130
C    118
D     99
E     66
dtype: int64

1.2 DataFrame

# Series是一维的,功能比较少
# DataFrame是二维的,多个Series公用索引,组成了DataFrame
# 像不像 Excel,所有数据,结构化
df1 = pd.DataFrame(data = np.random.randint(0,151,size = (10,3)),
                   index = list('ABCDEFHIJK'), # 行索引
                   columns=['Python','Math','En'],dtype=np.float16) # 列索引
df1
PythonMathEn
A113.037.070.0
B92.022.011.0
C0.09.066.0
D40.0145.023.0
E25.0133.0108.0
F124.016.0130.0
H121.085.0133.0
I84.0125.039.0
J111.036.0137.0
K55.026.085.0
df2 = pd.DataFrame(data = {'Python':[66,99,128],'Math':[88,65,137],'En':[100,121,45]})
df2 # 字典,key作为列索引,不指定index默认从0开始索引,自动索引一样
PythonMathEn
06688100
19965121
212813745

第二部分:数据查看

df = pd.DataFrame(data = np.random.randint(0,151,size = (100,3)),
                  columns=['Python','Math','En'])
df
PythonMathEn
0133139141
18217130
25151145
31277011
4936091
............
955713396
969121134
9776109113
98998229
99285488

100 rows × 3 columns

df.shape # 查看DataFrame形状
(100, 3)
df.head(n = 3) # 显示前N个,默认N = 5
PythonMathEn
0133139141
18217130
25151145
df.tail() # 显示后n个
PythonMathEn
955713396
969121134
9776109113
98998229
99285488
df.dtypes # 数据类型
Python    int64
Math      int64
En        int64
dtype: object
df.info() # 比较详细信息
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Python  100 non-null    int64
 1   Math    100 non-null    int64
 2   En      100 non-null    int64
dtypes: int64(3)
memory usage: 2.5 KB
df.describe() # 描述:平均值、标准差、中位数、四等分、最大值,最小值
PythonMathEn
count100.000000100.000000100.000000
mean85.79000077.41000067.630000
std41.37517344.90530943.883835
min3.0000000.0000003.000000
25%54.50000040.25000031.250000
50%84.50000081.00000058.500000
75%123.000000113.250000103.000000
max149.000000149.000000147.000000
df.values # 值,返回的是NumPy数组
array([[133, 139, 141],
       [ 82,  17, 130],
       [ 51,  51, 145],
       [127,  70,  11],
       [ 93,  60,  91],
       [103, 110, 103],
       [ 27, 133,  32],
       [148,  99, 128],
       [139,  97,  44],
       [ 64,  85,  71],
       [147,  94,  37],
       [114,  12,  16],
       [ 16,  54,  44],
       [123,   3,  76],
       [137,  97, 123],
       [149, 113,  74],
       [ 69,  38,   7],
       [ 68, 122,   4],
       [ 53,  13,  47],
       [113, 127, 124],
       [ 55, 139,  47],
       [140, 114,  14],
       [ 84, 111, 115],
       [ 65,   5, 136],
       [ 96,  50,  89],
       [145, 130,  15],
       [111,  30,  66],
       [132, 122, 144],
       [ 79,   5,  45],
       [115,  29,  49],
       [ 27,  55,  83],
       [ 29,  74,  38],
       [ 87, 100,  45],
       [132, 147, 119],
       [ 66,  90,  40],
       [ 67, 108,  48],
       [ 78,  28,  46],
       [105, 137, 110],
       [132, 119,  55],
       [117,  23,  79],
       [ 12,  29,  12],
       [114,  58, 119],
       [139,   0,  42],
       [ 61,  69, 142],
       [141,  73, 107],
       [ 49,  12,  19],
       [  8,   1,  75],
       [134,  60,  25],
       [138,  80,  79],
       [112, 115,  26],
       [ 77,   4, 120],
       [140, 100,  35],
       [ 82, 129,   4],
       [100,   8,  25],
       [ 77,  97,  78],
       [ 55, 113,  53],
       [ 45,  73,  37],
       [ 44,   0,  80],
       [ 26,  74,  52],
       [ 99,  75, 147],
       [111,   8, 144],
       [ 55, 146,  15],
       [140, 106,  74],
       [ 91,  78,  92],
       [130, 108,  41],
       [ 34,  41, 136],
       [  3, 139,   4],
       [123,  93,   4],
       [ 24, 103,   3],
       [ 44, 122,  92],
       [ 83,  45,  50],
       [ 46, 149, 103],
       [ 48, 127,  92],
       [  3,  51,  57],
       [136, 136,  82],
       [ 65, 102,  16],
       [ 23,  61, 118],
       [138,  15,   6],
       [ 83,  91,   4],
       [109,  24,  54],
       [ 40,  43, 125],
       [103, 123, 141],
       [116, 113,  38],
       [137,  71, 126],
       [ 69, 143,  83],
       [  8,  60,  60],
       [ 40,  22,  95],
       [ 73,  19,  17],
       [137, 129, 103],
       [109, 142,  94],
       [ 85, 105,  10],
       [ 97, 107,  19],
       [ 79,  12,  27],
       [143,  74,  18],
       [ 32, 114,  52],
       [ 57, 133,  96],
       [ 91,  21, 134],
       [ 76, 109, 113],
       [ 99,  82,  29],
       [ 28,  54,  88]])
df.columns # 列索引
Index(['Python', 'Math', 'En'], dtype='object')
df.index # 行索引 0 ~ 99
RangeIndex(start=0, stop=100, step=1)

第三部分:数据输入与输出

3.1 csv

df = pd.DataFrame(data = np.random.randint(0,151,size = (100,3)),
                  columns=['Python','Math','En'])
df # 行索引,列索引
PythonMathEn
01012854
1454774
210333133
3702481
490143121
............
9514525139
965351109
97357130
98865120
991496675

100 rows × 3 columns

df.to_csv('./data.csv',sep = ',',
          index = True, # 保存行索引
          header=True) # 保存列索引
df.to_csv('./data2.csv',sep = ',',
          index = False, # 不保存行索引
          header=False) # 不保存列索引
pd.read_csv('./data.csv',
            index_col=0) # 第一列作为行索引
PythonMathEn
01012854
1454774
210333133
3702481
490143121
............
9514525139
965351109
97357130
98865120
991496675

100 rows × 3 columns

pd.read_csv('./data2.csv',header =None)
012
01012854
1454774
210333133
3702481
490143121
............
9514525139
965351109
97357130
98865120
991496675

100 rows × 3 columns

3.2 Excel

df
PythonMathEn
01012854
1454774
210333133
3702481
490143121
............
9514525139
965351109
97357130
98865120
991496675

100 rows × 3 columns

df.to_excel('./data.xls')
pd.read_excel('./data.xls',
              index_col=0) # 第一列作为行索引
PythonMathEn
01012854
1454774
210333133
3702481
490143121
............
9514525139
965351109
97357130
98865120
991496675

100 rows × 3 columns

3.3 HDF5

df.to_hdf('./data.h5',key = 'score')
df2 = pd.DataFrame(data = np.random.randint(6,100,size = (1000,5)),
                   columns=['计算机','化工','生物','工程','教师'])
df2
计算机化工生物工程教师
06422166860
19547727637
28848925037
3753886383
46214202145
..................
9958994889727
996526821850
9977699109256
9986631556594
999821388914

1000 rows × 5 columns

df2.to_hdf('./data.h5',key = 'salary')
pd.read_hdf('./data.h5',key = 'salary')
计算机化工生物工程教师
06422166860
19547727637
28848925037
3753886383
46214202145
..................
9958994889727
996526821850
9977699109256
9986631556594
999821388914

1000 rows × 5 columns

3.4 SQL

from sqlalchemy import create_engine # 数据库引擎,构建和数据库的连接
# Pymysql
# 类似网页地址
engine = create_engine('mysql+pymysql://root:12345678@localhost/pandas?charset=utf8')
df2.to_sql('salary',engine,index=False) # 将Python中数据DataFrame保存到Mysql
df3 = pd.read_sql('select * from salary limit 50',con = engine)
df3
计算机化工生物工程教师
06422166860
19547727637
28848925037
3753886383
46214202145
59541843716
63445119394
79910573063
85960123793
94158156770
10198639664
117561784989
128684316827
134298242085
149515978087
155222443574
166720652410
17946416266
188676807219
19618164266
20779284187
218716751434
222382924232
236189282140
242212388914
257712468912
266045527167
272976942691
281460828860
295636446037
306377434282
312571365121
32768687838
339359862578
347340128666
351030541371
36948587585
378141611255
388068669284
395336842666
401962634745
418939913186
425743534819
436616231910
444628788121
45385376498
46559470644
475633921784
486968238790
491247328015

第四部分:数据选取

4.1 获取数据

df = pd.DataFrame(np.random.randint(0,151,size = (10,3)),
                  index=list('ABCDEFHIJK'),columns=['Python','Math','En'])
df
PythonMathEn
A885248
B786294
C91471
D861521
E17171
F12313855
H5917140
I68858
J1007063
K793772
df['Python'] # 获取数据Series
A     88
B     78
C      9
D     86
E      1
F    123
H     59
I     68
J    100
K     79
Name: Python, dtype: int64
df.Python # 属性,DataFrame中列索引,表示属性
A     88
B     78
C      9
D     86
E      1
F    123
H     59
I     68
J    100
K     79
Name: Python, dtype: int64
df[['Python','En']] # 获取多列数据
PythonEn
A8848
B7894
C971
D8621
E171
F12355
H59140
I6858
J10063
K7972

4.2 标签选择

# 标签,就是行索引 location = loc 位置
df.loc['A']
Python    88
Math      52
En        48
Name: A, dtype: int64
df.loc[['A','F','K']]
PythonMathEn
A885248
F12313855
K793772
df.loc['A','Python']
88
df.loc[['A','C','F'],'Python']
A     88
C      9
F    123
Name: Python, dtype: int64
df.loc['A'::2,['Math','En']]
MathEn
A5248
C1471
E7171
H17140
J7063
df.loc['A':'D',:]
PythonMathEn
A885248
B786294
C91471
D861521

4.3 位置选择

df.iloc[0]
Python    88
Math      52
En        48
Name: A, dtype: int64
df.iloc[[0,2,4]]
PythonMathEn
A885248
C91471
E17171
df.iloc[0:4,[0,2]]
PythonEn
A8848
B7894
C971
D8621
df.iloc[3:8:2]
PythonMathEn
D861521
F12313855
I68858

4.4 boolean索引

cond = df.Python > 80 # 将Python大于80分的成绩获取
df[cond]
PythonMathEn
A885248
D861521
F12313855
J1007063
cond = df.mean(axis = 1) > 75 # 平均分大于75,优秀,筛选出来
df[cond]
PythonMathEn
B786294
F12313855
J1007063
cond = (df.Python > 70) & (df.Math > 70)
df[cond]
PythonMathEn
F12313855
cond = df.index.isin(['C','E','H','K']) # 判断数据是否在数组中
df[cond] # 删选出来了符合条件的数据
PythonMathEn
C91471
E17171
H5917140
K793772

4.5 赋值操作

df['Python']['A'] = 150 # 修改某个位置的值
df
PythonMathEn
A1505248
B786294
C91471
D861521
E17171
F12313855
H5917140
I68858
J1007063
K793772
df['Java'] = np.random.randint(0,151,size = 10) # 新增加一列
df
PythonMathEnJava
A150524865
B78629425
C9147182
D861521139
E1717167
F12313855145
H591714053
I68858141
J100706311
K793772127
df.loc[['C','D','E'],'Math'] = 147 # 修改多个人的成绩
df
PythonMathEnJava
A150524865
B78629425
C91477182
D8614721139
E11477167
F12313855145
H591714053
I68858141
J100706311
K793772127
cond = df < 60
df[cond] = 60 # where 条件操作,符合这条件值,修改,不符合,不改变
df
PythonMathEnJava
A150606065
B78629460
C601477182
D8614760139
E601477167
F12313860145
H606014060
I686060141
J100706360
K796072127
df.iloc[3::3,[0,2]] += 100
df
PythonMathEnJava
A150606065
B78629460
C601477182
D186147160139
E601477167
F12313860145
H1606024060
I686060141
J100706360
K17960172127

第五部分:数据集成

5.1 concat数据串联

# np.concatenate NumPy数据集成
df1 = pd.DataFrame(np.random.randint(0,151,size = (10,3)),
                   columns=['Python','Math','En'],
                   index = list('ABCDEFHIJK'))
df2 = pd.DataFrame(np.random.randint(0,151,size = (10,3)),
                   columns = ['Python','Math','En'],
                   index = list('QWRTUYOPLM'))
df3 = pd.DataFrame(np.random.randint(0,151,size = (10,2)),
                  columns=['Java','Chinese'],index = list('ABCDEFHIJK'))
pd.concat([df1,df2],axis = 0) # axis = 0变是行合并,行增加
PythonMathEn
A1087453
B981647
C7177128
D9123131<

以上是关于学习pandas全套代码超详细数据查看输入输出选取集成清洗转换重塑数学和统计方法排序的主要内容,如果未能解决你的问题,请参考以下文章

学习NumPy全套代码超详细基本操作数据类型数组运算复制和试图索引切片和迭代形状操作通用函数线性代数

数据挖掘经典算法之K-邻近算法(超详细附代码)

PAT乙级全套超详细题解建议收藏

超详细一文详解 pandas 核心操作技巧

数据结构与算法全套数据结构笔记持续更新

数据结构与算法全套数据结构笔记持续更新

(c)2006-2024 SYSTEM All Rights Reserved IT常识