Python数据分析pandas真入门-----基础学习

Posted 2022-12-05 Geek_bao

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Python数据分析pandas真入门-----基础学习相关的知识，希望对你有一定的参考价值。

Python数据分析基础

0. 前言
1. Series
2.DataFrame
- 2.1 DataFrame的简单运用
3. 筛选
4.Pandas设置值
5. Pandas处理丢失数据
6. Pandas导入导出
- 6.1 导入数据
- 6.2 导出数据
7. Pandas合并操作
8. Pandas plot出图
9. 结语

0. 前言

前面我们经历了十分钟学废pandas，相信大家一定都学废了（狗头保命），下面我们开始学习pandas基础知识。

1. Series

import pandas as pd
import numpy as np
# Series
s = pd.Series([1, 3, 6, np.nan, 44, 1])
print(s)
# 默认index从0开始，如果想要按照自己的索引设置，则修改index参数，如：index=[3, 4, 3, 7, 8, 9]

0     1.0
1     3.0
2     6.0
3     NaN
4    44.0
5     1.0
dtype: float64

2.DataFrame

2.1 DataFrame的简单运用

# dataFrame
dates = pd.date_range('2018-08-19', periods=6)
# dates = pd.date_range('2018-08-19', '2018-08-24') # 起始、结束、与上述代码等价
'''
numpy.random.randn(d0, d1,.... ,dn)是从标准正太分布中返回一个或多个样本值。
numpy.random.rand(d0, d1,...,dn)的随机样本位于[0,1]中
(6, 4)表示六行四列数据
'''
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=['a', 'b', 'c','d'])
print(df)
# DataFrame既有行索引也有列索引，它可以被看作由Series组成的大字典

                   a         b         c         d
2018-08-19 -0.193563  0.774822  0.791951 -0.001489
2018-08-20  1.383536  0.013180 -1.013866  0.277929
2018-08-21  0.194067 -0.112442  0.537806  0.775922
2018-08-22 -1.257753 -1.241477  1.099022  0.487283
2018-08-23 -0.383184 -0.299835 -1.212893  0.884345
2018-08-24  0.691404 -1.207610 -0.168567  0.642692

print(df['b'])

2018-08-19    0.774822
2018-08-20    0.013180
2018-08-21   -0.112442
2018-08-22   -1.241477
2018-08-23   -0.299835
2018-08-24   -1.207610
Freq: D, Name: b, dtype: float64

# 未指定行标签和列标签的数据
df1 = pd.DataFrame(np.arange(12).reshape(3,4))
print(df1)

   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

# 另一种方式，字典方式，健为列索引
df2 = pd.DataFrame(
    'A': [1, 2, 3, 4],
    'B': pd.Timestamp('20180819'),    
    'C': pd.Series([1, 6, 9, 10], dtype='float32'),
    'D': np.array([3] * 4, dtype='int32'),
    'E': pd.Categorical(['test', 'train', 'test', 'train']),
    'F': 'foo'
)
print(df2)

   A          B     C  D      E    F
0  1 2018-08-19   1.0  3   test  foo
1  2 2018-08-19   6.0  3  train  foo
2  3 2018-08-19   9.0  3   test  foo
3  4 2018-08-19  10.0  3  train  foo

print(df2.index)

RangeIndex(start=0, stop=4, step=1)

print(df2.columns)

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

print(df2.values)

[[1 Timestamp('2018-08-19 00:00:00') 1.0 3 'test' 'foo']
 [2 Timestamp('2018-08-19 00:00:00') 6.0 3 'train' 'foo']
 [3 Timestamp('2018-08-19 00:00:00') 9.0 3 'test' 'foo']
 [4 Timestamp('2018-08-19 00:00:00') 10.0 3 'train' 'foo']]

# 数据总结。
# count：数出有该列有多少行数据
# mean：该列的平均值
# std：一列数据的均方差；（方差的算术平方根，反映一个数据集的离散程度：越大，数据间的差异越大，数据集中数据的离散程度越高；越小，数据间的大小差异越小，数据集中的数据离散程度越低）
# min：最小值
# 25%：等于该样本中所有数值由小到大排列后第25%的数字
# 50%：等于该样本中所有数值由小到大排列后第50%的数字
# 75%：等于该样本中所有数值由小到大排列后第75%的数字。
'''
首先确定四分位数的位置：
Q1的位置= (n+1) × 0.25
Q2的位置= (n+1) × 0.5
Q3的位置= (n+1) × 0.75
对于四分位数的确定，有不同的方法，另外一种方法基于N-1 基础。即
Q1的位置=1+（n-1）x 0.25
Q2的位置=1+（n-1）x 0.5
Q3的位置=1+（n-1）x 0.75
实例1：
数据总量: 6, 47, 49, 15, 42, 41, 7, 39, 43, 40, 36
由小到大排列的结果: 6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49，一共11项
Q1 的位置=（11+1） × 0.25=3， Q2 的位置=（11+1）× 0.5=6， Q3的位置=（11+1） × 0.75=9，故
Q1 = 15，
Q2 = 40，
Q3 = 43
实例2
数据总量: 7, 15, 36, 39, 40, 41，一共6项
数列项为偶数项时，四分位数Q2为该组数列的中数，
（n+1）/4= 7/4 =1.75，Q1在第一与第二个数字之间，
3（n+1）/4= 21/4 =5.25, Q3在第五与第六个数字之间，
Q1 = 0.75*15+0.25*7 = 13，
Q2 = （36+39）/2= 37.5，
Q3 = 0.25*41+0.75*40 = 40.25.
'''
# max：最大值
print(df2.describe())

              A          C    D
count  4.000000   4.000000  4.0
mean   2.500000   6.500000  3.0
std    1.290994   4.041452  0.0
min    1.000000   1.000000  3.0
25%    1.750000   4.750000  3.0
50%    2.500000   7.500000  3.0
75%    3.250000   9.250000  3.0
max    4.000000  10.000000  3.0

# 翻转数据
print(df2.T)
# print(np.transpose(df2))等价于上述操作

                     0                    1                    2  \\
A                    1                    2                    3   
B  2018-08-19 00:00:00  2018-08-19 00:00:00  2018-08-19 00:00:00   
C                    1                    6                    9   
D                    3                    3                    3   
E                 test                train                 test   
F                  foo                  foo                  foo   

                     3  
A                    4  
B  2018-08-19 00:00:00  
C                   10  
D                    3  
E                train  
F                  foo

'''
axis=1表示行
axis=0表示列
默认ascenging(升序)为True
ascending=True表示升序，ascending=False表示降序
下面两行分别表示按行升序和按行降序
'''
print(df2.sort_index(axis=1, ascending=True))

   A          B     C  D      E    F
0  1 2018-08-19   1.0  3   test  foo
1  2 2018-08-19   6.0  3  train  foo
2  3 2018-08-19   9.0  3   test  foo
3  4 2018-08-19  10.0  3  train  foo

print(df2.sort_index(axis=1, ascending=False))

     F      E  D     C          B  A
0  foo   test  3   1.0 2018-08-19  1
1  foo  train  3   6.0 2018-08-19  2
2  foo   test  3   9.0 2018-08-19  3
3  foo  train  3  10.0 2018-08-19  4

# 表示按列排序与降序排序
print(df2.sort_index(axis=0, ascending=False))

   A          B     C  D      E    F
3  4 2018-08-19  10.0  3  train  foo
2  3 2018-08-19   9.0  3   test  foo
1  2 2018-08-19   6.0  3  train  foo
0  1 2018-08-19   1.0  3   test  foo

print(df2.sort_index(axis=0, ascending=True))

   A          B     C  D      E    F
0  1 2018-08-19   1.0  3   test  foo
1  2 2018-08-19   6.0  3  train  foo
2  3 2018-08-19   9.0  3   test  foo
3  4 2018-08-19  10.0  3  train  foo

# 对特定列数值排序
# 表示对C列降序排序
print(df2.sort_values(by='C', ascending=False))

   A          B     C  D      E    F
3  4 2018-08-19  10.0  3  train  foo
2  3 2018-08-19   9.0  3   test  foo
1  2 2018-08-19   6.0  3  train  foo
0  1 2018-08-19   1.0  3   test  foo

3. 筛选

3.1 实战筛选

import pandas as pd
import numpy as np
dates = pd.date_range('20180819', periods=6)
df = pd.DataFrame(np.arange(24).reshape((6, 4)), index=dates, columns=['A', 'B', 'C', 'D'])
print(df)

             A   B   C   D
2018-08-19   0   1   2   3
2018-08-20   4   5   6   7
2018-08-21   8   9  10  11
2018-08-22  12  13  14  15
2018-08-23  16  17  18  19
2018-08-24  20  21  22  23

# 检索A列
print(df['A'])

2018-08-19     0
2018-08-20     4
2018-08-21     8
2018-08-22    12
2018-08-23    16
2018-08-24    20
Freq: D, Name: A, dtype: int32

print(df.A)

2018-08-19     0
2018-08-20     4
2018-08-21     8
2018-08-22    12
2018-08-23    16
2018-08-24    20
Freq: D, Name: A, dtype: int32

# 选择跨越多行或多列
# 选取前三行
print(df[0:3])

            A  B   C   D
2018-08-19  0  1   2   3
2018-08-20  4  5   6   7
2018-08-21  8  9  10  11

print(df['2018-08-19': '2018-08-21'])

            A  B   C   D
2018-08-19  0  1   2   3
2018-08-20  4  5   6   7
2018-08-21  8  9  10  11

# 根据标签选择数据
# 获取特定行或列
# 指定行数据
print(df.loc['20180819'])

A    0
B    1
C    2
D    3
Name: 2018-08-19 00:00:00, dtype: int32

# 指定列
# 两种方式
print(df.loc[:, 'A':'B'])

             A   B
2018-08-19   0   1
2018-08-20   4   5
2018-08-21   8   9
2018-08-22  12  13
2018-08-23  16  17
2018-08-24  20  21

print(df.loc[:, ['A', 'B']])

             A   B
2018-08-19   0   1
2018-08-20   4   5
2018-08-21   8   9
2018-08-22  12  13
2018-08-23  16  17
2018-08-24  20  21

# 行与列同时检索
print(df.loc['20180819', ['A', 'B']])

A    0
B    1
Name: 2018-08-19 00:00:00, dtype: int32

# 根据序列iloc
# 获取特定位置的值
print(df.iloc[3, 1])

print(df.iloc[3:5, 1:3])  # 不包含末尾5或3，同列表切片

             B   C
2018-08-22  13  14
2018-08-23  17  18

# 跨行操作
print(df.iloc[[1, 3, 5], 1:3])

             B   C
2018-08-20   5   6
2018-08-22  13  14
2018-08-24  21  22

# 混合选择
print(df.ix[:3, ['A', 'C']])

            A   C
2018-08-19  0   2
2018-08-20  4   6
2018-08-21  8  10


D:\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:2: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

print(df.iloc[:3, [0, 2]]) # 结果同上

            A   C
2018-08-19  0   2
2018-08-20  4   6
2018-08-21  8  10

# 通过判断的筛选
print(df[df.A>8])

             A   B   C   D
2018-08-22  12  13  14  15
2018-08-23  16  17  18  19
2018-08-24  20  21  22  23

# 通过判断的筛选
print(df.loc[df.A>8])

             A   B   C   D
2018-08-22  12  13  14  15
2018-08-23  16  17  18  19
2018-08-24  20  21  22  23

3.2 筛选总结

1. iloc与ix区别

总结：

相同点：iloc可以取相应的值，操作方便，与ix操作类似。

不同点：ix可以混合选择，可以填入column对应的字符选择，而iloc只能采用index索引，对于列数较多的情况下，ix要方便操作许多。

2. loc与iloc区别

总结：

相同点：都可以索引出块数据

不同点：iloc可以检索对应值，两者操作不同。

3. ix与loc、iloc三者的区别

总结：ix是混合loc与iloc操作

如下：对比三者操作，输出结果相同

print(df.loc['20180819', 'A':'B'])
print(df.iloc[0, 0:2])
print(df.ix[0, 'A':'B'])

A    0
B    1
Name: 2018-08-19 00:00:00, dtype: int32
A    0
B    1
Name: 2018-08-19 00:00:00, dtype: int32
A    0
B    1
Name: 2018-08-19 00:00:00, dtype: int32


D:\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:3: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  This is separate from the ipykernel package so we can avoid doing imports until

4.Pandas设置值

4.1 创建数据

import pandas as pd
import numpy as np
# 创建数据
dates = pd.date_range('20180820', periods=6)
df = pd.DataFrame(np.arange(24).reshape(6, 4), index=dates, columns=['A', 'B', 'C', 'D'])
print(df)

             A   B   C   D
2018-08-20   0   1   2   3
2018-08-21   4   5   6   7
2018-08-22   8   9  10  11
2018-08-23  12  13  14  15
2018-08-24  16  17  18  19
2018-08-25  20  21  22  23

4.2 根据位置设置loc与iloc

# 根据位置设置loc与iloc
df.iloc[2, 2] = 111
df.loc['20180820', 'B'] = 2222
print(df)

             A     B    C   D
2018-08-20   0  2222    2   3
2018-08-21   4     5    6   7
2018-08-22   8     9  111  11
2018-08-23  12    13   14  15
2018-08-24  16    17   18  19
2018-08-25  20    21   22  23

4.3 根据条件设置

# 根据条件设置
# 更改B中的数，而更改的位置取决于4的位置，并设置相应位置的数为0
df.B[df.A>4] = 0
print(df)

             A     B    C   D
2018-08-20   0  2222    2   3
2018-08-21   4     5    6   7
2018-08-22   8     0  111  11
2018-08-23  12     0   14  15
2018-08-24  16     0   18  19
2018-08-25  20     0   22  23

df.B.loc[df.A>4] = 0
print(df)

             A     B    C   D
2018-08-20   0  2222    2   3
2018-08-21   4     5    6   7
2018-08-22   8     0  111  11
2018-08-23  12     0   14  15
2018-08-24  16     0   18  19
2018-08-25  20     0   22  23

4.4 按行或列设置

# 按行或列设置
# 列批处理，F列全改为NaN
df['F'] = np.nan
print(df)

             A     B    C   D   F
2018-08-20   0  2222    2   3 NaN
2018-08-21   4     5    6   7 NaN
2018-08-22   8     0  111  11 NaN
2018-08-23  12     0   14  15 NaN
2018-08-24  16     0   18  19 NaN
2018-08-25  20     0   22  23 NaN

4.5 添加Series序列（长度必须对齐）

df['E'] = pd.Series([1, 2, 3, 4, 5, 6], index = pd.date_range('20180820', periods=6))
print(df)

             A     B    C   D   F  E
2018-08-20   0  2222    2   3 NaN  1
2018-08-21   4     5    6   7 NaN  2
2018-08-22   8     0  111  11 NaN  3
2018-08-23  12     0   14  15 NaN  4
2018-08-24  16     0   18  19 NaN  5
2018-08-25  20     0   22  23 NaN  6

4.6 设定某行某列为特定值

# 设定某行某列为特定值
df.ix['20180820', 'A'] = 56
print(df)
# ix以后要剥离了，尽量不要用了

             A     B    C   D   F  E
2018-08-20  56  2222    2   3 NaN  1
2018-08-21   4     5    6   7 NaN  2
2018-08-22   8     0  111  11 NaN  3
2018-08-23  12     0   14  15 NaN  4
2018-08-24  16     0   18  19 NaN  5
2018-08-25  20     0   22  23 NaN  6


D:\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:2: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

df.loc['20180820', 'A'] = 67
print(df)

             A     B    C   D   F  E
2018-08-20  67  2222    2   3 NaN  1
2018-08-21   4     5    6   7 NaN  2
2018-08-22   8     0  111  11 NaN  3
2018-08-23  12     0   14  15 NaN  4
2018-08-24  16     0   18  19 NaN  5
2018-08-25  20     0   22  23 NaN  6

df.iloc[0, 0] = 76
print(df)

             A     B    C   D   F  E
2018-08-20  76  2222    2   3 NaN  1
2018-08-21   4     5    6   7 NaN  2
2018-08-22   8     0  111  11 NaN  3
2018-08-23  12     0   14  15 NaN  4
2018-08-24  16     0   18  19 NaN  5
2018-08-25  20     0   22  23 NaN  6

4.7 修改一整行数据

# 修改一整行数据
df.iloc[1] = np.nan # df.iloc[1,:] = np.nan
print(df)

               A       B      C     D   F    E
2018-08-20  76.0  2222.0    2.0   3.0 NaN  1.0
2018-08-21   NaN     NaN    NaN   NaN NaN  NaN
2018-08-22   8.0     0.0  111.0  11.0 NaN  3.0
2018-08-23  12.0     0.0   14.0  15.0 NaN  4.0
2018-08-24  16.0     0.0   18.0  19.0 NaN  5.0
2018-08-25  20.0     0.0   22.0  23.0 NaN  6.0

df.loc['20180823'] = np.nan # df.loc['20180823', :] = np.nan
print(df)

               A       B      C     D   F    E
2018-08-20  76.0  2222.0    2.0   3.0 NaN  1.0
2018-08-21   NaN     NaN    NaN   NaN NaN  NaN
2018-08-22   8.0     0.0  111.0  11.0 NaN  3.0
2018-08-23   NaN     NaN    NaN   NaN NaN  NaN
2018-08-24  16.0     0.0   18.0  19.0 NaN  5.0
2018-08-25  20.0     0.0   22.0  23.0 NaN  6.0

df.ix[2] = np.nan # df.ix[2, :]
print(df)

               A       B     C     D   F    E
2018-08-20  76.0  2222.0   2.0   3.0 NaN  1.0
2018-08-21   NaN     NaN   NaN   NaN NaN  NaN
2018-08-22   NaN     NaN   NaN   NaN NaN  NaN
2018-08-23   NaN     NaN   NaN   NaN NaN  NaN
2018-08-24  16.0     0.0  18.0  19.0 NaN  5.0
2018-08-25  20.0     0.0  22.0  23.0 NaN  6.0


D:\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.

df.ix['20180824'] = np.nan
print(df)

               A       B     C     D   F    E
2018-08-20  76.0  2222.0   2.0   3.0 NaN  1.0
2018-08-21   NaN     NaN   NaN   NaN NaN  NaN
2018-08-22   NaN     NaN   NaN   NaN NaN  NaN
2018-08-23   NaN     NaN   NaN   NaN NaN  NaN
2018-08-24   NaN     NaN   NaN   NaN NaN  NaN
2018-08-25  20.0     0.0  22.0  23.0 NaN  6.0


D:\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.

5. Pandas处理丢失数据

5.1 创建含NaN的矩阵

# Pandas处理丢失数据
import pandas as pd
import numpy as np
# 创建含NaN的矩阵
# 如何填充和删除NaN数据！
dates = pd.date_range('20180820', periods=6)
df = pd.DataFrame(np.arange(24).reshape((6, 4)), index=dates, columns=['A', 'B', 'C', 'D'])
print(df)

             A   B   C   D
2018-08-20   0   1   2   3
2018-08-21   4   5   6   7
2018-08-22   8   9  10  11
2018-08-23  12  13  14  15
2018-08-24  16  17  18  19
2018-08-25  20  21  22  23

# a.reshape(6, 4)等价于a.reshape((6, 4))
df.iloc[0, 1] = np.nan
df.iloc[1, 2] = np.nan
print(df)

             A     B     C   D
2018-08-20   0   NaN   2.0   3
2018-08-21   4   5.0   NaN   7
2018-08-22   8   9.0  10.0  11
2018-08-23  12  13.0  14.0  15
2018-08-24  16  17.0  18.0  19
2018-08-25  20  21.0  22.0  23

5.2 删除掉有NaN的行或者列

# 删除掉有NaN的行或列
print(df.dropna()) # 默认是删除掉含有NaN的行

             A     B     C   D
2018-08-22   8   9.0  10.0  11
2018-08-23  12  13.0  14.0  15
2018-08-24  16  17.0  18.0  19
2018-08-25  20  21.0  22.0  23

print(df.dropna(
    axis = 0,  # 0表示对行进行操作；1表示对列进行操作
    how = 'any' # 'any':只要存在Nan就drop掉；‘all’：必须全部是NaN才drop掉
))

             A     B     C   D
2018-08-22   8   9.0  10.0  11
2018-08-23  12  13.0  14.0  15
2018-08-24  16  17.0  18.0  19
2018-08-25  20  21.0  22.0  23

# 删除掉所有含NaN的列
print(df.dropna(
    axis = 1,
    how = 'any'
))

             A   D
2018-08-20   0   3
2018-08-21   4   7
2018-08-22   8  11
2018-08-23  12  15
2018-08-24  16  19
2018-08-25  20  23

5.3 替换NaN值为0或其他

# 替换NaN值为0或者其他
print(df.fillna(value=0))

             A     B     C   D
2018-08-20   0   0.0   2.0   3
2018-08-21   4   5.0   0.0   7
2018-08-22   8   9.0  10.0  11
2018-08-23  12  13.0  14.0  15
2018-08-24  16  17.0  18.0  19
2018-08-25  20  21.0  22.0  23

5.4 是否有缺失数据NaN

# 是否有缺失数据NaN
# 是否为空
print(df.isnull())

                A      B      C      D
2018-08-20  False   True  False  False
2018-08-21  False  False   True  False
2018-08-22  False  False  False  False
2018-08-23  False  False  False  False
2018-08-24  False  False  False  False
2018-08-25  False  False  False  False

# 是否为NaN
print(df.isna())

                A      B      C      D
2018-08-20  False   True  False  False
2018-08-21  False  False   True  False
2018-08-22  False  False  False  False
2018-08-23  False  False  False  False
2018-08-24  False  False  False  False
2018-08-25  False  False  False  False

# 检测某列是否有缺失数据NaN
print(df.isnull().any())

A    False
B     True
C     True
D    False
dtype: bool

# 检测数据中是否存在NaN，如果存在就返回True
print(np.any(df.isnull()==True))

True

6. Pandas导入导出

6.1 导入数据

import pandas as pd
data = pd.read_csv('student.csv')
# 打印出data
print(data)

    Student ID  name   age  gender
0         1100  Kelly   22  Female
1         1101    Clo   21  Female
2         1102  Tilly   22  Female
3         1103   Tony   24    Male
4         1104  David   20    Male
5         1105  Catty   22  Female
6         1106      M    3  Female
7         1107      N   43    Male
8         1108      A   13    Male
9         1109      S   12    Male
10        1110  David   33    Male
11        1111     Dw    3  Female
12        1112      Q   23    Male
13        1113      W   21  Female

# 前三行
print(data.head(3))

   Student ID  name   age  gender
0        1100  Kelly   22  Female
1        1101    Clo   21  Female
2        1102  Tilly   22  Female

# 后三行
print(data.tail(3))

    Student ID name   age  gender
11        1111    Dw    3  Female
12        1112     Q   23    Male
13        1113     W   21  Female

6.2 导出数据

# 将资料存取成pickle
data.to_pickle('student.pickle')

# 读取pickle文件
print(pd.read_pickle('student.pickle'))

    Student ID  name   age  gender
0         1100  Kelly   22  Female
1         1101    Clo   21  Female
2         1102  Tilly   22  Female
3         1103   Tony   24    Male
4         1104  David   20    Male
5         1105  Catty   22  Female
6         1106      M    3  Female
7         1107      N   43    Male
8         1108      A   13    Male
9         1109      S   12    Male
10        1110  David   33    Male
11        1111     Dw    3  Female
12        1112      Q   23    Male
13        1113      W   21  Female

7. Pandas合并操作

7.1 Pandas合并concat

import pandas as pd
import numpy as np
# 定义资料集
df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b'

   
 (c)2006-2024 SYSTEM All Rights Reserved  IT常识