学了pandas,就根本停不下来......
Posted 跟着哈哥学大智若愚
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了学了pandas,就根本停不下来......相关的知识,希望对你有一定的参考价值。
pandas?国宝大熊猫😯?不,是python数据分析核心包!!!
那么它有哪些迷人之处?咱们边学习边点赞😛
一,首先当然是配置环境
直接在windows的dos窗口或pycharm下面的Terminal中输入如下命令即可
pip install pandas
二,数据结构Series👇
==Series是一种类似于一维数组的对象,由一组数据和一组与之对应的标签(索引)组成 ==
1,引用方法
import pandas as pd
import numpy as np
2,创建Series
常见有以下几种创建方式
a = pd.Series([3, 4, 5, 6])
print(a)
# 0 3
# 1 4
# 2 5
# 3 6
# dtype: int64
b = pd.Series([4, 5, 6, 7], index=['a', 'b', 'c', 'd'])
print(b)
# a 4
# b 5
# c 6
# d 7
# dtype: int64
a = pd.Series(np.arange(5, 10))
print(a)
# 0 5
# 1 6
# 2 7
# 3 8
# 4 9
# dtype: int32
a = pd.Series({'a': 4, 'c': 7, 'g': 2, 'd': 9})
print(a)
# a 4
# c 7
# g 2
# d 9
# dtype: int64
3,Series的索引
b = pd.Series([4, 5, 6, 7], index=['a', 'b', 'c', 'd'])
print(b)
# 既可以用下标索引,也可以用标签索引
print(b[1]) # 5 下表索引
print(b['a']) # 4 标签索引
print(b[[0, 2]])
# a 4
# c 6
# dtype: int64
print(b[['c', 'd']])
# c 6
# d 7
# dtype: int64
print(b[b > 6])
# d 7
# dtype: int64
4,运算
b = pd.Series([4, 5, 6, 7], index=['a', 'b', 'c', 'd'])
print(b + 2)
# a 6
# b 7
# c 8
# d 9
# dtype: int64
print(b + b)
# a 8
# b 10
# c 12
# d 14
# dtype: int64
print(b * b)
# a 16
# b 25
# c 36
# d 49
# dtype: int64
5,切片
a = pd.Series(np.arange(3, 8), ['a', 'b', 'c', 'd', 'e'])
print(a[2: 4])
# 2 5
# 3 6
# dtype: int32
print(a['a': 'c'])
# a 3
# b 4
# c 5
# dtype: int32
6,用标签查看series对象中是否含有某值
a = pd.Series({'w': 10, 't': 34, 'p': 26})
print(a)
print('a' in a) # False
print('p' in a) # True
7,与字典的区别
a = pd.Series({'w': 10, 't': 34, 'p': 26})
for i in a:
print(i, end=' ') # 10 34 26
即字典默认取键,series默认取值
8,获取索引和值
a = pd.Series({'w': 10, 't': 34, 'p': 26})
print(a.index) # Index(['w', 't', 'p'], dtype='object')
print(a.values) # [10 34 26]
9,Series中有点迷惑行为的整数索引
s = pd.Series(np.arange(10))
print(s)
# 0 0
# 1 1
# 2 2
# 3 3
# 4 4
# 5 5
# 6 6
# 7 7
# 8 8
# 9 9
# dtype: int32
s1 = s[6:]
print(s1)
# 6 6
# 7 7
# 8 8
# 9 9
# dtype: int32
# print(s1[1]) # 会报错, 也有歧义
# 推荐以后如下方式使用
print(s1.loc[6: 8]) # 标签索引
# 6 6
# 7 7
# 8 8
# dtype: int32
print(s1.iloc[0: 2]) # 下表索引
# 6 6
# 7 7
# dtype: int32
即在series中取索引以后建议使用 loc[]标签索引,iloc[]下标索引
10,数据对齐
s1 = pd.Series([13, 45, 36, 24], index=['c', 'd', 'b', 'a'])
s2 = pd.Series([26, 38, 19, 44], index=['b', 'a', 'd', 'c'])
print(s1 + s2)
# a 62
# b 62
# c 57
# d 64
# dtype: int64
s3 = pd.Series([13, 45, 36, 24, 56], index=['c', 'f', 'b', 'a', 'e'])
print(s1 + s3) # pandas中NAN表示缺失值
# a 48.0
# b 72.0
# c 26.0
# d NaN
# e NaN
# f NaN
# dtype: float64
# add, sub, div, mul 加减乘除
print(s1.add(s3))
# a 48.0
# b 72.0
# c 26.0
# d NaN
# e NaN
# f NaN
# dtype: float64
print(s1.add(s3, fill_value=0)) # 将缺失值变为0再相加
# a 48.0
# b 72.0
# c 26.0
# d 45.0
# e 56.0
# f 45.0
# dtype: float64
11,series的缺失值处理
s1 = pd.Series([13, 45, 36, 24], index=['c', 'd', 'b', 'a'])
s3 = pd.Series([13, 45, 36, 24, 56], index=['c', 'f', 'b', 'a', 'e'])
s = s1 + s3
# a 48.0
# b 72.0
# c 26.0
# d NaN
# e NaN
# f NaN
# dtype: float64
# 判断某一行是不是缺失值
print(s.isnull())
# a False
# b False
# c False
# d True
# e True
# f True
# dtype: bool
print(s.notnull())
# a True
# b True
# c True
# d False
# e False
# f False
# dtype: bool
# 过滤缺失值
# ①扔掉缺失值
print(s[s.notnull()])
# a 48.0
# b 72.0
# c 26.0
# dtype: float64
print(s.dropna())
# a 48.0
# b 72.0
# c 26.0
# dtype: float64
# ②保留缺失值
# 将缺失值重新赋值
print(s.fillna(0)) # 将缺失值赋为0
# a 48.0
# b 72.0
# c 26.0
# d 0.0
# e 0.0
# f 0.0
# dtype: float64
# 将将缺失值填充为平均值
print(s.fillna(s.mean()))
# a 48.000000
# b 72.000000
# c 26.000000
# d 48.666667
# e 48.666667
# f 48.666667
# dtype: float64
print(s.fillna(round(s.mean(), 2))) # 保留两位小数(四舍五入)
# a 48.00
# b 72.00
# c 26.00
# d 48.67
# e 48.67
# f 48.67
# dtype: float64
三,数据结构dataframe👇
dataframe是一个表格型的数据结构,含有一组有序的列
1,DataFrame创建
df = pd.DataFrame({'name': ['hpl', 'laowang', 'xpz'], 'age': [18, 21, 45]})
print(df)
# name age
# 0 hpl 18
# 1 laowang 21
# 2 xpz 45
df1 = pd.DataFrame({'name': ['hpl', 'laowang', 'xpz'], 'age': [18, 21, 45]}, index=['a', 'b', 'c'])
print(df1)
# name age
# a hpl 18
# b laowang 21
# c xpz 45
df2 = pd.DataFrame({'name': pd.Series(['hpl', 'lw', 'xpz'], index=['a', 'b', 'c']),
'age': pd.Series([18, 23, 45, 22], index=['a', 'b', 'c', 'd'])})
print(df2)
# name age
# a hpl 18
# b lw 23
# c xpz 45
# d NaN 22
2,DataFrame的常用属性
df = pd.DataFrame({'name': pd.Series(['hpl', 'lw', 'xpz'], index=['a', 'b', 'c']),
'age': pd.Series([18, 23, 45, 22], index=['a', 'b', 'c', 'd'])})
print(df)
# name age
# a hpl 18
# b lw 23
# c xpz 45
# d NaN 22
print(df.index) # Index(['a', 'b', 'c', 'd'], dtype='object') 返回行索引
print(df.values) # 返回值
# [['hpl' 18]
# ['lw' 23]
# ['xpz' 45]
# [nan 22]]
print(df.T) # 转置
# a b c d
# name hpl lw xpz NaN
# age 18 23 45 22
print(df.columns) # Index(['name', 'age'], dtype='object') 返回列索引
print(df.describe()) # 返回描述
# age
# count 4.000000 个数
# mean 27.000000 平均值
# std 12.192894 标准差
# min 18.000000 最小值
# 25% 21.000000 1/4位上的数
# 50% 22.500000 中位数
# 75% 28.500000 3/4位上的数
# max 45.000000 最大值
3,索引和切片
data = pd.read_csv('vegetables.csv', header=None) # 读取文件,后面会详细讲解
print(data.head(6)) # 读取前六行数据
print(data[0]) # 读取第0列
print(data[0][1]) # 读取第0列第1行的元素(不推荐)
print(data.loc[1, 2]) # 读取第1行第2列的元素(推荐)
print(data.loc[: 3, :4])
# 0 1 2 3 4
# 0 沙蜜托 20.0 22.5 25.0 樱桃类
# 1 羊上脑 42.0 42.5 43.0 羊肉类
# 2 羊骨头 11.0 11.5 12.0 羊肉类
# 3 羊腩 32.0 32.0 32.0 羊肉类
print(data[:3]) # 读取前三行
4,数据对齐和数据缺失处理
data = pd.read_csv('test1.csv')
# print(data)
# name age
# 0 hpl 18.0
# 1 lw NaN
# 2 xpz 45.0
# 3 NaN 22.0
print(data['age'] + data['age'])
# 0 36.0
# 1 NaN
# 2 90.0
# 3 44.0
# Name: age, dtype: float64
print(data.fillna(0)) # 将缺失值填充为0
# name age
# 0 hpl 18.0
# 1 lw 0.0
# 2 xpz 45.0
# 3 0 22.0
print(data.dropna()) # 将缺失值所在的行直接删除
# name age
# 0 hpl 18.0
# 2 xpz 45.0
data2 = pd.read_csv('test2.csv')
print(data2)
# 0 hpl 18.0
# 1 NaN NaN
# 2 xpz 45.0
# 3 NaN 22.0
print(data2.dropna(how='all')) # how='all' 只有一行都是缺失值时才删除
# name age
# 0 hpl 18.0
# 2 xpz 45.0
# 3 NaN 22.0
print(data2.dropna(how='any')) # how='any' 和不传默认是一样的
# name age
# 0 hpl 18.0
# 2 xpz 45.0
print(data2.dropna(axis=1)) # axis=0 表示行(删除有缺失值的行),axis=1 表示列(删除有缺失值的列)
5,pandas的常用函数及排序
data = pd.read_csv('vegetables.csv', header=None)
# 求平均值
print(data.mean()) # 对每一列求平均值(默认)
# 1 8.242353
# 2 8.863975
# 3 9.520366
# dtype: float64
print(data.mean(axis=1)) # axis=1 按每一行求平均值
# 求和
print(data.sum()) # 默认按列求和
# 1 837151.05
# 2 900287.39
# 3 966954.97
# dtype: float64
print(data.sum(axis=1)) # axis=1 按行进行求和
print(data.max())
# 1 300.0
# 2 315.0
# 3 330.0
# dtype: float64
print(data.min())
# 1 0.2
# 2 0.0
# 3 0.3
# dtype: float64
print(data.var()) # 方差
# dtype: float64
# 1 279.696835
# 2 299.989904
# 3 324.831664
# dtype: float64
print(data.std()) # 标准差
# 1 16.724139
# 2 17.320217
# 3 18.023087
# dtype: float64
# 排序(排序中有NAN空值的不参与排序,统一放到最后面)
print(data.sort_values(by=1)) # by=1 按第一列从小到大进行排序
print(data.sort_values(by=1, ascending=False)) # ascending意思为上升,ascending=False即倒序
print(data.sort_index()) # 按标签进行排序,默认是升序
print(data.sort_index(ascending=False)) # 按标签进行降序
print(data.sort_index(axis=1)) # 按标签进行排序和按列进行排序(顺序由小到大)
print(data.sort_index(ascending=False, axis=1)) # 按标签进行排序和按列进行排序(倒序由大到小)
6,批量 处理 / 生成 时间对象
print(datetime.datetime.strptime('2021-05-19', '%Y-%m-%d')) # 字符串转时间格式 2021-05-19 00:00:00
# pandas 批量处理时间对象
print(pd.to_datetime(['19/05/2021', '2021-05-20', '2021/May/21']))
# DatetimeIndex(['2021-05-19', '2021-05-20', '2021-05-21'], dtype='datetime64[ns]', freq=None)
# pandas 批量生成时间
print(pd.date_range('2021-05-10', '2021-05-19')) # 设置起始和结束,默认是天
# DatetimeIndex(['2021-05-10', '2021-05-11', '2021-05-12', '2021-05-13',
# '2021-05-14', '2021-05-15', '2021-05-16', '2021-05-17',
# '2021-05-18', '2021-05-19'],
# dtype='datetime64[ns]', freq='D')
print(pd.date_range('2021-05-01', periods=9)) # periods=9 指定长度,生成9个,默认是天
# DatetimeIndex(['2021-05-01', '2021-05-02', '2021-05-03', '2021-05-04',
# '2021-05-05', '2021-05-06', '2021-05-07', '2021-05-08',
# '2021-05-09'],
# dtype='datetime64[ns]', freq='D')
# freq='' S/s按秒,T/t按分, H/h按小时, W/w按周, B/b按工作日(周一至周五),W-MON/w-mon按每周一,M/m按月,SM/sm按半个月,Y/y按年,1h20min
print(pd.date_range('2021-05-01', periods=10, freq='1h30min'))
# DatetimeIndex(['2021-05-01 00:00:00', '2021-05-01 01:30:00',
# '2021-05-01 03:00:00', '2021-05-01 04:30:00',
# '2021-05-01 06:00:00', '2021-05-01 07:30:00',
# '2021-05-01 09:00:00', '2021-05-01 10:30:00',
# '2021-05-01 12:00:00', '2021-05-01 13:30:00'],
# dtype='datetime64[ns]', freq='90T')
7,pandas 的时间序列
data = pd.read_csv('vegetables.csv', header=None, index_col=6, parse_dates=学了pandas,就根本停不下来......