pandas一些常用函数以及操作的使用和理解(持续更新)

Posted Icy Hunter

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了pandas一些常用函数以及操作的使用和理解(持续更新)相关的知识,希望对你有一定的参考价值。

pandas库的一些用法

1.创建DataFrame(下面都缩写为df)

import pandas as pd
import numpy as np
f = pd.DataFrame(np.arange(20).reshape(4, 5), index=["c", "a", "d", "b"])
print(f)

结果

2.df.columns(更改列名)

import pandas as pd
import numpy as np
f = pd.DataFrame(np.arange(20).reshape(4, 5), index=["c", "a", "d", "b"])
f.columns = ["A", "B", "C", "D", "E"]
print(f)

结果

3.df[]列索引、df.loc[]行索引、df.iloc[]、df[]布尔索引、df.at[行索引, 列索引]

列索引和行索引

import pandas as pd
import numpy as np
f = pd.DataFrame(np.arange(20).reshape(4, 5), index=["c", "a", "d", "b"])
print("f[0]:")
print(f[0])
f.columns = ["A", "B", "C", "D", "E"]
print('f["A"]')
print(f["A"])
print('f.loc["c"]')
print(f.loc["c"])  # 用列名行名取
print('f.iloc[0]')
print(f.iloc[0][0])  # 用index取

结果

布尔索引

import pandas as pd
import numpy as np
f = pd.DataFrame(np.arange(20).reshape(4, 5), index=["c", "a", "d", "b"])
f.columns = ["A", "B", "C", "D", "E"]
print(f[f["A"] % 10 == 5])  # 选择A列值mod 10 = 5的数据

结果

取出除第一列的所有列

import pandas as pd
import numpy as np
f = pd.DataFrame(np.arange(20).reshape(4, 5), index=["c", "a", "d", "b"])
f.columns = ["A", "B", "C", "D", "E"]
print(f)
f = f.loc[:, list(f.columns)[1:]]
print(f)

结果

df.at能够进行特定位置赋值

import pandas as pd
import numpy as np
f = pd.DataFrame(np.arange(20).reshape(4, 5), index=["c", "a", "d", "b"])
f.columns = ["A", "B", "C", "D", "E"]
f.at["c", "A"] = 100
f.at["aaa", "bbb"] = 100
print(f)

结果

删除全为0的列

import pandas as pd
import numpy as np
f = pd.DataFrame(np.arange(20).reshape(4, 5), index=["c", "a", "d", "b"])
f.columns = ["A", "B", "C", "D", "E"]
f.iloc[1][0] = 0
f.at["d", "A"] = 0
f.loc["b"]["A"] = 0
print(f)
f = f.loc[:, (f != 0).any(axis=0)]
print(f)

结果

4.pd.read_csv()、df.to_csv()

读存取csv文件,十分方便

import pandas as pd
import numpy as np
f = pd.DataFrame(np.arange(20).reshape(4, 5), index=["c", "a", "d", "b"])
f.columns = ["A", "B", "C", "D", "E"]
f.to_csv("a.csv", index=0, encoding="utf-8")  # 行索引不保存
#  f.to_csv("a.csv", encoding="utf_8_sig")  # 这个保存用excel打开中文不会乱码
ff = pd.read_csv("a.csv", encoding="utf-8", sep=",") # 默认就是逗号相隔,sep="\\t"就可以读tsv了

print(ff)

csv文件

5.df.sort_value()

对列值进行排序,ascending=False表示降序排

import pandas as pd
import numpy as np
f = pd.DataFrame(np.arange(20).reshape(4, 5), index=["c", "a", "d", "b"])
f.columns = ["A", "B", "C", "D", "E"]
f.iloc[0][1] = 20
f = f.sort_values(by=["A", "B"], ascending=False)  #  首先按A列降序,其次按B列降序
print(f)

结果

6.df.describe()

用于计算一些常用的统计数

import pandas as pd
import numpy as np
f = pd.DataFrame(np.arange(20).reshape(4, 5), index=["c", "a", "d", "b"])
f.columns = ["A", "B", "C", "D", "E"]
f = f.describe()
print(f)

结果

7.df.applymap(), df.astype()数据类型转换

import pandas as pd
import numpy as np
f = pd.DataFrame(np.arange(20).reshape(4, 5), index=["c", "a", "d", "b"])
f.columns = ["A", "B", "C", "D", "E"]
f = f.applymap(str)  # 全部转换为字符型
f.iloc[0][0] = "Yes"
f.iloc[1][0] = "yES"
f.iloc[2][0] = "No"
f.iloc[3][0] = "nO"
f["A"] = f["A"].str.lower()  # 全都转换成小写
print(f)
f["A"] = (f["A"] == "yes").astype("int")  # yes为1,其他为0
print(f)

结果

8.df.列名.str.split(分隔符,expand=True)、df.groupby().size()

拆列成多列、分组求和

import pandas as pd
import numpy as np
f = pd.DataFrame(np.arange(20).reshape(4, 5), index=["c", "a", "d", "b"])
f.columns = ["A", "B", "C", "D", "E"]
f = f.applymap(str)  # 全部转换为字符型
f.iloc[0][0] = "浙江 嘉兴"
f.iloc[1][0] = "浙江 杭州"
f.iloc[2][0] = "福建 泉州"
f.iloc[3][0] = "浙江 温州"
print(f)
print("-------------")
shen_shi = f.A.str.split(" ", expand=True)  # 以空格切分
print(shen_shi)
print("------------")
group_by = shen_shi.groupby(shen_shi[0]).size()
print(group_by)

结果

9.df.drop()

删除某列数据

import pandas as pd
import numpy as np
f = pd.DataFrame(np.arange(20).reshape(4, 5), index=["c", "a", "d", "b"])
f.columns = ["A", "B", "C", "D", "E"]
f = f.drop(columns=["A"])
print(f)

结果

10.df.dropna()

删除含有缺失值的列

import pandas as pd
import numpy as np
f = pd.DataFrame(np.arange(20).reshape(4, 5), index=["c", "a", "d", "b"])
f.columns = ["A", "B", "C", "D", "E"]
f.at["e", "A"] = 0
print(f)
f = f.dropna()
print(f)

结果

10.df.apply() (无敌神器)

apply函数可以说是最灵活的函数了,我感觉是相干啥都行,只要你能写出对于操作的函数就可以

import pandas as pd
import numpy as np
import re
f = pd.DataFrame(np.arange(20).reshape(4, 5), index=["c", "a", "d", "b"])
f.columns = ["A", "B", "C", "D", "E"]
f = f.applymap(str)  # 全部转换为字符型
f.iloc[0][0] = "2021年12月4号"
f.iloc[1][0] = "2021年11月4号"
f.iloc[2][0] = "2021年10月14号"
f.iloc[3][0] = "2021年7月14号"
f["时间"] = f.apply(lambda x: "-".join([t for t in re.findall(r"(.*)年(.*)月(.*)号", x[0])[0]]), axis=1)
print(f)

这个示例的思想就是先通过re正则表达式将A列中的时间(年、月、日各自提取出来 即re.findall(r"(.)年(.)月(.*)号", x[0], x[0]就代表是这一行的第一个元素,因为后面axis=1表示对每行各自进行操作),时间提取出来之后在用-连接起来,形成一个新的时间,最后赋值给时间那列。
运行结果

这里有一些小细节值得注意,如果是对Dataframe进行apply操作那么,Dataframe.apply(lambda x:fun(x[col_name]), axis=1)也是可以的,即可以用x[0…]来索引对应数值也可以用对应列名来索引,但是如果Series使用时,例如df[col].apply(lambda x:fun(x[0]), axis=1)就是不正确的,因为这里已经确定传入是指定的列了,df[col].apply(lambda x:fun(x), axis=1)这样才能运行成功。

11.df.isnull()

查找缺失值
下面以tsv文件为例写了一个检查缺失情况的函数,只需要把tsv的文件名输入就可展示缺失字段以及数量。

def check(file):
    f = pd.read_csv(file, sep="\\t", encoding="utf-8")
    col = f.columns
    null_col = []  # 有缺失值的列
    for c in col:
        if len(f[f[c].isnull()][c]) != 0:
            null_col.append(c)
            print("字段:", c, "缺失数:", len(f[f[c].isnull()][c]))
 file = "文件名"
 check(file)

我自己随便一个文件的运行结果

# 查看数据每一行的缺失值个数的频次
f.isnull().sum(axis=1).value_counts()

12.df.value_counts()

用于查看数据的频次分布
df.value_counts(normalize=True)能够查看相对频率(概率密度)

import pandas as pd
import numpy as np
f = pd.DataFrame(np.arange(20).reshape(4, 5), index=["c", "a", "d", "b"])
f.columns = ["A", "B", "C", "D", "E"]
f = f.applymap(str)  # 全部转换为字符型
f.iloc[0][0] = "浙江 嘉兴"
f.iloc[1][0] = "浙江 杭州"
f.iloc[2][0] = "福建 泉州"
f.iloc[3][0] = "浙江 温州"
shen_shi = f.A.str.split(" ", expand=True)  # 以空格切分
print(shen_shi)
print(shen_shi[0].value_counts())

结果

13.df.append()用于df增加一行

import pandas as pd
f = pd.DataFrame()
t = pd.Series([5, 2, 5, 5])
f = f.append(t, ignore_index=True)
print(f)

结果

14.df.drop_duplicates()删除重复数据

import pandas as pd
import numpy as np
f = pd.DataFrame(np.arange(20).reshape(4, 5), index=Pandas 函数方法汇总一览查询(持续补充改进)

Pandas 函数方法汇总一览查询(持续补充改进)

Pandas里面常用的一些数据分析函数总结

总结--- Numpy和Pandas库常用函数

Pandas常用操作方法

NumPy和Pandas常用库