机器学习常用模块

Posted 2020-11-10 寂寞的小乞丐

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了机器学习常用模块相关的知识，希望对你有一定的参考价值。

注释：基础不牢固，特别不牢固，项目无从下手！
　　　这次花一个星期的时间把Python的基础库学习一下，一来总结过去的学习，二来为深度学习打基础。
　　　部分太简单，或者映象很深的就不记录了，避免浪费时间。

**博客园的makedown真是无语了,排版好久,上传就是这个鬼模样

1. python基础

(1).字符串

print("abc".upper())#转为大写
print("ABC".lower())#转为小写
print("abc"+"abc")#合并
print("abc   ".rstrip())#删除最右边(结尾)的空白(不改变"abc "的档案,只是结果删除)
print("   abc".lstrip())#删除最左边(开头)空白
print("abc".title())#首字母变成大写

ABC
abc
abcabc
abc
abc
Abc

(3).常数

#整数和浮点数不能直接相乘
3*0.1

0.3000000000004

(4).列表

A.读写列表

list[0] #第一个元素
list[-1]#最后一个元素
list[:] #打印全部元素
list[0:-1]#打印不包括最后一个元素
list[a:b:c]#打印a-b,间隔为c
list[0::-1]#颠倒数据

B.操作列表

list.append(\'data\')#正常添加一个元素
list.extend("data")#添加一个列表数据
list.insert(num,\'data\')#从num位置插入一个数据
del list[num]#删除num位置元素
list.remove(\'data\')#删除名字为"data"的数据
list.sort(bool)#排序列表(永久)
sorted(list)#排序列表(临时排序)
list.reverse()#颠倒列表,可用上面"读写列表的list[0::-1]进行颠倒"
len(list)#列表长度
min(list)
max(list)

C.复制列表

list1 = list2#浅复制
list1 = list2[1:2]#深度复制

D.列表推导式

# 循环列表
values = [10,20,4,50.31]
squres = []
for x in values:
     squres.append(x)
print(squres)

[10, 20, 4, 50.31]

#列表推导式
values = [10,20,4,50.31]
squres = [x for x in values]
print(squres)

[10, 20, 4, 50.31]

#推导式中加入条件进行筛选
values = [10,20,4,50.31]
squres = [x**2 for x in values if x<=12]
squres

[100, 16]

#推导式生成集合和字典
values = [10,20,4,50.31]
squres_set = {x**2 for x in values if x<=12}
squres_dict = {x:x**2 for x in values}
print(squres_set)
squres_dict

{16, 100}
{4: 16, 10: 100, 20: 400, 50.31: 2531.0961}

#求和表达式
values = [10,20,4,50.31]
sum(x**2 for x in values)

3047.0961

(5).判断

A if B else C #B==True-->>A  B==False-->>C

(6).字典

dict.items() #得到[(key,value),(key,value)...]
dict.keys()	 #得到key
dict.values()#得到values
del dict[key]#删除某个键-值对
dict[key]=value#更改某个键值对

(7).循环

break	#跳出大循环
continue#跳出小循环

(8).函数

#内部函数
def fuction(num):
    return (num)
fuction1 = fuction(\'hello\')
print(fuction1)    
    
#外部函数调用
import fuction
fuction1 = fuction.fuction(\'hello\')
print(fuction1)

#调用外部特定函数
from fuction import fuction
fuction1 = fuction(\'hello\')
print(fuction1)

#调用函数之后指定别名
import fuction as fuc
fuction1 = fuc.fuction(\'hello\')
print(fuction1)

from fuction import fuction as fuc
fuction1 = fuc(\'hello\')
print(fuction1)

(9).类

#内部创建类
#self相当于C++的this指针
class Dog():
    #num1和num2是输入的形参
    def __init__(self,num1,num2):
        #num3和num4是函数内部使用的参数
        self.num3 = num1
        self.num4 = num2
    def    printf(self):
        print(self.num3)
    def sum1(self):
        print(self.num3+self.num4)
my_dog = Dog(32,\'jake\')

#外部创建类
import dog 
my_dog = dog.Dog(23,\'rake\')

from dog import Dog
my_dog = Dog(23,\'rake\')

#类的继承
class Dog():
    def __init__(self,num1,num2):
        self.num3 = num1
        self.num4 = num2
    def    printf(self):
        return self.num3
    def sum1(self):
        print(self.num3+self.num4)
class Cdog(Dog):
    def __init__(self,num1,num2):
            super().__init__(num1,num2)
            self.num3 = super().printf()
my_dog = Cdog(23,\'rake\')
print(my_dog.num3)

(10).文件操作

#读取文件
\'\'\'with的作用是不再访问程序的时候关闭程序\'\'\'
\'\'\'close()的作用是直接关闭程序，用 with 比较好\'\'\'
with open(\'test.txt\') as file_object:
    contents = file_object.read()
    print(contents)
#读取文件内容成一行
with open(\'test.txt\') as file_object:
    contents = file_object.readlines()#行读取
print(contents[:42] + \'...\')

#写入文件
#\'w\':格式化以后输入
#\'a\':在原来基础上输入
with open(\'test.txt\',\'w\') as file_object:
    file_object = \'hello world\'
\'\'\'先输入int、float类型的时候得转化成str\'\'\'

#简单扩展

#zip(a,b)返回一个tuple
a = [1,2,3]
b = [4,5,6]
ab = zip(a,b)
ab = [(1, 4), (2, 5), (3, 6)]

#lambda
#x,y为自变量，x+y为具体运算
fun = lambda x,y:x+y
num = fun(x,y)

#map是把函数和参数绑定在一起
map(函数(迭代器),参数)#把函数和参数卸载一起了
>>> def fun(x,y):
    return (x+y)
>>> list(map(fun,[1],[2]))
"""
[3]
"""
>>> list(map(fun,[1,2],[3,4]))
"""
[4,6]
"""

2. OS模块

os模块主要是对系统_文件目录等进行操作

在学习TF的时候用到了os的相关操作如下:

os.listdir(path)----#返回指定目录下的所有文件和目录名。在读取文件的时候经常用到！
os.getcwd()-------#函数得到当前工作目录，即当前Python脚本工作的目录路径。读写文件常用！
os.path.join(path,name)---#连接目录与文件名或目录;使用“\\”连接，保存文件路径常用
注意一下：#name前面如果没有‘\\’，系统会自动添加！如：print(os.path.join("C:\\wjy",name.txt))----->>>>C\\:wjy\\name.txt

举两个例子:

#把某个文件夹的内部文件(二层目录)路径全部写到一个txt文档之中
import os 
import re

def createFileList(path,txt_path):
    fw = open(txt_path,\'w+\')
    image_files = os.listdir(path)
    for i in range(len(image_files)):
        dog_categories = os.listdir(path+\'/\'+image_files[i])
        for each_image in dog_categories:
            fw.write(path+\'/\'+image_files[i]+\'/\'+ each_image + \' %d\\n\'%i)
    print(\'生成txt文件成功\\n\')
    fw.close()

path = \'dog_10_images\'
txt_path = \'train.txt\'
createFileList(path,txt_path)

# TF模型保存的路径和文件名。
MODEL_SAVE_PATH = "saves_model_path"
MODEL_NAME = "model.ckpt"
saver.save(sess, os.path.join(MODEL_SAVE_PATH, MODEL_NAME),global_step=global_step)

os常用命令如下,就不一一列举了:

os.sep#可以取代操作系统特定的路径分隔符。windows下为 “\\\\”
os.name#字符串指示你正在使用的平台。比如对于Windows，它是\'nt\'，而对于Linux/Unix用户，它是\'posix\'。
os.getcwd()#函数得到当前工作目录，即当前Python脚本工作的目录路径。
os.getenv()#获取一个环境变量，如果没有返回none
os.putenv(key, value)#设置一个环境变量值
os.listdir(path)#返回指定目录下的所有文件和目录名。
os.remove(path)#函数用来删除一个文件。
os.system(command)#函数用来运行shell命令。
os.linesep#字符串给出当前平台使用的行终止符。例如，Windows使用\'\\r\\n\'，Linux使用\'\\n\'而Mac使用\'\\r\'。
os.curdir:#返回当前目录（\'.\')
os.chdir(dirname):#改变工作目录到dirname
========================================================================================
os.path#常用方法：
os.path.isfile()和os.path.isdir()#函数分别检验给出的路径是一个文件还是目录。
os.path.exists()#函数用来检验给出的路径是否真地存在
os.path.getsize(name):#获得文件大小，如果name是目录返回0L
os.path.abspath(name):#获得绝对路径
os.path.normpath(path):#规范path字符串形式
os.path.split(path) ：#将path分割成目录和文件名二元组返回。
os.path.splitext():#分离文件名与扩展名
os.path.join(path,name):#连接目录与文件名或目录;使用“\\”连接
os.path.basename(path):#返回文件名
os.path.dirname(path):#返回文件路径

3. argparse模块

用于命令行参数操作,常使用在远程演示操作!

以下是一个简单的操作:

创建ArgumentParser()对象
调用add_argument()方法添加参数
使用parse_args()解析添加的参数

定位参数:

#定位参数，必须指定
parser = argparse.ArgumentParser()
parser.add_argument("integer",type=int,help="display an integer",name="--h")
args = parser.parse_args()
print(args.integer)

选择参数:

#选择参数，带有"--"形式的，是可选的参数
    parser = argparse.ArgumentParser()
    parser.add_argument("--square",type = int ,help = "display a square of a given number")
    parser.add_argument("--cubic",type = int,help = "display a cubic of a given number")
    args = parser.parse_args()
    if args.square:
        print(args.square**2)
    if args.cubic:
        print(args.cubic**3)

混合使用:

parser = argparse.ArgumentParser(description = "Process some integer")
parser.add_argument("integers",metavar="N",type=int,nargs="+",help="an integer for the accumulator")
parser.add_argument("--sum",dest="accumulate",action="store_const",const=sum,default=max,help="sum the integers(default: find the max)")
args = parser.parse_args()
print(args.accumulate(args.integers))

以最后一个例子说明:

description:描述我们这程序是干什么的，Str输入

　　metavar：是在命令行解释器的时候用来说明参数的

　　nargs：说明有多个定位参数，也就是输入 1 3 5 7 9 可以把这五个数都识别

具体参数指定如下:

具体参数指定如下：

    name or flags - #选项字符串的名字或者列表，例如 foo 或者 -f, --foo。
    action - #命令行遇到参数时的动作，默认值是 store。nargs - 应该读取的命令行参数个数，可以是具体的数字，或者是?号，当不指定值时对于 Positional argument 使用 default，对于 Optional argument 使用 const；或者是 * 号，表示 0 或多个参数；或者是 + 号表示 1 或多个参数。
    store_const，#表示赋值为const；
    append，#将遇到的值存储成列表，也就是如果参数重复则会保存多个值;
    append_const，#将参数规范中定义的一个值保存到一个列表；
    count，#存储遇到的次数；此外，也可以继承 argparse.Action 自定义参数解析；
    nargs - #应该读取的命令行参数个数，可以是具体的数字，或者是?号，当不指定值时对于 Positional argument 使用 default，对于 Optional argument 使用 const；或者是 * 号，表示 0 或多个参数；或者是 + 号表示 1 或多个参数。
    const - action 和 nargs 所需要的常量值。
    default - #不指定参数时的默认值。
    type - #命令行参数应该被转换成的类型。
    choices - #参数可允许的值的一个容器。
    required - #可选参数是否可以省略 (仅针对可选参数)。
    help - #参数的帮助信息，当指定为 argparse.SUPPRESS 时表示不显示该参数的帮助信息.
    metavar - #在 usage 说明中的参数名称，对于必选参数默认就是参数名称，对于可选参数默认是全大写的参数名称.
    dest - #解析后的参数名称，默认情况下，对于可选参数选取最长的名称，中划线转换为下划线.

4. Re模块

re模块主要是操作正则表达式内容,它能帮助我们方便的检查一个字符串是否与某种模式匹配

不用可以去学习,大概了解就行了,到时候用的时候再去查询

语法说明

语法图片

函数说明

编译函数

将字符串形式的正则表达式编译为Pattern对象

compile(pattern，flags= 0)	使用任何可选的标记来编译正则表达式的模式，然后返回一个正则表达式对象

re 模块函数和正则表达式对象的方法*

match(pattern，string，flags=0)	尝试使用带有可选的标记的正则表达式的模式来匹配字符串。如果匹配成功，就返回匹配对象；如果失败，就返回 None
search(pattern，string，flags=0)	使用可选标记搜索字符串中第一次出现的正则表达式模式。如果匹配成功，则返回匹配对象；如果失败，则返回 None
findall(pattern，string[, flags] )①	查找字符串中所有（非重复）出现的正则表达式模式，并返回一个匹配列表
finditer(pattern，string[, flags] )②	与 findall()函数相同，但返回的不是一个列表，而是一个迭代器。对于每一次匹配，迭代器都返回一个匹配对象
split(pattern，string，max=0)③	根据正则表达式的模式分隔符， split函数将字符串分割为列表，然后返回成功匹配的列表，分隔最多操作 max 次（默认分割所有匹配成功的位置）

sub(pattern，repl，string，count=0)③	使用 repl 替换所有正则表达式的模式在字符串中出现的位置，除非定义 count，否则就将替换所有出现的位置（另见 subn()函数，该函数返回替换操作的数目）
purge()	清除隐式编译的正则表达式模式

常用的匹配对象方法（查看文档以获取更多信息)

group(num=0)	返回整个匹配对象，或者编号为 num的特定子组
groups(default=None)	返回一个包含所有匹配子组的元组（如果没有成功匹配，则返回一个空元组）
groupdict(default=None)	返回一个包含所有匹配的命名子组的字典，所有的子组名称作为字典的键（如果没有成功匹配，则返回一个空字典）

常用的模块属性（用于大多数正则表达式函数的标记）

re.I、 re.IGNORECASE	不区分大小写的匹配
re.L、 re.LOCALE	根据所使用的本地语言环境通过\\w、\\W、\\b、\\B、\\s、\\S实现匹配
re.M、 re.MULTILINE	^和$分别匹配目标字符串中行的起始和结尾，而不是严格匹配整个字符串本身的起始和结尾
re.S、 rer.DOTALL	“.” （点号）通常匹配除了\\n（换行符）之外的所有单个字符；该标记表示“.” （点号）能够匹配全部字符
re.X、 re.VERBOSE	通过反斜线转义，否则所有空格加上#（以及在该行中所有后续文字）都被忽略，除非在一个字符类中或者允许注释并且提高可读性

下面举例说明几个常用函数和主要事项:

寻找位置

#span代表匹配的位置
#<_sre.SRE_Match object; span=(0, 3), match=\'www\'>
print(re.search(\'www\', \'www.runoob.com\').span())  # 在起始位置匹配
print(re.search(\'com\', \'www.runoob.com\').span())  # 不在起始位置匹配

(0, 3)
(11, 14)

分割数据

#  *的作用是匹配前一个字符1次或者无数次
\'\'\' "e"="e" 只能匹配一次\'\'\'
print(re.split(r\'e\',\'one1two2three3four4\'))
#  out:[\'on\', \'1two2thr\', \'\', \'3four4\']
\'\'\' "e+"="e"+"ee"+"eee"+"eee..." 匹配很多个类型\'\'\'
print(re.split(r\'e+\',\'one1two2three3four4\'))
#  out:[\'on\', \'1two2thr\', \'3four4\']
\'\'\' \\d+ 代表匹配很多次数字 \'\'\'
print(re.split(r\'\\d+\',\'one1two2three3four4\'))
# [\'one\', \'two\', \'three\', \'four\', \'\']

搜索字符串(findall和group)

# \'\\d+\' 匹配多个数字
print (p.findall(r\'\\d+\',\'one1two2three3four4\'))
# [\'1\', \'2\', \'3\', \'4\']
p = re.finditer(r\'\\d+\',\'one1two2three3four4\')
for i in p:
    print(i.group())
# 1
# 2
# 3
# 4

代替

# \'\\w+\' 匹配很多个单词,也就是遇到空格结束(换句话说匹配单词)
# r\'(\\w+) (\\w+)\' 匹配中间带空格的两个单词
# \\<number> 分组匹配到的字符串
# r\'\\2 \\1\' 1和2的位置调换
p = re.compile(r\'(\\w+) (\\w+)\')
s = \'i say, hello world!\'
 
print (re.sub(r\'(\\w+) (\\w+)\',r\'\\2 \\1\', s))
 
def func(m):
    return m.group(1).title() + \' \' + m.group(2).title()
 
print (p.sub(func, s))
# say i, world hello!
# I Say, Hello World!

5. Time模块

time模块顾名思义与时间相关,我们平时主要使用日期和计时

(1).日期

time.time() :时间间隔是以秒为单位的浮点小数,后面的解析都是基于此函数.

time.localtime(time.time()): 获取当前时间元组

time.asctime( time.localtime(time.time())) :获取格式化的时间

time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) :格式化成2016-03-20 11:45:39形式

time.strftime("%a %b %d %H:%M:%S %Y", time.localtime()):格式化成Sat Mar 28 22:24:24 2016形式

time.mktime(time.strptime(time.strftime("%a %b %d %H:%M:%S %Y", time.localtime()),"%a %b %d %H:%M:%S %Y")):将格式字符串转换为time.time()

1528094659.1950338
time.struct_time(tm_year=2018, tm_mon=6, tm_mday=4, tm_hour=14, tm_min=46, tm_sec=7, tm_wday=0, tm_yday=155, tm_isdst=0)
\'Mon Jun  4 14:45:08 2018\'
\'2018-06-04 14:45:25\'
\'Mon Jun 04 14:46:24 2018\'
1528094841.0

(2). 计时

time.time()计时计算的是程序开始到程序结束的时间包括CPU+其它程序运行时间,time.clock()只是计算的是CPU运行的时间

time.clock()但是这个函数在windows下返回的是真实时间（wall time）

Linux中建议用time.time()计时,在windows中建议用time.clock()计时

time.time()计时

import time
start = time.time()
time.sleep(2)
end = time.time()
print(end-start)
# 2.002422332763672

time.clock()计时

import time
start = time.clock()
time.sleep(2)
end = time.clock()
print(str(end-start))
#0.003894999999999982

6. Numpy模块

numpy模块属于机器学习四剑客之一,Pandas+Numpy+Scipy+matplotlib,这个不用介绍了.

(1)数组属性

import numpy as np
a = np.ones((3,4),dtype=np.float32)
#元素个数
print(a.size)
#数组形状
print(a.shape)
#数据维度
print(a.ndim)
#数据类型
print(a.dtype)

12
(3, 4)
2
float32

(2)创建数组

#简单创建数组
np.array(object)    #指定大小和形状
#np.ndarray()
np.ones()           #指定形状
np.zeros()          #指定形状
np.linspace()       #指定大小和形状
np.range()          #指定大小和形状

#利用random创建
np.random.rand(1,2,3)          #大小[0-1],形状[1,2,3]
np.random.random((5,5))        #大小(0-1],形状[5,5]
np.random.sample((5,5))        #大小(0-1],形状[5,5]
np.random.randint(1,5,(5,5))   #大小[1-5],形状[5,5]
np.random.normal(10,1,(5,5))   #给定均值/标准差/维度的正态分布,以10为中心左右随机范围[-1,1],产生reshape=（5,5），符合高斯分布
np.random.uniform(0,10,(5,5))  #大小[0-10],形状[5,5]
N + S * np.random.random((5,5))#大小[N-S],形状[5,5]

(3)索引切片

假设a.dtype = [2,3,4]---->>>#三维数组　

　　　　　　　　　　[:,:,:]  = [0:2,0:3,0:4] --->>>#读取全部数据

　　　　　　　　　　[:,:,:-1] = [0:2,0:3,0:3]--->>>#少了最后一个元素

　　　　　　　　　　[1,2,3] = [1:2,2:3,3:4]--->>>#读取一个元素

　　　　　　　　　　[...,:] = [:,:,:] --->>>...#代表全部的意思

　　　　　　　　　　[a:b:c,d:e:f,i:j:k]--->>>#a,d,i 代表初始位置，b,e,j 代表终止位置，c,f,k 代表间隔值

(4)条件运算

#简单的条件运算
stus_score = np.array([[80, 88], [82, 81], [84, 75], [86, 83], [75, 81]])
stus_score > 80
#out
array([[False,  True],
       [ True,  True],
       [ True, False],
       [ True,  True],
       [False,  True]])

#三目运算
import numpy as np
\'\'\'np.where()的用法\'\'\'

#第一种用法
\'\'\'相当于if else操作
np.where(condition,x,y)
x if condition else y
\'\'\'
#例如
a = np.arange(10)
c = np.where(a>3,a*3,0)
c = array([ 0,  0,  0,  0, 12, 15, 18, 21, 24, 27])


#第二种用法
\'\'\'求位置坐标\'\'\'
#例如
a = np.array([1,2,3,4,5])
c = np.where(a>3)
c = (array([3, 4], dtype=int64),)#得到a>3的列坐标，行坐标为零省略不写

a = array([[1, 2, 3],
       [5, 6, 7]])
b = np.where(a>1)#五个元素大于1，他们的行坐标：[0, 0, 1, 1, 1]，列坐标：[1, 2, 0, 1, 2]
b = (array([0, 0, 1, 1, 1], dtype=int64), array([1, 2, 0, 1, 2], dtype=int64))

import numpy as np

alpha = np.array([[1,0,3,4],[2,0,4,5],[6,7,8,9]])#3X4
\'\'\'--------得到非零数的坐标-----------\'\'\'
#方法一：直接求解，这里不演示很简单！
#方法二：
np.nonzero(alpha)
#方法三：
np.nonzero(alpha>0)
\'\'\'--------得到大于0小于8的坐标---------\'\'\'
#alpha>0代表bool矩阵
#>>> (alpha>0)*(alpha<9)
#array([[ True, False,  True,  True],
#       [ True, False,  True,  True],
#       [ True,  True,  True, False]], dtype=bool)
#方法一：
np.nonzero((alpha>0)*(alpha<8))
#方法二：
np.nonzero(np.multiply(alpha>0,alpha<8))

#>>> np.nonzero(np.multiply((alpha>0),(alpha<9)))
#(array([0, 0, 0, 1, 1, 1, 2, 2, 2]), array([0, 2, 3, 0, 2, 3, 0, 1, 2]))
#第一行代表一维坐标，第二行代表二维坐标--->>（0,0）（0.2）（0,3）。。。

(5)统计运算

#返回指定轴的最值
np.amax()
np.max()
np.amin()
np.min()
#返回指定轴最值的坐标
np.argmax()
np.argmin()
#指定轴求平均值
np.mean()
#求方差
np.std()

(6)类型/格式转换

np.astype()#类型转换
np.reshape()#维度转换
np.ravel()#返回的是原来数据
np.flatten()#返回的是拷贝

(7)复制

#完全不复制　　　
import numpy as sp
a = np.arange(12)
b = a
b.shape = 3,4
print(a.shape)
#out
(3, 4)

#不完全复制，不同的数组对象分享同一个数据。视图方法创造一个新的数组对象指向同一数据。
#数据指向同一个地方，视图不是一个！
a = np.arange(12)
b = a.view()
print(b)
#a is b
a[:,] = 0
print(b)
#out
[ 0  1  2  3  4  5  6  7  8  9 10 11]
[0 0 0 0 0 0 0 0 0 0 0 0]
	
import numpy as sp
a = np.arange(12)
b = a.view()
b.shape = 3,4
print(a.shape)
#out
(12,)

#完全复制，视图和数据都不一样
import numpy as np
a = np.arange(12)
b = a.copy()
b.shape = 3,4
b[:,] =10
print(a)
#out
[ 0  1  2  3  4  5  6  7  8  9 10 11]

(8)特殊操作

学习机器学习的时候看到的一个小模块，当然实现的方法很多。

给定一个一个样本集[[1,2,3],[4,5,6],[7,8,9]....]，假设含有100个样本，现在要对样本进行训练，每个样本训练一次，且每个样本都是随机的，怎么取合适？

#!/usr/bin/python3
import numpy as np

#本函数作用是随机且不重复的执行完输入的数据
def choose_random(dataSet):
    #dataSet:输入为一维数据，如果是多维数据得变成一维数据之后传入
    numArr = list(range(len(dataSet)))#下标
    returnList = []
    for count in range(len(dataSet)):
        numRandom = int(np.random.uniform(0,len(numArr)))
        \'\'\'-------------处理数据---------------\'\'\'
        returnList.append(dataSet[numArr[numRandom]])
        del(numArr[numRandom])
    return returnList
if __name__ == \'__main__\':    
    print(choose_random(list(np.arange(0,5,0.5))))

Pandas模块

padndas表格处理模块,相对于numpy是对统一的数据进行操作,pandas是对表格(不通的数据类型)进行的操作

pandas的基本构成是数据帧,其中索引是X轴,数据是Y轴.

索引	data1	data2
1	name	pad1
2	classs	numpy
3	school	matplotlib

(1).创建数据

数据的创建有两种:Series(单行数据)和__DataFrame__(多行数据)

pd.Series([\'a\',\'b\',\'c\',\'d\',\'e\'])
#out:
0    a
1    b
2    c
3    d
4    e
dtype: object
#其中0-4是索引,a-d是数据

web_stats= {\'Day\':[1,2,3,4,5,6],
             \'Visitors\':[43,34,65,56,29,76],
             \'Bounce Rate\':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats,index=list(\'abcdef\'))
#out

	Bounce Rate 	Day 	Visitors
a 	65 	1 	43
b 	67 	2 	34
c 	78 	3 	65
d 	65 	4 	56
e 	45 	5 	29
f 	52 	6 	76
#其中0-5是索引,后面都是数据

(2).预览数据

预览数据包括:__直接预览__和__可修改__预览两类

#查看数据开始或者结尾n行数据
df.head(n)
df.tail(n)
#查看索引/列标签/数据
df.index()
df.columns()
df.values()
#查看df基本信息描述,包括(中值/均值/方差...)
df.describe()
df.T#数据转置
df.sort_index(axis=1,ascending=False)#按照某个轴进行排序
df.sort_value(by="B")#按照数据B排列

(3)数据选择

使用panda过程中推荐使用优化过的pandas数据索引方式：at,iat,loc,iloc,ix

#选择单独一列，得到一个Series数据结构,df[\'A\'] 等同于df.A
df[\'A\']
df[0:3]#通过行切片进行
df.loc[\'a\':\'b\']#通过标签进行选择,注意里面的数据和标签对的上
df.at[\'c\',\'A\']#通过行列返回一个标量
df.iloc[0:3]#通过整数位置访问
df[df.A>0.5]#通过布尔类型选择
df[df[\'E\'].isin([\'two\',\'four\'])]#通过isin方法过滤
df = df.reindex(index=[0:4]]#查看数据

(4)数据赋值&更新

pandas中使用Nan代表缺失的数据

通过loc/at
df.dropna(how=\'any\')#删除所有包含Nan的行
df.fillna(value=1)#填充所有Nan的值
pd.isnull(df)#获得所有Nan的掩码(Bool)

(5)Operation

#apply函数可以根据函数,操作DataFrame的行列等
df = pd.DataFrame(np.arange(0,16).reshape(4,4),columns=list(\'ABCD\'),index=list(\'abcd\'))
def f(x):
    return x.max()
print(df,\'\\n\')
print(df.loc[\'a\':\'c\',"A":\'B\'].apply(f))
#out:
     A   B   C   D
a   0   1   2   3
b   4   5   6   7
c   8   9  10  11
d  12  13  14  15 

A    8
B    9
dtype: int64

#直方图
s = pd.Series(np.random.randint(0, 7, size=10)) 
s 0 1 1 1 2 6 3 4 4 6 5 5 6 4 7 2 8 1 9 3 dtype: int64 
s.value_counts() 1 3 6 2 4 2 5 1 3 1 2 1 dtype: int64

#字符串操作,Series在str属性中封装了一系列的字符串操作方法
s = pd.Series([\'A\', \'B\', \'C\', \'Aaba\', \'Baca\', np.nan, \'CABA\', \'dog\', \'cat\']) 
s.str.lower() 
0 a 1 b 2 c 3 aaba 4 baca 5 NaN 6 caba 7 dog 8 cat dtype: object

#通过concat实现pandas对象的连接
#join,SQL风格的数据合并

#append,增加

Grouping

Grouping使用步骤：

Splitting ：基于一些标准把数据分成组
Applying：分别对每个组应用操作函数
Combining：将结果构建成一个数据结构

(6)数据展示

In[95]: ts = pd.Series(np.random.randn(1000), index=pd.date_range(\'1/1/2000\', periods=1000))
ts = ts.cumsum()
ts.plot()
plt.show()

在DataFrame中，plot是一个便利的绘图工具

df = pd.DataFrame(np.random.randn(1000, 4), In[99]: In[100]: index=ts.index,columns=[\'A\', \'B\', \'C\', \'D\'])
df = df.cumsum()
plt.figure(); df.plot(); plt.legend(loc=\'best\')
plt.show()

Matplotlib模块

参考资料

注释:这个博文是针对之前的博文进行的总结,有些参考资料已经丢失,如果有侵权的地方请告知,马上删除或者添加遗漏的参考链接!

http://wiki.jikexueyuan.com/project/explore-python/Standard-Modules/argparse.html%E3%80%80

莫凡python

Re模块1

Re模块2

time模块1

time模块2

Numpy模块1

Numpy模块2

Pandas模块1

Pandas数据手册

以上是关于机器学习常用模块的主要内容，如果未能解决你的问题，请参考以下文章