数据分析

Posted 2022-07-16 sikuaiqian

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了数据分析相关的知识，希望对你有一定的参考价值。

数据分析

含义：

　　是把隐藏在一些看似杂乱无章的数据背后的信息提炼出来，总结出所研究对象的内在规律

　　数据分析三剑客：Numpy,Pandas,Matplotlib

Numpy

　　NumPy(Numerical Python) 是 Python 语言的一个扩展程序库，支持大量的维度数组与矩阵运算，此外也针对数组运算提供大量的数学函数库。

创建ndarray

1.使用np.array()创建

import numpy as np

# 一维数据的创建
np.array([1,2,3,4,5])

--- array([1, 2, 3, 4, 5])

# 二位数据创建
np.array([[1,2,3],[4,5,6]])

--- array([[1, 2, 3],
       [4, 5, 6]])

np.array([[1,2.2,3],[4,5,6]])
---  array([[1. , 2.2, 3. ],
       [4. , 5. , 6. ]])
# numpy默认ndarray的所有元素的类型是相同的
# 如果传进来的列表中包含不同的类型，则统一为同一类型，优先级：str>float>int

照片数据的引入

import matplotlib.pyplot as plt

#引入照片，并进行显示
img_arr = plt.imread(‘./meinv.jpg‘)
plt.imshow(img_arr)

打印结果：

技术图片

2. 使用np的routines函数创建

#np.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None) 等差数列

np.linspace(1,100,num=20)

---  array([  1.        ,   6.21052632,  11.42105263,  16.63157895,
        21.84210526,  27.05263158,  32.26315789,  37.47368421,
        42.68421053,  47.89473684,  53.10526316,  58.31578947,
        63.52631579,  68.73684211,  73.94736842,  79.15789474,
        84.36842105,  89.57894737,  94.78947368, 100.        ])


#使用该方法让数值进行相等增加
#np.arange([start, ]stop, [step, ]dtype=None)

np.arange(0,100,step=2)

---  array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32,
       34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66,
       68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98])


# np.random.randint(low, high=None, size=None, dtype=‘l‘)

np.random.seed(10)  #随机因子/时间种子
np.random.randint(0,100,size=(4,5))


---   array([[ 9, 15, 64, 28, 89],
       [93, 29,  8, 73,  0],
       [40, 36, 16, 11, 54],
       [88, 62, 33, 72, 78]])

ndarray的属性

　　4个必记参数： ndim：维度 shape：形状（各维度的长度） size：总长度 dtype：元素类型

#显示数据维度
img_arr.ndim

--- 3

#显示各维度长度
img_arr.shape

---  (456, 730, 3)

#显示总长度
img_arr.size

---  998640

#显示元素类型
type(img_arr)

---  numpy.ndarray

ndarray的基本操作

索引

arr = np.random.randint(0,100,size=(5,7))
arr
---   array([[49, 51, 54, 77, 69, 13, 25],
       [13, 92, 86, 30, 30, 89, 12],
       [65, 31, 57, 36, 27, 18, 93],
       [77, 22, 23, 94, 11, 28, 74],
       [88,  9, 15, 18, 80, 71, 88]])

# 根据索引修改数据
#查询第二行第三行数据
arr[[1,2]]
---   array([[13, 92, 86, 30, 30, 89, 12],
       [65, 31, 57, 36, 27, 18, 93]])

#查询第二行数据
arr[1]
---   array([13, 92, 86, 30, 30, 89, 12])

#查询二行第三列的数据
arr[1,2]
---  86

切片

#获取二维数组前两行
arr[0:2]
---   array([[49, 51, 54, 77, 69, 13, 25],
       [13, 92, 86, 30, 30, 89, 12]])

#获取二维数组前两列
arr[:,0:2] #arr[hang,lie]
---  array([[49, 51],
       [13, 92],
       [65, 31],
       [77, 22],
       [88,  9]])

#获取二维数组前两行和前两列数据
arr[0:2,0:2]
---   array([[49, 51],
       [13, 92]])

#将数据反转，例如[1,2,3]---->[3,2,1]，::进行切片
#将数组的行倒序
arr[::-1]
---  array([[88,  9, 15, 18, 80, 71, 88],
       [77, 22, 23, 94, 11, 28, 74],
       [65, 31, 57, 36, 27, 18, 93],
       [13, 92, 86, 30, 30, 89, 12],
       [49, 51, 54, 77, 69, 13, 25]])

#列倒序
arr[:,::-1]
---   array([[25, 13, 69, 77, 54, 51, 49],
       [12, 89, 30, 30, 86, 92, 13],
       [93, 18, 27, 36, 57, 31, 65],
       [74, 28, 11, 94, 23, 22, 77],
       [88, 71, 80, 18, 15,  9, 88]])

#全部倒序
arr[::-1,::-1]
---  array([[88, 71, 80, 18, 15,  9, 88],
       [74, 28, 11, 94, 23, 22, 77],
       [93, 18, 27, 36, 57, 31, 65],
       [12, 89, 30, 30, 86, 92, 13],
       [25, 13, 69, 77, 54, 51, 49]])


#将图片进行全倒置操作
plt.imshow(img_arr[:,::-1,:])

plt.imshow(img_arr[::-1,:,:])

#对图片进行截取操作
plt.imshow(img_arr[115:340,145:580,:])

变形

使用arr.reshape()函数，注意参数是一个tuple！

将一维数组变形成多维数组
arr_1.reshape((-1,5))
---   array([[49, 51, 54, 77, 69],
       [13, 25, 13, 92, 86],
       [30, 30, 89, 12, 65],
       [31, 57, 36, 27, 18],
       [93, 77, 22, 23, 94],
       [11, 28, 74, 88,  9],
       [15, 18, 80, 71, 88]])

arr.shape
---- (5, 7)

将多维数组变形成一维数组
all_l=arr.reshape((35,))
all_l
---   array([49, 51, 54, 77, 69, 13, 25, 13, 92, 86, 30, 30, 89, 12, 65, 31, 57,
       36, 27, 18, 93, 77, 22, 23, 94, 11, 28, 74, 88,  9, 15, 18, 80, 71,
       88])

级联

#就是对多个numpy数据进行横向或者纵向的拼接

#一维，二维，多维数组的级联，实际操作中级联多为二维数组
np.concatenate((arr,arr),axis=0) #axis=0 列  1表示行
---  array([[49, 51, 54, 77, 69, 13, 25],
       [13, 92, 86, 30, 30, 89, 12],
       [65, 31, 57, 36, 27, 18, 93],
       [77, 22, 23, 94, 11, 28, 74],
       [88,  9, 15, 18, 80, 71, 88],
       [49, 51, 54, 77, 69, 13, 25],
       [13, 92, 86, 30, 30, 89, 12],
       [65, 31, 57, 36, 27, 18, 93],
       [77, 22, 23, 94, 11, 28, 74],
       [88,  9, 15, 18, 80, 71, 88]])

#合并照片，将图片变成九宫格图片
arr_3 = np.concatenate((img_arr,img_arr,img_arr),axis=1)
arr_9 = np.concatenate((arr_3,arr_3,arr_3),axis=0)
plt.imshow(arr_9)

ndarray的聚合操作

#求和np.sum
arr.sum(axis=1)
---   array([338, 352, 327, 329, 369])

#最大最小值：np.max/ np.min
同理

#平均值：np.mean()
同理

#其他聚合操作
Function Name    NaN-safe Version    Description
np.sum    np.nansum    Compute sum of elements
np.prod    np.nanprod    Compute product of elements
np.mean    np.nanmean    Compute mean of elements
np.std    np.nanstd    Compute standard deviation
np.var    np.nanvar    Compute variance
np.min    np.nanmin    Find minimum value
np.max    np.nanmax    Find maximum value
np.argmin    np.nanargmin    Find index of minimum value
np.argmax    np.nanargmax    Find index of maximum value
np.median    np.nanmedian    Compute median of elements
np.percentile    np.nanpercentile    Compute rank-based statistics of elements
np.any    N/A    Evaluate whether any elements are true
np.all    N/A    Evaluate whether all elements are true
np.power 幂运算

ndarray的排序

np.sort(arr,axis=0)
---  array([[13,  9, 15, 18, 11, 13, 12],
       [49, 22, 23, 30, 27, 18, 25],
       [65, 31, 54, 36, 30, 28, 74],
       [77, 51, 57, 77, 69, 71, 88],
       [88, 92, 86, 94, 80, 89, 93]])

Pandas

导入方式

import pandas as pd
from pandas import DataFrame,Series
import numpy as np

Series

　　Series是一种类似与一维数组的对象，由下面两个部分组成：

values：一组数据（ndarray类型）
index：相关的数据索引标签

Series的创建

#两种创建方式：
#1 由列表或numpy数组创建

#   默认索引为0到N-1的整数型索引

#使用列表创建Series
Series(data=[1,2,3])
---   0    1
1    2
2    3
dtype: int64

#还可以通过设置index参数指定索引
s = Series(data=[1,2,3],index=[‘a‘,‘b‘,‘c‘])
s
---   a    1
b    2
c    3
dtype: int64

Series的索引和切片

s[‘c‘]
---   3

Series的基本概念

#可以使用s.head(),tail()分别查看前n个和后n个值
s.tail(2)
---  b    2
c    3
dtype: int64

#对Series元素进行去重
s = Series(data=[1,1,2,2,3,4,5,6,6,6,7,6,6,7,8])
s.unique()
---   array([1, 2, 3, 4, 5, 6, 7, 8], dtype=int64)

#当索引没有对应的值时，可能出现缺失数据显示NaN（not a number）的情况
s1 = Series(data=[1,2,3,4],index=[‘a‘,‘b‘,‘c‘,‘d‘])
s2 = Series(data=[1,2,3,4],index=[‘a‘,‘b‘,‘e‘,‘d‘])
s = s1 + s2
s
---   a    2.0
b    4.0
c    NaN
d    8.0
e    NaN
dtype: float64

#可以使用pd.isnull()，pd.notnull()，或s.isnull(),notnull()函数检测缺失数据
s.isnull()
---  a    False
b    False
c     True
d    False
e     True
dtype: bool

s.notnull()
---  a     True
b     True
c    False
d     True
e    False
dtype: bool

s[s.notnull()]
---  a    2.0
b    4.0
d    8.0
dtype: float64

Series之间的运算
在运算中自动对齐不同索引的数据
如果索引不对应，则补NaN

DataFrame

DataFrame是一个【表格型】的数据结构。DataFrame由按一定顺序排列的多列数据组成。设计初衷是将Series的使用场景从一维拓展到多维。DataFrame既有行索引，也有列索引。

行索引：index
列索引：columns
值：values

DataFrame的创建

最常用的方法是传递一个字典来创建。DataFrame以字典的键作为每一【列】的名称，以字典的值（一个数组）作为每一列。

此外，DataFrame会自动加上每一行的索引。

使用字典创建的DataFrame后，则columns参数将不可被使用。

同Series一样，若传入的列与字典的键不匹配，则相应的值为NaN。

DataFrame(data=np.random.randint(60,100,size=(3,4)))

     0    1    2    3
0    69    81    60    79
1    64    60    70    64
2    87    94    98    98

 df = DataFrame(data=np.random.randint(60,100,size=(3,4)),index=[‘A‘,‘B‘,‘C‘],columns=[‘a‘,‘b‘,‘c‘,‘d‘])
df
      a    b    c    d
A    92    67    79    68
B    84    66    61    66
C    84    79    66    82

#DataFrame属性：values、columns、index、shape

df.values
array([[92, 67, 79, 68],
       [84, 66, 61, 66],
       [84, 79, 66, 82]])

df.index
Index([‘A‘, ‘B‘, ‘C‘], dtype=‘object‘)

df.columns
Index([‘a‘, ‘b‘, ‘c‘, ‘d‘], dtype=‘object‘)

df.shape
(3, 4)

DataFrame的索引

1.对列进行索引

    - 通过类似字典的方式  df[‘q‘]
    - 通过属性的方式     df.q

 可以将DataFrame的列获取为一个Series。返回的Series拥有原DataFrame相同的索引，且name属性也已经设置好了，就是相应的列名。
df = DataFrame(data=np.random.randint(60,100,size=(3,4)),index=[‘A‘,‘B‘,‘C‘],columns=[‘a‘,‘b‘,‘c‘,‘d‘])
df

     a    b    c    d
A    75    71    92    98
B    79    93    86    80
C    78    87    94    91

#修改列索引
df.columns = [‘a‘,‘c‘,‘b‘,‘d‘]
df

　　  a    c    b    d
A    75    71    92    98
B    79    93    86    80
C    78    87    94    91

#获取前两列
df[[‘a‘,‘c‘]]
    a    c
A    75    71
B    79    93
C    78    87

2. 对行进行索引

- 使用.loc[]加index来进行行索引
- 使用.iloc[]加整数来进行行索引
同样返回一个Series，index为原来的columns
df
　　  a    c    b    d
A    75    71    92    98
B    79    93    86    80
C    78    87    94    91

df.iloc[0]
a    75
c    71
b    92
d    98
Name: A, dtype: int32

df.loc[‘A‘]
a    75
c    71
b    92
d    98
Name: A, dtype: int32

df.loc[[‘A‘,‘B‘]]
　　  a    c    b    d
A    75    71    92    98
B    79    93    86    80
 
3.对元素索引的方法

- 使用列索引
- 使用行索引(iloc[3,1] or loc[‘C‘,‘q‘]) 行索引在前，列索引在后

df.iloc[1,2]
86

df.loc[[‘B‘,‘C‘],‘b‘]
B    86
C    94
Name: b, dtype: int32

#索引表示的是列索引
#切片表示的是行切片
df[0:2]
     a    c    b    d
A    75    71    92    98
B    79    93    86    80

#在loc和iloc中使用切片(切列) ： df.loc[‘B‘:‘C‘,‘丙‘:‘丁‘]
df.iloc[:,0:2]
     a    c
A    75    71
B    79    93
C    78    87

df[‘a‘]
A    75
B    79
C    78
Name: a, dtype: int32

- 总结：
    - 索引：
        - 取行：df.loc[‘A‘]
        - 取列：df[‘a‘]
        - 取元素：df.iloc[1,2]
    - 切片：
        - 切行：df[0:2]
        - 切列：df.iloc[:,0:2]

DataFrame的运算

DataFrame之间的运算

同Series一样：

在运算中自动对齐不同索引的数据
如果索引不对应，则补NaN

df + df
　　  a    c    b    d
A    150    142    184    196
B    158    186    172    160
C    156    174    188    182

问题解决

使用tushare包获取某股票的历史行情数据。

pip install tushare
import tushare as ts
#code=股票代码，start=开始时间，end不写默认到最近时间
df = ts.get_k_data(code=‘600519‘,start=‘2000-01-01‘)

#保存到本地目录下
df.to_csv(‘./600519.csv‘)

#将date这一列作为源数据的行索引且将数据类型转成时间类型
df = pd.read_csv(‘./600519.csv‘,index_col=‘date‘,parse_dates=[‘date‘])
df.drop(labels=‘Unnamed: 0‘,axis=1,inplace=True)

输出该股票所有收盘比开盘上涨3%以上的日期。

#（收盘-开盘）/开盘 >= 0.03
(df[‘close‘] - df[‘open‘]) / df[‘open‘] >= 0.03

#将上述表达式返回的布尔值作为df的行索引：取出了所有符合需求的行数据
df.loc[(df[‘close‘] - df[‘open‘]) / df[‘open‘] >= 0.03]

df.loc[(df[‘close‘] - df[‘open‘]) / df[‘open‘] >= 0.03].index


DatetimeIndex([‘2001-08-27‘, ‘2001-08-28‘, ‘2001-09-10‘, ‘2001-12-21‘,
               ‘2002-01-18‘, ‘2002-01-31‘, ‘2003-01-14‘, ‘2003-10-29‘,
               ‘2004-01-05‘, ‘2004-01-14‘,
               ...
               ‘2019-01-15‘, ‘2019-02-11‘, ‘2019-03-01‘, ‘2019-03-18‘,
               ‘2019-04-10‘, ‘2019-04-16‘, ‘2019-05-10‘, ‘2019-05-15‘,
               ‘2019-06-11‘, ‘2019-06-20‘],
              dtype=‘datetime64[ns]‘, name=‘date‘, length=301, freq=None)

输出该股票所有开盘比前日收盘跌幅超过2%的日期

#（开盘 - 前日收盘） / 前日收盘  < -0.02
(df[‘open‘] - df[‘close‘].shift(1)) / df[‘close‘].shift(1) < -0.02

df.loc[(df[‘open‘] - df[‘close‘].shift(1)) / df[‘close‘].shift(1) < -0.02]

df.loc[(df[‘open‘] - df[‘close‘].shift(1)) / df[‘close‘].shift(1) < -0.02].index

DatetimeIndex([‘2001-09-12‘, ‘2002-06-26‘, ‘2002-12-13‘, ‘2004-07-01‘,
               ‘2004-10-29‘, ‘2006-08-21‘, ‘2006-08-23‘, ‘2007-01-25‘,
               ‘2007-02-01‘, ‘2007-02-06‘, ‘2007-03-19‘, ‘2007-05-21‘,
               ‘2007-05-30‘, ‘2007-06-05‘, ‘2007-07-27‘, ‘2007-09-05‘,
               ‘2007-09-10‘, ‘2008-03-13‘, ‘2008-03-17‘, ‘2008-03-25‘,
               ‘2008-03-27‘, ‘2008-04-22‘, ‘2008-04-23‘, ‘2008-04-29‘,
               ‘2008-05-13‘, ‘2008-06-10‘, ‘2008-06-13‘, ‘2008-06-24‘,
               ‘2008-06-27‘, ‘2008-08-11‘, ‘2008-08-19‘, ‘2008-09-23‘,
               ‘2008-10-10‘, ‘2008-10-15‘, ‘2008-10-16‘, ‘2008-10-20‘,
               ‘2008-10-23‘, ‘2008-10-27‘, ‘2008-11-06‘, ‘2008-11-12‘,
               ‘2008-11-20‘, ‘2008-11-21‘, ‘2008-12-02‘, ‘2009-02-27‘,
               ‘2009-03-25‘, ‘2009-08-13‘, ‘2010-04-26‘, ‘2010-04-30‘,
               ‘2011-08-05‘, ‘2012-03-27‘, ‘2012-08-10‘, ‘2012-11-22‘,
               ‘2012-12-04‘, ‘2012-12-24‘, ‘2013-01-16‘, ‘2013-01-25‘,
               ‘2013-09-02‘, ‘2014-04-25‘, ‘2015-01-19‘, ‘2015-05-25‘,
               ‘2015-07-03‘, ‘2015-07-08‘, ‘2015-07-13‘, ‘2015-08-24‘,
               ‘2015-09-02‘, ‘2015-09-15‘, ‘2017-11-17‘, ‘2018-02-06‘,
               ‘2018-02-09‘, ‘2018-03-23‘, ‘2018-03-28‘, ‘2018-07-11‘,
               ‘2018-10-11‘, ‘2018-10-24‘, ‘2018-10-25‘, ‘2018-10-29‘,
               ‘2018-10-30‘, ‘2019-05-06‘, ‘2019-05-08‘],
              dtype=‘datetime64[ns]‘, name=‘date‘, freq=None)

假如我从2010年1月1日开始，每月第一个交易日买入1手股票，每年最后一个交易日卖出所有股票，到今天为止，我的收益如何？

#将时间限定到2010-2019
df = df[‘2010‘:‘2019‘]

#数据的重新取样
df_monthly = df.resample(‘M‘).first()
df_yearly = df.resample(‘A‘).last()[:-1]

#求出这些年总共买股票所购买金钱总额
cost_money = df_monthly[‘open‘].sum()*100

#求出这些年卖出股票的金钱总额加上今年所购买的买个月的股票总和
recv_monry = df[‘open‘][-1] * 800 + df_yearly[‘open‘].sum()*1200

#用卖出股票的钱减去购买股票的钱
recv_monry - cost_money

以上是关于数据分析的主要内容，如果未能解决你的问题，请参考以下文章