Pandas 之 描述性统计案例

Posted 致于数据科学家的小陈

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Pandas 之 描述性统计案例相关的知识,希望对你有一定的参考价值。

认识

jupyter地址: ​​https://nbviewer.jupyter.org/github/chenjieyouge/jupyter_share/blob/master/share/pandas- 描述性统计.ipynb​

import numpy as np
import pandas as pd

pandas objects are equipped(配备的) with a set of common mathematical and statistical methods. Most of these fall into the categrory of reductions or summary statistics, methods that exract(提取) a single value(like the sum or mean) from a Series of values from the rows or columns of a DataFrame. Compared with the similar methods found on NumPy arrays, they built-in handling for missiing data. Consider a small DataFarme -> (pandas提供了一些常用的统计函数, 输入通常是一个series的值, 或df的行, 列; 值得一提的是, pandas提供了缺失值处理, 在统计的时候, 不列入计算)

df = pd.DataFrame([
[1.4, np.nan],
[7.6, -4.5],
[np.nan, np.nan],
[3, -1.5]
],
index=list(abcd), columns=[one, two])

df

one

two

a

1.4

NaN

b

7.6

-4.5

c

NaN

NaN

d

3.0

-1.5

Calling DataFrames sum method returns a Series containing column sums:

"默认axis=0, 行方向, 下方, 展示每列, 忽略缺失值"
df.sum()

df.mean()
"在计算平均值时, NaN 不计入样本"
默认axis=0, 行方向, 下方, 展示每列, 忽略缺失值
one    12.0
two -6.0
dtype: float64
one    4.0
two -3.0
dtype: float64
在计算平均值时, NaN 不计入样本

Passing axis=columns or axis=1 sums across the columns instead. -> axis方向

"按行统计, aixs=1, 列方向, 右边"
df.sum(axis=1)
按行统计, aixs=1, 列方向, 右边
a    1.4
b 3.1
c 0.0
d 1.5
dtype: float64

NA values are excluded unless the entire slice (row or column in the case) is NA. This can be disabled with the skipna option: -> 统计计算会自动忽略缺失值, 不计入样本

"默认是忽略缺失值的, 要缺失值, 则手动指定一下"
df.mean(skipna=False, axis=columns) # 列方向, 行哦
默认是忽略缺失值的, 要缺失值, 则手动指定一下
a     NaN
b 1.55
c NaN
d 0.75
dtype: float64

See Table 5-7 for a list of common options for each reduction method.

Method

Description

axis

Axis to reduce over, 0 for DataFrames rows and 1 for columns

skipna

Exclude missing values; True by default

level

Reduce grouped by level if the axis is hierachically indexed(MaltiIndex)

Some methods, like idmax and idmin, return indirect statistics like the index where the minimum or maximum values are attained(取得).

"idxmax() 返回最大值的第一个索引标签"
df.idxmax()
idxmax() 返回最大值的第一个索引标签
one    b
two d
dtype: object

Other methods are accumulations: 累积求和-默认axis=0 行方向

"累积求和, 默认axis=0, 忽略NA"
df.cumsum()

"也可指定axis=1列方向"
df.cumsum(axis=1)
累积求和, 默认axis=0, 忽略NA

one

two

a

1.4

NaN

b

9.0

-4.5

c

NaN

NaN

d

12.0

-6.0

也可指定axis=0列方向

one

two

a

1.4

NaN

b

7.6

3.1

c

NaN

NaN

d

3.0

1.5

Another type of method is neither a reduction(聚合) nor an accumulation. describe is one such example, producing multiple summary statistic in one shot: --> (describe()方法是对列变量做描述性统计)

"describe() 返回列变量分位数, 均值, count, std等常用统计指标"
" roud(2)保留2位小数"

df.describe().round(2)
describe() 返回列变量分位数, 均值, count, std等常用统计指标
 roud(2)保留2位小数

one

two

count

3.00

2.00

mean

4.00

-3.00

std

3.22

2.12

min

1.40

-4.50

25%

2.20

-3.75

50%

3.00

-3.00

75%

5.30

-2.25

max

7.60

-1.50

On non-numeric data, describe produces alternative(供选择的) summary statistics: --> 对于分类字段, 能自动识别并返回分类汇总信息

obj = pd.Series([a, a, b, c]*4)

"describe()对分类字段自动分类汇总"
obj.describe()
describe()对分类字段自动分类汇总
count     16
unique 3
top a
freq 8
dtype: object

See Table 5-8 for a full list of summary statistics and related methods.

Method

Description

count

Number of non-NA values

describe

描述性统计Series或DataFrame的列

min, max

极值

argmin, argmax

极值所有的位置下标

idmin, idmax

极值所对应的行索引label

quantile

样本分位数

sum

求和

mean

求均值

median

中位数

var

方差

std

标准差

skew

偏度

kurt

峰度

skew

偏度

cumsum

累积求和

cumprod

累积求积

diff

Compute first arithmetic difference (useful for time series)

pct_change

Compute percent change

df.idxmax()
one    b
two d
dtype: object
df[one].argmax()
c:\\python\\python36\\lib\\site-packages\\ipykernel_launcher.py:1: FutureWarning: argmax is deprecated, use idxmax instead. The behavior of argmax
will be corrected to return the positional maximum in the future.
Use series.values.argmax to get the position of the maximum now.
"""Entry point for launching an IPython kernel.
b

Correlation and Convariance

Some summary statistics, like correlation and convariance(方差和协方差), are computed from pairs of arguments. Lets consider some DataFrames of stock prices and volumes(体量) obtained from Yahoo! Finace using the add-on pandas-datareader package. If you dont have it install already, it can be obtained via or pip:

(conda) pip install pandas-datareader

I use the pandas_datareader module to dwonload some data for a few stock tickers:

import pandas_datareader.data as web

"字典推导式"
# all_data = ticker: web.get_data_yahoo(ticker)
# for ticker in [AAPL, IBM, MSFT, GOOG]
字典推导式
"读取二进制数据 read_pickle(), 存为 to_pickle()"

returns = pd.read_pickle("../examples/yahoo_volume.pkl")

returns.tail()
读取二进制数据 read_pickle(), 存为 to_pickle()

AAPL

GOOG

IBM

MSFT

Date

2016-10-17

23624900

1089500

5890400

23830000

2016-10-18

24553500

1995600

12770600

19149500

2016-10-19

20034600

116600

4632900

22878400

2016-10-20

24125800

1734200

4023100

49455600

2016-10-21

22384800

1260500

4401900

79974200

The corr method of Series computes the correlation of the overlapping, non-NA(线性相关), aligned-by-index values in two Series. Relatedly, cov compute teh convariance: ->(corr 计算相关系数, cov 计算协方差)

returns.describe()

AAPL

GOOG

IBM

MSFT

count

1.714000e+03

1.714000e+03

1.714000e+03

1.714000e+03

mean

9.595085e+07

4.111642e+06

4.815604e+06

4.630359e+07

std

6.010914e+07

2.948526e+06

2.345484e+06

2.437393e+07

min

1.304640e+07

7.900000e+03

1.415800e+06

9.009100e+06

25%

5.088832e+07

1.950025e+06

3.337950e+06

3.008798e+07

Pandas常用累计同比环比等统计方法实践案例

20个案例详解 Pandas 当中的数据统计分析与排序

100天精通Python(数据分析篇)——第62天:pandas常用统计方法与案例

python数据分析之Pandas:汇总和计算描述统计

5-Pandas之常用的描述性统计函数汇总函数

#yyds干货盘点#登天之梯——Pandas快速入门(下)

(c)2006-2024 SYSTEM All Rights Reserved IT常识