Pandas 之 描述性统计案例
Posted 致于数据科学家的小陈
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Pandas 之 描述性统计案例相关的知识,希望对你有一定的参考价值。
认识
jupyter地址: https://nbviewer.jupyter.org/github/chenjieyouge/jupyter_share/blob/master/share/pandas- 描述性统计.ipynb
import numpy as np
import pandas as pd
pandas objects are equipped(配备的) with a set of common mathematical and statistical methods. Most of these fall into the categrory of reductions or summary statistics, methods that exract(提取) a single value(like the sum or mean) from a Series of values from the rows or columns of a DataFrame. Compared with the similar methods found on NumPy arrays, they built-in handling for missiing data. Consider a small DataFarme -> (pandas提供了一些常用的统计函数, 输入通常是一个series的值, 或df的行, 列; 值得一提的是, pandas提供了缺失值处理, 在统计的时候, 不列入计算)
df = pd.DataFrame([
[1.4, np.nan],
[7.6, -4.5],
[np.nan, np.nan],
[3, -1.5]
],
index=list(abcd), columns=[one, two])
df
one | two | |
a | 1.4 | NaN |
b | 7.6 | -4.5 |
c | NaN | NaN |
d | 3.0 | -1.5 |
Calling DataFrames sum method returns a Series containing column sums:
"默认axis=0, 行方向, 下方, 展示每列, 忽略缺失值"
df.sum()
df.mean()
"在计算平均值时, NaN 不计入样本"
默认axis=0, 行方向, 下方, 展示每列, 忽略缺失值
one 12.0
two -6.0
dtype: float64
one 4.0
two -3.0
dtype: float64
在计算平均值时, NaN 不计入样本
Passing axis=columns or axis=1 sums across the columns instead. -> axis方向
"按行统计, aixs=1, 列方向, 右边"
df.sum(axis=1)
按行统计, aixs=1, 列方向, 右边
a 1.4
b 3.1
c 0.0
d 1.5
dtype: float64
NA values are excluded unless the entire slice (row or column in the case) is NA. This can be disabled with the skipna option: -> 统计计算会自动忽略缺失值, 不计入样本
"默认是忽略缺失值的, 要缺失值, 则手动指定一下"
df.mean(skipna=False, axis=columns) # 列方向, 行哦
默认是忽略缺失值的, 要缺失值, 则手动指定一下
a NaN
b 1.55
c NaN
d 0.75
dtype: float64
See Table 5-7 for a list of common options for each reduction method.
Method | Description |
axis | Axis to reduce over, 0 for DataFrames rows and 1 for columns |
skipna | Exclude missing values; True by default |
level | Reduce grouped by level if the axis is hierachically indexed(MaltiIndex) |
Some methods, like idmax and idmin, return indirect statistics like the index where the minimum or maximum values are attained(取得).
"idxmax() 返回最大值的第一个索引标签"
df.idxmax()
idxmax() 返回最大值的第一个索引标签
one b
two d
dtype: object
Other methods are accumulations: 累积求和-默认axis=0 行方向
"累积求和, 默认axis=0, 忽略NA"
df.cumsum()
"也可指定axis=1列方向"
df.cumsum(axis=1)
累积求和, 默认axis=0, 忽略NA
one | two | |
a | 1.4 | NaN |
b | 9.0 | -4.5 |
c | NaN | NaN |
d | 12.0 | -6.0 |
也可指定axis=0列方向
one | two | |
a | 1.4 | NaN |
b | 7.6 | 3.1 |
c | NaN | NaN |
d | 3.0 | 1.5 |
Another type of method is neither a reduction(聚合) nor an accumulation. describe is one such example, producing multiple summary statistic in one shot: --> (describe()方法是对列变量做描述性统计)
"describe() 返回列变量分位数, 均值, count, std等常用统计指标"
" roud(2)保留2位小数"
df.describe().round(2)
describe() 返回列变量分位数, 均值, count, std等常用统计指标
roud(2)保留2位小数
one | two | |
count | 3.00 | 2.00 |
mean | 4.00 | -3.00 |
std | 3.22 | 2.12 |
min | 1.40 | -4.50 |
25% | 2.20 | -3.75 |
50% | 3.00 | -3.00 |
75% | 5.30 | -2.25 |
max | 7.60 | -1.50 |
On non-numeric data, describe produces alternative(供选择的) summary statistics: --> 对于分类字段, 能自动识别并返回分类汇总信息
obj = pd.Series([a, a, b, c]*4)
"describe()对分类字段自动分类汇总"
obj.describe()
describe()对分类字段自动分类汇总
count 16
unique 3
top a
freq 8
dtype: object
See Table 5-8 for a full list of summary statistics and related methods.
Method | Description |
count | Number of non-NA values |
describe | 描述性统计Series或DataFrame的列 |
min, max | 极值 |
argmin, argmax | 极值所有的位置下标 |
idmin, idmax | 极值所对应的行索引label |
quantile | 样本分位数 |
sum | 求和 |
mean | 求均值 |
median | 中位数 |
var | 方差 |
std | 标准差 |
skew | 偏度 |
kurt | 峰度 |
skew | 偏度 |
cumsum | 累积求和 |
cumprod | 累积求积 |
diff | Compute first arithmetic difference (useful for time series) |
pct_change | Compute percent change |
df.idxmax()
one b
two d
dtype: object
df[one].argmax()
c:\\python\\python36\\lib\\site-packages\\ipykernel_launcher.py:1: FutureWarning: argmax is deprecated, use idxmax instead. The behavior of argmax
will be corrected to return the positional maximum in the future.
Use series.values.argmax to get the position of the maximum now.
"""Entry point for launching an IPython kernel.
b
Correlation and Convariance
Some summary statistics, like correlation and convariance(方差和协方差), are computed from pairs of arguments. Lets consider some DataFrames of stock prices and volumes(体量) obtained from Yahoo! Finace using the add-on pandas-datareader package. If you dont have it install already, it can be obtained via or pip:
(conda) pip install pandas-datareader
I use the pandas_datareader module to dwonload some data for a few stock tickers:
import pandas_datareader.data as web
"字典推导式"
# all_data = ticker: web.get_data_yahoo(ticker)
# for ticker in [AAPL, IBM, MSFT, GOOG]
字典推导式
"读取二进制数据 read_pickle(), 存为 to_pickle()"
returns = pd.read_pickle("../examples/yahoo_volume.pkl")
returns.tail()
读取二进制数据 read_pickle(), 存为 to_pickle()
AAPL | GOOG | IBM | MSFT | |
Date | ||||
2016-10-17 | 23624900 | 1089500 | 5890400 | 23830000 |
2016-10-18 | 24553500 | 1995600 | 12770600 | 19149500 |
2016-10-19 | 20034600 | 116600 | 4632900 | 22878400 |
2016-10-20 | 24125800 | 1734200 | 4023100 | 49455600 |
2016-10-21 | 22384800 | 1260500 | 4401900 | 79974200 |
The corr method of Series computes the correlation of the overlapping, non-NA(线性相关), aligned-by-index values in two Series. Relatedly, cov compute teh convariance: ->(corr 计算相关系数, cov 计算协方差)
returns.describe()
AAPL | GOOG | IBM | MSFT | |
count | 1.714000e+03 | 1.714000e+03 | 1.714000e+03 | 1.714000e+03 |
mean | 9.595085e+07 | 4.111642e+06 | 4.815604e+06 | 4.630359e+07 |
std | 6.010914e+07 | 2.948526e+06 | 2.345484e+06 | 2.437393e+07 |
min | 1.304640e+07 | 7.900000e+03 | 1.415800e+06 | 9.009100e+06 |
25% | 5.088832e+07 | 1.950025e+06 | 3.337950e+06 | 3.008798e+07 |
Pandas常用累计同比环比等统计方法实践案例 |