pandas 之 索引重塑
Posted 致于数据科学家的小陈
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了pandas 之 索引重塑相关的知识,希望对你有一定的参考价值。
索引重塑reshape
import numpy as np
import pandas as pd
There are a number of basic operations for rearanging tabular data. These are alternatingly referred to as reshape or pivot operations.
多层索引重塑
Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:
stack - 列拉长index
This "rotates" or pivots from the columns in the data to the rows.
unstack
This pivots from the rows into the columns.
Ill illustrate these operations through a series of examples. Consider a small DataFrame with string arrays as row and column indexes:
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
index=pd.Index([Ohio, Colorado], name=state),
columns=pd.Index([one, two, three],
name=number))
data
number | one | two | three |
state | |||
Ohio | 0 | 1 | 2 |
Colorado | 3 | 4 | 5 |
Using the stack method on this data pivots the columns into the rows, producing a Series.
"stack 将每一行, 叠成一个Series, 堆起来"
result = data.stack()
result
stack 将每一行, 叠成一个Series, 堆起来
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32
From a hierarchically indexed Series, you can rearrage the data back into a DataFrame with unstack
"unstack 将叠起来的Series, 变回DF"
result.unstack()
unstack 将叠起来的Series, 变回DF
number | one | two | three |
state | |||
Ohio | 0 | 1 | 2 |
Colorado | 3 | 4 | 5 |
By default the innermost level is unstacked(same with stack). You can unstack a different level by passing a level number or name.
result.unstack(level=0)
state | Ohio | Colorado |
number | ||
one | 0 | 3 |
two | 1 | 4 |
three | 2 | 5 |
result.unstack(level=state)
state | Ohio | Colorado |
number | ||
one | 0 | 3 |
two | 1 | 4 |
three | 2 | 5 |
Unstacking might introduce missing data if all of the values in the level arent found in each of the subgroups.
s1 = pd.Series([0, 1, 2, 3], index=[a, b, c, d])
s2 = pd.Series([4, 5, 6], index=[c, d, e])
data2 = pd.concat([s1, s2], keys=[one, two])
data2
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
data2.unstack() # 外连接哦
a | b | c | d | e | |
one | 0.0 | 1.0 | 2.0 | 3.0 | NaN |
two | NaN | NaN | 4.0 | 5.0 | 6.0 |
%time data2.unstack().stack()
Wall time: 5 ms
one a 0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
%time data2.unstack().stack(dropna=False)
Wall time: 3 ms
one a 0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64
When you unstack in a DataFrame, the level unstacked becomes the lowest level in the result:
df = pd.DataFrame(left: result, right: result + 5,
columns=pd.Index([left, right], name=side))
df
side | left | right | |
state | number | ||
Ohio | one | 0 | 5 |
two | 1 | 6 | |
three | 2 | 7 | |
Colorado | one | 3 | 8 |
two | 4 | 9 | |
three | 5 | 10 |
df.unstack("state")
side | left | right | ||
state | Ohio | Colorado | Ohio | Colorado |
number | ||||
one | 0 | 3 | 5 | 8 |
two | 1 | 4 | 6 | 9 |
three | 2 | 5 | 7 | 10 |
When calling stack, we can indicate the name of the axis to stack:
%time df.unstack(state).stack(side)
Wall time: 118 ms
state | Colorado | Ohio | |
number | side | ||
one | left | 3 | 0 |
right | 8 | 5 | |
two | left | 4 | 1 |
right | 9 | 6 | |
three | left | 5 | 2 |
right | 10 | 7 |
长转宽形
A common way to store multiple time series in databases and CSV is in so-called long or stacked format. Lets load some example data and do a small amonut of time series wrangling and other data cleaning:
%%time
data = pd.read_csv("../examples/macrodata.csv")
data.info()
<class pandas.core.frame.DataFrame>
RangeIndex: 203 entries, 0 to 202
Data columns (total 14 columns):
year 203 non-null float64
quarter 203 non-null float64
realgdp 203 non-null float64
realcons 203 non-null float64
realinv 203 non-null float64
realgovt 203 non-null float64
realdpi 203 non-null float64
cpi 203 non-null float64
m1 203 non-null float64
tbilrate 203 non-null float64
unemp 203 non-null float64
pop 203 non-null float64
infl 203 non-null float64
realint 203 non-null float64
dtypes: float64(14)
memory usage: 22.3 KB
Wall time: 142 ms
data.head()
year | quarter | realgdp | realcons | realinv | realgovt | realdpi | cpi | m1 | tbilrate | unemp | pop | infl | realint | |
0 | 1959.0 | 1.0 | 2710.349 | 1707.4 | 286.898 | 470.045 | 1886.9 | 28.98 | 139.7 | 2.82 | 5.8 | 177.146 | 0.00 | 0.00 |
1 | 1959.0 | 数据分析之Pandas
利用Python进行数据分析-Pandas(第五部分-数据规整:聚合合并和重塑) |