pandas 之 索引重塑

Posted 致于数据科学家的小陈

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了pandas 之 索引重塑相关的知识,希望对你有一定的参考价值。

索引重塑reshape

import numpy as np 
import pandas as pd

There are a number of basic operations for rearanging tabular data. These are alternatingly referred to as reshape or pivot operations.

多层索引重塑

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:

stack - 列拉长index
​ This "rotates" or pivots from the columns in the data to the rows.

unstack
​ This pivots from the rows into the columns.

Ill illustrate these operations through a series of examples. Consider a small DataFrame with string arrays as row and column indexes:

data = pd.DataFrame(np.arange(6).reshape((2, 3)),
index=pd.Index([Ohio, Colorado], name=state),
columns=pd.Index([one, two, three],
name=number))

data

number

one

two

three

state

Ohio

0

1

2

Colorado

3

4

5

Using the stack method on this data pivots the columns into the rows, producing a Series.

"stack 将每一行, 叠成一个Series, 堆起来"
result = data.stack()

result
stack 将每一行, 叠成一个Series, 堆起来






state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32

From a hierarchically indexed Series, you can rearrage the data back into a DataFrame with unstack

"unstack 将叠起来的Series, 变回DF"

result.unstack()
unstack 将叠起来的Series, 变回DF

number

one

two

three

state

Ohio

0

1

2

Colorado

3

4

5

By default the innermost level is unstacked(same with stack). You can unstack a different level by passing a level number or name.

result.unstack(level=0)

state

Ohio

Colorado

number

one

0

3

two

1

4

three

2

5

result.unstack(level=state)

state

Ohio

Colorado

number

one

0

3

two

1

4

three

2

5

Unstacking might introduce missing data if all of the values in the level arent found in each of the subgroups.

s1 = pd.Series([0, 1, 2, 3], index=[a, b, c, d])

s2 = pd.Series([4, 5, 6], index=[c, d, e])

data2 = pd.concat([s1, s2], keys=[one, two])

data2
one  a    0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
data2.unstack()  # 外连接哦

a

b

c

d

e

one

0.0

1.0

2.0

3.0

NaN

two

NaN

NaN

4.0

5.0

6.0

%time data2.unstack().stack()
Wall time: 5 ms





one a 0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
%time data2.unstack().stack(dropna=False)
Wall time: 3 ms





one a 0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64

When you unstack in a DataFrame, the level unstacked becomes the lowest level in the result:

df = pd.DataFrame(left: result, right: result + 5,
columns=pd.Index([left, right], name=side))
df

side

left

right

state

number

Ohio

one

0

5

two

1

6

three

2

7

Colorado

one

3

8

two

4

9

three

5

10

df.unstack("state")

side

left

right

state

Ohio

Colorado

Ohio

Colorado

number

one

0

3

5

8

two

1

4

6

9

three

2

5

7

10

When calling stack, we can indicate the name of the axis to stack:

%time df.unstack(state).stack(side)
Wall time: 118 ms

state

Colorado

Ohio

number

side

one

left

3

0

right

8

5

two

left

4

1

right

9

6

three

left

5

2

right

10

7

长转宽形

A common way to store multiple time series in databases and CSV is in so-called long or stacked format. Lets load some example data and do a small amonut of time series wrangling and other data cleaning:

%%time

data = pd.read_csv("../examples/macrodata.csv")

data.info()
<class pandas.core.frame.DataFrame>
RangeIndex: 203 entries, 0 to 202
Data columns (total 14 columns):
year 203 non-null float64
quarter 203 non-null float64
realgdp 203 non-null float64
realcons 203 non-null float64
realinv 203 non-null float64
realgovt 203 non-null float64
realdpi 203 non-null float64
cpi 203 non-null float64
m1 203 non-null float64
tbilrate 203 non-null float64
unemp 203 non-null float64
pop 203 non-null float64
infl 203 non-null float64
realint 203 non-null float64
dtypes: float64(14)
memory usage: 22.3 KB
Wall time: 142 ms
data.head()

year

quarter

realgdp

realcons

realinv

realgovt

realdpi

cpi

m1

tbilrate

unemp

pop

infl

realint

0

1959.0

1.0

2710.349

1707.4

286.898

470.045

1886.9

28.98

139.7

2.82

5.8

177.146

0.00

0.00

1

1959.0

数据分析之Pandas

数据规整:聚合合并和重塑 Pandas

利用Python进行数据分析-Pandas(第五部分-数据规整:聚合合并和重塑)

Pandas Pivot with Strings- ValueError:索引包含重复的条目,无法重塑

如何在 Pandas 中正确旋转或重塑时间序列数据框?

pandas数据规整化:清理转换合并重塑之合并数据集

(c)2006-2024 SYSTEM All Rights Reserved IT常识