pandas 之索引重塑

Posted 2022-08-23 致于数据科学家的小陈

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了pandas 之索引重塑相关的知识，希望对你有一定的参考价值。

索引重塑reshape

import numpy as np 
import pandas as pd

There are a number of basic operations for rearanging tabular data. These are alternatingly referred to as reshape or pivot operations.

多层索引重塑

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:

stack - 列拉长index
This "rotates" or pivots from the columns in the data to the rows.

unstack
This pivots from the rows into the columns.

Ill illustrate these operations through a series of examples. Consider a small DataFrame with string arrays as row and column indexes:

data = pd.DataFrame(np.arange(6).reshape((2, 3)),
    index=pd.Index([Ohio, Colorado], name=state),
    columns=pd.Index([one, two, three],
    name=number))

data

number	one	two	three
state
Ohio	0	1	2
Colorado	3	4	5

Using the stack method on this data pivots the columns into the rows, producing a Series.

"stack 将每一行, 叠成一个Series, 堆起来"
result = data.stack()

result

stack 将每一行, 叠成一个Series, 堆起来






state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32

From a hierarchically indexed Series, you can rearrage the data back into a DataFrame with unstack

"unstack 将叠起来的Series, 变回DF"

result.unstack()

unstack 将叠起来的Series, 变回DF

number	one	two	three
state
Ohio	0	1	2
Colorado	3	4	5

By default the innermost level is unstacked(same with stack). You can unstack a different level by passing a level number or name.

result.unstack(level=0)

state	Ohio	Colorado
number
one	0	3
two	1	4
three	2	5

result.unstack(level=state)

state	Ohio	Colorado
number
one	0	3
two	1	4
three	2	5

Unstacking might introduce missing data if all of the values in the level arent found in each of the subgroups.

s1 = pd.Series([0, 1, 2, 3], index=[a, b, c, d])

s2 = pd.Series([4, 5, 6], index=[c, d, e])

data2 = pd.concat([s1, s2], keys=[one, two])

data2

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

data2.unstack()  # 外连接哦

	a	b	c	d	e
one	0.0	1.0	2.0	3.0	NaN
two	NaN	NaN	4.0	5.0	6.0

%time data2.unstack().stack()

Wall time: 5 ms





one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64

%time data2.unstack().stack(dropna=False)

Wall time: 3 ms





one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

When you unstack in a DataFrame, the level unstacked becomes the lowest level in the result:

df = pd.DataFrame(left: result, right: result + 5,
columns=pd.Index([left, right], name=side))

df

	side	left	right
state	number
Ohio	one	0	5
two	1	6
three	2	7
Colorado	one	3	8
two	4	9
three	5	10

df.unstack("state")

side	left	right
state	Ohio	Colorado	Ohio	Colorado
number
one	0	3	5	8
two	1	4	6	9
three	2	5	7	10

When calling stack, we can indicate the name of the axis to stack:

%time df.unstack(state).stack(side)

Wall time: 118 ms

	state	Colorado	Ohio
number	side
one	left	3	0
right	8	5
two	left	4	1
right	9	6
three	left	5	2
right	10	7

长转宽形

A common way to store multiple time series in databases and CSV is in so-called long or stacked format. Lets load some example data and do a small amonut of time series wrangling and other data cleaning:

%%time

data = pd.read_csv("../examples/macrodata.csv")

data.info()

<class pandas.core.frame.DataFrame>
RangeIndex: 203 entries, 0 to 202
Data columns (total 14 columns):
year        203 non-null float64
quarter     203 non-null float64
realgdp     203 non-null float64
realcons    203 non-null float64
realinv     203 non-null float64
realgovt    203 non-null float64
realdpi     203 non-null float64
cpi         203 non-null float64
m1          203 non-null float64
tbilrate    203 non-null float64
unemp       203 non-null float64
pop         203 non-null float64
infl        203 non-null float64
realint     203 non-null float64
dtypes: float64(14)
memory usage: 22.3 KB
Wall time: 142 ms

data.head()

	year	quarter	realgdp	realcons	realinv	realgovt	realdpi	cpi	m1	tbilrate	unemp	pop	infl	realint
0	1959.0	1.0	2710.349	1707.4	286.898	470.045	1886.9	28.98	139.7	2.82	5.8	177.146	0.00	0.00
1	1959.0	(c)2006-2024 SYSTEM All Rights Reserved IT常识

pandas 之 索引重塑

多层索引重塑

长转宽形

pandas 之索引重塑