如何从具有多级重复列的excel表中取消堆叠df?设置多索引?
Posted
技术标签:
【中文标题】如何从具有多级重复列的excel表中取消堆叠df?设置多索引?【英文标题】:How to unstack a df from excel table with multiple levels of duplicating columns? Set multi index? 【发布时间】:2020-12-10 11:16:01 【问题描述】:从 xlsx 读取的 df:df = pd.read_excel('file.xlsx')
如下所示:
Age Male Female Male.1 Female.1
0 NaN Big Small Small Big
1 1.0 2 3 2 3
2 2.0 3 4 3 4
3 3.0 4 5 4 5
df = pd.DataFrame('Age':[np.nan, 1,2,3],'Male':['Big',2,3,4],'Female':['Small',3,4,5],'Male.1':['Small',2,3,4],'Female.1':['Big',3,4,5])
注意 Pandas 为重复列添加了后缀 .1
,这是不希望的。我想拆开/融化来得到这个或类似的东西:
Age Gender Size [measure]
1 1 Male Big 2
2 2 Male Big 3
3 3 Male Big 4
4 1 Female Big 3
5 2 Female Big 4
6 3 Female Big 5
7 1 Male Small 2
8 2 Male Small 3
9 3 Male Small 4
10 1 Female Small 3
11 2 Female Small 4
12 3 Female Small 5
重命名列和取消堆叠接近但没有雪茄:
df= df.rename(columns='Male.1': 'Male', 'Female.1':'Female')
df= df.set_index(['Age']).unstack()
如何将第一行设置为列的第二个索引级别,如here 所示?我错过了什么?
【问题讨论】:
【参考方案1】:除了.unstack()
,另一种方法是.melt()
。
您可以使用.T
转置数据帧,并使用.iloc[1:]
获取第一行之后的所有内容。然后,.rename
列,.replace
带有一些正则表达式的.1
,.melt
数据框和.sort_values
。
df = pd.DataFrame('Age':[np.nan, 1,2,3],'Male':['Big',2,3,4],'Female':['Small',3,4,5],'Male.1':['Small',2,3,4],'Female.1':['Big',3,4,5])
df = (df.T.reset_index().iloc[1:]
.rename('index' : 'Gender', 0 : 'Size', axis=1)
.replace(r'\.\d+$', '', regex=True)
.melt(id_vars=['Gender', 'Size'], value_name='[measure]', var_name='Age')
.sort_values(['Size', 'Gender', 'Age'], ascending=[True,False,True])
.reset_index(drop=True))
df = df[['Age', 'Gender', 'Size', '[measure]']]
df
Out[41]:
Age Gender Size [measure]
0 1 Male Big 2
1 2 Male Big 3
2 3 Male Big 4
3 1 Female Big 3
4 2 Female Big 4
5 3 Female Big 5
6 1 Male Small 2
7 2 Male Small 3
8 3 Male Small 4
9 1 Female Small 3
10 2 Female Small 4
11 3 Female Small 5
【讨论】:
【参考方案2】:通过将第 0 行与列组合来创建多索引列:
df.columns = pd.MultiIndex.from_arrays((df.columns, df.iloc[0]))
df.columns.names = ['gender', 'size']
df.columns
MultiIndex([( 'Age', nan),
( 'Male', 'Big'),
( 'Female', 'Small'),
( 'Male.1', 'Small'),
('Female.1', 'Big')],
names=['gender', 'size'])
现在你可以重塑和重命名:
(df
.dropna()
.melt([('Age', np.NaN)], value_name='measure')
.replace(r'\.\d+$', '', regex=True)
.rename(columns=("Age", np.NaN) : "Age"))
Age gender size measure
0 1.0 Male Big 2
1 2.0 Male Big 3
2 3.0 Male Big 4
3 1.0 Female Small 3
4 2.0 Female Small 4
5 3.0 Female Small 5
6 1.0 Male Small 2
7 2.0 Male Small 3
8 3.0 Male Small 4
9 1.0 Female Big 3
10 2.0 Female Big 4
11 3.0 Female Big 5
【讨论】:
【参考方案3】:如果可能,创建前 2 行 MultiIndex
和第一列以通过 read_excel
中的 header
和 index_col
参数进行索引:
df = pd.read_excel('file.xlsx',header=[0,1], index_col=[0])
print (df)
Age Male Female Male Female
Big Small Small Big
1.0 2 3 2 3
2.0 3 4 3 4
3.0 4 5 4 5
print (df.columns)
MultiIndex([( 'Male', 'Big'),
('Female', 'Small'),
( 'Male', 'Small'),
('Female', 'Big')],
names=['Age', None])
print (df.index)
Float64Index([1.0, 2.0, 3.0], dtype='float64')
所以可以使用DataFrame.unstack
:
df = (df.unstack()
.rename_axis(['Gender', 'Size','Age'])
.reset_index(name='measure'))
print (df)
Gender Size Age measure
0 Male Big 1.0 2
1 Male Big 2.0 3
2 Male Big 3.0 4
3 Female Small 1.0 3
4 Female Small 2.0 4
5 Female Small 3.0 5
6 Male Small 1.0 2
7 Male Small 2.0 3
8 Male Small 3.0 4
9 Female Big 1.0 3
10 Female Big 2.0 4
11 Female Big 3.0 5
如果不可能,请使用:
您可以通过MultiIndex.from_arrays
创建MultiIndex
并通过replace
删除最后一个带有数字的.
,然后通过DataFrame.iloc
过滤掉第一行并通过第一列通过DataFrame.melt
重塑,最后设置新列名称:
df.columns = pd.MultiIndex.from_arrays([df.columns.str.replace(r'\.\d+$', ''),
df.iloc[0]])
df = df.iloc[1:].melt(df.columns[:1].tolist())
df.columns=['Age','Gender','Size','measure']
print (df)
Age Gender Size measure
0 1.0 Male Big 2
1 2.0 Male Big 3
2 3.0 Male Big 4
3 1.0 Female Small 3
4 2.0 Female Small 4
5 3.0 Female Small 5
6 1.0 Male Small 2
7 2.0 Male Small 3
8 3.0 Male Small 4
9 1.0 Female Big 3
10 2.0 Female Big 4
11 3.0 Female Big 5
或者使用DataFrame.unstack
的解决方案是可能的,只需将第一列设置为DataFrame.set_index
的index
并设置MultiIndex
的级别Series.rename_axis
用于新列名称:
df.columns = pd.MultiIndex.from_arrays([df.columns.str.replace(r'\.\d+$', ''),
df.iloc[0]])
df = (df.iloc[1:].set_index(df.columns[:1].tolist())
.unstack()
.rename_axis(['Gender', 'Size','Age'])
.reset_index(name='measure'))
print (df)
Gender Size Age measure
0 Male Big 1.0 2
1 Male Big 2.0 3
2 Male Big 3.0 4
3 Female Small 1.0 3
4 Female Small 2.0 4
5 Female Small 3.0 5
6 Male Small 1.0 2
7 Male Small 2.0 3
8 Male Small 3.0 4
9 Female Big 1.0 3
10 Female Big 2.0 4
11 Female Big 3.0 5
【讨论】:
以上是关于如何从具有多级重复列的excel表中取消堆叠df?设置多索引?的主要内容,如果未能解决你的问题,请参考以下文章
如何使用 VBA 在 Excel 宏中删除具有两列的重复项?