如何从具有多级重复列的excel表中取消堆叠df?设置多索引?

Posted

技术标签:

【中文标题】如何从具有多级重复列的excel表中取消堆叠df?设置多索引?【英文标题】:How to unstack a df from excel table with multiple levels of duplicating columns? Set multi index? 【发布时间】:2020-12-10 11:16:01 【问题描述】:

从 xlsx 读取的 df:df = pd.read_excel('file.xlsx') 如下所示:

   Age Male Female Male.1 Female.1
0  NaN  Big  Small  Small      Big
1  1.0    2      3      2        3
2  2.0    3      4      3        4
3  3.0    4      5      4        5
df = pd.DataFrame('Age':[np.nan, 1,2,3],'Male':['Big',2,3,4],'Female':['Small',3,4,5],'Male.1':['Small',2,3,4],'Female.1':['Big',3,4,5])

注意 Pandas 为重复列添加了后缀 .1,这是不希望的。我想拆开/融化来得到这个或类似的东西:

    Age Gender  Size    [measure]
1   1   Male    Big     2
2   2   Male    Big     3
3   3   Male    Big     4
4   1   Female  Big     3
5   2   Female  Big     4
6   3   Female  Big     5
7   1   Male    Small   2
8   2   Male    Small   3
9   3   Male    Small   4
10  1   Female  Small   3
11  2   Female  Small   4
12  3   Female  Small   5

重命名列和取消堆叠接近但没有雪茄:

df= df.rename(columns='Male.1': 'Male', 'Female.1':'Female')
df= df.set_index(['Age']).unstack()

如何将第一行设置为列的第二个索引级别,如here 所示?我错过了什么?

【问题讨论】:

【参考方案1】:

除了.unstack(),另一种方法是.melt()

您可以使用.T 转置数据帧,并使用.iloc[1:] 获取第一行之后的所有内容。然后,.rename 列,.replace 带有一些正则表达式的.1.melt 数据框和.sort_values

df = pd.DataFrame('Age':[np.nan, 1,2,3],'Male':['Big',2,3,4],'Female':['Small',3,4,5],'Male.1':['Small',2,3,4],'Female.1':['Big',3,4,5])
df = (df.T.reset_index().iloc[1:]
      .rename('index' : 'Gender', 0 : 'Size', axis=1)
      .replace(r'\.\d+$', '', regex=True)
      .melt(id_vars=['Gender', 'Size'], value_name='[measure]', var_name='Age')
      .sort_values(['Size', 'Gender', 'Age'], ascending=[True,False,True])
      .reset_index(drop=True))
df = df[['Age', 'Gender', 'Size', '[measure]']]      
df
Out[41]: 
   Age  Gender   Size  [measure]
0    1    Male    Big          2
1    2    Male    Big          3
2    3    Male    Big          4
3    1  Female    Big          3
4    2  Female    Big          4
5    3  Female    Big          5
6    1    Male  Small          2
7    2    Male  Small          3
8    3    Male  Small          4
9    1  Female  Small          3
10   2  Female  Small          4
11   3  Female  Small          5

【讨论】:

【参考方案2】:

通过将第 0 行与列组合来创建多索引列:

df.columns = pd.MultiIndex.from_arrays((df.columns, df.iloc[0]))
df.columns.names = ['gender', 'size']

df.columns

MultiIndex([(     'Age',     nan),
            (    'Male',   'Big'),
            (  'Female', 'Small'),
            (  'Male.1', 'Small'),
            ('Female.1',   'Big')],
          names=['gender', 'size'])

现在你可以重塑和重命名:

 (df
  .dropna()
  .melt([('Age', np.NaN)], value_name='measure')
  .replace(r'\.\d+$', '', regex=True)
  .rename(columns=("Age", np.NaN) : "Age"))

   Age  gender  size measure
0   1.0 Male    Big     2
1   2.0 Male    Big     3
2   3.0 Male    Big     4
3   1.0 Female  Small   3
4   2.0 Female  Small   4
5   3.0 Female  Small   5
6   1.0 Male    Small   2
7   2.0 Male    Small   3
8   3.0 Male    Small   4
9   1.0 Female  Big     3
10  2.0 Female  Big     4
11  3.0 Female  Big     5

【讨论】:

【参考方案3】:

如果可能,创建前 2 行 MultiIndex 和第一列以通过 read_excel 中的 headerindex_col 参数进行索引:

df = pd.read_excel('file.xlsx',header=[0,1], index_col=[0])
    
print (df)
Age Male Female  Male Female
     Big  Small Small    Big
1.0    2      3     2      3
2.0    3      4     3      4
3.0    4      5     4      5

print (df.columns)
MultiIndex([(  'Male',   'Big'),
            ('Female', 'Small'),
            (  'Male', 'Small'),
            ('Female',   'Big')],
           names=['Age', None])

print (df.index)
Float64Index([1.0, 2.0, 3.0], dtype='float64')

所以可以使用DataFrame.unstack:

df = (df.unstack()
        .rename_axis(['Gender', 'Size','Age'])
        .reset_index(name='measure'))
print (df)
    Gender   Size  Age  measure
0     Male    Big  1.0        2
1     Male    Big  2.0        3
2     Male    Big  3.0        4
3   Female  Small  1.0        3
4   Female  Small  2.0        4
5   Female  Small  3.0        5
6     Male  Small  1.0        2
7     Male  Small  2.0        3
8     Male  Small  3.0        4
9   Female    Big  1.0        3
10  Female    Big  2.0        4
11  Female    Big  3.0        5

如果不可能,请使用:

您可以通过MultiIndex.from_arrays 创建MultiIndex 并通过replace 删除最后一个带有数字的.,然后通过DataFrame.iloc 过滤掉第一行并通过第一列通过DataFrame.melt 重塑,最后设置新列名称:

df.columns = pd.MultiIndex.from_arrays([df.columns.str.replace(r'\.\d+$', ''), 
                                        df.iloc[0]])
df = df.iloc[1:].melt(df.columns[:1].tolist())
df.columns=['Age','Gender','Size','measure']
print (df)
    Age  Gender   Size measure
0   1.0    Male    Big       2
1   2.0    Male    Big       3
2   3.0    Male    Big       4
3   1.0  Female  Small       3
4   2.0  Female  Small       4
5   3.0  Female  Small       5
6   1.0    Male  Small       2
7   2.0    Male  Small       3
8   3.0    Male  Small       4
9   1.0  Female    Big       3
10  2.0  Female    Big       4
11  3.0  Female    Big       5

或者使用DataFrame.unstack 的解决方案是可能的,只需将第一列设置为DataFrame.set_indexindex 并设置MultiIndex 的级别Series.rename_axis 用于新列名称:

df.columns = pd.MultiIndex.from_arrays([df.columns.str.replace(r'\.\d+$', ''), 
                                        df.iloc[0]])
df = (df.iloc[1:].set_index(df.columns[:1].tolist())
        .unstack()
        .rename_axis(['Gender', 'Size','Age'])
        .reset_index(name='measure'))
print (df)
    Gender   Size  Age measure
0     Male    Big  1.0       2
1     Male    Big  2.0       3
2     Male    Big  3.0       4
3   Female  Small  1.0       3
4   Female  Small  2.0       4
5   Female  Small  3.0       5
6     Male  Small  1.0       2
7     Male  Small  2.0       3
8     Male  Small  3.0       4
9   Female    Big  1.0       3
10  Female    Big  2.0       4
11  Female    Big  3.0       5

【讨论】:

以上是关于如何从具有多级重复列的excel表中取消堆叠df?设置多索引?的主要内容,如果未能解决你的问题,请参考以下文章

如何通过取消旋转标题行来转置Excel中的列[重复]

如何使用 VBA 在 Excel 宏中删除具有两列的重复项?

从具有 NULL 列的大表中删除重复项,这也需要考虑

excel中,如何实现从一个表中检索数据,并排列到另一个表中。

如何将数据从一张 Excel 表导入到另一张

如何在配置单元表中插入具有地图列的数据框