如何使用 Pandas 或 Spark Dataframe 展平嵌套的 Excel 数据?

Posted

技术标签:

【中文标题】如何使用 Pandas 或 Spark Dataframe 展平嵌套的 Excel 数据?【英文标题】:How to flatten Nested Excel data using Panda or Spark Dataframe? 【发布时间】:2021-05-22 03:17:54 【问题描述】:

我有以下 excel 数据,我想将其转换为下面提到的预期平面格式。在预期的格式中,我不想包含差异和总计等聚合列。请帮助我使用 python Pandas Dataframe 或 Pyspark Dataframe 完成它。

Excel 中的输入数据为:

预期的最终格式为

df 是我在输入数据上的 Pandas 数据框。以下是

的输出
df = pd.read_excel("D:/***.xlsx",sheet_name = 'Input')
        Data    Unnamed: 1  Unnamed: 2  Unnamed: 3  Unnamed: 4  Unnamed: 5  Unnamed: 6
    0   Type    Domestic    NaN NaN International   NaN NaN
    1   Unit/Subunit    Jan Feb Diff    Jan Feb Diff
    2   NaN 2020    2021    NaN 2020    2021    NaN
    3   Unit1   NaN NaN NaN NaN NaN NaN
    4   SubUnit11   100 130 30  5000    8000    3000
    5   SubUnit12   50  80  30  1000    4000    3000
    6   SubUnit13   1000    1100    100 100000  100010  10
    7   Total Unit1 1150    1310    160 106000  112010  6010
    8   Unit2   NaN NaN NaN NaN NaN NaN
    9   SubUnit21   1100    1130    30  15000   81100   66100
    10  SubUnit22   150 180 30  11000   42000   31000
    11  SubUnit23   11000   1110    -9890   1001000 1022010 21010
    12  Total Unit2 12250   2420    -9830   1027000 1145110 118110

print (df.index.tolist()[:3])
[0, 1, 2]

print (df.columns.tolist()[:3])
['Data', 'Unnamed: 1', 'Unnamed: 2']

print (df.head(10).to_dict())

'Data': 0: 'Type', 1: 'Unit/Subunit', 2: nan, 3: 'Unit1', 4: 'SubUnit11', 5: 'SubUnit12', 6: 'SubUnit13', 7: 'Total Unit1', 8: 'Unit2', 9: 'SubUnit21', 'Unnamed: 1': 0: 'Domestic ', 1: 'Jan', 2: 2020, 3: nan, 4: 100, 5: 50, 6: 1000, 7: 1150, 8: nan, 9: 1100, 'Unnamed: 2': 0: nan, 1: 'Feb', 2: 2021, 3: nan, 4: 130, 5: 80, 6: 1100, 7: 1310, 8: nan, 9: 1130, 'Unnamed: 3': 0: nan, 1: 'Diff', 2: nan, 3: nan, 4: 30, 5: 30, 6: 100, 7: 160, 8: nan, 9: 30, 'Unnamed: 4': 0: 'International', 1: 'Jan', 2: 2020, 3: nan, 4: 5000, 5: 1000, 6: 100000, 7: 106000, 8: nan, 9: 15000, 'Unnamed: 5': 0: nan, 1: 'Feb', 2: 2021, 3: nan, 4: 8000, 5: 4000, 6: 100010, 7: 112010, 8: nan, 9: 81100, 'Unnamed: 6': 0: nan, 1: 'Diff', 2: nan, 3: nan, 4: 3000, 5: 3000, 6: 10, 7: 6010, 8: nan, 9: 66100


df1 = pd.read_excel("D:/***.xlsx",sheet_name = 'Input',header = [1,2,3])

    Type    Domestic    International
Unit/Subunit    Jan Feb Diff    Jan Feb Diff
Unnamed: 0_level_2  2020    2021    Unnamed: 3_level_2  2020    2021    Unnamed: 6_level_2
0   Unit1   NaN NaN NaN NaN NaN NaN
1   SubUnit11   100.0   130.0   30.0    5000.0  8000.0  3000.0
2   SubUnit12   50.0    80.0    30.0    1000.0  4000.0  3000.0
3   SubUnit13   1000.0  1100.0  100.0   100000.0    100010.0    10.0
4   Total Unit1 1150.0  1310.0  160.0   106000.0    112010.0    6010.0
5   Unit2   NaN NaN NaN NaN NaN NaN
6   SubUnit21   1100.0  1130.0  30.0    15000.0 81100.0 66100.0
7   SubUnit22   150.0   180.0   30.0    11000.0 42000.0 31000.0
8   SubUnit23   11000.0 1110.0  -9890.0 1001000.0   1022010.0   21010.0
9   Total Unit2 12250.0 2420.0  -9830.0 1027000.0   1145110.0   118110.0

【问题讨论】:

print (df.index.tolist()[:3])print (df.columns.tolist()[:3]) 是什么? print (df.head(10).to_dict()) 是什么?如果问题创建 DataFrame 是可能的共享示例 excel 文件? @jezrael 抱歉,我无法附上这些数据。无论如何,它并没有为此提供我。我已经更新了 print (df.head(10).to_dict()) 的输出,如果有帮助,请告诉我。 @jezrael 我已经更新了所有输出。 谢谢,有一件事,可以使用df1 = pd.read_excel("D:/***.xlsx",sheet_name = 'Input',header = [1,2,3], index_col=[0])吗?然后print (df1.to_dict()) 看看 MultiIndex 看起来如何? 【参考方案1】:

您可以使用DataFrame.stack 进行整形,然后删除不必要的行:

d =('Domestic ', 'Jan', 2020): 'SubUnit11': 100.0, 'SubUnit12': 50.0, 'SubUnit13': 1000.0, 'Total Unit1': 1150.0, 'Unit2': np.nan, 'SubUnit21': 1100.0, 'SubUnit22': 150.0, 'SubUnit23': 11000.0, 'Total Unit2': 12250.0, ('Domestic ', 'Feb', 2021): 'SubUnit11': 130.0, 'SubUnit12': 80.0, 'SubUnit13': 1100.0, 'Total Unit1': 1310.0, 'Unit2': np.nan, 'SubUnit21': 1130.0, 'SubUnit22': 180.0, 'SubUnit23': 1110.0, 'Total Unit2': 2420.0, ('Domestic ', 'Diff', 'Unnamed: 3_level_2'):
'SubUnit11': 30.0, 'SubUnit12': 30.0, 'SubUnit13': 100.0, 'Total Unit1': 160.0, 'Unit2': np.nan, 'SubUnit21': 30.0, 'SubUnit22': 30.0, 'SubUnit23': -9890.0, 'Total Unit2': -9830.0, ('International', 'Jan', 2020): 'SubUnit11': 5000.0, 'SubUnit12': 1000.0, 'SubUnit13': 100000.0, 'Total Unit1': 106000.0, 'Unit2': np.nan, 'SubUnit21': 15000.0, 'SubUnit22': 11000.0, 'SubUnit23': 1001000.0, 'Total Unit2': 1027000.0,
('International', 'Feb', 2021): 'SubUnit11': 8000.0, 'SubUnit12': 4000.0, 'SubUnit13': 100010.0, 'Total Unit1': 112010.0, 'Unit2': np.nan, 'SubUnit21': 81100.0, 'SubUnit22': 42000.0, 'SubUnit23': 1022010.0, 'Total Unit2': 1145110.0, ('International', 'Diff', 'Unnamed: 6_level_2'): 'SubUnit11': 3000.0, 'SubUnit12': 3000.0, 'SubUnit13': 10.0, 'Total Unit1': 6010.0, 'Unit2': np.nan, 'SubUnit21': 66100.0, 'SubUnit22': 31000.0, 'SubUnit23': 21010.0, 'Total Unit2': 118110.0

df = pd.DataFrame(d)

#added missing Unit1 in sample data
df.loc['Unit1'] = np.nan
df = df.loc[df.index[-1:].tolist() + df.index[:-1].tolist()]
# print (df)

#create new column Unit by index with repalce non matched Unit to NaNs
df['Unit'] = df.index.where(df.index.str.startswith('Unit'))
#forward flling NaNs
df['Unit'] = df['Unit'].ffill()

#add column to index and change order of levels in MultiIndex
df = df.set_index('Unit', append=True).swaplevel(1,0)

#reshape
df = df.stack([1, 2]).rename_axis(['Unit','SubUnit','Month','Year']).reset_index()

#remove unnecessary rows
df = df[df['Month'].ne('Diff') &
        ~df['Year'].str.startswith('Unnamed', na=False) &
        ~df['SubUnit'].str.startswith(('Total', 'Unit'), na=False)]

print (df)
     Unit    SubUnit Month  Year  Domestic   International
2   Unit1  SubUnit11   Feb  2021      130.0         8000.0
3   Unit1  SubUnit11   Jan  2020      100.0         5000.0
6   Unit1  SubUnit12   Feb  2021       80.0         4000.0
7   Unit1  SubUnit12   Jan  2020       50.0         1000.0
10  Unit1  SubUnit13   Feb  2021     1100.0       100010.0
11  Unit1  SubUnit13   Jan  2020     1000.0       100000.0
18  Unit2  SubUnit21   Feb  2021     1130.0        81100.0
19  Unit2  SubUnit21   Jan  2020     1100.0        15000.0
22  Unit2  SubUnit22   Feb  2021      180.0        42000.0
23  Unit2  SubUnit22   Jan  2020      150.0        11000.0
26  Unit2  SubUnit23   Feb  2021     1110.0      1022010.0
27  Unit2  SubUnit23   Jan  2020    11000.0      1001000.0

最后是添加日期时间列进行排序:

df['Date'] = pd.to_datetime(df['Month'] + df['Year'].astype(str), format='%b%Y')

df = df.sort_values(['Unit','SubUnit','Date'], ignore_index=True)
print (df)
     Unit    SubUnit Month  Year  Domestic   International       Date
0   Unit1  SubUnit11   Jan  2020      100.0         5000.0 2020-01-01
1   Unit1  SubUnit11   Feb  2021      130.0         8000.0 2021-02-01
2   Unit1  SubUnit12   Jan  2020       50.0         1000.0 2020-01-01
3   Unit1  SubUnit12   Feb  2021       80.0         4000.0 2021-02-01
4   Unit1  SubUnit13   Jan  2020     1000.0       100000.0 2020-01-01
5   Unit1  SubUnit13   Feb  2021     1100.0       100010.0 2021-02-01
6   Unit2  SubUnit21   Jan  2020     1100.0        15000.0 2020-01-01
7   Unit2  SubUnit21   Feb  2021     1130.0        81100.0 2021-02-01
8   Unit2  SubUnit22   Jan  2020      150.0        11000.0 2020-01-01
9   Unit2  SubUnit22   Feb  2021      180.0        42000.0 2021-02-01
10  Unit2  SubUnit23   Jan  2020    11000.0      1001000.0 2020-01-01
11  Unit2  SubUnit23   Feb  2021     1110.0      1022010.0 2021-02-01

【讨论】:

以上是关于如何使用 Pandas 或 Spark Dataframe 展平嵌套的 Excel 数据?的主要内容,如果未能解决你的问题,请参考以下文章

如何扩展输出显示以查看 Pandas DataFrame 的更多列?

如何将使用 Pandas 在 Spark 集群上编写的文件移动到 HDFS?

如何设置 Spark 以使用由 anaconda 管理的 pandas?

动态构建大型数据框(spark 或 pandas)以导出到 csv 的方法

如何将 Pandas 数据框中的负数替换为零

如何将 xlsx 或 xls 文件读取为 spark 数据框