如何使用 Pandas 或 Spark Dataframe 展平嵌套的 Excel 数据?
Posted
技术标签:
【中文标题】如何使用 Pandas 或 Spark Dataframe 展平嵌套的 Excel 数据?【英文标题】:How to flatten Nested Excel data using Panda or Spark Dataframe? 【发布时间】:2021-05-22 03:17:54 【问题描述】:我有以下 excel 数据,我想将其转换为下面提到的预期平面格式。在预期的格式中,我不想包含差异和总计等聚合列。请帮助我使用 python Pandas Dataframe 或 Pyspark Dataframe 完成它。
Excel 中的输入数据为:
预期的最终格式为
df 是我在输入数据上的 Pandas 数据框。以下是
的输出df = pd.read_excel("D:/***.xlsx",sheet_name = 'Input')
Data Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
0 Type Domestic NaN NaN International NaN NaN
1 Unit/Subunit Jan Feb Diff Jan Feb Diff
2 NaN 2020 2021 NaN 2020 2021 NaN
3 Unit1 NaN NaN NaN NaN NaN NaN
4 SubUnit11 100 130 30 5000 8000 3000
5 SubUnit12 50 80 30 1000 4000 3000
6 SubUnit13 1000 1100 100 100000 100010 10
7 Total Unit1 1150 1310 160 106000 112010 6010
8 Unit2 NaN NaN NaN NaN NaN NaN
9 SubUnit21 1100 1130 30 15000 81100 66100
10 SubUnit22 150 180 30 11000 42000 31000
11 SubUnit23 11000 1110 -9890 1001000 1022010 21010
12 Total Unit2 12250 2420 -9830 1027000 1145110 118110
print (df.index.tolist()[:3])
[0, 1, 2]
print (df.columns.tolist()[:3])
['Data', 'Unnamed: 1', 'Unnamed: 2']
print (df.head(10).to_dict())
'Data': 0: 'Type', 1: 'Unit/Subunit', 2: nan, 3: 'Unit1', 4: 'SubUnit11', 5: 'SubUnit12', 6: 'SubUnit13', 7: 'Total Unit1', 8: 'Unit2', 9: 'SubUnit21', 'Unnamed: 1': 0: 'Domestic ', 1: 'Jan', 2: 2020, 3: nan, 4: 100, 5: 50, 6: 1000, 7: 1150, 8: nan, 9: 1100, 'Unnamed: 2': 0: nan, 1: 'Feb', 2: 2021, 3: nan, 4: 130, 5: 80, 6: 1100, 7: 1310, 8: nan, 9: 1130, 'Unnamed: 3': 0: nan, 1: 'Diff', 2: nan, 3: nan, 4: 30, 5: 30, 6: 100, 7: 160, 8: nan, 9: 30, 'Unnamed: 4': 0: 'International', 1: 'Jan', 2: 2020, 3: nan, 4: 5000, 5: 1000, 6: 100000, 7: 106000, 8: nan, 9: 15000, 'Unnamed: 5': 0: nan, 1: 'Feb', 2: 2021, 3: nan, 4: 8000, 5: 4000, 6: 100010, 7: 112010, 8: nan, 9: 81100, 'Unnamed: 6': 0: nan, 1: 'Diff', 2: nan, 3: nan, 4: 3000, 5: 3000, 6: 10, 7: 6010, 8: nan, 9: 66100
df1 = pd.read_excel("D:/***.xlsx",sheet_name = 'Input',header = [1,2,3])
Type Domestic International
Unit/Subunit Jan Feb Diff Jan Feb Diff
Unnamed: 0_level_2 2020 2021 Unnamed: 3_level_2 2020 2021 Unnamed: 6_level_2
0 Unit1 NaN NaN NaN NaN NaN NaN
1 SubUnit11 100.0 130.0 30.0 5000.0 8000.0 3000.0
2 SubUnit12 50.0 80.0 30.0 1000.0 4000.0 3000.0
3 SubUnit13 1000.0 1100.0 100.0 100000.0 100010.0 10.0
4 Total Unit1 1150.0 1310.0 160.0 106000.0 112010.0 6010.0
5 Unit2 NaN NaN NaN NaN NaN NaN
6 SubUnit21 1100.0 1130.0 30.0 15000.0 81100.0 66100.0
7 SubUnit22 150.0 180.0 30.0 11000.0 42000.0 31000.0
8 SubUnit23 11000.0 1110.0 -9890.0 1001000.0 1022010.0 21010.0
9 Total Unit2 12250.0 2420.0 -9830.0 1027000.0 1145110.0 118110.0
【问题讨论】:
print (df.index.tolist()[:3])
和 print (df.columns.tolist()[:3])
是什么?
print (df.head(10).to_dict())
是什么?如果问题创建 DataFrame 是可能的共享示例 excel 文件?
@jezrael 抱歉,我无法附上这些数据。无论如何,它并没有为此提供我。我已经更新了 print (df.head(10).to_dict()) 的输出,如果有帮助,请告诉我。
@jezrael 我已经更新了所有输出。
谢谢,有一件事,可以使用df1 = pd.read_excel("D:/***.xlsx",sheet_name = 'Input',header = [1,2,3], index_col=[0])
吗?然后print (df1.to_dict())
看看 MultiIndex 看起来如何?
【参考方案1】:
您可以使用DataFrame.stack
进行整形,然后删除不必要的行:
d =('Domestic ', 'Jan', 2020): 'SubUnit11': 100.0, 'SubUnit12': 50.0, 'SubUnit13': 1000.0, 'Total Unit1': 1150.0, 'Unit2': np.nan, 'SubUnit21': 1100.0, 'SubUnit22': 150.0, 'SubUnit23': 11000.0, 'Total Unit2': 12250.0, ('Domestic ', 'Feb', 2021): 'SubUnit11': 130.0, 'SubUnit12': 80.0, 'SubUnit13': 1100.0, 'Total Unit1': 1310.0, 'Unit2': np.nan, 'SubUnit21': 1130.0, 'SubUnit22': 180.0, 'SubUnit23': 1110.0, 'Total Unit2': 2420.0, ('Domestic ', 'Diff', 'Unnamed: 3_level_2'):
'SubUnit11': 30.0, 'SubUnit12': 30.0, 'SubUnit13': 100.0, 'Total Unit1': 160.0, 'Unit2': np.nan, 'SubUnit21': 30.0, 'SubUnit22': 30.0, 'SubUnit23': -9890.0, 'Total Unit2': -9830.0, ('International', 'Jan', 2020): 'SubUnit11': 5000.0, 'SubUnit12': 1000.0, 'SubUnit13': 100000.0, 'Total Unit1': 106000.0, 'Unit2': np.nan, 'SubUnit21': 15000.0, 'SubUnit22': 11000.0, 'SubUnit23': 1001000.0, 'Total Unit2': 1027000.0,
('International', 'Feb', 2021): 'SubUnit11': 8000.0, 'SubUnit12': 4000.0, 'SubUnit13': 100010.0, 'Total Unit1': 112010.0, 'Unit2': np.nan, 'SubUnit21': 81100.0, 'SubUnit22': 42000.0, 'SubUnit23': 1022010.0, 'Total Unit2': 1145110.0, ('International', 'Diff', 'Unnamed: 6_level_2'): 'SubUnit11': 3000.0, 'SubUnit12': 3000.0, 'SubUnit13': 10.0, 'Total Unit1': 6010.0, 'Unit2': np.nan, 'SubUnit21': 66100.0, 'SubUnit22': 31000.0, 'SubUnit23': 21010.0, 'Total Unit2': 118110.0
df = pd.DataFrame(d)
#added missing Unit1 in sample data
df.loc['Unit1'] = np.nan
df = df.loc[df.index[-1:].tolist() + df.index[:-1].tolist()]
# print (df)
#create new column Unit by index with repalce non matched Unit to NaNs
df['Unit'] = df.index.where(df.index.str.startswith('Unit'))
#forward flling NaNs
df['Unit'] = df['Unit'].ffill()
#add column to index and change order of levels in MultiIndex
df = df.set_index('Unit', append=True).swaplevel(1,0)
#reshape
df = df.stack([1, 2]).rename_axis(['Unit','SubUnit','Month','Year']).reset_index()
#remove unnecessary rows
df = df[df['Month'].ne('Diff') &
~df['Year'].str.startswith('Unnamed', na=False) &
~df['SubUnit'].str.startswith(('Total', 'Unit'), na=False)]
print (df)
Unit SubUnit Month Year Domestic International
2 Unit1 SubUnit11 Feb 2021 130.0 8000.0
3 Unit1 SubUnit11 Jan 2020 100.0 5000.0
6 Unit1 SubUnit12 Feb 2021 80.0 4000.0
7 Unit1 SubUnit12 Jan 2020 50.0 1000.0
10 Unit1 SubUnit13 Feb 2021 1100.0 100010.0
11 Unit1 SubUnit13 Jan 2020 1000.0 100000.0
18 Unit2 SubUnit21 Feb 2021 1130.0 81100.0
19 Unit2 SubUnit21 Jan 2020 1100.0 15000.0
22 Unit2 SubUnit22 Feb 2021 180.0 42000.0
23 Unit2 SubUnit22 Jan 2020 150.0 11000.0
26 Unit2 SubUnit23 Feb 2021 1110.0 1022010.0
27 Unit2 SubUnit23 Jan 2020 11000.0 1001000.0
最后是添加日期时间列进行排序:
df['Date'] = pd.to_datetime(df['Month'] + df['Year'].astype(str), format='%b%Y')
df = df.sort_values(['Unit','SubUnit','Date'], ignore_index=True)
print (df)
Unit SubUnit Month Year Domestic International Date
0 Unit1 SubUnit11 Jan 2020 100.0 5000.0 2020-01-01
1 Unit1 SubUnit11 Feb 2021 130.0 8000.0 2021-02-01
2 Unit1 SubUnit12 Jan 2020 50.0 1000.0 2020-01-01
3 Unit1 SubUnit12 Feb 2021 80.0 4000.0 2021-02-01
4 Unit1 SubUnit13 Jan 2020 1000.0 100000.0 2020-01-01
5 Unit1 SubUnit13 Feb 2021 1100.0 100010.0 2021-02-01
6 Unit2 SubUnit21 Jan 2020 1100.0 15000.0 2020-01-01
7 Unit2 SubUnit21 Feb 2021 1130.0 81100.0 2021-02-01
8 Unit2 SubUnit22 Jan 2020 150.0 11000.0 2020-01-01
9 Unit2 SubUnit22 Feb 2021 180.0 42000.0 2021-02-01
10 Unit2 SubUnit23 Jan 2020 11000.0 1001000.0 2020-01-01
11 Unit2 SubUnit23 Feb 2021 1110.0 1022010.0 2021-02-01
【讨论】:
以上是关于如何使用 Pandas 或 Spark Dataframe 展平嵌套的 Excel 数据?的主要内容,如果未能解决你的问题,请参考以下文章
如何扩展输出显示以查看 Pandas DataFrame 的更多列?
如何将使用 Pandas 在 Spark 集群上编写的文件移动到 HDFS?
如何设置 Spark 以使用由 anaconda 管理的 pandas?