如何组合和旋转具有不同结构的数据框
Posted
技术标签:
【中文标题】如何组合和旋转具有不同结构的数据框【英文标题】:How to combine and pivot dataframes with different structures 【发布时间】:2021-07-23 02:42:54 【问题描述】:我根据一些关于发布问题的建议进行了编辑和重新发布
我需要一些帮助以特定方式组合 3 个数据帧。
所以我从 3 个数据集开始
这是第一个:
df_WL = pd.read_csv('Regional_Scale_GWL_data.csv')
df_WL
Date SiteNo WL
0 8/20/1992 6203301 58
1 2/16/1993 6203301 57
2 2/23/1994 6203301 57
3 11/17/1994 6203301 58
4 11/16/1995 6203301 57
... ... ... ...
784760 12/6/2017 334000000000000 258.22
784761 12/6/2017 334000000000000 258.22
784762 3/15/2018 334000000000000 258.43
784763 5/30/2018 334000000000000 258.34
784764 NaN 334000000000000 NaN
[784765 rows x 3 columns]
由此我创建了这个数据框:
df_WL['Date'] = pd.to_datetime(df_WL['Date'], errors='coerce')
df_WL['WL'] = pd.to_numeric(df_WL['WL'], errors='coerce')
df_WL['SiteNo'] = df_WL['SiteNo'].astype(str)
df_WL = df_WL.dropna(subset=['Date'])
df_WL = df_WL.dropna(subset=['WL'])
df_WL = df_WL.pivot_table(index='Date', columns=["SiteNo"], values=['WL']) \
.reorder_levels([1, 0], axis=1) \
.sort_index(axis=1)
df_WL
SiteNo 1021201 1023902 ... SA-0174 SA-0231 SM-0049
WL WL ... WL WL WL
Date ...
1970-01-01 NaN NaN ... NaN NaN NaN
1970-01-03 NaN NaN ... NaN NaN NaN
1970-01-05 NaN NaN ... NaN NaN NaN
1970-01-06 NaN NaN ... NaN NaN NaN
1970-01-07 3692.0 NaN ... NaN NaN NaN
... ... ... ... ... ... ...
2021-02-18 NaN NaN ... NaN NaN NaN
2021-02-19 NaN NaN ... NaN NaN NaN
2021-02-22 NaN NaN ... NaN NaN NaN
2021-02-23 NaN NaN ... NaN NaN NaN
2021-02-24 NaN NaN ... NaN 7209.0 NaN
[17353 rows x 863 columns]
我拥有的另外两个数据集是:
df_precip = pd.read_csv('Regional_Scale_Precip.csv')
df_precip = df_precip.set_index('Date')
df_precip
294957000000000 294722000000000 ... 6129203 6414105
Date
1981-01-01 0.000 0.000 ... 0.000 0.000
1981-01-02 0.000 0.000 ... 0.000 0.000
1981-01-03 0.000 0.000 ... 0.000 0.000
1981-01-04 0.000 0.000 ... 0.000 0.000
2017-05-27 0.000 0.000 ... 0.000 0.000
... ... ... ... ... ...
2017-05-22 13.529 15.883 ... 19.788 45.493
2017-05-23 16.181 28.589 ... 36.448 8.722
2017-05-24 13.189 16.917 ... 15.643 14.794
2017-05-25 0.000 0.000 ... 0.000 0.000
2017-05-26 0.000 0.000 ... 0.000 0.000
[13295 rows x 1331 columns]
和
df_temp = pd.read_csv('Regional_Scale_Temp.csv')
df_temp = df_temp.set_index('Date')
df_temp
6131901 6129203 ... 6414105 6155707
Date
1981-01-01 8.965 8.733 ... 9.117 9.118
1981-01-02 6.654 6.614 ... 7.834 7.195
1981-01-03 4.794 4.796 ... 4.826 4.880
1981-01-04 7.582 7.752 ... 8.380 8.018
2009-08-25 22.438 22.129 ... 23.607 22.702
... ... ... ... ... ...
2009-08-20 27.354 27.177 ... 28.498 28.055
2009-08-21 26.706 26.397 ... 28.671 27.479
2009-08-22 25.126 24.778 ... 26.644 25.600
2009-08-23 22.001 21.835 ... 23.803 22.543
2009-08-24 21.626 21.422 ... 23.257 22.160
[10463 rows x 1331 columns]
我的目标是创建一个看起来像这样的数据框(我在 excel 中使用任意值来说明我的目标):
我已尝试将最后两个数据帧拆开,并将它们与第一个数据帧组合,以创建一个包含“日期”、“站点编号”、“WL”、“温度”和“沉淀”列的数据帧。然后调整它们以获得我的目标,但那是一团糟。
任何帮助将不胜感激。谢谢!
【问题讨论】:
【参考方案1】:-
从文件创建的所有 DataFrame 都应转换为标准长格式,包含
'Date'
、'Site'
、'Values'
和 'Type'
列。
使用pandas.concat
组合所有DataFrame
使用pandas.DataFrame.pivot
实现所需的形式
加载和清理
import pandas as pd
# load the data
wl = pd.read_csv('wl.csv', parse_dates=['Date'])
pre = pd.read_csv('precip.csv', index_col='Date', parse_dates=['Date'])
temp = pd.read_csv('temp.csv', index_col='Date', parse_dates=['Date'])
# wl is already in a long form so clean the column names
wl.rename('SiteNo': 'Site', 'WL': 'Values', axis=1, inplace=True)
# stack the other two dataframes into a long form
pre = pre.stack().reset_index(name='Values').rename('level_1': 'Site', axis=1)
temp = temp.stack().reset_index(name='Values').rename('level_1': 'Site', axis=1)
# add a Type column
wl['Type'] = 'WL'
pre['Type'] = 'Precip'
temp['Type'] = 'Temp'
# sample of wl
Date Site Values Type
0 1992-08-20 6203301 58.0 WL
1 1993-02-16 6203301 57.0 WL
2 1994-02-23 6203301 57.0 WL
3 1994-11-17 6203301 58.0 WL
4 1995-11-16 6203301 57.0 WL
结合和旋转
提供的示例数据对齐不正确,因此所有'Site'
列在透视数据帧中没有所有'Type'
列
# combine the DataFrames
df = pd.concat([wl, pre, temp])
# drop duplicate rows - there shouldn't be any, but the sample data did
df.drop_duplicates(inplace=True)
# sort the values - not strictly necessary
df = df.sort_values(['Date', 'Site', 'Type']).reset_index(drop=True)
# dropna
df.dropna(subset=['Date'], inplace=True)
# pivot
dfp = df.pivot(index='Date', columns=['Site', 'Type'], values='Values')
# display(dfp.head())
Site 294722000000000 294957000000000 6129203 6131901 6155707 6414105 6203301 334000000000000
Type Precip Precip Precip Temp Temp Temp Precip Temp WL WL
Date
1981-01-01 0.0 0.0 0.0 8.733 8.965 9.118 0.0 9.117 NaN NaN
1981-01-02 0.0 0.0 0.0 6.614 6.654 7.195 0.0 7.834 NaN NaN
1981-01-03 0.0 0.0 0.0 4.796 4.794 4.880 0.0 4.826 NaN NaN
1981-01-04 0.0 0.0 0.0 7.752 7.582 8.018 0.0 8.380 NaN NaN
1992-08-20 NaN NaN NaN NaN NaN NaN NaN NaN 58.0 NaN
【讨论】:
以上是关于如何组合和旋转具有不同结构的数据框的主要内容,如果未能解决你的问题,请参考以下文章