如何组合和旋转具有不同结构的数据框

Posted

技术标签:

【中文标题】如何组合和旋转具有不同结构的数据框【英文标题】:How to combine and pivot dataframes with different structures 【发布时间】:2021-07-23 02:42:54 【问题描述】:

我根据一些关于发布问题的建议进行了编辑和重新发布

我需要一些帮助以特定方式组合 3 个数据帧。

所以我从 3 个数据集开始

这是第一个:

df_WL = pd.read_csv('Regional_Scale_GWL_data.csv')
df_WL

              Date           SiteNo      WL
0        8/20/1992          6203301      58
1        2/16/1993          6203301      57
2        2/23/1994          6203301      57
3       11/17/1994          6203301      58
4       11/16/1995          6203301      57
...            ...              ...     ...
784760   12/6/2017  334000000000000  258.22
784761   12/6/2017  334000000000000  258.22
784762   3/15/2018  334000000000000  258.43
784763   5/30/2018  334000000000000  258.34
784764         NaN  334000000000000     NaN

[784765 rows x 3 columns]

由此我创建了这个数据框:

df_WL['Date'] = pd.to_datetime(df_WL['Date'], errors='coerce')
df_WL['WL'] = pd.to_numeric(df_WL['WL'], errors='coerce')
df_WL['SiteNo'] = df_WL['SiteNo'].astype(str)
df_WL = df_WL.dropna(subset=['Date'])
df_WL = df_WL.dropna(subset=['WL'])

df_WL = df_WL.pivot_table(index='Date', columns=["SiteNo"], values=['WL']) \
    .reorder_levels([1, 0], axis=1) \
    .sort_index(axis=1)
df_WL

SiteNo     1021201 1023902  ... SA-0174 SA-0231 SM-0049 
                WL      WL  ...      WL      WL      WL 
Date                        ...                            
1970-01-01     NaN     NaN  ...     NaN     NaN     NaN    
1970-01-03     NaN     NaN  ...     NaN     NaN     NaN    
1970-01-05     NaN     NaN  ...     NaN     NaN     NaN    
1970-01-06     NaN     NaN  ...     NaN     NaN     NaN    
1970-01-07  3692.0     NaN  ...     NaN     NaN     NaN    
...            ...     ...  ...     ...     ...     ...    
2021-02-18     NaN     NaN  ...     NaN     NaN     NaN    
2021-02-19     NaN     NaN  ...     NaN     NaN     NaN    
2021-02-22     NaN     NaN  ...     NaN     NaN     NaN    
2021-02-23     NaN     NaN  ...     NaN     NaN     NaN    
2021-02-24     NaN     NaN  ...     NaN  7209.0     NaN  

[17353 rows x 863 columns]

我拥有的另外两个数据集是:

df_precip = pd.read_csv('Regional_Scale_Precip.csv')
df_precip = df_precip.set_index('Date')
df_precip

            294957000000000  294722000000000  ...  6129203  6414105  
Date                                                                      
1981-01-01            0.000            0.000  ...    0.000    0.000       
1981-01-02            0.000            0.000  ...    0.000    0.000     
1981-01-03            0.000            0.000  ...    0.000    0.000      
1981-01-04            0.000            0.000  ...    0.000    0.000      
2017-05-27            0.000            0.000  ...    0.000    0.000       
...                     ...              ...  ...      ...      ...     
2017-05-22           13.529           15.883  ...   19.788   45.493      
2017-05-23           16.181           28.589  ...   36.448    8.722      
2017-05-24           13.189           16.917  ...   15.643   14.794     
2017-05-25            0.000            0.000  ...    0.000    0.000     
2017-05-26            0.000            0.000  ...    0.000    0.000    

[13295 rows x 1331 columns]

df_temp = pd.read_csv('Regional_Scale_Temp.csv')

df_temp = df_temp.set_index('Date')
df_temp

            6131901  6129203  ...  6414105  6155707
Date                                                                                 
1981-01-01    8.965    8.733  ...    9.117    9.118   
1981-01-02    6.654    6.614  ...    7.834    7.195   
1981-01-03    4.794    4.796  ...    4.826    4.880   
1981-01-04    7.582    7.752  ...    8.380    8.018   
2009-08-25   22.438   22.129  ...   23.607   22.702   
...             ...      ...  ...      ...      ...   
2009-08-20   27.354   27.177  ...   28.498   28.055   
2009-08-21   26.706   26.397  ...   28.671   27.479   
2009-08-22   25.126   24.778  ...   26.644   25.600   
2009-08-23   22.001   21.835  ...   23.803   22.543   
2009-08-24   21.626   21.422  ...   23.257   22.160  

[10463 rows x 1331 columns]

我的目标是创建一个看起来像这样的数据框(我在 excel 中使用任意值来说明我的目标):

我已尝试将最后两个数据帧拆开,并将它们与第一个数据帧组合,以创建一个包含“日期”、“站点编号”、“WL”、“温度”和“沉淀”列的数据帧。然后调整它们以获得我的目标,但那是一团糟。

任何帮助将不胜感激。谢谢!

【问题讨论】:

【参考方案1】:
    从文件创建的所有 DataFrame 都应转换为标准长格式,包含 'Date''Site''Values''Type' 列。 使用pandas.concat 组合所有DataFrame 使用pandas.DataFrame.pivot实现所需的形式

加载和清理

import pandas as pd

# load the data
wl = pd.read_csv('wl.csv', parse_dates=['Date'])
pre = pd.read_csv('precip.csv', index_col='Date', parse_dates=['Date'])
temp = pd.read_csv('temp.csv', index_col='Date', parse_dates=['Date'])

# wl is already in a long form so clean the column names
wl.rename('SiteNo': 'Site', 'WL': 'Values', axis=1, inplace=True)

# stack the other two dataframes into a long form
pre = pre.stack().reset_index(name='Values').rename('level_1': 'Site', axis=1)
temp = temp.stack().reset_index(name='Values').rename('level_1': 'Site', axis=1)

# add a Type column
wl['Type'] = 'WL'
pre['Type'] = 'Precip'
temp['Type'] = 'Temp'

# sample of wl
        Date     Site  Values Type
0 1992-08-20  6203301    58.0   WL
1 1993-02-16  6203301    57.0   WL
2 1994-02-23  6203301    57.0   WL
3 1994-11-17  6203301    58.0   WL
4 1995-11-16  6203301    57.0   WL

结合和旋转

提供的示例数据对齐不正确,因此所有'Site' 列在透视数据帧中没有所有'Type'
# combine the DataFrames
df = pd.concat([wl, pre, temp])

# drop duplicate rows - there shouldn't be any, but the sample data did
df.drop_duplicates(inplace=True)

# sort the values - not strictly necessary
df = df.sort_values(['Date', 'Site', 'Type']).reset_index(drop=True)

# dropna
df.dropna(subset=['Date'], inplace=True)

# pivot
dfp = df.pivot(index='Date', columns=['Site', 'Type'], values='Values')

# display(dfp.head())
Site       294722000000000 294957000000000 6129203        6131901 6155707 6414105        6203301 334000000000000
Type                Precip          Precip  Precip   Temp    Temp    Temp  Precip   Temp      WL              WL
Date                                                                                                            
1981-01-01             0.0             0.0     0.0  8.733   8.965   9.118     0.0  9.117     NaN             NaN
1981-01-02             0.0             0.0     0.0  6.614   6.654   7.195     0.0  7.834     NaN             NaN
1981-01-03             0.0             0.0     0.0  4.796   4.794   4.880     0.0  4.826     NaN             NaN
1981-01-04             0.0             0.0     0.0  7.752   7.582   8.018     0.0  8.380     NaN             NaN
1992-08-20             NaN             NaN     NaN    NaN     NaN     NaN     NaN    NaN    58.0             NaN

【讨论】:

以上是关于如何组合和旋转具有不同结构的数据框的主要内容,如果未能解决你的问题,请参考以下文章

组合具有不同列数的 Spark 数据帧

QtDesigner - 两个具有不同大小的相同组合框

Python Pandas - 具有不同列的 Concat 数据框忽略列名

如何让组合框选择与 ExtJS6 中显示的值不同的值?

合并具有所有组合的两个数据框

Python:如何从具有多列的数据框中循环遍历每两列组合以进行聚类?