循环并合并具有相同索引、相同列的 DataFrame(但是每个 DataFrame 有几列唯一)

Posted

技术标签:

【中文标题】循环并合并具有相同索引、相同列的 DataFrame(但是每个 DataFrame 有几列唯一)【英文标题】:Looping through and merging DataFrames with same index, same columns (however a few columns unique to each DataFrame) 【发布时间】:2020-05-12 01:04:33 【问题描述】:

所需任务说明

我使用以下代码合并dfdf1(显示示例数据),它可以很好地满足我的需要。但是,我需要遍历大量数据帧(例如df2,但将是df3df4 等)并且不知道如何修改代码。我有具有相同索引、相同列的 DataFrame,但是每个 DataFrame 有几列是单独的。我使用以下代码,它运行良好,但我希望对其进行修改,以便我可以循环遍历 dfdf1,将它们合并在一起,创建 requireddata,然后在 requireddata 与 @987654330 合并的地方重复此操作@。 requireddatadf3 合并等相同的逻辑将继续。任何帮助都是极好的!! :)

df

       ID    AA  TA  TL
Date                      
2001  AAPL   1.0  44  50 
2002  AAPL   3.0  33  51 
2003  AAPL   2.0  22  53 
2004  AAPL   5.0  11  76 
2005  AAPL   2.0  33  44 
2006  AAPL   3.0  22  12 

df1

       ID    AA  TA  ML
Date                      
2001  MSFT   3.5  44  12
2002  MSFT   6.7  33  15
2003  MSFT   2.3  22  19
2004  MSFT   5.5  11  20
2005  MSFT   2.2  33  43
2006  MSFT   3.2  22  23

df2 示例

       ID    AA  TA  PP
Date                      
2001  TSLA   3.3  48  18
2002  TSLA   6.3  38  18
2003  TSLA   2.6  28  18
2004  TSLA   5.3  18  28
2005  TSLA   2.3  38  48
2006  TSLA   3.3  28  28

使用的代码

dfdates['Date'] # this has dates required for index
df
df1

cols_to_use = df.columns.difference(df1.columns) #compare column difference df and df1
cols_to_use1 = df1.columns.difference(df.columns) #compare column difference df1 and df

dataframe = pd.DataFrame(columns = cols_to_use, index = df['Date']) #dataframe with columns in df1 but not in df
dataframe1 = pd.DataFrame(columns = cols_to_use1, index = df1['Date']) #dataframe with columns in df but not in df1

datatesting = pd.concat([dataframe, df], axis=1) #merge missing columns into df
datatesting1 = pd.concat([dataframe1, df1], axis=1) #merge missing columns into df1

diff = datatesting1.columns.difference(datatesting.columns) #check difference (is 0)
print (diff)
frames = [datatesting, datatesting1] #list of dataframes 
requireddata = pd.concat(frames) #merge dataframes

创建这个:

       ID    AA   TA   TL  ML
Date                      
2001  AAPL   1.0  44  50  NaN
2002  AAPL   3.0  33  51  NaN
2003  AAPL   2.0  22  53  NaN
2004  AAPL   5.0  11  76  NaN
2005  AAPL   2.0  33  44  NaN
2006  AAPL   3.0  22  12  NaN                    
2001  MSFT   3.5  44  NaN  12
2002  MSFT   6.7  33  NaN  15
2003  MSFT   2.3  22  NaN  19
2004  MSFT   5.5  11  NaN  20
2005  MSFT   2.2  33  NaN  43
2006  MSFT   3.2  22  NaN  23

使用循环代码,会喜欢这样的东西:

       ID    AA   TA   TL  ML  PP
Date                      
2001  AAPL   1.0  44  50  NaN  NaN
2002  AAPL   3.0  33  51  NaN  NaN
2003  AAPL   2.0  22  53  NaN  NaN
2004  AAPL   5.0  11  76  NaN  NaN
2005  AAPL   2.0  33  44  NaN  NaN
2006  AAPL   3.0  22  12  NaN  NaN                  
2001  MSFT   3.5  44  NaN  12  NaN
2002  MSFT   6.7  33  NaN  15  NaN
2003  MSFT   2.3  22  NaN  19  NaN
2004  MSFT   5.5  11  NaN  20  NaN
2005  MSFT   2.2  33  NaN  43  NaN
2006  MSFT   3.2  22  NaN  23  NaN
2001  TSLA   3.3  48  NaN  NaN  18
2002  TSLA   6.3  38  NaN  NaN  18
2003  TSLA   2.6  28  NaN  NaN  18
2004  TSLA   5.3  18  NaN  NaN  28
2005  TSLA   2.3  38  NaN  NaN  48
2006  TSLA   3.3  28  NaN  NaN  28

【问题讨论】:

IIUC,你将数据帧合并为一个。为什么不直接连接数据帧? pd.concat([df,df1,df2])。还有其他你没有提到的规则吗? 如果我是你,我将使用带日期和 ID 的 MultiIndex 如果你想连接pd.concat([df,df1,df2],sort = False) 【参考方案1】:

我认为这里不需要列差异,只使用concat,列正确对齐:

df = pd.concat([df,df1,df2], sort=False)
print (df)
        ID   AA  TA    TL    ML    PP
Date                                 
2001  AAPL  1.0  44  50.0   NaN   NaN
2002  AAPL  3.0  33  51.0   NaN   NaN
2003  AAPL  2.0  22  53.0   NaN   NaN
2004  AAPL  5.0  11  76.0   NaN   NaN
2005  AAPL  2.0  33  44.0   NaN   NaN
2006  AAPL  3.0  22  12.0   NaN   NaN
2001  MSFT  3.5  44   NaN  12.0   NaN
2002  MSFT  6.7  33   NaN  15.0   NaN
2003  MSFT  2.3  22   NaN  19.0   NaN
2004  MSFT  5.5  11   NaN  20.0   NaN
2005  MSFT  2.2  33   NaN  43.0   NaN
2006  MSFT  3.2  22   NaN  23.0   NaN
2001  TSLA  3.3  48   NaN   NaN  18.0
2002  TSLA  6.3  38   NaN   NaN  18.0
2003  TSLA  2.6  28   NaN   NaN  18.0
2004  TSLA  5.3  18   NaN   NaN  28.0
2005  TSLA  2.3  38   NaN   NaN  48.0
2006  TSLA  3.3  28   NaN   NaN  28.0

【讨论】:

以上是关于循环并合并具有相同索引、相同列的 DataFrame(但是每个 DataFrame 有几列唯一)的主要内容,如果未能解决你的问题,请参考以下文章

熊猫合并具有相同值和相同索引的行

R合并具有相同列的两个数据框而不替换值[重复]

创建 2 个具有相同键列但不同包含列的非聚集索引

合并具有索引的数据帧,其中一个包含另一个(但不相同)

将具有相同列/索引的两个 pandas DataFrame 合并为一个 DataFrame

合并具有不同列名但定义相同的多个CSV