循环并合并具有相同索引、相同列的 DataFrame(但是每个 DataFrame 有几列唯一)
Posted
技术标签:
【中文标题】循环并合并具有相同索引、相同列的 DataFrame(但是每个 DataFrame 有几列唯一)【英文标题】:Looping through and merging DataFrames with same index, same columns (however a few columns unique to each DataFrame) 【发布时间】:2020-05-12 01:04:33 【问题描述】:所需任务说明
我使用以下代码合并df
和df1
(显示示例数据),它可以很好地满足我的需要。但是,我需要遍历大量数据帧(例如df2
,但将是df3
、df4
等)并且不知道如何修改代码。我有具有相同索引、相同列的 DataFrame,但是每个 DataFrame 有几列是单独的。我使用以下代码,它运行良好,但我希望对其进行修改,以便我可以循环遍历 df
和 df1
,将它们合并在一起,创建 requireddata
,然后在 requireddata
与 @987654330 合并的地方重复此操作@。 requireddata
与 df3
合并等相同的逻辑将继续。任何帮助都是极好的!! :)
df
ID AA TA TL
Date
2001 AAPL 1.0 44 50
2002 AAPL 3.0 33 51
2003 AAPL 2.0 22 53
2004 AAPL 5.0 11 76
2005 AAPL 2.0 33 44
2006 AAPL 3.0 22 12
df1
ID AA TA ML
Date
2001 MSFT 3.5 44 12
2002 MSFT 6.7 33 15
2003 MSFT 2.3 22 19
2004 MSFT 5.5 11 20
2005 MSFT 2.2 33 43
2006 MSFT 3.2 22 23
df2
示例
ID AA TA PP
Date
2001 TSLA 3.3 48 18
2002 TSLA 6.3 38 18
2003 TSLA 2.6 28 18
2004 TSLA 5.3 18 28
2005 TSLA 2.3 38 48
2006 TSLA 3.3 28 28
使用的代码
dfdates['Date'] # this has dates required for index
df
df1
cols_to_use = df.columns.difference(df1.columns) #compare column difference df and df1
cols_to_use1 = df1.columns.difference(df.columns) #compare column difference df1 and df
dataframe = pd.DataFrame(columns = cols_to_use, index = df['Date']) #dataframe with columns in df1 but not in df
dataframe1 = pd.DataFrame(columns = cols_to_use1, index = df1['Date']) #dataframe with columns in df but not in df1
datatesting = pd.concat([dataframe, df], axis=1) #merge missing columns into df
datatesting1 = pd.concat([dataframe1, df1], axis=1) #merge missing columns into df1
diff = datatesting1.columns.difference(datatesting.columns) #check difference (is 0)
print (diff)
frames = [datatesting, datatesting1] #list of dataframes
requireddata = pd.concat(frames) #merge dataframes
创建这个:
ID AA TA TL ML
Date
2001 AAPL 1.0 44 50 NaN
2002 AAPL 3.0 33 51 NaN
2003 AAPL 2.0 22 53 NaN
2004 AAPL 5.0 11 76 NaN
2005 AAPL 2.0 33 44 NaN
2006 AAPL 3.0 22 12 NaN
2001 MSFT 3.5 44 NaN 12
2002 MSFT 6.7 33 NaN 15
2003 MSFT 2.3 22 NaN 19
2004 MSFT 5.5 11 NaN 20
2005 MSFT 2.2 33 NaN 43
2006 MSFT 3.2 22 NaN 23
使用循环代码,会喜欢这样的东西:
ID AA TA TL ML PP
Date
2001 AAPL 1.0 44 50 NaN NaN
2002 AAPL 3.0 33 51 NaN NaN
2003 AAPL 2.0 22 53 NaN NaN
2004 AAPL 5.0 11 76 NaN NaN
2005 AAPL 2.0 33 44 NaN NaN
2006 AAPL 3.0 22 12 NaN NaN
2001 MSFT 3.5 44 NaN 12 NaN
2002 MSFT 6.7 33 NaN 15 NaN
2003 MSFT 2.3 22 NaN 19 NaN
2004 MSFT 5.5 11 NaN 20 NaN
2005 MSFT 2.2 33 NaN 43 NaN
2006 MSFT 3.2 22 NaN 23 NaN
2001 TSLA 3.3 48 NaN NaN 18
2002 TSLA 6.3 38 NaN NaN 18
2003 TSLA 2.6 28 NaN NaN 18
2004 TSLA 5.3 18 NaN NaN 28
2005 TSLA 2.3 38 NaN NaN 48
2006 TSLA 3.3 28 NaN NaN 28
【问题讨论】:
IIUC,你将数据帧合并为一个。为什么不直接连接数据帧? pd.concat([df,df1,df2])。还有其他你没有提到的规则吗? 如果我是你,我将使用带日期和 ID 的 MultiIndex 如果你想连接pd.concat([df,df1,df2],sort = False)
【参考方案1】:
我认为这里不需要列差异,只使用concat
,列正确对齐:
df = pd.concat([df,df1,df2], sort=False)
print (df)
ID AA TA TL ML PP
Date
2001 AAPL 1.0 44 50.0 NaN NaN
2002 AAPL 3.0 33 51.0 NaN NaN
2003 AAPL 2.0 22 53.0 NaN NaN
2004 AAPL 5.0 11 76.0 NaN NaN
2005 AAPL 2.0 33 44.0 NaN NaN
2006 AAPL 3.0 22 12.0 NaN NaN
2001 MSFT 3.5 44 NaN 12.0 NaN
2002 MSFT 6.7 33 NaN 15.0 NaN
2003 MSFT 2.3 22 NaN 19.0 NaN
2004 MSFT 5.5 11 NaN 20.0 NaN
2005 MSFT 2.2 33 NaN 43.0 NaN
2006 MSFT 3.2 22 NaN 23.0 NaN
2001 TSLA 3.3 48 NaN NaN 18.0
2002 TSLA 6.3 38 NaN NaN 18.0
2003 TSLA 2.6 28 NaN NaN 18.0
2004 TSLA 5.3 18 NaN NaN 28.0
2005 TSLA 2.3 38 NaN NaN 48.0
2006 TSLA 3.3 28 NaN NaN 28.0
【讨论】:
以上是关于循环并合并具有相同索引、相同列的 DataFrame(但是每个 DataFrame 有几列唯一)的主要内容,如果未能解决你的问题,请参考以下文章