Pandas 合并具有不同列的两个数据框

Posted 2023-03-11

技术标签:

【中文标题】Pandas 合并具有不同列的两个数据框【英文标题】：Pandas merge two dataframes with different columns 【发布时间】：2015-03-21 17:50:40 【问题描述】：

我肯定在这里遗漏了一些简单的东西。尝试合并 pandas 中的两个数据框，它们的列名大多相同，但右侧数据框有一些左侧没有的列，反之亦然。

>df_may

  id  quantity  attr_1  attr_2
0  1        20       0       1
1  2        23       1       1
2  3        19       1       1
3  4        19       0       0

>df_jun

  id  quantity  attr_1  attr_3
0  5         8       1       0
1  6        13       0       1
2  7        20       1       1
3  8        25       1       1

我尝试过使用外部连接加入：

mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer")

但这会产生：

Left data columns not unique: Index([....

我还指定了要加入的单个列（例如on = "id"），但这会复制除id 之外的所有列，例如attr_1_x、attr_1_y，这并不理想。我还将列的整个列表（有很多）传递给on：

mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer", on=list(df_may.columns.values))

产量：

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

我错过了什么？我想获得一个附加了所有行的df，并在可能的情况下填充attr_1、attr_2、attr_3，在它们不出现的地方填充NaN。这似乎是一个非常典型的数据处理工作流程，但我被卡住了。

提前致谢。

【问题讨论】：

【参考方案1】：

接受的答案将打破if there are duplicate headers：

InvalidIndexError：重新索引仅对具有唯一值的索引对象有效。

例如，这里 A 有 3x trial 列，这会阻止 concat：

A = pd.DataFrame([[3, 1, 4, 1]], columns=['id', 'trial', 'trial', 'trial'])
#    id  trial  trial  trial
# 0   3      1      4      1

B = pd.DataFrame([[5, 9], [2, 6]], columns=['id', 'trial'])
#    id  trial
# 0   5      9
# 1   2      6

pd.concat([A, B], ignore_index=True)
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects

要解决这个问题，deduplicate the column names 在concat 之前：

parser = pd.io.parsers.base_parser.ParserBase('usecols': None)

for df in [A, B]:
    df.columns = parser._maybe_dedup_names(df.columns) 

pd.concat([A, B], ignore_index=True)
#    id  trial  trial.1  trial.2
# 0   3      1        4        1
# 1   5      9      NaN      NaN
# 2   2      6      NaN      NaN

或作为单行但可读性较差：

pd.concat([df.set_axis(parser._maybe_dedup_names(df.columns), axis=1) for df in [A, B]], ignore_index=True)

请注意，对于 pandas parser = pd.io.parsers.ParserBase()

【讨论】：

【参考方案2】：

我今天使用 concat、append 或 merge 中的任何一个都遇到了这个问题，我通过添加一个按顺序编号的辅助列然后进行外部连接来解决这个问题

helper=1
for i in df1.index:
    df1.loc[i,'helper']=helper
    helper=helper+1
for i in df2.index:
    df2.loc[i,'helper']=helper
    helper=helper+1
df1.merge(df2,on='helper',how='outer')

【讨论】：

接受的答案有什么问题：pd.concat([df,df1], axis=0, ignore_index=True)? 我用非唯一的列到达了这个。考虑a = pd.DataFrame('d':[1], 'b':[2]).rename(columns='b':'d') 和b=pd.DataFrame('d':[4, 6]) 那么pd.concat([a, b], axis=0, ignore_index=True) 会失败。尽管可以应用一些变通方法，但我认为最好解决问题的根源以具有唯一的列名（如我的情况）。此外，在尝试重命名已经存在的列名时，我会收到一些警告。【参考方案3】：

我认为在这种情况下concat 是您想要的：

In [12]:

pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
   attr_1  attr_2  attr_3  id  quantity
0       0       1     NaN   1        20
1       1       1     NaN   2        23
2       1       1     NaN   3        19
3       0       0     NaN   4        19
4       1     NaN       0   5         8
5       0     NaN       1   6        13
6       1     NaN       1   7        20
7       1     NaN       1   8        25

通过在这里传递axis=0，您将df 堆叠在一起，我相信这是您想要的，然后在它们各自的dfs 中不存在NaN 值。

【讨论】：

出于某种原因，这对我不起作用。我得到了 pandas.errors.InvalidIndexError: Reindexing only valid with unique value Index objects 我试图以这种方式合并三个具有不同列的 DF。一些列被添加，一些列丢失。

以上是关于Pandas 合并具有不同列的两个数据框的主要内容，如果未能解决你的问题，请参考以下文章