如何在有重复的列上合并两个 DataFrame,并输出没有重复的行

Posted

技术标签:

【中文标题】如何在有重复的列上合并两个 DataFrame,并输出没有重复的行【英文标题】:How do I merge two DataFrame on a column with duplicates, and output without duplicates row 【发布时间】:2021-09-27 06:50:21 【问题描述】:

我有两个数据框df1和df2如下图:

df1:

company occupation
0 A Administrator
1 B Engineer
2 C Engineer
3 D Account
4 E Administrator
5 F Engineer

df2:

occupation description
0 Account balance
1 Engineer database
2 Administrator chores
3 Administrator calling
4 Engineer frontend
5 Engineer backendend

我想要什么:

company occupation description
0 A Administrator chores
1 B Engineer database
2 C Engineer frontend
3 D Account balance
4 E Administrator calling
5 F Engineer backendend

我试过pd.merge(df1,df2,how="inner"),但总是得到重复行:

company occupation description
0 A Administrator chores
1 A Administrator calling
2 E Administrator chores
3 E Administrator calling
4 B Engineer database
5 B Engineer frontend
6 B Engineer backendend
7 C Engineer database
8 C Engineer frontend
9 C Engineer backendend
10 F Engineer database
11 F Engineer frontend
12 F Engineer backendend
13 D Account balance

代码:

import pandas as pd
df1 = pd.DataFrame("company":["A","B","C","D","E","F"],"occupation":["Administrator","Engineer","Engineer","Account","Administrator","Engineer"])
df2 = pd.DataFrame("occupation":["Account","Engineer","Administrator","Administrator","Engineer","Engineer"],"description":["balance","database","chores","calling","frontend","backendend"])
df3 = pd.DataFrame("company":["A","B","C","D","E","F"],"occupation":["Administrator","Engineer","Engineer","Account","Administrator","Engineer"],"description":["chores","database","balance","frontend","calling","backendend"])
df4 = pd.merge(df1,df2,how="inner")
display(df1)
display(df2)
display(df3)
display(df4)

【问题讨论】:

您想要的输出准确吗?我想您可能希望将df1 中某个职业的出现与df2 中的相应出现相匹配,例如第一工程师被分配“数据库”。如果是,那么您想要的输出可能不准确? 我认为frontendbalance 可能会互换。 是的,打错了,我修改了 【参考方案1】:

让我们尝试使用groupby cumcount 创建一个键列来跟踪位置,然后在occupationkey 上合并:

df1['key'] = df1.groupby('occupation').cumcount()
df2['key'] = df2.groupby('occupation').cumcount()
df4 = df1.merge(df2, on=['occupation', 'key']).drop('key', axis=1)

df4:

  company     occupation description
0       A  Administrator      chores
1       B       Engineer    database
2       C       Engineer    frontend
3       D        Account     balance
4       E  Administrator     calling
5       F       Engineer  backendend

df4 不丢弃key:

  company     occupation  key description
0       A  Administrator    0      chores
1       B       Engineer    0    database
2       C       Engineer    1    frontend
3       D        Account    0     balance
4       E  Administrator    1     calling
5       F       Engineer    2  backendend

也可以通过直接合并系列而不影响df1df2

df4 = df1.merge(
    df2,
    left_on=['occupation', df1.groupby('occupation').cumcount()],
    right_on=['occupation', df2.groupby('occupation').cumcount()]
).drop('key_1', axis=1)

df4:

  company     occupation description
0       A  Administrator      chores
1       B       Engineer    database
2       C       Engineer    frontend
3       D        Account     balance
4       E  Administrator     calling
5       F       Engineer  backendend

【讨论】:

【参考方案2】:

您可以合成需要的合并条件部分。 职业在数据框中的位置。

df1 = pd.DataFrame('company': ['A', 'B', 'C', 'D', 'E', 'F'],
 'occupation': ['Administrator','Engineer','Engineer','Account','Administrator','Engineer'])

df2 = pd.DataFrame('occupation': ['Account','Engineer','Administrator','Administrator','Engineer','Engineer'],
 'description': ['balance','database','chores','calling','frontend','backendend'])

df1.assign(oid=df1.groupby("occupation", as_index=False).cumcount()).merge(
    df2.assign(oid=df2.groupby("occupation", as_index=False).cumcount()),
    on=["occupation", "oid"],
)

company occupation oid description
0 A Administrator 0 chores
1 B Engineer 0 database
2 C Engineer 1 frontend
3 D Account 0 balance
4 E Administrator 1 calling
5 F Engineer 2 backendend

【讨论】:

这和我的回答不一样吗? @HenryEcker 在语义上是等价的。使用cumcount()跟踪位置

以上是关于如何在有重复的列上合并两个 DataFrame,并输出没有重复的行的主要内容,如果未能解决你的问题,请参考以下文章

如何在不使用for循环的情况下合并需要提前3个月的列上的两个数据框

当列中的项目是列表时,列上的合并 Pandas DataFrame 的 TypeError

熊猫合并:合并同一列上的两个数据框,但保留不同的列

在特定 ID 列上合并两个 DataFrame(数据集)但具有日期条件

熊猫在不同长度的列上合并两个数据框

Pandas:如何在现有 DataFrame 的列上设置索引?