如何在有重复的列上合并两个 DataFrame,并输出没有重复的行
Posted
技术标签:
【中文标题】如何在有重复的列上合并两个 DataFrame,并输出没有重复的行【英文标题】:How do I merge two DataFrame on a column with duplicates, and output without duplicates row 【发布时间】:2021-09-27 06:50:21 【问题描述】:我有两个数据框df1和df2如下图:
df1:
company | occupation | |
---|---|---|
0 | A | Administrator |
1 | B | Engineer |
2 | C | Engineer |
3 | D | Account |
4 | E | Administrator |
5 | F | Engineer |
df2:
occupation | description | |
---|---|---|
0 | Account | balance |
1 | Engineer | database |
2 | Administrator | chores |
3 | Administrator | calling |
4 | Engineer | frontend |
5 | Engineer | backendend |
我想要什么:
company | occupation | description | |
---|---|---|---|
0 | A | Administrator | chores |
1 | B | Engineer | database |
2 | C | Engineer | frontend |
3 | D | Account | balance |
4 | E | Administrator | calling |
5 | F | Engineer | backendend |
我试过pd.merge(df1,df2,how="inner")
,但总是得到重复行:
company | occupation | description | |
---|---|---|---|
0 | A | Administrator | chores |
1 | A | Administrator | calling |
2 | E | Administrator | chores |
3 | E | Administrator | calling |
4 | B | Engineer | database |
5 | B | Engineer | frontend |
6 | B | Engineer | backendend |
7 | C | Engineer | database |
8 | C | Engineer | frontend |
9 | C | Engineer | backendend |
10 | F | Engineer | database |
11 | F | Engineer | frontend |
12 | F | Engineer | backendend |
13 | D | Account | balance |
代码:
import pandas as pd
df1 = pd.DataFrame("company":["A","B","C","D","E","F"],"occupation":["Administrator","Engineer","Engineer","Account","Administrator","Engineer"])
df2 = pd.DataFrame("occupation":["Account","Engineer","Administrator","Administrator","Engineer","Engineer"],"description":["balance","database","chores","calling","frontend","backendend"])
df3 = pd.DataFrame("company":["A","B","C","D","E","F"],"occupation":["Administrator","Engineer","Engineer","Account","Administrator","Engineer"],"description":["chores","database","balance","frontend","calling","backendend"])
df4 = pd.merge(df1,df2,how="inner")
display(df1)
display(df2)
display(df3)
display(df4)
【问题讨论】:
您想要的输出准确吗?我想您可能希望将df1
中某个职业的出现与df2
中的相应出现相匹配,例如第一工程师被分配“数据库”。如果是,那么您想要的输出可能不准确?
我认为frontend
和balance
可能会互换。
是的,打错了,我修改了
【参考方案1】:
让我们尝试使用groupby cumcount
创建一个键列来跟踪位置,然后在occupation
和key
上合并:
df1['key'] = df1.groupby('occupation').cumcount()
df2['key'] = df2.groupby('occupation').cumcount()
df4 = df1.merge(df2, on=['occupation', 'key']).drop('key', axis=1)
df4
:
company occupation description
0 A Administrator chores
1 B Engineer database
2 C Engineer frontend
3 D Account balance
4 E Administrator calling
5 F Engineer backendend
df4
不丢弃key
:
company occupation key description
0 A Administrator 0 chores
1 B Engineer 0 database
2 C Engineer 1 frontend
3 D Account 0 balance
4 E Administrator 1 calling
5 F Engineer 2 backendend
也可以通过直接合并系列而不影响df1
或df2
:
df4 = df1.merge(
df2,
left_on=['occupation', df1.groupby('occupation').cumcount()],
right_on=['occupation', df2.groupby('occupation').cumcount()]
).drop('key_1', axis=1)
df4
:
company occupation description
0 A Administrator chores
1 B Engineer database
2 C Engineer frontend
3 D Account balance
4 E Administrator calling
5 F Engineer backendend
【讨论】:
【参考方案2】:您可以合成需要的合并条件部分。 职业在数据框中的位置。
df1 = pd.DataFrame('company': ['A', 'B', 'C', 'D', 'E', 'F'],
'occupation': ['Administrator','Engineer','Engineer','Account','Administrator','Engineer'])
df2 = pd.DataFrame('occupation': ['Account','Engineer','Administrator','Administrator','Engineer','Engineer'],
'description': ['balance','database','chores','calling','frontend','backendend'])
df1.assign(oid=df1.groupby("occupation", as_index=False).cumcount()).merge(
df2.assign(oid=df2.groupby("occupation", as_index=False).cumcount()),
on=["occupation", "oid"],
)
company | occupation | oid | description | |
---|---|---|---|---|
0 | A | Administrator | 0 | chores |
1 | B | Engineer | 0 | database |
2 | C | Engineer | 1 | frontend |
3 | D | Account | 0 | balance |
4 | E | Administrator | 1 | calling |
5 | F | Engineer | 2 | backendend |
【讨论】:
这和我的回答不一样吗? @HenryEcker 在语义上是等价的。使用cumcount()
跟踪位置以上是关于如何在有重复的列上合并两个 DataFrame,并输出没有重复的行的主要内容,如果未能解决你的问题,请参考以下文章
如何在不使用for循环的情况下合并需要提前3个月的列上的两个数据框
当列中的项目是列表时,列上的合并 Pandas DataFrame 的 TypeError