如果一对列值未在另一个 df2 中配对，则删除 df1 中的行

Posted 2023-03-11

技术标签:

【中文标题】如果一对列值未在另一个 df2 中配对，则删除 df1 中的行【英文标题】：Removing row in df1 if pair of column values aren't paired in another df2 【发布时间】：2019-08-29 21:25:42 【问题描述】：

给定 df1 和 df2，我想得到 df3。我想匹配的唯一列/行是 Pop 和 Homes。我已经包含了 Other 数据列，以获得任意数量的列的解决方案。

df1
City        Pop  Homes Other
City_1      100      1     0
City_1      100      2     6
City_1      100      2     2
City_1      100      3     9
City_1      200      1     6
City_1      200      2     6
City_1      200      3     7
City_1      300      1     0

df2
City        Pop  Homes Other
City_1      100      1     0
City_1      100      2     6
City_1      100      2     2
City_1      100      8     9
City_1      200      1     6
City_1      200      2     6
City_1      800      3     7
City_1      800      8     0

df3
City        Pop  Homes Other
City_1      100      1     0
City_1      100      2     6
City_1      100      2     2
City_1      200      1     6
City_1      200      2     6

我考虑过按 City、Pop 和 Homes 分组，例如 df1.groupby(['City', 'Pop', 'Homes'])，但后来我不知道如何过滤掉 Pop 和 Homes。

编辑

这是我的代码，因此您可以更轻松地帮助我。

df1_string = """City_1      100      1     0
City_1      100      2     6
City_1      100      2     2
City_1      100      3     9
City_1      200      1     6
City_1      200      2     6
City_1      200      3     7
City_1      300      1     0"""

df2_string = """City_1      100      1     0
City_1      100      2     6
City_1      100      2     2
City_1      100      8     9
City_1      200      1     6
City_1      200      2     6
City_1      800      3     7
City_1      800      8     0"""

df1 = pd.DataFrame([x.split() for x in df1_string.split('\n')], columns=['City', 'Pop', 'Homes', 'Other'])
df2 = pd.DataFrame([x.split() for x in df2_string.split('\n')], columns=['City', 'Pop', 'Homes', 'Other'])

df1_keys = [x for x in df1.groupby(['Pop', 'Homes']).groups.keys()]
df2_keys = [x for x in df2.groupby(['Pop', 'Homes']).groups.keys()]

print(df1_keys)
[('100', '1'), ('100', '2'), ('100', '3'), ('200', '1'), ('200', '2'), ('200', '3'), ('300', '1')]
print(df2_keys)
[('100', '1'), ('100', '2'), ('100', '8'), ('200', '1'), ('200', '2'), ('800', '3'), ('800', '8')]

从这里过滤掉不相等的组对似乎很简单，但我无法解决这个问题。我试过了：

df1 = df1[df1.groupby(['Pop', 'Homes']).groups.keys().isin(df2.groupby(['Pop', 'Homes']).groups.keys())]

以及当它不起作用时的其他变体 - 但我感觉它接近工作了。

解决方案

df1.set_index(['Pop', 'Homes'], inplace=True)
df2.set_index(['Pop', 'Homes'], inplace=True)

df1 = df1[df2.index.isin(df1.index)]

df1.reset_index(inplace=True)

【问题讨论】：

添加minimal reproducible example 将极大地帮助那些提供有效答案的人。但你在下面得到了很好的答案。 @RichAndrews 我进行了编辑。现在应该更清楚了看起来不错。你知道人们将数据“复制”到他们的计算机剪贴板和pandas.read_clipboard()吗？超级方便。我认为你的 Q 中也有你的代码尝试，没有理由排除它。但是你有很好的答案要复习！ 【参考方案1】：

IIUC 如果City、Pop、Home 在索引中，则可以使用isin：

df2[df2.index.isin(df1.index)]

输出：

                 Count
City  Pop Homes       
City1 100 20       152
          24       184
      200 41       163
          42       163

【讨论】：

这些 df 是否引用 groupby 对象，例如： df1.groupby(['City', 'Pop', 'Homes']) 和 df2.groupby(['City', 'Pop', '家']) ? 是的。该 groupby 应该创建索引结构。这不起作用，我得到一个 AttributeError: Cannot access attribute 'index' of 'DataFrameGroupBy' objects，尝试使用 'apply' 方法【参考方案2】：

为数据框创建多索引并为交叉点进行内连接。

import pandas as pd
import numpy as np


df1_string = """City_1      100      1     0
City_1      100      2     6
City_1      100      2     2
City_1      100      3     9
City_1      200      1     6
City_1      200      2     6
City_1      200      3     7
City_1      300      1     0"""

df2_string = """City_1      100      1     0
City_1      100      2     6
City_1      100      2     2
City_1      100      8     9
City_1      200      1     6
City_1      200      2     6
City_1      800      3     7
City_1      800      8     0"""

df1 = pd.DataFrame([x.split() for x in df1_string.split('\n')], columns=['City', 'Pop', 'Homes', 'Other'])
df2 = pd.DataFrame([x.split() for x in df2_string.split('\n')], columns=['City', 'Pop', 'Homes', 'Other'])

# Dataframes benefit from having indexes that reflect that tabular data
df1.set_index(['City', 'Pop', 'Homes'], inplace=True)
df2.set_index(['City', 'Pop', 'Homes'], inplace=True)

# an inner join on the multiindex will provide the intersaction of the two
result = df1.join(df2, how='inner', on=['City', 'Pop', 'Homes'], lsuffix='_l', rsuffix='_r')

# a join provides all of the joined columns
result.reset_index(inplace=True)
result.drop(['Other_r'], axis=1, inplace=True)
result.columns = ['City', 'Pop', 'Homes', 'Other']

print(result)

【讨论】：

以上是关于如果一对列值未在另一个 df2 中配对，则删除 df1 中的行的主要内容，如果未能解决你的问题，请参考以下文章

在一列中选择该值未在另一列中出现 5 次的值

从表中删除记录，如果其特定值未出现在另一个表中

长 URL 值未在 Codeigniter 中提交

数据属性值未在jquery var中传递[重复]

在 GRID 中编辑列后，列值未更改

道具值未在 vuejs 中呈现