保留数据框中的行，对于某些列的值的所有组合，在另一列中包含相同的元素

Posted 2023-02-27

技术标签:

【中文标题】保留数据框中的行，对于某些列的值的所有组合，在另一列中包含相同的元素【英文标题】：Keep rows in data frame that, for all combinations of the values of certain columns, contain the same elements in another column 【发布时间】：2021-04-18 19:05:29 【问题描述】：

df = pd.DataFrame('a':['x','x','x','x','x','y','y','y','y','y'],'b':['z','z','z','w','w','z','z','w','w','w'],'c':['c1','c2','c3','c1','c3','c1','c3','c1','c2','c3'],'d':range(1,11))

   a  b   c   d
0  x  z  c1   1
1  x  z  c2   2
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
8  y  w  c2   9
9  y  w  c3  10

对于a 和b 的所有组合，如何只保留c 中包含相同值的行？或者换句话说，如何排除具有c 值的行，这些行只存在于a 和b 的某些组合中？

例如，a 和 b ([x,z],[x,w],[y,z],[y,w]) 的所有组合中仅存在 c1 和 c3，因此输出将是

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

【问题讨论】：

【参考方案1】：

这是一种方法。获取每个组的唯一列表，然后使用 reduce 和 np.intersect1d 检查所有返回数组中的公共元素。然后使用series.isin和boolean indexing过滤数据框

from functools import reduce
out = df[df['c'].isin(reduce(np.intersect1d,df.groupby(['a','b'])['c'].unique()))]

细分：

s = df.groupby(['a','b'])['c'].unique()
common_elements = reduce(np.intersect1d,s)
#Returns :-> array(['c1', 'c3'], dtype=object)

out = df[df['c'].isin(common_elements )]#.copy()

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

【讨论】：

【参考方案2】：

让我们尝试 groupby 和 nunique 来计算每列 c 组的唯一元素数：

s = df['a'] + ',' + df['b'] # combination of a, b
m = s.groupby(df['c']).transform('nunique').eq(s.nunique())

df[m]

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

【讨论】：

add 并不安全，它会将('aa', 'b') 与('a', 'ab') 混淆 @QuangHoang 是的。使用 df['a'] + ',' + df['b'] 之类的分隔符怎么样这将对字段内的, 敏感。最好使用tuple。 @QuangHoang 这是非常极端的情况，我认为在这种情况下使用更复杂的分隔符也应该可以正常工作。【参考方案3】：

尝试一些差异crosstab

s = pd.crosstab([df['a'],df['b']],df.c).all()
out = df.loc[df.c.isin(s.index[s])]
Out[34]: 
   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

【讨论】：

【参考方案4】：

让我们尝试旋转表格，然后删除NA，这意味着组合中缺少一个值：

all_data =(df.pivot(index=['a','b'], columns='c', values='c')
             .loc[:, lambda x: x.notna().all()]
             .columns)
df[df['c'].isin(all_data)]

输出：

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

【讨论】：

@anky 可能枢轴不允许相同的column 和values。将values='d' 和aggfunc='size' 与pivot_table 一起使用。【参考方案5】：

我们可以使用groupby + size，然后使用unstack，这将填充NaN，用于缺少“c”组的['a'，'b']组。然后我们 dropna 并将原始 DataFrame 子集为在 dropna 中幸存的 c 值。

df[df.c.isin(df.groupby(['a', 'b', 'c']).size().unstack(-1).dropna(axis=1).columns)]

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

groupby 操作的结果仅包含 c 组的列，这些列存在于 ['a', 'b'] 的所有唯一组合中，因此我们只获取 columns 属性。

df.groupby(['a', 'b', 'c']).size().unstack(-1).dropna(axis=1)

#c     c1   c3
#a b          
#x w  1.0  1.0
#  z  1.0  1.0
#y w  1.0  1.0
#  z  1.0  1.0

【讨论】：

【参考方案6】：

您可以将列表推导与 str.contains 一起使用：

unq = [[x, len(df[(df[['a','b','c']].agg(','.join, axis=1)).str.contains(',' + x)]
                   .drop_duplicates())] for x in df['c'].unique()]
keep = [lst[0] for lst in unq if lst[1] == max([lst[1] for lst in unq])]
df = df[df['c'].isin(keep)]
df

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

【讨论】：

【参考方案7】：

如果您做出以下假设，这可以为您提供要保留 c 列的哪些元素：

df.groupby("c")["a"].count() == df.groupby("c")["a"].count().max()

输出：

c
c1     True
c2    False
c3     True
Name: a, dtype: bool

假设：

没有重复 c 列至少有一个值包含 a 和 b 的所有组合。

【讨论】：

【参考方案8】：

您可以使用value_counts 并获得a 和b 的所有组合：

vc = df[['a', 'b']].drop_duplicates().value_counts()

结果：

然后您可以将每个组的计数与缺少组合的vc 和filter 组进行比较：

df.groupby('c').filter(lambda x: x[['a', 'b']].value_counts().ge(vc).all())

输出：

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

【讨论】：

【参考方案9】：

假设根据示例有 4 个不同的值：

一个简单的解决方案可以是：

df[df['a'].groupby(df['c']).transform('count').eq(4)]

【讨论】：

以上是关于保留数据框中的行，对于某些列的值的所有组合，在另一列中包含相同的元素的主要内容，如果未能解决你的问题，请参考以下文章

SQL：根据另一列的值在列上保留一个具有最大值的行

用于计算同一列的值百分比的 SQL [关闭]

如何更改 Ext js 组合框中某些值的索引

SQL Server中的组合

在另一列满足条件后计算一列中的值

在 pandas DataFrame 中有效地搜索列表值的组合