Python分组；仅在满足条件时保留

Posted 2023-02-16

技术标签:

【中文标题】Python分组；仅在满足条件时保留【英文标题】：Python group by; keep only when condition is met 【发布时间】：2021-10-30 14:19:21 【问题描述】：

假设您有一个数据集，其中包含零件、项目、报价、价格和isSelected。

对于每个零件、项目和报价，如果有isSelected，只保留那一行，但如果没有isSelected，则保留该零件、项目和报价组合的所有行。

请参阅下面的示例。

数据集：

Part	project	Quote	Price	isSelected
1	A	1	5.0	No
1	A	1	2.2	Yes
5	C	2	6.6	No
5	C	2	1.2	Yes
3	B	3	5.5	No
3	B	3	4.6	No

想要的结果：

Part	project	Quote	Price	isSelected
1	A	1	2.2	Yes
5	C	2	1.2	Yes
3	B	3	5.5	No
3	B	3	4.6	No

【问题讨论】：

非常感谢您分享表格中的数据。不幸的是，这不是在这里共享数据的最佳格式！通常最好共享“原始”CSV 数据或类似的数据，因为这样可以让人们更轻松地复制和粘贴，以便试验您的数据并开发解决方案。读者注意：原来你可以复制整个表格。至少在我的机器 (Mac) 上，当我将它粘贴到我的代码编辑器 (Neovim) 中时，它呈现为纯制表符分隔的数据。在您的第一行输出中，您似乎打算在Part 列中写入1 而不是2。对吗？ 【参考方案1】：

可以通过循环遍历.groupby 对Series 或DataFrame 的操作产生的GroupBy 对象来解决此一般任务类别。

在这种特殊情况下，您还可以使用GroupBy.apply method，它对每个组执行计算并将结果连接在一起。

GroupBy 类的文档是 here。

我将首先介绍循环版本，因为对于尚未熟悉计算的“DataFrame 样式”的程序员来说，它可能更易于使用。但是，我建议尽可能使用.apply 版本。处理大型数据集时速度会更快，并且可能会消耗更少的内存。它也被认为是更“惯用”的风格，它将迫使您学习如何将代码分解为单独的函数。

使用循环

很多人没有意识到DataFrame.groupby（GroupBy 对象）的结果可以被迭代。此特定功能已记录在 here。

除此之外，逻辑还包括一个简单的 if 语句、一些 Pandas 子集和 concat function。

完整示例：

import io
import pandas as pd

data = pd.read_csv(io.StringIO('''
Part,Project,Quote,Price,isSelected
1,A,1,5.0,No
1,A,1,2.2,Yes
5,C,2,6.6,No
5,C,2,1.2,Yes
3,B,3,5.5,No
3,B,3,4.6,No
'''))

group_results = []
for _, group in data.groupby(['Part', 'Project', 'Quote']):
    is_selected = group['isSelected'] == 'Yes'

    if is_selected.any():
        # Select the rows where 'isSelected' is True, and
        # then select the first row from that output.
        # Using [0] instead of 0 ensures that the result
        # is still a DataFrame, and that it does not get
        # "squeezed" down to a Series.
        group_result = group.loc[is_selected].iloc[[0]]

    else:
        group_result = group

    group_results.append(group_result)

results = pd.concat(group_results)
print(results)

输出：

   Part Project  Quote  Price isSelected
1     1      A       1    2.2        Yes
4     3      B       3    5.5         No
5     3      B       3    4.6         No
3     5      C       2    1.2        Yes

使用`.apply`

GroupBy.apply 方法本质上为您完成了pd.concat 和列表附加部分。我们没有编写循环，而是编写了一个函数，我们将其传递给.apply：

import io
import pandas as pd

data = pd.read_csv(io.StringIO('''
Part,Project,Quote,Price,isSelected
1,A,1,5.0,No
1,A,1,2.2,Yes
5,C,2,6.6,No
5,C,2,1.2,Yes
3,B,3,5.5,No
3,B,3,4.6,No
'''))


groups = data.groupby(['Part', 'Project', 'Quote'], as_index=False)


def process_group(group):
    is_selected = group['isSelected'] == 'Yes'

    if is_selected.any():
        # Select the rows where 'isSelected' is True, and
        # then select the first row from that output.
        # Using [0] instead of 0 ensures that the result
        # is still a DataFrame, and that it does not get
        # "squeezed" down to a Series.
        group_result = group.loc[is_selected].iloc[[0]]

    else:
        group_result = group

    return group_result


# Use .reset_index to remove the extra index layer created by Pandas,
# which is not necessary in this situation.
results = groups.apply(process_group).reset_index(level=0, drop=True)
print(results)

输出：

   Part Project  Quote  Price isSelected
1     1       A      1    2.2        Yes
4     3       B      3    5.5         No
5     3       B      3    4.6         No
3     5       C      2    1.2        Yes

【讨论】：

我用了你的 .apply 方法，效果很好，谢谢！很高兴它帮助了@BobbyPlourde！您可以通过单击旁边的复选标记将此答案标记为“已接受”。这为答案添加了一个可见的标记，以便未来的读者可以看到答案有效。它还向答案的作者奖励了一些“声誉积分”，我个人并不需要，但对于积分比我少的用户来说，它可能很有价值。【参考方案2】：

看看这是否有帮助：

yes=[]
yesIndex=[]
for index, row in df.iterrows():
    if (row['isSelected']=='Yes'):
        yes.append(row['Part'])
        yesIndex.append(index)
        
no=list(set(df.Part.unique().tolist()) - set(yes))
noIndex=[]
for index, row in df.iterrows():
    if (row['Part'] in no):
        noIndex.append(index)
        
        
listofindex=yesIndex+noIndex
df.loc[df.index.isin(listofindex)]

这里我尝试获取“是”的零件，然后与唯一的零件列表进行比较，得到只有“否”的零件列表。然后得到那些的索引。

【讨论】：

以上是关于Python分组；仅在满足条件时保留的主要内容，如果未能解决你的问题，请参考以下文章

Python分组；仅在满足条件时保留

使用循环

使用.apply

使用`.apply`