python pandas.Series.str.contains Whole WORD

Posted 2023-02-23

技术标签:

【中文标题】python pandas.Series.str.contains Whole WORD【英文标题】：python pandas.Series.str.contains WHOLE WORD 【发布时间】：2017-01-14 12:52:53 【问题描述】：

df (Pandas Dataframe) 有三行。

col_name
"This is Donald."
"His hands are so small"
"Why are his fingers so short?"

我想提取包含“is”和“small”的行。

如果我这样做

df.col_name.str.contains("is|small", case=False)

然后它也捕捉到“他的”——这是我不想要的。

下面的查询是在 df.series 中捕获整个单词的正确方法吗？

df.col_name.str.contains("\bis\b|\bsmall\b", case=False)

【问题讨论】：

【参考方案1】：

不，正则表达式 /bis/b|/bsmall/b 将失败，因为您使用的是 /b，而不是 \b，这意味着“单词边界”。

改变它，你就会得到匹配。我建议使用

\b(is|small)\b

至少对我来说，这个正则表达式更快，更易读。请记住将其放在raw string (r"\b(is|small)\b") 中，这样您就不必转义反斜杠。

【讨论】：

谢谢。我反映了你的观点 /b -> \b。还想再等几天，看看有没有其他方法可以捕捉到整个单词。切线，我必须在字符串之前添加一个r 才能让它工作：有人知道为什么吗？我还没有找到任何引用它.. 好吧，显然|char 将其隐含地变成了一个正则表达式，而\b 却没有.. @mccc 它变成了raw string（这是 Python 的东西，不是 Pandas 或 Regex 的东西）。 @Laurel 我认为如果您添加有关使用原始字符串参数的要点，您的答案会更完整，因为 OP 的查询中也缺少这一点。【参考方案2】：

首先，您可能希望将所有内容都转换为小写，删除标点符号和空格，然后将结果转换为一组单词。

import string

df['words'] = [set(words) for words in
    df['col_name']
    .str.lower()
    .str.replace('[0]*'.format(string.punctuation), '')
    .str.strip()
    .str.split()
]

>>> df
                        col_name                                words
0                This is Donald.                   this, is, donald
1         His hands are so small         small, his, so, are, hands
2  Why are his fingers so short?  short, fingers, his, so, are, why

您现在可以使用布尔索引来查看您的所有目标词是否都在这些新词集中。

target_words = ['is', 'small']
# Convert target words to lower case just to be safe.
target_words = [word.lower() for word in target_words]

df['match'] = df.words.apply(lambda words: all(target_word in words 
                                               for target_word in target_words))


print(df)
# Output: 
#                         col_name                                words  match
# 0                This is Donald.                   this, is, donald  False
# 1         His hands are so small         small, his, so, are, hands  False
# 2  Why are his fingers so short?  short, fingers, his, so, are, why  False    

target_words = ['so', 'small']
target_words = [word.lower() for word in target_words]

df['match'] = df.words.apply(lambda words: all(target_word in words 
                                               for target_word in target_words))

print(df)
# Output:
# Output: 
#                         col_name                                words  match
# 0                This is Donald.                   this, is, donald  False
# 1         His hands are so small         small, his, so, are, hands   True
# 2  Why are his fingers so short?  short, fingers, his, so, are, why  False

提取匹配行：

>>> df.loc[df.match, 'col_name']
# Output:
# 1    His hands are so small
# Name: col_name, dtype: object

使用布尔索引将这一切变成一条语句：

df.loc[[all(target_word in word_set for target_word in target_words) 
        for word_set in (set(words) for words in
                         df['col_name']
                         .str.lower()
                         .str.replace('[0]*'.format(string.punctuation), '')
                         .str.strip()
                         .str.split())], :]

【讨论】：

感谢您的回答.. 我正在尝试使用 Pandas 的内置索引（因为我的表包含大约 500k 行）但我猜你是自己索引它...？不确定你的意思。这确实使用了 Pandas 索引。这将返回一个匹配但不是整个字符串匹配！ @Nico，请详细说明。只需在相关列上使用布尔索引来提取上面示例中的匹配行。 @Alexander 有没有可能找到句子中匹配的单词。【参考方案3】：

您的方式（使用 /b）对我不起作用。我不确定你为什么不能使用逻辑运算符和 (&)，因为我认为这就是你真正想要的。

这是一种愚蠢的做法，但它确实有效：

mask = lambda x: ("is" in x) & ("small" in x)
series_name.apply(mask)

【讨论】：

您给出的示例在这方面令人困惑，尽管我看到您已经对其进行了改写以使其更清晰一些。这解决了您最初所说的问题是“我想提取包含“is”和“small”的行。”【参考方案4】：

作为讨论的延伸，我想在正则表达式中使用一个变量，如下所示：

df = df_w[df_w['Country/Region'].str.match("\b(location.loc[i]['country'])\b",case=False)]

如果我不输入 \b\b，代码将返回包含苏丹和南苏丹的所有列。而当我使用“\b(location.loc[i]['country'])\b”时，它会返回空数据框。请告诉我正确的用法。

【讨论】：

【参考方案5】：

在"\bis\b|\bsmall\b" 中，反斜杠\b 在传递给正则表达式方法进行匹配/搜索之前被解析为ASCII Backspace。欲了解更多信息，请查看this document about escape characters。本文档中提到，

当存在“r”或“R”前缀时，反斜杠后面的字符将不加更改地包含在字符串中，并且所有反斜杠都保留在字符串中。

因此，有两种选择-

r

df.col_name.str.contains(r"\bis\b|\bsmall\b", case=False)

\

df.col_name.str.contains("\\bis\\b|\\bsmall\\b", case=False)

如果你想看一个例子，这里是Fiddle

【讨论】：

以上是关于python pandas.Series.str.contains Whole WORD的主要内容，如果未能解决你的问题，请参考以下文章