Pandas str.contains - 在字符串中搜索多个值并在新列中打印值[重复]

Posted

技术标签:

【中文标题】Pandas str.contains - 在字符串中搜索多个值并在新列中打印值[重复]【英文标题】:Pandas str.contains - Search for multiple values in a string and print the values in a new column [duplicate] 【发布时间】:2018-07-15 20:27:10 【问题描述】:

我刚开始用 Python 编码,想构建一个解决方案,您可以在其中搜索字符串以查看它是否包含给定的一组值。

我在 R 中找到了一个使用 stringr 库的类似解决方案:Search for a value in a string and if the value exists, print it all by itself in a new column

以下代码似乎可以工作,但我也想输出我正在寻找的三个值,这个解决方案只会输出一个值:

#Inserting new column
df.insert(5, "New_Column", np.nan)

#Searching old column
df['New_Column'] = np.where(df['Column_with_text'].str.contains('value1|value2|value3', case=False, na=False), 'value', 'NaN')

----- 编辑 ------

所以我意识到我没有给出很好的解释,对此感到抱歉。

下面是我匹配字符串中的水果名称的示例,根据它是否在字符串中找到任何匹配项,它将在新列中打印出 true 或 false。这是我的问题:我不想打印出真假,而是想打印出它在字符串中找到的名称,例如。苹果、橙子等。

import pandas as pd
import numpy as np

text = [('I want to buy some apples.', 0),
         ('Oranges are good for the health.', 0),
         ('John is eating some grapes.', 0),
         ('This line does not contain any fruit names.', 0),
         ('I bought 2 blueberries yesterday.', 0)]
labels = ['Text','Random Column']

df = pd.DataFrame.from_records(text, columns=labels)

df.insert(2, "MatchedValues", np.nan)

foods =['apples', 'oranges', 'grapes', 'blueberries']

pattern = '|'.join(foods)

df['MatchedValues'] = df['Text'].str.contains(pattern, case=False)

print(df)

结果

                                          Text  Random Column  MatchedValues
0                   I want to buy some apples.              0           True
1             Oranges are good for the health.              0           True
2                  John is eating some grapes.              0           True
3  This line does not contain any fruit names.              0          False
4            I bought 2 blueberries yesterday.              0           True

想要的结果

                                          Text  Random Column  MatchedValues
0                   I want to buy some apples.              0           apples
1             Oranges are good for the health.              0           oranges
2                  John is eating some grapes.              0           grapes
3  This line does not contain any fruit names.              0          NaN
4            I bought 2 blueberries yesterday.              0           blueberries

【问题讨论】:

【参考方案1】:

您需要设置正则表达式标志(将您的搜索解释为正则表达式):

whatIwant = df['Column_with_text'].str.contains('value1|value2|value3',
                                                 case=False, regex=True)

df['New_Column'] = np.where(whatIwant, df['Column_with_text'])

----- 编辑 ------

根据更新后的问题陈述,以下是更新后的答案:

您需要使用括号在正则表达式中定义一个捕获组,并使用extract() 函数返回在捕获组中找到的值。 lower() 函数处理任何大写字母

df['MatchedValues'] = df['Text'].str.lower().str.extract( '('+pattern+')', expand=False)        

【讨论】:

这解决了您的问题吗? 不,但我现在已经编辑了我的帖子,以便更清楚地说明我的目标是什么。非常感谢您的帮助! 只是想知道,您是否尝试过我更新的(单行)解决方案?【参考方案2】:

这是一种方法:

foods =['apples', 'oranges', 'grapes', 'blueberries']

def matcher(x):
    for i in foods:
        if i.lower() in x.lower():
            return i
    else:
        return np.nan

df['Match'] = df['Text'].apply(matcher)

#                                           Text        Match
# 0                   I want to buy some apples.       apples
# 1             Oranges are good for the health.      oranges
# 2                  John is eating some grapes.       grapes
# 3  This line does not contain any fruit names.          NaN
# 4            I bought 2 blueberries yesterday.  blueberries

【讨论】:

以上是关于Pandas str.contains - 在字符串中搜索多个值并在新列中打印值[重复]的主要内容,如果未能解决你的问题,请参考以下文章

Python pandas,使用 .str.contains 搜索数据框列的子字符串时出错

pandas str.contains 匹配多个字符串并获取匹配的值

Pandas str.contains 用于部分字符串的精确匹配

使用带有 python/pandas 的 dict 理解与 str.contains 进行部分字符串匹配

pandas:如何限制 str.contains 的结果?

映射 str.contains 跨 pandas DataFrame