熊猫数据框 str.contains() AND 操作
Posted
技术标签:
【中文标题】熊猫数据框 str.contains() AND 操作【英文标题】:pandas dataframe str.contains() AND operation 【发布时间】:2016-08-28 22:44:11 【问题描述】:我有一个三行的 df (Pandas Dataframe):
some_col_name
"apple is delicious"
"banana is delicious"
"apple and banana both are delicious"
函数df.col_name.str.contains("apple|banana")
将捕获所有行:
"apple is delicious",
"banana is delicious",
"apple and banana both are delicious".
如何将 AND 运算符应用于 str.contains()
方法,以便它只抓取包含“apple”和“banana”的字符串?
"apple and banana both are delicious"
我想抓取包含 10-20 个不同单词(葡萄、西瓜、浆果、橙子、...等)的字符串
【问题讨论】:
这个例子是玩具,因为你只有 K=2 个子字符串并且它们按顺序出现:苹果、香蕉。但您确实需要一种以任意顺序匹配 K=10-20 个子字符串的方法。具有多个前瞻断言的正则表达式是可行的方法(@Anzel 的解决方案)。 【参考方案1】:你可以这样做:
df[(df['col_name'].str.contains('apple')) & (df['col_name'].str.contains('banana'))]
【讨论】:
【参考方案2】:你也可以用正则表达式来做:
df[df['col_name'].str.contains(r'^(?=.*apple)(?=.*banana)')]
然后,您可以将您的单词列表构建成一个正则表达式字符串,如下所示:
base = r'^'
expr = '(?=.*)'
words = ['apple', 'banana', 'cat'] # example
base.format(''.join(expr.format(w) for w in words))
将呈现:
'^(?=.*apple)(?=.*banana)(?=.*cat)'
然后你可以动态地做你的事情。
【讨论】:
这很棒。我试着用 f 弦来做。原来是这样,你有什么改进吗?filter_string = '^' + ''.join(fr'(?=.*w)' for w in words)
@spen.smith 我认为您的实现清晰而简单;除非遇到问题,否则不要认为需要进一步改进【参考方案3】:
试试这个正则表达式
apple.*banana|banana.*apple
代码是:
import pandas as pd
df = pd.DataFrame([[1,"apple is delicious"],[2,"banana is delicious"],[3,"apple and banana both are delicious"]],columns=('ID','String_Col'))
print df[df['String_Col'].str.contains(r'apple.*banana|banana.*apple')]
输出
ID String_Col
2 3 apple and banana both are delicious
【讨论】:
【参考方案4】:df = pd.DataFrame('col': ["apple is delicious",
"banana is delicious",
"apple and banana both are delicious"])
targets = ['apple', 'banana']
# Any word from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: any(word in sentence for word in targets))
0 True
1 True
2 True
Name: col, dtype: bool
# All words from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: all(word in sentence for word in targets))
0 False
1 False
2 True
Name: col, dtype: bool
【讨论】:
【参考方案5】:如果您想在句子中至少包含两个单词,也许这会起作用(从@Alexander 那里得到提示):
target=['apple','banana','grapes','orange']
connector_list=['and']
df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (all(connector in sentence for connector in connector_list)))]
输出:
col
2 apple and banana both are delicious
如果您有两个以上的单词要捕获,它们以逗号 '' 分隔,则将其添加到 connector_list 并将第二个条件从 all 修改为 any
df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (any(connector in sentence for connector in connector_list)))]
输出:
col
2 apple and banana both are delicious
3 orange,banana and apple all are delicious
【讨论】:
【参考方案6】:枚举大型列表的所有可能性很麻烦。更好的方法是使用reduce()
和bitwise AND 运算符(&
)。
例如,考虑以下 DataFrame:
df = pd.DataFrame('col': ["apple is delicious",
"banana is delicious",
"apple and banana both are delicious",
"i love apple, banana, and strawberry"])
# col
#0 apple is delicious
#1 banana is delicious
#2 apple and banana both are delicious
#3 i love apple, banana, and strawberry
假设我们要搜索以下所有内容:
targets = ['apple', 'banana', 'strawberry']
我们可以做到:
#from functools import reduce # needed for python3
print(df[reduce(lambda a, b: a&b, (df['col'].str.contains(s) for s in targets))])
# col
#3 i love apple, banana, and strawberry
【讨论】:
【参考方案7】:这行得通
df.col.str.contains(r'(?=.*apple)(?=.*banana)',regex=True)
【讨论】:
【参考方案8】:如果您只想使用本机方法并避免编写正则表达式,这里有一个不涉及 lambda 的矢量化版本:
targets = ['apple', 'banana', 'strawberry']
fruit_masks = (df['col'].str.contains(string) for string in targets)
combined_mask = np.vstack(fruit_masks).all(axis=0)
df[combined_mask]
【讨论】:
【参考方案9】:从@Anzel 的回答中,我写了一个函数,因为我要经常应用它:
def regify(words, base=str(r'^'), expr=str('(?=.*)')):
return base.format(''.join(expr.format(w) for w in words))
所以如果你定义了words
:
words = ['apple', 'banana']
然后用类似这样的方式调用它:
dg = df.loc[
df['col_name'].str.contains(regify(words), case=False, regex=True)
]
那么你应该得到你想要的。
【讨论】:
以上是关于熊猫数据框 str.contains() AND 操作的主要内容,如果未能解决你的问题,请参考以下文章