在 Pandas 列中搜索其他列中的子字符串

Posted 2023-02-22

技术标签:

【中文标题】在 Pandas 列中搜索其他列中的子字符串【英文标题】：Search Pandas Column for Substring in other Column 【发布时间】：2016-11-02 20:36:48 【问题描述】：

我有一个例子.csv，导入为df.csv，如下：

    Ethnicity, Description
  0 French, Irish Dance Company
  1 Italian, Moroccan/Algerian
  2 Danish, Company in Netherlands
  3 Dutch, French
  4 English, EnglishFrench
  5 Irish, Irish-American

我想检查 pandas test1['Description'] 中 test1['Ethnicity'] 中的字符串。这应该返回第 0、3、4 和 5 行，因为描述字符串包含种族列中的字符串。

到目前为止我已经尝试过：

df[df['Ethnicity'].str.contains('French')]['Description']

这会返回任何特定的字符串，但我想在不搜索每个特定种族值的情况下进行迭代。我还尝试将列转换为列表并进行迭代，但似乎找不到返回行的方法，因为它不再是 DataFrame()。

提前谢谢你！

【问题讨论】：

【参考方案1】：

您可以将str.contains 与Ethnicity 列中的值一起使用，然后将tolist 转换为join || 中的内容regex or：

print ('|'.join(df.Ethnicity.tolist()))
French|Italian|Danish|Dutch|English|Irish

mask = df.Description.str.contains('|'.join(df.Ethnicity.tolist()))
print (mask)
0     True
1    False
2    False
3     True
4     True
5     True
Name: Description, dtype: bool

#boolean-indexing
print (df[mask])
  Ethnicity          Description
0    French  Irish Dance Company
3     Dutch               French
4   English        EnglishFrench
5     Irish       Irish-American

看来你可以省略tolist()：

print (df[df.Description.str.contains('|'.join(df.Ethnicity))])
  Ethnicity          Description
0    French  Irish Dance Company
3     Dutch               French
4   English        EnglishFrench
5     Irish       Irish-American

【讨论】：

非常感谢，非常感谢！这在实施时有效。我对正则表达式操作 (regex) 没有太多经验，我一定会阅读。【参考方案2】：

曾经流行的双重申请：

df[df.Description.apply(lambda x: df.Ethnicity.apply(lambda y: y in x)).any(1)]

  Ethnicity          Description
0    French  Irish Dance Company
3     Dutch               French
4   English        EnglishFrench
5     Irish       Irish-American

时间

jezrael 的回答要好得多

【讨论】：

感谢您的回答！这在实施时起作用。

以上是关于在 Pandas 列中搜索其他列中的子字符串的主要内容，如果未能解决你的问题，请参考以下文章

如何在数据框的其他列中的一列中搜索字符串

删除出现在其他列中的单词，Pandas

Python Pandas：如何在列中搜索字符串？ [复制]

Python Pandas Regex：在列中搜索带有通配符的字符串并返回匹配项[重复]

根据每个句子的第一个单词将 pandas 数据框列中的字符串列表分解为新列

Pandas str.contains - 在字符串中搜索多个值并在新列中打印值[重复]