pandas 按正则表达式条件从列中过滤字符串并替换它

Posted 2023-03-12

技术标签:

【中文标题】pandas 按正则表达式条件从列中过滤字符串并替换它【英文标题】：pandas filter string from column by regex condition and replace it 【发布时间】：2021-01-26 16:18:35 【问题描述】：

这里我有一个来自 pandas DataFrame 的字符串。

https://www.gofundme.com/3hgsuu0,https://twitter.com/dog_rates/status/840632337062862849/photo/1

我想要做的是遍历所有行找到推特网址并从列中删除 NOT 推特网址。目标是在列中只包含 twitter 网址，而不是 2 个或更多网址。

我做的是

arch_drop_new1.expanded_urls.apply(lambda x: str(x).split(",")[0])

这给了我, 之前的所有字符串，这些字符串出现在包含超过 1 个 url 的行中。

screenshot

【问题讨论】：

我想你可以使用

arch_drop_new1['twitter_urls'] = arch_drop_new1['expanded_urls'].str.extract(r'(https://twitter\.com/\S*?)(?:,http|$)', expand=False)

【参考方案1】：

您可以使用.str.extract() 的值

rx = r'(https?://twitter\.com/\S*?)(?:,\s*http|$)'
arch_drop_new1['twitter_urls'] = arch_drop_new1['expanded_urls'].str.extract(rx, expand=False)

请参阅regex demo。

注意extract() 将从每一行中提取模式的第一次出现（这里，只有 Group 1 值，因为模式中有一个捕获组）。

详情

(https?://twitter\.com/\S*?) - 第 1 组：https://twitter.com/ 或 http://twitter.com/，然后是 0 个或多个非空白字符，尽可能少 (?:,\s*http|$) - 匹配 ,、0 个或多个空格，然后是 http 或字符串结尾的非捕获组。

【讨论】：

以上是关于pandas 按正则表达式条件从列中过滤字符串并替换它的主要内容，如果未能解决你的问题，请参考以下文章