如何编辑我的正则表达式，使其仅捕获（不包括）引号之间的子字符串？

Posted 2023-02-14

技术标签:

【中文标题】如何编辑我的正则表达式，使其仅捕获（不包括）引号之间的子字符串？【英文标题】：How can I edit my regex so that it captures only the substring between (and not including) quotation marks? 【发布时间】：2022-01-17 14:50:02 【问题描述】：

我是一个正则表达式的新手，很难理解它。现在我有一列用字符串填充，但与我的分析唯一相关的文本是引号之间。我试过这个：

response$text <-  stri_extract_all_regex(response$text, '"\\S+"')

但是当我查看 response$text 时，输出是这样的：

"\"caring\""

如何更改我的正则表达式，以便输出改为：

caring

【问题讨论】：

【参考方案1】：

你可以使用

library(stringi)
response$text <- stri_extract_all_regex(response$text, '(?<=")[^\\s"]+(?=")')

或者，stringr：

library(stringr)
response$text <- str_extract_all(response$text, '(?<=")[^\\s"]+(?=")')

但是，引号内有几个单词，我宁愿使用stringr::str_match_all：

library(stringr)
matches <- str_match_all(response$text, '"([^\\s"]+)"')
response$text <- lapply(matches, function(x) x[,2])

见this regex demo。

使用"([^\\s"]+)" 中使用的捕获组方法，可以避免引用的子字符串之间的重叠匹配，并且str_match_all 变得很方便，因为它返回的匹配也包含捕获的子字符串（与*extract* 函数不同）。

【讨论】：

以上是关于如何编辑我的正则表达式，使其仅捕获（不包括）引号之间的子字符串？的主要内容，如果未能解决你的问题，请参考以下文章