具有多个捕获组的 R 中的正则表达式组捕获

Posted 2023-02-14

技术标签:

【中文标题】具有多个捕获组的 R 中的正则表达式组捕获【英文标题】：Regex group capture in R with multiple capture-groups 【发布时间】：2010-10-31 10:30:10 【问题描述】：

在 R 中，是否可以从正则表达式匹配中提取组捕获？据我所知，grep、grepl、regexpr、gregexpr、sub 或 gsub 都没有返回组捕获。

我需要从这样编码的字符串中提取键值对：

\((.*?) :: (0\.[0-9]+)\)

我总是可以只做多个完全匹配的 grep，或者做一些外部（非 R）处理，但我希望我可以在 R 中完成所有这些。是否有一个函数或一个包提供这样的函数这样做？

【问题讨论】：

【参考方案1】：

str_match()，来自stringr 包，将执行此操作。它返回一个字符矩阵，匹配中的每个组有一列（整个匹配一列）：

> s = c("(sometext :: 0.1231313213)", "(moretext :: 0.111222)")
> str_match(s, "\\((.*?) :: (0\\.[0-9]+)\\)")
     [,1]                         [,2]       [,3]          
[1,] "(sometext :: 0.1231313213)" "sometext" "0.1231313213"
[2,] "(moretext :: 0.111222)"     "moretext" "0.111222"

【讨论】：

和 str_match_all() 匹配正则表达式中的所有组如何只打印 [,1] 的捕获组？不确定您在寻找什么。捕获的组是第 2 列和第 3 列。[,1] 是完全匹配。 [,2:3] 是捕获的组。【参考方案2】：

gsub 这样做，从你的例子：

gsub("\\((.*?) :: (0\\.[0-9]+)\\)","\\1 \\2", "(sometext :: 0.1231313213)")
[1] "sometext 0.1231313213"

您需要对引号中的 \s 进行双重转义，然后它们才能用于正则表达式。

希望这会有所帮助。

【讨论】：

其实我需要把捕获的子串拉出来放到data.frame中。但是，看看你的答案，我想我可以链接 gsub 和几个 strsplit 来获得我想要的东西，也许是： strsplit(strsplit(gsub(regex, "\\1::\\2::::", str ), "::::")[[1]], "::") 太棒了。 R gsub 手册页非常需要一个示例来说明您需要 '\\1' 来转义捕获组引用。【参考方案3】：

试试regmatches() 和regexec()：

regmatches("(sometext :: 0.1231313213)",regexec("\\((.*?) :: (0\\.[0-9]+)\\)","(sometext :: 0.1231313213)"))
[[1]]
[1] "(sometext :: 0.1231313213)" "sometext"                   "0.1231313213"

【讨论】：

感谢 vanilla R 解决方案并指出我以前从未见过的 regmatches 为什么要写两次字符串？ @StefanoBorini regexec 返回一个包含匹配位置信息的列表，因此regmatches 要求用户提供匹配列表所属的字符串。【参考方案4】：

gsub() 可以这样做并且只返回捕获组：

但是，为了使其工作，您必须明确选择捕获组之外的元素，如 gsub() 帮助中所述。

(...) 字符向量 'x' 中未被替换的元素将原封不动地返回。

因此，如果您要选择的文本位于某个字符串的中间，则在捕获组之前和之后添加 .* 应该只允许您返回它。

gsub(".*\\((.*?) :: (0\\.[0-9]+)\\).*","\\1 \\2", "(sometext :: 0.1231313213)") [1] "sometext 0.1231313213"

【讨论】：

【参考方案5】：

我喜欢 perl 兼容的正则表达式。可能其他人也这样做......

这是一个函数，它执行 perl 兼容的正则表达式并匹配我习惯的其他语言中的函数的功能：

regexpr_perl <- function(expr, str) 
  match <- regexpr(expr, str, perl=T)
  matches <- character(0)
  if (attr(match, 'match.length') >= 0) 
    capture_start <- attr(match, 'capture.start')
    capture_length <- attr(match, 'capture.length')
    total_matches <- 1 + length(capture_start)
    matches <- character(total_matches)
    matches[1] <- substr(str, match, match + attr(match, 'match.length') - 1)
    if (length(capture_start) > 1) 
      for (i in 1:length(capture_start)) 
        matches[i + 1] <- substr(str, capture_start[[i]], capture_start[[i]] + capture_length[[i]] - 1)
      
    
  
  matches

【讨论】：

【参考方案6】：

strcapture 的解决方案来自utils：

x <- c("key1 :: 0.01",
       "key2 :: 0.02")
strcapture(pattern = "(.*) :: (0\\.[0-9]+)",
           x = x,
           proto = list(key = character(), value = double()))
#>    key value
#> 1 key1  0.01
#> 2 key2  0.02

【讨论】：

【参考方案7】：

这就是我最终解决此问题的方法。我使用了两个单独的正则表达式来匹配第一个和第二个捕获组并运行两个gregexpr 调用，然后拉出匹配的子字符串：

regex.string <- "(?<=\\().*?(?= :: )"
regex.number <- "(?<= :: )\\d\\.\\d+"

match.string <- gregexpr(regex.string, str, perl=T)[[1]]
match.number <- gregexpr(regex.number, str, perl=T)[[1]]

strings <- mapply(function (start, len) substr(str, start, start+len-1),
                  match.string,
                  attr(match.string, "match.length"))
numbers <- mapply(function (start, len) as.numeric(substr(str, start, start+len-1)),
                  match.number,
                  attr(match.number, "match.length"))

【讨论】：

+1 表示工作代码。但是，我宁愿从 R 运行一个快速的 shell 命令并使用像这样的 Bash 单行代码 expr "xyx0.0023xyxy" : '[^0-9]*\([.0-9]\+\)'【参考方案8】：

正如stringr 包中所建议的，这可以使用str_match() 或str_extract() 来实现。

改编自手册：

library(stringr)

strings <- c(" 219 733 8965", "329-293-8753 ", "banana", 
             "239 923 8115 and 842 566 4692",
             "Work: 579-499-7527", "$1000",
             "Home: 543.355.3679")
phone <- "([2-9][0-9]2)[- .]([0-9]3)[- .]([0-9]4)"

提取和合并我们的组：

str_extract_all(strings, phone, simplify=T)
#      [,1]           [,2]          
# [1,] "219 733 8965" ""            
# [2,] "329-293-8753" ""            
# [3,] ""             ""            
# [4,] "239 923 8115" "842 566 4692"
# [5,] "579-499-7527" ""            
# [6,] ""             ""            
# [7,] "543.355.3679" ""

用输出矩阵表示组（我们对第 2 列感兴趣）：

str_match_all(strings, phone)
# [[1]]
#      [,1]           [,2]  [,3]  [,4]  
# [1,] "219 733 8965" "219" "733" "8965"
# 
# [[2]]
#      [,1]           [,2]  [,3]  [,4]  
# [1,] "329-293-8753" "329" "293" "8753"
# 
# [[3]]
#      [,1] [,2] [,3] [,4]
# 
# [[4]]
#      [,1]           [,2]  [,3]  [,4]  
# [1,] "239 923 8115" "239" "923" "8115"
# [2,] "842 566 4692" "842" "566" "4692"
# 
# [[5]]
#      [,1]           [,2]  [,3]  [,4]  
# [1,] "579-499-7527" "579" "499" "7527"
# 
# [[6]]
#      [,1] [,2] [,3] [,4]
# 
# [[7]]
#      [,1]           [,2]  [,3]  [,4]  
# [1,] "543.355.3679" "543" "355" "3679"

【讨论】：

842 566 4692 怎么样感谢您发现遗漏。使用相关stringr 函数的_all 后缀进行了更正。【参考方案9】：

这可以使用包 unglue 来完成，以所选答案为例：

# install.packages("unglue")
library(unglue)

s <- c("(sometext :: 0.1231313213)", "(moretext :: 0.111222)")
unglue_data(s, "(x :: y)")
#>          x            y
#> 1 sometext 0.1231313213
#> 2 moretext     0.111222

或者从一个数据框开始

df <- data.frame(col = s)
unglue_unnest(df, col, "(x :: y)",remove = FALSE)
#>                          col        x            y
#> 1 (sometext :: 0.1231313213) sometext 0.1231313213
#> 2     (moretext :: 0.111222) moretext     0.111222

您可以从脱胶模式中获取原始正则表达式，可以选择使用命名捕获：

unglue_regex("(x :: y)")
#>             (x :: y) 
#> "^\\((.*?) :: (.*?)\\)$"

unglue_regex("(x :: y)",named_capture = TRUE)
#>                     (x :: y) 
#> "^\\((?<x>.*?) :: (?<y>.*?)\\)$"

更多信息：https://github.com/moodymudskipper/unglue/blob/master/README.md

【讨论】：

以上是关于具有多个捕获组的 R 中的正则表达式组捕获的主要内容，如果未能解决你的问题，请参考以下文章

具有捕获组的有效正则表达式，但 sed 脚本不起作用

将排除捕获组的正则表达式

用于捕获组的正则表达式无法识别

正则表达式：捕获重复捕获组的所有单个实例 [重复]

如何获取正则表达式捕获组的值？ [复制]

正则表达式 c# 获取捕获组的子组