为不一致存在的子字符串制定捕获组

Posted

技术标签:

【中文标题】为不一致存在的子字符串制定捕获组【英文标题】:formulate capture groups for inconsistently present substrings 【发布时间】:2022-01-05 02:54:56 【问题描述】:

我有部分不规则格式的采访记录:

tst <- c("In: ja COOL;  #00:04:24-6#  ",           
         "  in den vier, FÜNF wochen, #00:04:57-8# ",
         "In: jah,  #00:02:07-8# ",
         "In:     [ja; ] #00:03:25-5# [ja; ] #00:03:26-1#",
         "    also jA:h; #00:03:16-6# (1.1)",
         "Bz:        [E::hm;    ]  #00:03:51-4#  (3.0)  ",
         "Bz:    [mhmh,      ]",
         "  in den bilLIE da war;")

我需要做的是通过将其关键元素提取到数据框的列中来构造这些数据。有四个这样的关键要素:

Rolein interview:被面试者或面试官 Utterance:采访伙伴致辞 Timestamp#表示到两端 Gap 用括号内的十进制数字表示

问题是TimestampGap 提供的不一致。虽然我可以将Gap 的最后一个捕获组设为可选,但那些既没有Timestamp 也没有Gap 的字符串无法正确呈现:

我正在使用来自tidyrextract 进行提取:

library(tidyr)
data.frame(tst) %>%
  extract(col = tst,
          into = c("Role", "Utterance", "Timestamp", "Gap"),
          regex = "^(\\w2:\\s|\\s+)([\\S\\s]+?)\\s*#([^#]+)?#\\s*(\\([0-9.]+\\))?\\s*")
  Role                 Utterance  Timestamp   Gap
1 In:                   ja COOL; 00:04:24-6      
2      in den vier, FÜNF wochen, 00:04:57-8      
3 In:                       jah, 00:02:07-8      
4 In:                     [ja; ] 00:03:25-5      
5                     also jA:h; 00:03:16-6 (1.1)
6 Bz:               [E::hm;    ] 00:03:51-4 (3.0)
7 <NA>                      <NA>       <NA>  <NA>
8 <NA>                      <NA>       <NA>  <NA>

如何改进正则表达式,以便获得所需的输出:

  Role                 Utterance  Timestamp   Gap
1 In:                   ja COOL; 00:04:24-6      
2      in den vier, FÜNF wochen, 00:04:57-8      
3 In:                       jah, 00:02:07-8      
4 In:                     [ja; ] 00:03:25-5      
5                     also jA:h; 00:03:16-6 (1.1)
6 Bz:               [E::hm;    ] 00:03:51-4 (3.0)
7 Bz:              [mhmh,      ]
8          in den bilLIE da war;

【问题讨论】:

【参考方案1】:

您可以更新您的模式以使用您的 4 个捕获组,并通过可选地匹配第 3 组和第 4 组并断言字符串的结尾来使最后一部分成为可选:

library(tidyr)

tst <- c("In: ja COOL;  #00:04:24-6#  ",           
         "  in den vier, FÜNF wochen, #00:04:57-8# ",
         "In: jah,  #00:02:07-8# ",
         "In:     [ja; ] #00:03:25-5# [ja; ] #00:03:26-1#",
         "    also jA:h; #00:03:16-6# (1.1)",
         "Bz:        [E::hm;    ]  #00:03:51-4#  (3.0)  ",
         "Bz:    [mhmh,      ]",
         "  in den bilLIE da war;")     

data.frame(tst) %>%
  extract(col = tst,
          into = c("Role", "Utterance", "Timestamp", "Gap"),
          regex = "^(\\w2:\\s|\\s+)([\\s\\S]*?)(?:\\s*#([^#]+)(?:#\\s*(\\([0-9.]+\\))?\\s*)?)?$")

输出

  Role                      Utterance  Timestamp   Gap
1 In:                        ja COOL; 00:04:24-6      
2           in den vier, FÜNF wochen, 00:04:57-8      
3 In:                            jah, 00:02:07-8      
4 In:      [ja; ] #00:03:25-5# [ja; ] 00:03:26-1      
5                          also jA:h; 00:03:16-6 (1.1)
6 Bz:                    [E::hm;    ] 00:03:51-4 (3.0)
7 Bz:                   [mhmh,      ]                 
8               in den bilLIE da war; 

【讨论】:

【参考方案2】:

复杂正则表达式的替代方法是使用多个提取和更简单的正则表达式。然后将任何 NA 转换为 "" 并去除不需要的空格。

library(dplyr)
library(tidyr)

data.frame(tst) %>%
  extract(tst, "Gap", "(\\(.*?\\))", remove = FALSE) %>%
  extract(tst, "Timestamp", "(#.*?#)", remove = FALSE) %>%
  extract(tst, c("Role", "Utterance"), "^(\\S+:|)([^#]*)") %>%
  mutate(across(, coalesce, ""), Utterance = trimws(Utterance))

给予:

  Role                 Utterance    Timestamp   Gap
1  In:                  ja COOL; #00:04:24-6#      
2      in den vier, FÜNF wochen, #00:04:57-8#      
3  In:                      jah, #00:02:07-8#      
4  In:                    [ja; ] #00:03:25-5#      
5                     also jA:h; #00:03:16-6# (1.1)
6  Bz:              [E::hm;    ] #00:03:51-4# (3.0)
7  Bz:             [mhmh,      ]                   
8          in den bilLIE da war;                   

【讨论】:

以上是关于为不一致存在的子字符串制定捕获组的主要内容,如果未能解决你的问题,请参考以下文章

正则表达式 c# 获取捕获组的子组

如何编辑我的正则表达式,使其仅捕获(不包括)引号之间的子字符串?

TS基础

如何检查字符串是不是包含 JavaScript 预定义数组中存在的子字符串?

SwiftUI:是不是存在修饰符来突出显示 Text() 视图的子字符串?

匹配两个特殊字符之间的子字符串,不包括字符