提取不允许的字符

Posted 2023-02-14

技术标签:

【中文标题】提取不允许的字符【英文标题】：Extract disallowed characters 【发布时间】：2022-01-08 20:06:27 【问题描述】：

我有错误编码的转录，即出现但不应该出现的字符。

在这个玩具数据中，唯一允许字符是这个类：

"[)(/][A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]"

df <- data.frame(
  Utterance = c("~°maybe you (.) >should ¥just¥<",
                "SOME text |<-- pipe¿ and€",            # <--: | and €
                "blah%",                                # <--: %
                "text ^more text",                      # <--: ^
                "£norm(hh)a::l£mal, (1.22)"))

我需要做的是：

检测包含任何错误编码的Utterances 提取错误的字符

就检测而言，我做得很好，但提取失败了：

library(stringr)
library(dplyr)
df %>%
  filter(!str_detect(Utterance, "[)(/][A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]")) %>%
  mutate(WrongChar = str_extract_all(Utterance, "[^)(/][A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]"))
                  Utterance                                  WrongChar
1 SOME text |<-- pipe¿ and€ SO, ME,  t, ex, |<, --,  p, ip, e¿,  a, nd
2                     blah%                                     bl, ah
3           text ^more text                     te, xt, ^m, or,  t, ex

如何改进提取以获得这个预期结果：

                  Utterance WrongChar
1 SOME text |<-- pipe¿ and€      |, €
2                     blah%         %
3           text ^more text         ^

【问题讨论】：

【参考方案1】：

你需要

确保 [ 和 ] 在字符类中转义向两个正则表达式检查添加空格模式，因为它的缺失会影响您的结果。

所以你需要使用

df %>%
   filter(str_detect(Utterance, "[^\\s)(/\\]\\[A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]")) %>%
   mutate(WrongChar = str_extract_all(Utterance, "[^\\s)(/\\]\\[A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]"))

输出：

                  Utterance WrongChar
1 SOME text |<-- pipe¿ and€      |, €
2                     blah%         %
3           text ^more text         ^

请注意，我在filter(str_detect(Utterance, "[^\\s)(/\\]\\[A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]")) 中使用了正逻辑，因此我们会得到所有包含至少一个字符而不是允许字符的项目。

【讨论】：

以上是关于提取不允许的字符的主要内容，如果未能解决你的问题，请参考以下文章