R 正则表达式替换除句子标记、撇号和连字符以外的所有标点符号

Posted

技术标签:

【中文标题】R 正则表达式替换除句子标记、撇号和连字符以外的所有标点符号【英文标题】:R regex to replace all punctuation except sentence markers, apostrophes and hyphens 【发布时间】:2015-08-06 17:44:00 【问题描述】:

我正在寻找一种在 R 中标记句子开头和结尾的方法。为此,我想消除除句末标记(如句号、感叹号、问号和连字符)之外的所有标点符号。我想用标记 *** 代替。同时,我也想保留包含撇号的单词。举一个具体的例子,给定这个字符串:

txt <- "We have examined all the possibilities, however we have not reached a solid conclusion - however we keep and open mind! Have you considered any other approach? Haven't you?"

期望的结果是

txt <- "We have examined all the possibilities however he have not reached a solid conclusion *** however we keep and open mind*** Have you considered any other approach*** Haven't you***"

我还没能拿出一个正则表达式来做到这一点。非常感谢任何提示。

【问题讨论】:

【参考方案1】:

你可以使用 gsub。

> txt <- "We have examined all the possibilities, however he have not reached a solid conclusion - however we keep and open mind! Have you considered any other approach? Haven't you?"
> gsub("[-.?!]", "<S>", gsub("(?![-.?!'])[[:punct:]]", "", txt, perl=T))
[1] "We have examined all the possibilities however he have not reached a solid conclusion <S> however we keep and open mind<S> Have you considered any other approach<S> Haven't you<S>"
> gsub("[-.?!]", "***", gsub("(?![-.?!'])[[:punct:]]", "", txt, perl=T))
[1] "We have examined all the possibilities however he have not reached a solid conclusion *** however we keep and open mind*** Have you considered any other approach*** Haven't you***"

我想删除除句末标记之外的所有标点符号,例如句号、感叹号、问号和连字符。

gsub("(?![-.?!'])[[:punct:]]", "", txt, perl=T)

我想用标记 *** 替换它。同时,我也想保留包含撇号的单词。

gsub("[-.?!]", "***", gsub("(?![-.?!'])[[:punct:]]", "", txt, perl=T))

【讨论】:

gsub("[-.?!]", "***", gsub("(?![-.?!]|\\b'\\b)[[:punct:]]", "", txt, perl=T))【参考方案2】:

您可以通过使用两个正则表达式来做到这一点。首先,您可以使用字符类删除不需要的字符:

[,.]
  ^--- Whatever you want to remove, put it here

并使用一个空的替换字符串。

然后,您可以像这样使用第二个正则表达式:

[?!-]
  ^--- Add characters you want to replace here

使用替换字符串:

<S>

Working demo

【讨论】:

以上是关于R 正则表达式替换除句子标记、撇号和连字符以外的所有标点符号的主要内容,如果未能解决你的问题,请参考以下文章

正则表达式用破折号、空格破折号、点空间、点和带有空字符串的撇号替换空格

如何使用正则表达式 python3 替换除空格和换行符旁边的数字以外的所有其他符号

忽略正则表达式中的撇号[重复]

Javascript 正则表达式替换所有非货币字符

如何用正则表达式过滤除数字以外的其他字符?

正则表达式将带连字符的单词与无连字符的查询匹配