删除重复的单词、逗号和空格
Posted
技术标签:
【中文标题】删除重复的单词、逗号和空格【英文标题】:Delete duplicate word, comma and whitespace 【发布时间】:2022-01-13 07:02:10 【问题描述】:如何使用 R 中的 Regex 删除所有重复的单词以及以下逗号和空格?
到目前为止,我已经提出了以下正则表达式,它匹配重复项,但不匹配逗号和空格。 :
(\b\w+\b)(?=[\S\s]*\b\1\b)
一个示例列表是:
blue, red, blue, yellow, green, blue
输出应如下所示:
blue, red, yellow, green
所以在这种情况下,它必须匹配两个“蓝色”,以及以下逗号和空格(如果有的话)。
【问题讨论】:
PCRE、TRE 或 ICU 正则表达式无法实现这一点,因为它们都不支持无限宽度的后视模式。 【参考方案1】:取决于您的列表是真正的列表还是带有逗号的字符串
# your data is actually already a list/vector
v <- c("blue", "red", "blue", "yellow", "green", "blue")
unique(v)
[1] "blue" "red" "yellow" "green"
# if your data is actually a comma seperated string
s <- "blue, red, blue, yellow, green, blue"
# if output needs to be a vector
unique(strsplit(s, ", ")[[1]])
[1] "blue" "red" "yellow" "green"
# if output needs to be a string again
paste(unique(strsplit(s, ", ")[[1]]), collapse = ", ")
[1] "blue, red, yellow, green"
示例基于 data.table 或 data.frame 中的列表列
dt <- data.table(
id = rep(1:5),
colors = list(
c("blue", "red", "blue", "yellow", "green", "blue"),
c("blue", "blue", "yellow", "green", "blue"),
c("blue", "red", "blue", "yellow"),
c("red", "red", "yellow", "yellow", "green", "blue"),
c("black")
)
)
## using data.table
library(data.table)
setDT(dt)
# use colors instead of clean_list to just fix the existing column
dt[, clean_list := lapply(colors, function(x) unique(x))]
## using dplyr
library(dplyr)
# use colors instead of clean_list to just fix the existing column
dt %>% mutate(clean_list = lapply(colors, function(x) unique(x)))
dt
# id colors clean_list
# 1: 1 blue,red,blue,yellow,green,blue blue,red,yellow,green
# 2: 2 blue,blue,yellow,green,blue blue,yellow,green
# 3: 3 blue,red,blue,yellow blue,red,yellow
# 4: 4 red,red,yellow,yellow,green,blue red,yellow,green,blue
# 5: 5 black black
# or just simply in base
dt$colors <- lapply(dt$colors, function(x) unique(x))
【讨论】:
谢谢!已经是清单了。对于要删除特定列中的所有重复项的数据集,我将如何做到这一点?每行都有自己的列表 我会更新我的答案 谢谢!你知道在 R Base 中有什么方法吗?dt$colors <- lapply(dt$colors, function(x) unique(x))
【参考方案2】:
我们可以将paste
与unique
和collapse
一起使用:
paste(unique(string), collapse= (", "))
[1] "blue, red, yellow, green"
数据:
string <- c("blue", "red", "blue", "yellow", "green", "blue")
【讨论】:
我将如何为数据集执行此操作,我想删除特定列中的所有重复项?以上是关于删除重复的单词、逗号和空格的主要内容,如果未能解决你的问题,请参考以下文章