将列中逗号分隔的字符串拆分为单独的行
Posted
技术标签:
【中文标题】将列中逗号分隔的字符串拆分为单独的行【英文标题】:Split comma-separated strings in a column into separate rows 【发布时间】:2022-01-22 20:15:56 【问题描述】:我有一个数据框,如下所示:
data.frame(director = c("Aaron Blaise,Bob Walker", "Akira Kurosawa",
"Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu",
"Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu",
"Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock",
"Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik",
"Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson",
"Anne Fontaine", "Anthony Harvey"), AB = c('A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A'))
如您所见,director
列中的某些条目是多个名称,以逗号分隔。我想将这些条目分成单独的行,同时保持另一列的值。例如,上面数据框中的第一行应该分成两行,director
列中的每行都有一个名称,AB
列中的“A”。
【问题讨论】:
只是问一个显而易见的问题:您应该在互联网上发布这些数据吗? 他们“不都是 B 级电影”。看起来足够无害。 所有这些人都是奥斯卡奖提名者,我几乎不认为这是一个秘密 =) 【参考方案1】:几种选择:
1)data.table的两种方式:
library(data.table)
# method 1 (preferred)
setDT(v)[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
][!is.na(director)]
# method 2
setDT(v)[, strsplit(as.character(director), ",", fixed=TRUE), by = .(AB, director)
][,.(director = V1, AB)]
2) dplyr / tidyr 组合:
library(dplyr)
library(tidyr)
v %>%
mutate(director = strsplit(as.character(director), ",")) %>%
unnest(director)
3) 仅使用tidyr: 使用tidyr 0.5.0
(及更高版本),您也可以只使用separate_rows
:
separate_rows(v, director, sep = ",")
您可以使用convert = TRUE
参数自动将数字转换为数字列。
4) 以 R 为基数:
# if 'director' is a character-column:
stack(setNames(strsplit(df$director,','), df$AB))
# if 'director' is a factor-column:
stack(setNames(strsplit(as.character(df$director),','), df$AB))
【讨论】:
有没有办法同时为多个列执行此操作?例如 3 列,每列都有用“;”分隔的字符串每列具有相同数量的字符串。即data.table(id= "X21", a = "chr1;chr1;chr1", b="123;133;134",c="234;254;268")
变成data.table(id = c("X21","X21",X21"), a=c("chr1","chr1","chr1"), b=c("123","133","134"), c=c("234","254","268"))
?
哇刚刚意识到它已经同时适用于多个列 - 这太棒了!
@Reilstein 您能否分享一下您是如何将其应用于多个列的?我有同样的用例,但不确定如何去做。
@Moon_Watcher 上面答案中的方法 1 已经适用于多个列,我认为这很神奇。 setDT(dt)[,lapply(.SD, function(x) unlist(tstrsplit(x, ";",fixed=TRUE))), by = ID]
对我有用。
有没有办法在 DT 解决方案中使用 := 赋值运算符,与使用通常的
【参考方案2】:
这个老问题经常被用作欺骗目标(标记为r-faq
)。到今天为止,它已经回答了 3 次,提供 6 种不同的方法,但缺乏基准作为指导哪种方法最快1。
基准解决方案包括
Matthew Lundberg's base R approach但根据Rich Scriven's comment修改, Jaap's 两个data.table
方法和两个 dplyr
/ tidyr
方法,
Ananda's splitstackshape
solution,
以及 Jaap 的 data.table
方法的另外两个变体。
使用 microbenchmark
包对 6 种不同大小的数据帧进行了总共 8 种不同方法的基准测试(参见下面的代码)。
OP 给出的样本数据仅包含 20 行。要创建更大的数据框,只需将这 20 行重复 1、10、100、1000、10000 和 100000 次,这样问题的大小就可以达到 200 万行。
基准测试结果
基准测试结果表明,对于足够大的数据帧,所有data.table
方法都比任何其他方法都快。对于超过 5000 行的数据帧,Jaap 的 data.table
方法 2 和变体 DT3
是最快的,比最慢的方法快很多。
值得注意的是,两种tidyverse
方法和splistackshape
解决方案的时序非常相似,以至于很难区分图表中的曲线。它们是所有数据帧大小的基准测试方法中最慢的。
对于较小的数据帧,Matt 的基本 R 解决方案和 data.table
方法 4 的开销似乎比其他方法少。
代码
director <-
c("Aaron Blaise,Bob Walker", "Akira Kurosawa", "Alan J. Pakula",
"Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu",
"Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu",
"Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock",
"Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik",
"Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson",
"Anne Fontaine", "Anthony Harvey")
AB <- c("A", "B", "A", "A", "B", "B", "B", "A", "B", "A", "B", "A",
"A", "B", "B", "B", "B", "B", "B", "A")
library(data.table)
library(magrittr)
为问题大小n
的基准运行定义函数
run_mb <- function(n)
# compute number of benchmark runs depending on problem size `n`
mb_times <- scales::squish(10000L / n , c(3L, 100L))
cat(n, " ", mb_times, "\n")
# create data
DF <- data.frame(director = rep(director, n), AB = rep(AB, n))
DT <- as.data.table(DF)
# start benchmarks
microbenchmark::microbenchmark(
matt_mod =
s <- strsplit(as.character(DF$director), ',')
data.frame(director=unlist(s), AB=rep(DF$AB, lengths(s))),
jaap_DT1 =
DT[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
][!is.na(director)],
jaap_DT2 =
DT[, strsplit(as.character(director), ",", fixed=TRUE),
by = .(AB, director)][,.(director = V1, AB)],
jaap_dplyr =
DF %>%
dplyr::mutate(director = strsplit(as.character(director), ",")) %>%
tidyr::unnest(director),
jaap_tidyr =
tidyr::separate_rows(DF, director, sep = ","),
cSplit =
splitstackshape::cSplit(DF, "director", ",", direction = "long"),
DT3 =
DT[, strsplit(as.character(director), ",", fixed=TRUE),
by = .(AB, director)][, director := NULL][
, setnames(.SD, "V1", "director")],
DT4 =
DT[, .(director = unlist(strsplit(as.character(director), ",", fixed = TRUE))),
by = .(AB)],
times = mb_times
)
针对不同的问题规模运行基准测试
# define vector of problem sizes
n_rep <- 10L^(0:5)
# run benchmark for different problem sizes
mb <- lapply(n_rep, run_mb)
为绘图准备数据
mbl <- rbindlist(mb, idcol = "N")
mbl[, n_row := NROW(director) * n_rep[N]]
mba <- mbl[, .(median_time = median(time), N = .N), by = .(n_row, expr)]
mba[, expr := forcats::fct_reorder(expr, -median_time)]
创建图表
library(ggplot2)
ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) +
geom_point() + geom_smooth(se = FALSE) +
scale_x_log10(breaks = NROW(director) * n_rep) + scale_y_log10() +
xlab("number of rows") + ylab("median of execution time [ms]") +
ggtitle("microbenchmark results") + theme_bw()
会话信息和包版本(摘录)
devtools::session_info()
#Session info
# version R version 3.3.2 (2016-10-31)
# system x86_64, mingw32
#Packages
# data.table * 1.10.4 2017-02-01 CRAN (R 3.3.2)
# dplyr 0.5.0 2016-06-24 CRAN (R 3.3.1)
# forcats 0.2.0 2017-01-23 CRAN (R 3.3.2)
# ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.3.2)
# magrittr * 1.5 2014-11-22 CRAN (R 3.3.0)
# microbenchmark 1.4-2.1 2015-11-25 CRAN (R 3.3.3)
# scales 0.4.1 2016-11-09 CRAN (R 3.3.2)
# splitstackshape 1.4.2 2014-10-23 CRAN (R 3.3.3)
# tidyr 0.6.1 2017-01-10 CRAN (R 3.3.2)
1this exuberant comment 激起了我的好奇心太棒了!快几个数量级! 到 a question 的 tidyverse
答案,作为此问题的副本已关闭。
【讨论】:
不错!看起来 cSplit 和 separate_rows 的改进空间(专门设计用于执行此操作)。顺便说一句,cSplit 也需要一个 fixed= arg 并且是一个基于 data.table 的包,所以不妨给它 DT 而不是 DF。同样,我认为从因子到字符的转换不属于基准(因为它应该是字符开始)。我检查了一下,这些变化都没有对结果产生任何定性影响。 @Frank 感谢您提出改进基准和检查结果影响的建议。在data.table
、dplyr
等的下一版本发布后进行更新时会选择此功能。
我认为这些方法没有可比性,至少不是在所有情况下,因为数据表方法只生成带有“选定”列的表,而 dplyr 生成所有列的结果(包括那些不参与分析,也不必在函数中写下他们的名字)。
@Ferroao 错了,data.tables 方法修改了“表”,所有列都保留了,当然,如果你不修改就地,你得到的只是你的过滤副本有要求。简而言之,data.table 方法不是生成结果数据集,而是更新数据集,这就是 data.table 和 dplyr 之间的真正区别。
真的很不错的比较!也许你可以在做strsplit
fixed=TRUE
时添加matt_mod和jaap_dplyr。正如其他人所拥有的那样,这将对时间产生影响。从 R 4.0.0 开始,创建data.frame
时默认为stringsAsFactors = FALSE
,因此可以删除as.character
。【参考方案3】:
命名您的原始 data.frame v
,我们有这个:
> s <- strsplit(as.character(v$director), ',')
> data.frame(director=unlist(s), AB=rep(v$AB, sapply(s, FUN=length)))
director AB
1 Aaron Blaise A
2 Bob Walker A
3 Akira Kurosawa B
4 Alan J. Pakula A
5 Alan Parker A
6 Alejandro Amenabar B
7 Alejandro Gonzalez Inarritu B
8 Alejandro Gonzalez Inarritu B
9 Benicio Del Toro B
10 Alejandro González Iñárritu A
11 Alex Proyas B
12 Alexander Hall A
13 Alfonso Cuaron B
14 Alfred Hitchcock A
15 Anatole Litvak A
16 Andrew Adamson B
17 Marilyn Fox B
18 Andrew Dominik B
19 Andrew Stanton B
20 Andrew Stanton B
21 Lee Unkrich B
22 Angelina Jolie B
23 John Stevenson B
24 Anne Fontaine B
25 Anthony Harvey A
注意使用rep
来构建新的AB 列。这里,sapply
返回每个原始行中的姓名数。
【讨论】:
我想知道 `AB=rep(v$AB, unlist(sapply(s, FUN=length )))` 是否比更晦涩的vapply
更容易掌握?有什么让vapply
在这里更合适的地方吗?
现在sapply(s, length)
可以替换为lengths(s)
。【参考方案4】:
聚会迟到了,但另一个通用的替代方法是使用我的“splitstackshape”包中的cSplit
,它有一个direction
参数。将此设置为 "long"
以获得您指定的结果:
library(splitstackshape)
head(cSplit(mydf, "director", ",", direction = "long"))
# director AB
# 1: Aaron Blaise A
# 2: Bob Walker A
# 3: Akira Kurosawa B
# 4: Alan J. Pakula A
# 5: Alan Parker A
# 6: Alejandro Amenabar B
【讨论】:
【参考方案5】:devtools::install_github("yikeshu0611/onetree")
library(onetree)
dd=spread_byonecolumn(data=mydata,bycolumn="director",joint=",")
head(dd)
director AB
1 Aaron Blaise A
2 Bob Walker A
3 Akira Kurosawa B
4 Alan J. Pakula A
5 Alan Parker A
6 Alejandro Amenabar B
【讨论】:
【参考方案6】:目前建议使用 base 中的strsplit
生成的另一个基准将列中以逗号分隔的字符串拆分为单独的行,因为它是最快的各种尺寸:
s <- strsplit(v$director, ",", fixed=TRUE)
s <- data.frame(director=unlist(s), AB=rep(v$AB, lengths(s)))
请注意,使用fixed=TRUE
对计时有重大影响。
比较方法:
met <- alist(base = s <- strsplit(v$director, ",") #Matthew Lundberg
s <- data.frame(director=unlist(s), AB=rep(v$AB, sapply(s, FUN=length)))
, baseLength = s <- strsplit(v$director, ",") #Rich Scriven
s <- data.frame(director=unlist(s), AB=rep(v$AB, lengths(s)))
, baseLeFix = s <- strsplit(v$director, ",", fixed=TRUE)
s <- data.frame(director=unlist(s), AB=rep(v$AB, lengths(s)))
, cSplit = s <- cSplit(v, "director", ",", direction = "long") #A5C1D2H2I1M1N2O1R2T1
, dt = s <- setDT(v)[, lapply(.SD, function(x) unlist(tstrsplit(x, "," #Jaap
, fixed=TRUE))), by = AB][!is.na(director)]
#, dt2 = s <- setDT(v)[, strsplit(director, "," #Jaap #Only Unique
# , fixed=TRUE), by = .(AB, director)][,.(director = V1, AB)]
, dplyr = s <- v %>% #Jaap
mutate(director = strsplit(director, ",", fixed=TRUE)) %>%
unnest(director)
, tidyr = s <- separate_rows(v, director, sep = ",") #Jaap
, stack = s <- stack(setNames(strsplit(v$director, ",", fixed=TRUE), v$AB)) #Jaap
#, dt3 = s <- setDT(v)[, strsplit(director, ",", fixed=TRUE), #Uwe #Only Unique
# by = .(AB, director)][, director := NULL][, setnames(.SD, "V1", "director")]
, dt4 = s <- setDT(v)[, .(director = unlist(strsplit(director, "," #Uwe
, fixed = TRUE))), by = .(AB)]
, dt5 = s <- vT[, .(director = unlist(strsplit(director, "," #Uwe
, fixed = TRUE))), by = .(AB)]
)
图书馆:
library(microbenchmark)
library(splitstackshape) #cSplit
library(data.table) #dt, dt2, dt3, dt4
#setDTthreads(1) #Looks like it has here minor effect
library(dplyr) #dplyr
library(tidyr) #dplyr, tidyr
数据:
v0 <- data.frame(director = c("Aaron Blaise,Bob Walker", "Akira Kurosawa",
"Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu",
"Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu",
"Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock",
"Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik",
"Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson",
"Anne Fontaine", "Anthony Harvey"), AB = c('A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A'))
计算和计时结果:
n <- 10^(0:5)
x <- lapply(n, function(n) v <- v0[rep(seq_len(nrow(v0)), n),]
vT <- setDT(v)
ti <- min(100, max(3, 1e4/n))
microbenchmark(list = met, times = ti, control=list(order="block")))
y <- do.call(cbind, lapply(x, function(y) aggregate(time ~ expr, y, median)))
y <- cbind(y[1], y[-1][c(TRUE, FALSE)])
y[-1] <- y[-1] / 1e6 #ms
names(y)[-1] <- paste("n:", n * nrow(v0))
y #Time in ms
# expr n: 20 n: 200 n: 2000 n: 20000 n: 2e+05 n: 2e+06
#1 base 0.2989945 0.6002820 4.8751170 46.270246 455.89578 4508.1646
#2 baseLength 0.2754675 0.5278900 3.8066300 37.131410 442.96475 3066.8275
#3 baseLeFix 0.2160340 0.2424550 0.6674545 4.745179 52.11997 555.8610
#4 cSplit 1.7350820 2.5329525 11.6978975 99.060448 1053.53698 11338.9942
#5 dt 0.7777790 0.8420540 1.6112620 8.724586 114.22840 1037.9405
#6 dplyr 6.2425970 7.9942780 35.1920280 334.924354 4589.99796 38187.5967
#7 tidyr 4.0323765 4.5933730 14.7568235 119.790239 1294.26959 11764.1592
#8 stack 0.2931135 0.4672095 2.2264155 22.426373 289.44488 2145.8174
#9 dt4 0.5822910 0.6414900 1.2214470 6.816942 70.20041 787.9639
#10 dt5 0.5015235 0.5621240 1.1329110 6.625901 82.80803 636.1899
注意,方法如
(v <- rbind(v0[1:2,], v0[1,]))
# director AB
#1 Aaron Blaise,Bob Walker A
#2 Akira Kurosawa B
#3 Aaron Blaise,Bob Walker A
setDT(v)[, strsplit(director, "," #Jaap #Only Unique
, fixed=TRUE), by = .(AB, director)][,.(director = V1, AB)]
# director AB
#1: Aaron Blaise A
#2: Bob Walker A
#3: Akira Kurosawa B
为unique
返回一个strsplit
导演 并且可能与
tmp <- unique(v)
s <- strsplit(tmp$director, ",", fixed=TRUE)
s <- data.frame(director=unlist(s), AB=rep(tmp$AB, lengths(s)))
但据我了解,这不是被问到的。
【讨论】:
以上是关于将列中逗号分隔的字符串拆分为单独的行的主要内容,如果未能解决你的问题,请参考以下文章