如何在R中的一列中添加具有不同值的新行
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如何在R中的一列中添加具有不同值的新行相关的知识,希望对你有一定的参考价值。
基本上我有一个所有名称的矢量names
,以及一个带有BIN(0/1)字段和NAME字段的数据帧df
。对于BIN==0
的每一行,我想创建一个重复的行,但是用1代替,并将其添加到df
的底部,并使用不同的名称。这是我必须选择一个新名称,给定当前名称:
sample(names[names!=name], 1)
但我不知道如何对此进行矢量化,并使用来自BIN的相同数据将其添加到df
。
编辑:示例数据:
df = data.frame(BIN=c(1,0,1), NAME=c("alice","bob","cate"))
names = c("alice","bob","cate","dan")
我接近这样的事情:
rbind(df, df %>% filter(BIN == 0) %>%
mutate(NAME = sample(names[names!=NAME],1)))
但是我得到一个错误:在binattr(e1,e2)中:长度(e1)不是长度的倍数(e2)。
这是一个简单的方法。我认为这很简单,如果您有疑问,请告诉我:
rename = subset(df, BIN == 0)
rename$NEW_NAME = sample(names, size = nrow(rename), replace = TRUE)
while(any(rename$NAME == rename$NEW_NAME)) {
matches = rename$NAME == rename$NEW_NAME
rename$NEW_NAME[matches] = sample(names, size = sum(matches), replace = TRUE)
}
rename$BIN = 1
rename$NAME = rename$NEW_NAME
rename$NEW_NAME = NULL
result = rbind(df, rename)
result
# BIN NAME
# 1 1 alice
# 2 0 bob
# 3 1 cate
# 21 1 alice
这是另一种方法,不太清晰但效率更高。这是“正确”的方式,但它需要更多的思考和解释。
df$NAME = factor(df$NAME, levels = names)
rename = subset(df, BIN == 0)
n = length(names)
# we will increment each level number with a random integer from
# 1 to n - 1 (with a mod to make it cyclical)
offset = sample(1:(n - 1), size = nrow(rename), replace = TRUE)
adjusted = (as.integer(rename$NAME) + offset) %% n
# reconcile 1-indexed factor levels with 0-indexed mod operator
adjusted[adjusted == 0] = n
rename$NAME = names[adjusted]
rename$BIN = 1
result = rbind(df, rename)
(或者,为dplyr
改写)
df = mutate(df, NAME = factor(NAME, levels = names))
n = length(names)
df %>% filter(BIN == 0) %>%
mutate(
offset = sample(1:(n - 1), size = n(), replace = TRUE),
adjusted = (as.integer(NAME) + offset) %% n,
adjusted = if_else(adjusted == 0, n, adjusted),
NAME = names[adjusted],
BIN = 1
) %>%
select(-offset, -adjusted) %>%
rbind(df, .)
由于您的问题是矢量化部分,我建议在具有多个BIN 0行的示例案例上测试答案,我使用了这个:
df = data.frame(BIN=c(1,0,1,0,0,0,0,0,0), NAME=rep(c("alice","bob","cate"), 3))
而且,因为我很好奇,这里有10个行的基准,有26个名字。结果首先,代码如下:
# Unit: milliseconds
# expr min lq mean median uq max neval
# while_loop 34.070438 34.327020 37.53357 35.548047 39.922918 46.206454 10
# increment 1.397617 1.458592 1.88796 1.526512 2.123894 3.196104 10
# increment_dplyr 24.002169 24.681960 25.50568 25.374429 25.750548 28.054954 10
# map_char 346.531498 347.732905 361.82468 359.736403 374.648635 383.575265 10
“聪明”的方式是迄今为止最快的方式。我的猜测是dplyr
减速是因为我们不能直接替换adjusted
的相关位,而是必须添加if_else
的开销。那我们实际上是在adjusted
和offset
的数据框中添加列而不是处理向量。这足以使它几乎与while
循环方法一样慢,这仍然比map_chr
快10倍,nn = 10000
df = data.frame(
BIN = sample(0:1, size = nn, replace = TRUE, prob = c(0.7, 0.3)),
NAME = factor(sample(letters, size = nn, replace = TRUE), levels = letters)
)
get.new.name <- function(c){
return(sample(names[names!=c],1))
}
microbenchmark::microbenchmark(
while_loop = {
rename = subset(df, BIN == 0)
rename$NEW_NAME = sample(names, size = nrow(rename), replace = TRUE)
while (any(rename$NAME == rename$NEW_NAME)) {
matches = rename$NAME == rename$NEW_NAME
rename$NEW_NAME[matches] = sample(names, size = sum(matches), replace = TRUE)
}
rename$BIN = 1
rename$NAME = rename$NEW_NAME
rename$NEW_NAME = NULL
result = rbind(df, rename)
},
increment = {
rename = subset(df, BIN == 0)
n = length(names)
# we will increment each level number with a random integer from
# 1 to n - 1 (with a mod to make it cyclical)
offset = sample(1:(n - 1), size = nrow(rename), replace = TRUE)
adjusted = (as.integer(rename$NAME) + offset) %% n
# reconcile 1-indexed factor levels with 0-indexed mod operator
adjusted[adjusted == 0] = n
rename$NAME = names[adjusted]
rename$BIN = 1
},
increment_dplyr = {
n = length(names)
df %>% filter(BIN == 0) %>%
mutate(
offset = sample(1:(n - 1), size = n(), replace = TRUE),
adjusted = (as.integer(NAME) + offset) %% n,
adjusted = if_else(adjusted == 0, n, adjusted),
NAME = names[adjusted],
BIN = 1
) %>%
select(-offset,-adjusted)
},
map_char = {
new.df <- df %>% filter(BIN == 0) %>%
mutate(NAME = map_chr(NAME, get.new.name)) %>%
mutate(BIN = 1)
},
times = 10
)
必须一次排成一排。
library(tidyverse)
df <- data.frame(BIN=c(1,0,1), NAME=c("alice","bob","cate"), stringsAsFactors = FALSE)
names <- c("alice","bob","cate","dan")
df %>%
mutate(NAME_new = ifelse(BIN == 0, sample(names, n(), replace = TRUE), NA)) %>%
gather(name_type, NAME, NAME:NAME_new, na.rm = TRUE) %>%
mutate(BIN = ifelse(name_type == "NAME_new", 1, BIN)) %>%
select(-name_type)
有点奇怪,但我认为这应该是你想要的:
BIN NAME
1 1 alice
2 0 bob
3 1 cate
4 1 alice
输出:
rowwise()
好吧,我不打算回答我自己的问题,但我确实找到了一个更简单的解决方案。我认为它比使用library(tidyverse)
get.new.name <- function(c){
return(sample(names[names!=c],1))
}
new.df <- rbind(df, df %>% filter(BIN == 0) %>%
mutate(NAME = map_chr(NAME, get.new.name)) %>%
mutate(BIN = 1)
更好,但我不知道它是否必然是最有效的方式。
map_char
map
最终变得非常重要,而不仅仅是qazxswpoi,因为后者将返回一个奇怪的列表列表。
以上是关于如何在R中的一列中添加具有不同值的新行的主要内容,如果未能解决你的问题,请参考以下文章
如何在数据帧中沿值的一列生成随机均匀分布而不必对所述列中的每个值重复?