如何在R中的一列中添加具有不同值的新行

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如何在R中的一列中添加具有不同值的新行相关的知识,希望对你有一定的参考价值。

基本上我有一个所有名称的矢量names,以及一个带有BIN(0/1)字段和NAME字段的数据帧df。对于BIN==0的每一行,我想创建一个重复的行,但是用1代替,并将其添加到df的底部,并使用不同的名称。这是我必须选择一个新名称,给定当前名称:

sample(names[names!=name], 1)

但我不知道如何对此进行矢量化,并使用来自BIN的相同数据将其添加到df

编辑:示例数据:

df = data.frame(BIN=c(1,0,1), NAME=c("alice","bob","cate"))
names = c("alice","bob","cate","dan")

我接近这样的事情:

rbind(df, df %>% filter(BIN == 0) %>%
    mutate(NAME = sample(names[names!=NAME],1)))

但是我得到一个错误:在binattr(e1,e2)中:长度(e1)不是长度的倍数(e2)。

答案

这是一个简单的方法。我认为这很简单,如果您有疑问,请告诉我:

rename = subset(df, BIN == 0)
rename$NEW_NAME = sample(names, size = nrow(rename), replace = TRUE)
while(any(rename$NAME == rename$NEW_NAME)) {
  matches = rename$NAME == rename$NEW_NAME
  rename$NEW_NAME[matches] = sample(names, size = sum(matches), replace = TRUE)
}
rename$BIN = 1
rename$NAME = rename$NEW_NAME
rename$NEW_NAME = NULL

result = rbind(df, rename)
result
#    BIN  NAME
# 1    1 alice
# 2    0   bob
# 3    1  cate
# 21   1 alice

这是另一种方法,不太清晰但效率更高。这是“正确”的方式,但它需要更多的思考和解释。

df$NAME = factor(df$NAME, levels = names)
rename = subset(df, BIN == 0)
n = length(names)
# we will increment each level number with a random integer from
# 1 to n - 1 (with a mod to make it cyclical)
offset = sample(1:(n - 1), size = nrow(rename), replace = TRUE)
adjusted = (as.integer(rename$NAME) + offset) %% n
# reconcile 1-indexed factor levels with 0-indexed mod operator
adjusted[adjusted == 0] = n
rename$NAME = names[adjusted]
rename$BIN = 1
result = rbind(df, rename)

(或者,为dplyr改写)

df = mutate(df, NAME = factor(NAME, levels = names))
n = length(names)
df %>% filter(BIN == 0) %>%
  mutate(
    offset = sample(1:(n - 1), size = n(), replace = TRUE),
    adjusted = (as.integer(NAME) + offset) %% n,
    adjusted = if_else(adjusted == 0, n, adjusted),
    NAME = names[adjusted],
    BIN = 1
  ) %>%
  select(-offset, -adjusted) %>%
  rbind(df, .)

由于您的问题是矢量化部分,我建议在具有多个BIN 0行的示例案例上测试答案,我使用了这个:

df = data.frame(BIN=c(1,0,1,0,0,0,0,0,0), NAME=rep(c("alice","bob","cate"), 3))

而且,因为我很好奇,这里有10个行的基准,有26个名字。结果首先,代码如下:

# Unit: milliseconds
#             expr        min         lq      mean     median         uq        max neval
#       while_loop  34.070438  34.327020  37.53357  35.548047  39.922918  46.206454    10
#        increment   1.397617   1.458592   1.88796   1.526512   2.123894   3.196104    10
#  increment_dplyr  24.002169  24.681960  25.50568  25.374429  25.750548  28.054954    10
#         map_char 346.531498 347.732905 361.82468 359.736403 374.648635 383.575265    10

“聪明”的方式是迄今为止最快的方式。我的猜测是dplyr减速是因为我们不能直接替换adjusted的相关位,而是必须添加if_else的开销。那我们实际上是在adjustedoffset的数据框中添加列而不是处理向量。这足以使它几乎与while循环方法一样慢,这仍然比map_chr快10倍,nn = 10000 df = data.frame( BIN = sample(0:1, size = nn, replace = TRUE, prob = c(0.7, 0.3)), NAME = factor(sample(letters, size = nn, replace = TRUE), levels = letters) ) get.new.name <- function(c){ return(sample(names[names!=c],1)) } microbenchmark::microbenchmark( while_loop = { rename = subset(df, BIN == 0) rename$NEW_NAME = sample(names, size = nrow(rename), replace = TRUE) while (any(rename$NAME == rename$NEW_NAME)) { matches = rename$NAME == rename$NEW_NAME rename$NEW_NAME[matches] = sample(names, size = sum(matches), replace = TRUE) } rename$BIN = 1 rename$NAME = rename$NEW_NAME rename$NEW_NAME = NULL result = rbind(df, rename) }, increment = { rename = subset(df, BIN == 0) n = length(names) # we will increment each level number with a random integer from # 1 to n - 1 (with a mod to make it cyclical) offset = sample(1:(n - 1), size = nrow(rename), replace = TRUE) adjusted = (as.integer(rename$NAME) + offset) %% n # reconcile 1-indexed factor levels with 0-indexed mod operator adjusted[adjusted == 0] = n rename$NAME = names[adjusted] rename$BIN = 1 }, increment_dplyr = { n = length(names) df %>% filter(BIN == 0) %>% mutate( offset = sample(1:(n - 1), size = n(), replace = TRUE), adjusted = (as.integer(NAME) + offset) %% n, adjusted = if_else(adjusted == 0, n, adjusted), NAME = names[adjusted], BIN = 1 ) %>% select(-offset,-adjusted) }, map_char = { new.df <- df %>% filter(BIN == 0) %>% mutate(NAME = map_chr(NAME, get.new.name)) %>% mutate(BIN = 1) }, times = 10 ) 必须一次排成一排。

library(tidyverse)

df <- data.frame(BIN=c(1,0,1), NAME=c("alice","bob","cate"), stringsAsFactors = FALSE)
names <- c("alice","bob","cate","dan")

df %>% 
  mutate(NAME_new = ifelse(BIN == 0, sample(names, n(), replace = TRUE), NA)) %>% 
  gather(name_type, NAME, NAME:NAME_new, na.rm = TRUE) %>% 
  mutate(BIN = ifelse(name_type == "NAME_new", 1, BIN)) %>% 
  select(-name_type)
另一答案

有点奇怪,但我认为这应该是你想要的:

  BIN  NAME
1   1 alice
2   0   bob
3   1  cate
4   1 alice

输出:

rowwise()
另一答案

好吧,我不打算回答我自己的问题,但我确实找到了一个更简单的解决方案。我认为它比使用library(tidyverse) get.new.name <- function(c){ return(sample(names[names!=c],1)) } new.df <- rbind(df, df %>% filter(BIN == 0) %>% mutate(NAME = map_chr(NAME, get.new.name)) %>% mutate(BIN = 1) 更好,但我不知道它是否必然是最有效的方式。

map_char

map最终变得非常重要,而不仅仅是qazxswpoi,因为后者将返回一个奇怪的列表列表。

以上是关于如何在R中的一列中添加具有不同值的新行的主要内容,如果未能解决你的问题,请参考以下文章

如何在数据帧中沿值的一列生成随机均匀分布而不必对所述列中的每个值重复?

基于另一列中的值的一列上的pyspark滞后函数

一列中相同值的R子集行取决于另一列中的多个值

Power BI:将项目分组在一列中,具有不同值的其他列显示为几列

将字典添加为数据框中的新行

DB2 根据另一列中的不同值更新具有递增数字的列