创建一个变量以计算列子集的每行中唯一值的数量

Posted

技术标签:

【中文标题】创建一个变量以计算列子集的每行中唯一值的数量【英文标题】:Create a variable to count the number of unique values in each row for a subset of columns 【发布时间】:2022-01-18 22:52:44 【问题描述】:

我想创建一个变量来计算列子集(即基线、wave1、wave2、wave3)每行中唯一值的数量。到目前为止,我有以下内容。我已经包含了一个带有变量“示例”的示例数据集,以显示我所追求的。我还包含了变量“change”,它显示了使用下面的代码创建的变量。

# Create example data
data <- structure(list(age = c("18", "19", NA, "40", "21", "33", "32", 
"34", "43", "22"), baseline = c("1", "1", NA, "4", "1", "3", 
"2", "4", "3", "2"), wave1 = c("1", "1", "2", "4", "4", "3", 
"2", "4", "3", "2"), wave2 = c("1", "1", "4", "4", NA, "3", 
"2", "4", "3", "2"), wave3 = c("1", "2", NA, "4", "4", "3", 
"2", "4", "3", "4"), example = c("1", "2", "2", "1", "2", "1", 
"1", "1", "1", "2"), change = c(6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L)), row.names = c(NA, -10L), groups = structure(list(.rows = structure(list(
    1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame")), class = c("rowwise_df", "tbl_df", "tbl", 
"data.frame"))

library(dplyr)
# Create a var for change at any point (ignoring NAs)
data <- data %>% 
  rowwise() %>% #perform operation by row
  mutate(change = length(unique(na.omit(baseline,wave1,wave2,wave3))))

【问题讨论】:

我不认为有一个有效的过程。 @akrun 刚刚建议使用n_distinct 函数来替换您的length(unique(.)),以及使用c_across,但是虽然它们增加了可读性(并且是dplyr-canonical),但我不知道你会发现很多更好。 data[,"change"] &lt;- apply(data[,2:5],1,function(x) length(na.omit(unique(x)))) 【参考方案1】:

我们可以使用n_distinct,我们可以使用na.rm 参数来删除NA 元素(尽管在OP 的数据中,它是"NA"

library(dplyr)
data %>%
   type.convert(as.is = TRUE) %>%
   rowwise %>% 
   mutate(change = n_distinct(c_across(baseline:wave3), na.rm = TRUE)) %>%
   ungroup

-输出

# A tibble: 10 × 7
     age baseline wave1 wave2 wave3 example change
   <int>    <int> <int> <int> <int>   <int>  <int>
 1    18        1     1     1     1       1      1
 2    19        1     1     1     2       2      2
 3    NA       NA     2     4    NA       2      2
 4    40        4     4     4     4       1      1
 5    21        1     4    NA     4       2      2
 6    33        3     3     3     3       1      1
 7    32        2     2     2     2       1      1
 8    34        4     4     4     4       1      1
 9    43        3     3     3     3       1      1
10    22        2     2     2     4       2      2

或者dapply from collapse 的更快选项

library(collapse)
data$change <- dapply(slt(ungroup(data), baseline:wave3), 
      MARGIN = 1, FUN = fndistinct)

【讨论】:

以上是关于创建一个变量以计算列子集的每行中唯一值的数量的主要内容,如果未能解决你的问题,请参考以下文章

用 pentaho 计算每列空值的数量

计算一行中空值的数量以除以(未知)列数

计算二维数组每行中非 NaN 值的数量

PySpark:计算列子集的最大行数并添加到现有数据帧

计算列子集的行均值

R中唯一值的累积计数