通过收集多个列来整理数据集? [复制]

Posted

技术标签:

【中文标题】通过收集多个列来整理数据集? [复制]【英文标题】:Tidying dataset by gathering multiple columns? [duplicate] 【发布时间】:2018-04-26 14:38:20 【问题描述】:

我想通过这种方式处理数据来整理数据集:

age gender  education       previous_comp_exp   tutorial_time   qID.1    time_taken.1   qID.2    time_taken.2   
18  Male    Undergraduate   casual gamer        62.17926        sor9     39.61206       sor8     19.4892
24  Male    Undergraduate   casual gamer        85.01288        sor9     50.92343       sor8     16.15616

变成这样:

age gender  education       previous_comp_exp   tutorial_time   qID      time_taken 
18  Male    Undergraduate   casual gamer        62.17926        sor9     39.61206       
18  Male    Undergraduate   casual gamer        62.17926        sor8     19.4892
24  Male    Undergraduate   casual gamer        85.01288        sor9     50.92343       
24  Male    Undergraduate   casual gamer        85.01288        sor8     16.15616

我已经尝试过gather(),但我只能让它与一列一起工作,而且我不断收到这个警告:

警告消息:度量变量的属性不相同; 他们将被丢弃

有什么想法吗?

【问题讨论】:

这不是一个错误,它是一个警告,让您知道您堆叠的两列具有不同的属性(也许它们都是因子,但具有不同的级别),因此这些属性是在输出中下降。 This 和 this 可能有助于处理从宽到长的重塑,因为您有成对的列,每个列都需要堆叠。 【参考方案1】:

melt 来自data.table(参见?patterns):

library(data.table)

melt(setDT(df), measure = patterns("^qID", "^time_taken"),
     value.name = c("qID", "time_taken"))

结果:

   age gender     education previous_comp_exp tutorial_time variable  qID time_taken
1:  18   Male Undergraduate      casual_gamer      62.17926        1 sor9   39.61206
2:  24   Male Undergraduate      casual_gamer      85.01288        1 sor9   50.92343
3:  18   Male Undergraduate      casual_gamer      62.17926        2 sor8   19.48920
4:  24   Male Undergraduate      casual_gamer      85.01288        2 sor8   16.15616

tidyr:

library(dplyr)
library(tidyr)

df %>%
  gather(variable, value, qID.1:time_taken.2) %>%
  mutate(variable = sub("\\.\\d$", "", variable)) %>%
  group_by(variable) %>%
  mutate(ID = row_number()) %>%
  spread(variable, value, convert = TRUE) %>%
  select(-ID)

结果:

# A tibble: 4 x 7
    age gender     education previous_comp_exp tutorial_time   qID time_taken
  <int> <fctr>        <fctr>            <fctr>         <dbl> <chr>      <dbl>
1    18   Male Undergraduate      casual_gamer      62.17926  sor9   39.61206
2    18   Male Undergraduate      casual_gamer      62.17926  sor8   19.48920
3    24   Male Undergraduate      casual_gamer      85.01288  sor9   50.92343
4    24   Male Undergraduate      casual_gamer      85.01288  sor8   16.15616

注意:

对于tidyr 方法,convert=TRUE 用于将time_taken 转换回numeric,因为当gatherqID 列一起使用时,它被强制转换为字符。

数据:

df = structure(list(age = c(18L, 24L), gender = structure(c(1L, 1L
), .Label = "Male", class = "factor"), education = structure(c(1L, 
1L), .Label = "Undergraduate", class = "factor"), previous_comp_exp = structure(c(1L, 
1L), .Label = "casual_gamer", class = "factor"), tutorial_time = c(62.17926, 
85.01288), qID.1 = structure(c(1L, 1L), .Label = "sor9", class = "factor"), 
    time_taken.1 = c(39.61206, 50.92343), qID.2 = structure(c(1L, 
    1L), .Label = "sor8", class = "factor"), time_taken.2 = c(19.4892, 
    16.15616)), .Names = c("age", "gender", "education", "previous_comp_exp", 
"tutorial_time", "qID.1", "time_taken.1", "qID.2", "time_taken.2"
), class = "data.frame", row.names = c(NA, -2L))

【讨论】:

融会贯通! @user,在 Tidy 方法中,年龄变量消失了。任何想法为什么? @stenfeio 感谢您的关注!我实际上错误地读入了数据。由于我使用的是read.tablecasualgamer 被视为单独的列,第一列被视为行名。它没有抛出错误,因为列数恰好匹配。查看我的编辑 不知道pattern函数,不错的线。 @user我在看到您的解决方案之前尝试了一个问题: time 您可以在spread 中使用convert = TRUE,而不是稍后在mutate 中手动设置类型。【参考方案2】:

在 base R 中,您可以使用强大的 reshape 在一行语句中将数据从宽格式转换为长格式:

   reshape(dx,direction="long",
        varying=list(grep("qID",colnames(dx)),
                     grep("time_taken",colnames(dx))),
        v.names=c("qID","time_taken"))

     age gender     education previous_comp_exp tutorial_time time  qID time_taken id
1.1  18   Male Undergraduate      casual_gamer      62.17926    1 sor9   39.61206  1
2.1  24   Male Undergraduate      casual_gamer      85.01288    1 sor9   50.92343  2
1.2  18   Male Undergraduate      casual_gamer      62.17926    2 sor8   19.48920  1
2.2  24   Male Undergraduate      casual_gamer      85.01288    2 sor8   16.15616  2

【讨论】:

我认为您也错误地读取了数据。您可以在我的回答中使用新的dput @user 很好。现已修复。

以上是关于通过收集多个列来整理数据集? [复制]的主要内容,如果未能解决你的问题,请参考以下文章

MongoDB管理之副本集

如何使用 StringAgg 或 ArrayAgg 连接多个子行中的一列来注释 django 查询集?

MongoDB复制集

使用多个数据集创建 seaborn 散点图矩阵 (PairGrid)

MongoDB 复制集

MongoDB复制集