通过收集多个列来整理数据集? [复制]
Posted
技术标签:
【中文标题】通过收集多个列来整理数据集? [复制]【英文标题】:Tidying dataset by gathering multiple columns? [duplicate] 【发布时间】:2018-04-26 14:38:20 【问题描述】:我想通过这种方式处理数据来整理数据集:
age gender education previous_comp_exp tutorial_time qID.1 time_taken.1 qID.2 time_taken.2
18 Male Undergraduate casual gamer 62.17926 sor9 39.61206 sor8 19.4892
24 Male Undergraduate casual gamer 85.01288 sor9 50.92343 sor8 16.15616
变成这样:
age gender education previous_comp_exp tutorial_time qID time_taken
18 Male Undergraduate casual gamer 62.17926 sor9 39.61206
18 Male Undergraduate casual gamer 62.17926 sor8 19.4892
24 Male Undergraduate casual gamer 85.01288 sor9 50.92343
24 Male Undergraduate casual gamer 85.01288 sor8 16.15616
我已经尝试过gather()
,但我只能让它与一列一起工作,而且我不断收到这个警告:
警告消息:度量变量的属性不相同; 他们将被丢弃
有什么想法吗?
【问题讨论】:
这不是一个错误,它是一个警告,让您知道您堆叠的两列具有不同的属性(也许它们都是因子,但具有不同的级别),因此这些属性是在输出中下降。 This 和 this 可能有助于处理从宽到长的重塑,因为您有成对的列,每个列都需要堆叠。 【参考方案1】:与melt
来自data.table
(参见?patterns
):
library(data.table)
melt(setDT(df), measure = patterns("^qID", "^time_taken"),
value.name = c("qID", "time_taken"))
结果:
age gender education previous_comp_exp tutorial_time variable qID time_taken
1: 18 Male Undergraduate casual_gamer 62.17926 1 sor9 39.61206
2: 24 Male Undergraduate casual_gamer 85.01288 1 sor9 50.92343
3: 18 Male Undergraduate casual_gamer 62.17926 2 sor8 19.48920
4: 24 Male Undergraduate casual_gamer 85.01288 2 sor8 16.15616
或tidyr
:
library(dplyr)
library(tidyr)
df %>%
gather(variable, value, qID.1:time_taken.2) %>%
mutate(variable = sub("\\.\\d$", "", variable)) %>%
group_by(variable) %>%
mutate(ID = row_number()) %>%
spread(variable, value, convert = TRUE) %>%
select(-ID)
结果:
# A tibble: 4 x 7
age gender education previous_comp_exp tutorial_time qID time_taken
<int> <fctr> <fctr> <fctr> <dbl> <chr> <dbl>
1 18 Male Undergraduate casual_gamer 62.17926 sor9 39.61206
2 18 Male Undergraduate casual_gamer 62.17926 sor8 19.48920
3 24 Male Undergraduate casual_gamer 85.01288 sor9 50.92343
4 24 Male Undergraduate casual_gamer 85.01288 sor8 16.15616
注意:
对于tidyr
方法,convert=TRUE
用于将time_taken
转换回numeric
,因为当gather
与qID
列一起使用时,它被强制转换为字符。
数据:
df = structure(list(age = c(18L, 24L), gender = structure(c(1L, 1L
), .Label = "Male", class = "factor"), education = structure(c(1L,
1L), .Label = "Undergraduate", class = "factor"), previous_comp_exp = structure(c(1L,
1L), .Label = "casual_gamer", class = "factor"), tutorial_time = c(62.17926,
85.01288), qID.1 = structure(c(1L, 1L), .Label = "sor9", class = "factor"),
time_taken.1 = c(39.61206, 50.92343), qID.2 = structure(c(1L,
1L), .Label = "sor8", class = "factor"), time_taken.2 = c(19.4892,
16.15616)), .Names = c("age", "gender", "education", "previous_comp_exp",
"tutorial_time", "qID.1", "time_taken.1", "qID.2", "time_taken.2"
), class = "data.frame", row.names = c(NA, -2L))
【讨论】:
融会贯通! @user,在 Tidy 方法中,年龄变量消失了。任何想法为什么? @stenfeio 感谢您的关注!我实际上错误地读入了数据。由于我使用的是read.table
,casual
和gamer
被视为单独的列,第一列被视为行名。它没有抛出错误,因为列数恰好匹配。查看我的编辑
不知道pattern函数,不错的线。 @user我在看到您的解决方案之前尝试了一个问题: time
您可以在spread
中使用convert = TRUE
,而不是稍后在mutate
中手动设置类型。【参考方案2】:
在 base R 中,您可以使用强大的 reshape
在一行语句中将数据从宽格式转换为长格式:
reshape(dx,direction="long",
varying=list(grep("qID",colnames(dx)),
grep("time_taken",colnames(dx))),
v.names=c("qID","time_taken"))
age gender education previous_comp_exp tutorial_time time qID time_taken id
1.1 18 Male Undergraduate casual_gamer 62.17926 1 sor9 39.61206 1
2.1 24 Male Undergraduate casual_gamer 85.01288 1 sor9 50.92343 2
1.2 18 Male Undergraduate casual_gamer 62.17926 2 sor8 19.48920 1
2.2 24 Male Undergraduate casual_gamer 85.01288 2 sor8 16.15616 2
【讨论】:
我认为您也错误地读取了数据。您可以在我的回答中使用新的dput
。
@user 很好。现已修复。以上是关于通过收集多个列来整理数据集? [复制]的主要内容,如果未能解决你的问题,请参考以下文章
如何使用 StringAgg 或 ArrayAgg 连接多个子行中的一列来注释 django 查询集?