具有重复列名的整洁 data.frame

Posted

技术标签:

【中文标题】具有重复列名的整洁 data.frame【英文标题】:Tidy data.frame with repeated column names 【发布时间】:2017-06-03 08:44:07 【问题描述】:

我有一个程序可以提供这种格式的数据

toy
                file_path Condition Trial.Num A B  C  ID A B  C   ID  A B  C    ID
1     root/some.extension  Baseline         1 2 3  5 car 2 1  7 bike  4 9  0 plane
2    root/thing.extension  Baseline         2 3 6 45 car 5 4  4 bike  9 5  4 plane
3     root/else.extension  Baseline         3 4 4  6 car 7 5  4 bike 68 7 56 plane
4 root/uniquely.extension Treatment         1 5 3  7 car 1 7 37 bike  9 8  7 plane
5  root/defined.extension Treatment         2 6 7  3 car 4 6  8 bike  9 0  8 plane

我的目标是将格式整理成至少可以更容易通过具有唯一列名的 reshape 进行整理

tidy_toy
                 file_path Condition Trial.Num  A B  C    ID
1      root/some.extension  Baseline         1  2 3  5   car
2     root/thing.extension  Baseline         2  3 6 45   car
3      root/else.extension  Baseline         3  4 4  6   car
4  root/uniquely.extension Treatment         1  5 3  7   car
5   root/defined.extension Treatment         2  6 7  3   car
6      root/some.extension  Baseline         1  2 1  7  bike
7     root/thing.extension  Baseline         2  5 4  4  bike
8      root/else.extension  Baseline         3  7 5  4  bike
9  root/uniquely.extension Treatment         1  1 7 37  bike
10  root/defined.extension Treatment         2  4 6  8  bike
11     root/some.extension  Baseline         1  4 9  0 plane
12    root/thing.extension  Baseline         2  9 5  4 plane
13     root/else.extension  Baseline         3 68 7 56 plane
14 root/uniquely.extension Treatment         1  9 8  7 plane
15  root/defined.extension Treatment         2  9 0  8 plane

如果我尝试从toymelt,它不起作用,因为只有第一个 ID 列将用于id.vars(因此所有内容都会被标记为汽车)。相同的变量将被删除。

这是两个表的输出

   structure(list(file_path = structure(c(3L, 4L, 2L, 5L, 1L), .Label = c("root/defined.extension", 
    "root/else.extension", "root/some.extension", "root/thing.extension", 
    "root/uniquely.extension"), class = "factor"), Condition = structure(c(1L, 
    1L, 1L, 2L, 2L), .Label = c("Baseline", "Treatment"), class = "factor"), 
        Trial.Num = c(1L, 2L, 3L, 1L, 2L), A = 2:6, B = c(3L, 6L, 
        4L, 3L, 7L), C = c(5L, 45L, 6L, 7L, 3L), ID = structure(c(1L, 
        1L, 1L, 1L, 1L), .Label = "car", class = "factor"), A = c(2L, 
        5L, 7L, 1L, 4L), B = c(1L, 4L, 5L, 7L, 6L), C = c(7L, 4L, 
        4L, 37L, 8L), ID = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "bike", class = "factor"), 
        A = c(4L, 9L, 68L, 9L, 9L), B = c(9L, 5L, 7L, 8L, 0L), C = c(0L, 
        4L, 56L, 7L, 8L), ID = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "plane", class = "factor")), .Names = c("file_path", 
    "Condition", "Trial.Num", "A", "B", "C", "ID", "A", "B", "C", 
    "ID", "A", "B", "C", "ID"), class = "data.frame", row.names = c(NA, 
    -5L))


structure(list(file_path = structure(c(3L, 4L, 2L, 5L, 1L, 3L, 
4L, 2L, 5L, 1L, 3L, 4L, 2L, 5L, 1L), .Label = c("root/defined.extension", 
"root/else.extension", "root/some.extension", "root/thing.extension", 
"root/uniquely.extension"), class = "factor"), Condition = structure(c(1L, 
1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L), .Label = c("Baseline", 
"Treatment"), class = "factor"), Trial.Num = c(1L, 2L, 3L, 1L, 
2L, 1L, 2L, 3L, 1L, 2L, 1L, 2L, 3L, 1L, 2L), A = c(2L, 3L, 4L, 
5L, 6L, 2L, 5L, 7L, 1L, 4L, 4L, 9L, 68L, 9L, 9L), B = c(3L, 6L, 
4L, 3L, 7L, 1L, 4L, 5L, 7L, 6L, 9L, 5L, 7L, 8L, 0L), C = c(5L, 
45L, 6L, 7L, 3L, 7L, 4L, 4L, 37L, 8L, 0L, 4L, 56L, 7L, 8L), ID = structure(c(2L, 
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L), .Label = c("bike", 
"car", "plane"), class = "factor")), .Names = c("file_path", 
"Condition", "Trial.Num", "A", "B", "C", "ID"), class = "data.frame", row.names = c(NA, 
-15L))

【问题讨论】:

【参考方案1】:

您可以使用make.unique-函数来创建唯一的列名。之后,您可以使用data.table-package 中的melt,它能够基于列名中的patterns 创建多个值列:

# make the column names unique
names(toy) <- make.unique(names(toy))
# let the 'Condition' column start with a small letter 'c'
# so it won't be detected by the patterns argument from melt
names(toy)[2] <- tolower(names(toy)[2])

# load the 'data.table' package
library(data.table)
# tidy the data into long format
tidy_toy <- melt(setDT(toy), 
                 measure.vars = patterns('^A','^B','^C','^ID'), 
                 value.name = c('A','B','C','ID'))

给出:

 > tidy_toy
                  file_path condition Trial.Num variable  A B  C    ID
 1:     root/some.extension  Baseline         1        1  2 3  5   car
 2:    root/thing.extension  Baseline         2        1  3 6 45   car
 3:     root/else.extension  Baseline         3        1  4 4  6   car
 4: root/uniquely.extension Treatment         1        1  5 3  7   car
 5:  root/defined.extension Treatment         2        1  6 7  3   car
 6:     root/some.extension  Baseline         1        2  2 1  7  bike
 7:    root/thing.extension  Baseline         2        2  5 4  4  bike
 8:     root/else.extension  Baseline         3        2  7 5  4  bike
 9: root/uniquely.extension Treatment         1        2  1 7 37  bike
10:  root/defined.extension Treatment         2        2  4 6  8  bike
11:     root/some.extension  Baseline         1        3  4 9  0 plane
12:    root/thing.extension  Baseline         2        3  9 5  4 plane
13:     root/else.extension  Baseline         3        3 68 7 56 plane
14: root/uniquely.extension Treatment         1        3  9 8  7 plane
15:  root/defined.extension Treatment         2        3  9 0  8 plane

另一种选择是使用measure.vars 的列索引列表:

tidy_toy <- melt(setDT(toy), 
                 measure.vars = list(c(4,8,12), c(5,9,13), c(6,10,14), c(7,11,15)), 
                 value.name = c('A','B','C','ID'))

那么就没有必要让列名唯一了。


一种更复杂的方法,可以创建更容易被patterns 参数区分的名称:

# select the names that are not unique
tt <- table(names(toy))
idx <- which(names(toy) %in% names(tt)[tt > 1])
nms <- names(toy)[idx]

# make them unique
names(toy)[idx] <- paste(nms, 
                         rep(seq(length(nms) / length(names(tt)[tt > 1])), 
                             each = length(names(tt)[tt > 1])), 
                         sep = '.')

# your columnnames are now unique:
> names(toy)
 [1] "file_path" "Condition" "Trial.Num" "A.1"       "B.1"       "C.1"       "ID.1"      "A.2"      
 [9] "B.2"       "C.2"       "ID.2"      "A.3"       "B.3"       "C.3"       "ID.3"     

# tidy the data into long format
tidy_toy <- melt(setDT(toy), 
                 measure.vars = patterns('^A.\\d','^B.\\d','^C.\\d','^ID.\\d'), 
                 value.name = c('A','B','C','ID'))

这将给出相同的最终结果。


正如 cmets 中提到的,janitor-package 也可以帮助解决这个问题。 clean_names() 的工作原理与 make.unique 函数类似。 See here 解释一下。

【讨论】:

这个很有创意!我将尝试使用变量的模式(我的原始数据中的列比 A、B、C 多) 我是 janitor 用于清理数据帧的软件包的忠实粉丝。对这种情况有帮助。【参考方案2】:

使用 tidyverse 我们可以做到:

library(tidyverse)
toy %>%
  repair_names(sep="_") %>%
  pivot_longer(-(1:3),names_to = c(".value","id"), names_sep="_") %>%
  select(-id)

#> # A tibble: 15 x 7
#>    file_path               Condition Trial.Num     A     B     C ID   
#>    <fct>                   <fct>         <int> <int> <int> <int> <fct>
#>  1 root/some.extension     Baseline          1     2     3     5 car  
#>  2 root/some.extension     Baseline          1     2     1     7 bike 
#>  3 root/some.extension     Baseline          1     4     9     0 plane
#>  4 root/thing.extension    Baseline          2     3     6    45 car  
#>  5 root/thing.extension    Baseline          2     5     4     4 bike 
#>  6 root/thing.extension    Baseline          2     9     5     4 plane
#>  7 root/else.extension     Baseline          3     4     4     6 car  
#>  8 root/else.extension     Baseline          3     7     5     4 bike 
#>  9 root/else.extension     Baseline          3    68     7    56 plane
#> 10 root/uniquely.extension Treatment         1     5     3     7 car  
#> 11 root/uniquely.extension Treatment         1     1     7    37 bike 
#> 12 root/uniquely.extension Treatment         1     9     8     7 plane
#> 13 root/defined.extension  Treatment         2     6     7     3 car  
#> 14 root/defined.extension  Treatment         2     4     6     8 bike 
#> 15 root/defined.extension  Treatment         2     9     0     8 plane
#> Warning message:
#> Expected 2 pieces. Missing pieces filled with `NA` in 4 rows [1, 2, 3, 4].

【讨论】:

以上是关于具有重复列名的整洁 data.frame的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 R 中指定的列名创建空数据框? [复制]

如何在data.table中使用某些列名的字符向量选择列?[重复]

R使用标题向量创建空数据框[重复]

重复data.frame的行[重复]

保留重复删除的唯一值

在 data.frame 中显示重复记录并省略单个记录