当列名是字符串时,转换为长并制作频率表,R

Posted

技术标签:

【中文标题】当列名是字符串时,转换为长并制作频率表,R【英文标题】:Convert to long and make frequency table when column names are strings, R 【发布时间】:2022-01-20 13:54:29 【问题描述】:
ID    What color is this item?   What color is this item?_2    What is the shape of this item? What is the shape of this item?_2          size
55    red                         blue                          circle                           triangle                                 small                                             
83    blue                        yellow                        circle                           NA                                       large
78    red                         yellow                        square                           circle                                   large
43    green                       NA                            square                           circle                                   small
29    yellow                      green                         circle                           triangle                                 medium             

我想要一个这样的频率表:

Variable      Level        Freq        Percent 
 
color         blue          2           22.22
              red           2           22.22
              yellow        3           33.33
              green         2           22.22
              total         9           100.00

shape         circle        5           50.0       
              triangle      3           30.0
              square        2           20.0
              total         10          100.0

size          small         2           33.3
              medium        2           33.3
              large         2           33.3
              total         6           100.0

但是当我尝试转换为 long 时,我无法匹配列的名称,因为它们是长字符串。从上一个问题中,我知道我可以这样做:


options(digits = 3)
df1 <- df2 %>% 
  pivot_longer(
    -ID,
    names_to = "Question",
    values_to = "Response"
  ) %>% 
  mutate(Question = str_extract(Question, '')) %>% 
  group_by(Question, Response) %>% 
  count(Response, name = "Freq") %>% 
  na.omit() %>% 
  group_by(Question) %>% 
  mutate(Percent = Freq/sum(Freq)*100) %>% 
  group_split() %>% 
  adorn_totals() %>% 
  bind_rows() %>% 
  mutate(Response = ifelse(Response == last(Response), last(Question), Response)) %>% 
  mutate(Question = ifelse(duplicated(Question) |
                             Question == "Total", NA, Question))

但我无法找到正确的正则表达式以放入该行:

 mutate(Question = str_extract(Question, '')) %>% 

如果有人知道另一种方法来做到这一点,那就太好了!

【问题讨论】:

不清楚你想提取什么。 But I'm having trouble finding the right regular expression to put in the line:。你要mutate(Question = str_extract(Question, "color|shape|size")) 您介意与dput 分享您的数据吗?或者至少在列名周围加上引号?空格使导入很烦人。 【参考方案1】:

如果打算提取自定义单词列表,我们可以将元素粘贴在一起以创建单个字符串并将其用作regex in str_extract

library(dplyr)
library(tidyr)
library(janitor)
library(stringr)
library(flextable)

pat_words <- c("color", "shape", "size")
out <-  df %>% 
  pivot_longer(
    -ID,
    names_to = "Question",
    values_to = "Response"
  ) %>% mutate(Question = str_extract(Question, str_c(pat_words, collapse="|"))) %>% group_by(Question, Response) %>% 
  count(Response, name = "Freq") %>% 
  na.omit() %>% 
  group_by(Question) %>% 
  mutate(Percent = round(Freq/sum(Freq)*100, 2)) %>% 
  group_split() %>% 
  adorn_totals() %>% 
  bind_rows() %>% 
  mutate(Response = ifelse(Response == last(Response), last(Question), Response)) %>% 
  mutate(Question = ifelse(duplicated(Question) |
                             Question == "Total", NA, Question)) %>% 
  as.data.frame
flextable(out)

-输出

数据

df <- structure(list(ID = c(55L, 83L, 78L, 43L, 29L), `What color is this item?` = c("red", 
"blue", "red", "green", "yellow"), `What color is this item?_2` = c("blue", 
"yellow", "yellow", NA, "green"), `What is the shape of this item?` = c("circle", 
"circle", "square", "square", "circle"), `What is the shape of this item?_2` = c("triangle", 
NA, "circle", "circle", "triangle"), size = c("small", "large", 
"large", "small", "medium")), class = "data.frame", row.names = c(NA, 
-5L))

【讨论】:

你知道我如何控制桌子的顺序吗?就像我做了级别 @alex。 arrange 可以在将 duplicated 元素更改为 NA 之前完成,即 %&gt;% arrange(factor(Question, levels = levels)) %&gt;% mutate(Question = ifelse(duplicated(Question) | Question == "Total", NA, Question))%&gt;%【参考方案2】:

首先,您应该使用更合适的名称进行编码。

names(dat)[2:5] <- paste0(rep(c('color.', 'shape.'), each=2), 1:2)

现在我们可以轻松地将数据转换为长格式。

dat_l <- reshape(dat, 2:5, direction='long', idvar='ID')

之后,我们可以在基础 R 及其亲属中使用 table() 函数,

vars <- names(dat_l)[c("size", "color", "shape")]
tbl <- lapply(vars, \(x) table(dat_l[, x]) |> 
                (\(Freq) cbind(Freq=addmargins(Freq), 
                               Percent=addmargins(proportions(Freq))*100))() |>
                round(2)) |> 
  setNames(vars)

为控制台找一张漂亮的桌子。

tbl
# $size
#        Freq Percent
# large     4      40
# medium    2      20
# small     4      40
# Sum      10     100
# 
# $color
#        Freq Percent
# blue      2   22.22
# green     2   22.22
# red       2   22.22
# yellow    3   33.33
# Sum       9  100.00
# 
# $shape
#          Freq Percent
# circle      5   55.56
# square      2   22.22
# triangle    2   22.22
# Sum         9  100.00

# [1] "R version 4.1.2 (2021-11-01)"

数据

dat <- structure(list(ID = c(55L, 83L, 78L, 43L, 29L), What.color.is.this.item. = c("red", 
"blue", "red", "green", "yellow"), What.color.is.this.item._2 = c("blue", 
"yellow", "yellow", NA, "green"), What.is.the.shape.of.this.item. = c("circle", 
"circle", "square", "square", "circle"), What.is.the.shape.of.this.item._2 = c("triangle", 
NA, "circle", "circle", "triangle"), size = c("small", "large", 
"large", "small", "medium")), class = "data.frame", row.names = c(NA, 
-5L))

【讨论】:

【参考方案3】:

需要对列的内容(appl)做出假设,即给出重要的关键字。

然后根据列创建数据框

appl <- sapply( c("color","shape","size"), function(x) grep(x, colnames(dat)) )

data.frame( do.call( rbind, sapply( seq_along(appl), function(x)
  tbl <- table(unlist( dat[,appl[[x]]] )); 
  rbind( cbind( Variable=names(appl[x]), Freq=tbl, Percent=round( tbl/sum(tbl)*100, digits=2 ) ), 
  cbind( Variable=names(appl[x]), sum(tbl), sum(tbl/sum(tbl)*100) ) )   ) ) )

         Variable Freq Percent
blue        color    2   22.22
green       color    2   22.22
red         color    2   22.22
yellow      color    3   33.33
X           color    9     100
circle      shape    5   55.56
square      shape    2   22.22
triangle    shape    2   22.22
X.1         shape    9     100
large        size    2      40
medium       size    1      20
small        size    2      40
X.2          size    5     100

数据

dat <- structure(list(ID = c(55L, 83L, 78L, 43L, 29L), What.color.is.this.item. = c("red", 
"blue", "red", "green", "yellow"), What.color.is.this.item._2 = c("blue", 
"yellow", "yellow", NA, "green"), What.is.the.shape.of.this.item. = c("circle", 
"circle", "square", "square", "circle"), What.is.the.shape.of.this.item._2 = c("triangle", 
NA, "circle", "circle", "triangle"), size = c("small", "large", 
"large", "small", "medium")), class = "data.frame", row.names = c(NA, 
-5L))

【讨论】:

以上是关于当列名是字符串时,转换为长并制作频率表,R的主要内容,如果未能解决你的问题,请参考以下文章

C:如何将字符串转换为长整数格式的整数

在 Access 中将列类型转换为长文本

当文本覆盖整个屏幕时,通过捏缩文本

转置数据框并将列名添加为 R 中的变量

将十六进制字符串转换为长字符串

如何在 C# 中将数据从字符串转换为长整数