当列名是字符串时,转换为长并制作频率表,R
Posted
技术标签:
【中文标题】当列名是字符串时,转换为长并制作频率表,R【英文标题】:Convert to long and make frequency table when column names are strings, R 【发布时间】:2022-01-20 13:54:29 【问题描述】:ID What color is this item? What color is this item?_2 What is the shape of this item? What is the shape of this item?_2 size
55 red blue circle triangle small
83 blue yellow circle NA large
78 red yellow square circle large
43 green NA square circle small
29 yellow green circle triangle medium
我想要一个这样的频率表:
Variable Level Freq Percent
color blue 2 22.22
red 2 22.22
yellow 3 33.33
green 2 22.22
total 9 100.00
shape circle 5 50.0
triangle 3 30.0
square 2 20.0
total 10 100.0
size small 2 33.3
medium 2 33.3
large 2 33.3
total 6 100.0
但是当我尝试转换为 long 时,我无法匹配列的名称,因为它们是长字符串。从上一个问题中,我知道我可以这样做:
options(digits = 3)
df1 <- df2 %>%
pivot_longer(
-ID,
names_to = "Question",
values_to = "Response"
) %>%
mutate(Question = str_extract(Question, '')) %>%
group_by(Question, Response) %>%
count(Response, name = "Freq") %>%
na.omit() %>%
group_by(Question) %>%
mutate(Percent = Freq/sum(Freq)*100) %>%
group_split() %>%
adorn_totals() %>%
bind_rows() %>%
mutate(Response = ifelse(Response == last(Response), last(Question), Response)) %>%
mutate(Question = ifelse(duplicated(Question) |
Question == "Total", NA, Question))
但我无法找到正确的正则表达式以放入该行:
mutate(Question = str_extract(Question, '')) %>%
如果有人知道另一种方法来做到这一点,那就太好了!
【问题讨论】:
不清楚你想提取什么。But I'm having trouble finding the right regular expression to put in the line:
。你要mutate(Question = str_extract(Question, "color|shape|size"))
您介意与dput
分享您的数据吗?或者至少在列名周围加上引号?空格使导入很烦人。
【参考方案1】:
如果打算提取自定义单词列表,我们可以将元素粘贴在一起以创建单个字符串并将其用作regex
in str_extract
library(dplyr)
library(tidyr)
library(janitor)
library(stringr)
library(flextable)
pat_words <- c("color", "shape", "size")
out <- df %>%
pivot_longer(
-ID,
names_to = "Question",
values_to = "Response"
) %>% mutate(Question = str_extract(Question, str_c(pat_words, collapse="|"))) %>% group_by(Question, Response) %>%
count(Response, name = "Freq") %>%
na.omit() %>%
group_by(Question) %>%
mutate(Percent = round(Freq/sum(Freq)*100, 2)) %>%
group_split() %>%
adorn_totals() %>%
bind_rows() %>%
mutate(Response = ifelse(Response == last(Response), last(Question), Response)) %>%
mutate(Question = ifelse(duplicated(Question) |
Question == "Total", NA, Question)) %>%
as.data.frame
flextable(out)
-输出
数据
df <- structure(list(ID = c(55L, 83L, 78L, 43L, 29L), `What color is this item?` = c("red",
"blue", "red", "green", "yellow"), `What color is this item?_2` = c("blue",
"yellow", "yellow", NA, "green"), `What is the shape of this item?` = c("circle",
"circle", "square", "square", "circle"), `What is the shape of this item?_2` = c("triangle",
NA, "circle", "circle", "triangle"), size = c("small", "large",
"large", "small", "medium")), class = "data.frame", row.names = c(NA,
-5L))
【讨论】:
你知道我如何控制桌子的顺序吗?就像我做了级别 @alex。arrange
可以在将 duplicated
元素更改为 NA
之前完成,即 %>% arrange(factor(Question, levels = levels)) %>% mutate(Question = ifelse(duplicated(Question) | Question == "Total", NA, Question))%>%
【参考方案2】:
首先,您应该使用更合适的名称进行编码。
names(dat)[2:5] <- paste0(rep(c('color.', 'shape.'), each=2), 1:2)
现在我们可以轻松地将数据转换为长格式。
dat_l <- reshape(dat, 2:5, direction='long', idvar='ID')
之后,我们可以在基础 R 及其亲属中使用 table()
函数,
vars <- names(dat_l)[c("size", "color", "shape")]
tbl <- lapply(vars, \(x) table(dat_l[, x]) |>
(\(Freq) cbind(Freq=addmargins(Freq),
Percent=addmargins(proportions(Freq))*100))() |>
round(2)) |>
setNames(vars)
为控制台找一张漂亮的桌子。
tbl
# $size
# Freq Percent
# large 4 40
# medium 2 20
# small 4 40
# Sum 10 100
#
# $color
# Freq Percent
# blue 2 22.22
# green 2 22.22
# red 2 22.22
# yellow 3 33.33
# Sum 9 100.00
#
# $shape
# Freq Percent
# circle 5 55.56
# square 2 22.22
# triangle 2 22.22
# Sum 9 100.00
# [1] "R version 4.1.2 (2021-11-01)"
数据
dat <- structure(list(ID = c(55L, 83L, 78L, 43L, 29L), What.color.is.this.item. = c("red",
"blue", "red", "green", "yellow"), What.color.is.this.item._2 = c("blue",
"yellow", "yellow", NA, "green"), What.is.the.shape.of.this.item. = c("circle",
"circle", "square", "square", "circle"), What.is.the.shape.of.this.item._2 = c("triangle",
NA, "circle", "circle", "triangle"), size = c("small", "large",
"large", "small", "medium")), class = "data.frame", row.names = c(NA,
-5L))
【讨论】:
【参考方案3】:需要对列的内容(appl
)做出假设,即给出重要的关键字。
然后根据列创建数据框
appl <- sapply( c("color","shape","size"), function(x) grep(x, colnames(dat)) )
data.frame( do.call( rbind, sapply( seq_along(appl), function(x)
tbl <- table(unlist( dat[,appl[[x]]] ));
rbind( cbind( Variable=names(appl[x]), Freq=tbl, Percent=round( tbl/sum(tbl)*100, digits=2 ) ),
cbind( Variable=names(appl[x]), sum(tbl), sum(tbl/sum(tbl)*100) ) ) ) ) )
Variable Freq Percent
blue color 2 22.22
green color 2 22.22
red color 2 22.22
yellow color 3 33.33
X color 9 100
circle shape 5 55.56
square shape 2 22.22
triangle shape 2 22.22
X.1 shape 9 100
large size 2 40
medium size 1 20
small size 2 40
X.2 size 5 100
数据
dat <- structure(list(ID = c(55L, 83L, 78L, 43L, 29L), What.color.is.this.item. = c("red",
"blue", "red", "green", "yellow"), What.color.is.this.item._2 = c("blue",
"yellow", "yellow", NA, "green"), What.is.the.shape.of.this.item. = c("circle",
"circle", "square", "square", "circle"), What.is.the.shape.of.this.item._2 = c("triangle",
NA, "circle", "circle", "triangle"), size = c("small", "large",
"large", "small", "medium")), class = "data.frame", row.names = c(NA,
-5L))
【讨论】:
以上是关于当列名是字符串时,转换为长并制作频率表,R的主要内容,如果未能解决你的问题,请参考以下文章