通过灵活的调用将具有不同宽度的数据从宽转向长(用于循环)
Posted
技术标签:
【中文标题】通过灵活的调用将具有不同宽度的数据从宽转向长(用于循环)【英文标题】:Pivoting data with varying width from wide to long with flexible call (to be used in loop) 【发布时间】:2020-09-21 00:44:49 【问题描述】:我需要pivot 一些宽时间序列数据,即使用tidyr 的pivot_longer()
来改变宽度。
数据是按季度计算的,但我会收到年块(四个季度)和六个月块(只有两个季度)的数据,即数据在宽度方面有所不同 em>。
我想找到一个可以循环使用的简单和灵活解决方案,因为我需要导入多年零六个月的块(并且,如我需要说服我的研究小组使用 R,我在这里要求使用(最好)@ 的 simple、smart 和 clean 解决方案987654323@)。
数据在年份块中看起来有点像这样,
dta_wide1 <- structure(list(V1 = c("", "", "", "", "", "", "", "peach", "dragonfruit", "honeydew", "huckleberry", "", ""), V2 = c("ABC", "some info", "Store A", "", "As of 31/03/2019", "label1", "", "7", "5", "6", "1", "(a) some useless clutter", "(b) more not relevent information"), V3 = c("", "", "", "", "", "", "label2", "0.5", "0.4", "0.8", "0.3", "", ""), V4 = c("", "", "", "", "", "label4", "label4a", "21", "21", "87", "21", "", ""), V5 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V6 = c("", "", "", "", "As of 30/06/2019", "label1", "", "5", "2", "3", "7", "", ""), V7 = c("", "", "", "", "", "", "label2", "0.46", "0.72", "0.7", "0.8", "", ""), V8 = c("", "", "", "", "", "label4", "label4a", "19", "22", "85", "25", "", ""), V9 = c("", "", "", "", "", "", "label4b", "0.4", "0.1", "0.3", "0.2", "", ""), V10 = c("", "", "", "", "As of 30/09/2019", "label1", "", "4", "1", "4", "8", "", ""), V11 = c("", "", "", "", "", "", "label2", "0.1", "0.3", "0.6", "0.22", "", ""), V12 = c("", "", "", "", "", "label4", "label4a", "21", "23", "71", "27", "", ""), V13 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V14 = c("", "", "", "", "As of 31/12/2019", "label1", "", "8", "6", "9", "9", "", ""), V15 = c("", "", "", "", "", "", "label2", "0.7", "0.87", "0.55", "0.33", "", ""), V16 = c("", "", "", "", "", "label4", "label4a", "24", "25", "99", "35", "", ""), V17 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", "")), class = "data.frame", row.names = c(NA, -13L))
像这样在六个月内,
dta_wide2 <- structure(list(V1 = c("", "", "", "", "", "", "", "peach", "dragonfruit", "honeydew", "huckleberry", "", ""), V2 = c("ABC", "some info", "Store A", "", "As of 31/03/2020", "label1", "", "2", "3", "4", "8", "(a) some useless clutter", "(b) more not relevent information"), V3 = c("", "", "", "", "", "", "label2", "0.1", "0.2", "0.3", "0.8", "", ""), V4 = c("", "", "", "", "", "label4", "label4a", "10", "11", "12", "9", "", ""), V5 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V6 = c("", "", "", "", "As of 30/06/2020", "label1", "", "4", "6", "8", "16", "", ""), V7 = c("", "", "", "", "", "", "label2", "0.22", "0.33", "0.44", "0.55", "", ""), V8 = c("", "", "", "", "", "label4", "label4a", "11", "12", "13", "10", "", ""), V9 = c("", "", "", "", "", "", "label4b", "0.4", "0.1", "0.3", "0.2", "", "")), class = "data.frame", row.names = c(NA, -13L))
即(针对六个月的区块)
# install.packages(c("tidyverse"), dependencies = TRUE)
library(tidyverse)
dta_wide2 %>% as_tibble
# A tibble: 13 x 9
V1 V2 V3 V4 V5 V6 V7 V8 V9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "" "ABC" "" "" "" "" "" "" ""
2 "" "some info" "" "" "" "" "" "" ""
3 "" "Store A" "" "" "" "" "" "" ""
4 "" "" "" "" "" "" "" "" ""
5 "" "As of 31/03/~ "" "" "" "As of ~ "" "" ""
6 "" "label1" "" "label~ "" "label1" "" "labe~ ""
7 "" "" "labe~ "label~ "labe~ "" "lab~ "labe~ "labe~
8 "peach" "2" "0.1" "10" "0.3" "4" "0.2~ "11" "0.4"
9 "dragon~ "3" "0.2" "11" "0.1" "6" "0.3~ "12" "0.1"
10 "honeyd~ "4" "0.3" "12" "0.4" "8" "0.4~ "13" "0.3"
11 "huckle~ "8" "0.8" "9" "0.2" "16" "0.5~ "10" "0.2"
12 "" "(a) some use~ "" "" "" "" "" "" ""
13 "" "(b) more not~ "" "" "" "" "" "" ""
在dta_wide2
中,日期键像这样浮动
> dta_wide2[5,] %>% str_sub(start= -10) %>% lubridate::dmy()
[1] NA "2020-03-31" NA NA NA
[6] "2020-06-30" NA NA NA
所以我试着像这样整理它
dta_wide2 %>%
add_column(date1 = dta_wide2[5,2] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 2) %>%
add_column(date2 = dta_wide2[5,6] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 6) %>%
add_column(store = dta_wide2[3,2], .before = 2) %>% as_tibble
# A tibble: 13 x 12
V1 store date1 V2 V3 V4 date2 V5 V6 V7
<chr> <chr> <date> <chr> <chr> <chr> <date> <chr> <chr> <chr>
1 "" Stor~ 2020-03-31 "ABC" "" "" 2020-06-30 "" "" ""
2 "" Stor~ 2020-03-31 "som~ "" "" 2020-06-30 "" "" ""
3 "" Stor~ 2020-03-31 "Sto~ "" "" 2020-06-30 "" "" ""
4 "" Stor~ 2020-03-31 "" "" "" 2020-06-30 "" "" ""
5 "" Stor~ 2020-03-31 "As ~ "" "" 2020-06-30 "" "As ~ ""
6 "" Stor~ 2020-03-31 "lab~ "" "lab~ 2020-06-30 "" "lab~ ""
7 "" Stor~ 2020-03-31 "" "lab~ "lab~ 2020-06-30 "lab~ "" "lab~
8 "pea~ Stor~ 2020-03-31 "2" "0.1" "10" 2020-06-30 "0.3" "4" "0.2~
9 "dra~ Stor~ 2020-03-31 "3" "0.2" "11" 2020-06-30 "0.1" "6" "0.3~
10 "hon~ Stor~ 2020-03-31 "4" "0.3" "12" 2020-06-30 "0.4" "8" "0.4~
11 "huc~ Stor~ 2020-03-31 "8" "0.8" "9" 2020-06-30 "0.2" "16" "0.5~
12 "" Stor~ 2020-03-31 "(a)~ "" "" 2020-06-30 "" "" ""
13 "" Stor~ 2020-03-31 "(b)~ "" "" 2020-06-30 "" "" ""
# ... with 2 more variables: V8 <chr>, V9 <chr>
现在,我需要使用pivot_longer
将其旋转更长时间,但我的挑战是如何 - 当我还获得看起来像 dta_wide1
的数据时,即有四个季度 - 我是否可以灵活地使用 dta_wide1
和 dta_wide2
。
我已经为此工作了一段时间,如果有任何帮助使它工作、简化或清理它,我都会非常感激。
这是我目前所处的位置,但它不正确,不灵活,而且不简单
dta_wide2_foo <- dta_wide2
names(dta_wide2_foo) <- c('goods', paste0(dta_wide2[6,2:5], dta_wide2[7,2:5], sep = '_1'), paste0(dta_wide2[6,2:5], dta_wide2[7,2:5], sep = '_2'))
dta_wide2_foo %>%
add_column(date1 = dta_wide2[5,2] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 2) %>%
add_column(date2 = dta_wide2[5,6] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 6) %>%
add_column(store = dta_wide2[3,2], .before = 2) %>% as_tibble %>% .[8:11,] %>%
pivot_longer(-c(goods, store, date1, date2), values_to = "Value", names_to = "variable") %>% print(n = 100)
或者,一些通用的sn-p,它既不简单,也不聪明,也不干净,但它可以用来在循环中获取两个样本数据中日期的位置
dta <- dta_wide2
dta[5,] %>% str_sub(start= -10) %>% lubridate::dmy() %>% which(!is.na(.))
[1] 2 6
或者,清洁工,
dta <- dta_wide1
dta[5,] %>% grep("As ",.)
[1] 2 6 10 14
更新 2020-06-08 07:45:18Z
我的目标是组合长数据集以绘制数据,(Wimpel suggest below 我组合了我的各种宽数据集,即dta_wide1
、dta_wide2
、...
dta_widen
,使用lapply() 调用)我想像这样的数据,
> dta_long
# A tibble: 96 x 5
product label value date store
<chr> <chr> <dbl> <date> <chr>
1 peach label1 7 2019-03-31 Store A
2 peach label2 0.5 2019-03-31 Store A
3 peach label4a 21 2019-03-31 Store A
4 peach label4b 0.3 2019-03-31 Store A
5 peach label1 5 2019-06-30 Store A
6 peach label2 0.46 2019-06-30 Store A
7 peach label4a 19 2019-06-30 Store A
8 peach label4b 0.4 2019-06-30 Store A
9 peach label1 4 2019-09-30 Store A
10 peach label2 0.1 2019-09-30 Store A
# ... with 86 more rows
然后ggplot2/用类似的东西绘制日期,
dta_long %>% filter(label == 'label1') %>% ggplot(aes(date, value, colour = product)) +
geom_line() + scale_x_date(date_breaks = "3 months",
date_labels = "%b-%y", limits = c((min(dta_long$date)-34), max = max(dta_long$date)))
【问题讨论】:
这看起来像这样的问题,您可以通过手动完成而不是不幸地找到一种编程方式来转换它,从而节省时间。您如何将这些数据读入 R?以不同的方式导入它可能比转换它更容易。 我正在使用 readxlsb 包中的read_xlsb
从 Excel 的二进制 (.xlsb) 工作簿中导入它。我想象我可以获取带有日期的索引/位置,然后使用它来添加日期,并从那里使用日期和键来旋转数据。类似的东西。
也许您可以查看tidyxl
和unipivotr
以找到更简单的数据提取方法。
我不知道unipivotr。谢谢。我会看看它。不幸的是tidyxl不支持二进制文件格式.xlsb
,据它[cran page][(cran.r-project.org/web/packages/tidyxl/vignettes/tidyxl.html)。
【参考方案1】:
我认为以下内容可以解决您的问题。我做了以下假设:
-
每张表中的组按日期分组,每组包含四个变量
您要表示的值始终位于电子表格的第 8-11 行中
一张表代表一个商店,商店名称将位于电子表格的第三行第二列
dta_wide1 <- structure(list(V1 = c("", "", "", "", "", "", "", "peach", "dragonfruit", "honeydew", "huckleberry", "", ""), V2 = c("ABC", "some info", "Store A", "", "As of 31/03/2019", "label1", "", "7", "5", "6", "1", "(a) some useless clutter", "(b) more not relevent information"), V3 = c("", "", "", "", "", "", "label2", "0.5", "0.4", "0.8", "0.3", "", ""), V4 = c("", "", "", "", "", "label4", "label4a", "21", "21", "87", "21", "", ""), V5 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V6 = c("", "", "", "", "As of 30/06/2019", "label1", "", "5", "2", "3", "7", "", ""), V7 = c("", "", "", "", "", "", "label2", "0.46", "0.72", "0.7", "0.8", "", ""), V8 = c("", "", "", "", "", "label4", "label4a", "19", "22", "85", "25", "", ""), V9 = c("", "", "", "", "", "", "label4b", "0.4", "0.1", "0.3", "0.2", "", ""), V10 = c("", "", "", "", "As of 30/09/2019", "label1", "", "4", "1", "4", "8", "", ""), V11 = c("", "", "", "", "", "", "label2", "0.1", "0.3", "0.6", "0.22", "", ""), V12 = c("", "", "", "", "", "label4", "label4a", "21", "23", "71", "27", "", ""), V13 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V14 = c("", "", "", "", "As of 31/12/2019", "label1", "", "8", "6", "9", "9", "", ""), V15 = c("", "", "", "", "", "", "label2", "0.7", "0.87", "0.55", "0.33", "", ""), V16 = c("", "", "", "", "", "label4", "label4a", "24", "25", "99", "35", "", ""), V17 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", "")), class = "data.frame", row.names = c(NA, -13L))
## Calculate the number of groups of data in the spreadsheet
d1grps <- (ncol(dta_wide1) - 1) / 4 # Divide by 4 due to assumption #1 above
dnew1 <- as_tibble(dta_wide1) %>%
## Take rows that contain data (see assumption #2 above)
slice(8:11) %>%
mutate(
## Get dates from the original data frame and remove empty cells
## (need `unname()` or else this will overwrite variables)
!!!as.data.frame(slice(dta_wide1, 5) %>% select(seq(2, ncol(.), 4)) %>% unname()),
## Get store from second column (see assumption #3 above)
store = dta_wide1[3, 2]
) %>%
## Create variable names for each variable by group
setNames(
c("product", paste0(
c("label1_", "label2_", "label4a_", "label4b_"),
rep(1:d1grps, each = 4)
),
paste0("date_", 1:d1grps), "store"
)) %>%
pivot_longer(
cols = !any_of(c("product", "store")),
names_pattern = "(.+)_(.+)",
names_to = c(".value", "group")
) %>%
mutate(date = lubridate::dmy(sub("As of ", "", date)))
以及长格式的最后一个小标题:
# A tibble: 16 x 8
product store group label1 label2 label4a label4b date
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <date>
1 peach Store A 1 7 0.5 21 0.3 2019-03-31
2 peach Store A 2 5 0.46 19 0.4 2019-06-30
3 peach Store A 3 4 0.1 21 0.3 2019-09-30
4 peach Store A 4 8 0.7 24 0.3 2019-12-31
5 dragonfruit Store A 1 5 0.4 21 0.1 2019-03-31
6 dragonfruit Store A 2 2 0.72 22 0.1 2019-06-30
7 dragonfruit Store A 3 1 0.3 23 0.1 2019-09-30
8 dragonfruit Store A 4 6 0.87 25 0.1 2019-12-31
9 honeydew Store A 1 6 0.8 87 0.4 2019-03-31
10 honeydew Store A 2 3 0.7 85 0.3 2019-06-30
11 honeydew Store A 3 4 0.6 71 0.4 2019-09-30
12 honeydew Store A 4 9 0.55 99 0.4 2019-12-31
13 huckleberry Store A 1 1 0.3 21 0.2 2019-03-31
14 huckleberry Store A 2 7 0.8 25 0.2 2019-06-30
15 huckleberry Store A 3 8 0.22 27 0.2 2019-09-30
16 huckleberry Store A 4 9 0.33 35 0.2 2019-12-31
还有你想要的情节:
ggplot(dnew1, aes(x = date, y = label2, color = product, group = product)) +
geom_line()
代码应该能够适应每个电子表格中不断增加的日期数量,并且您可以编写一个函数来以编程方式查找组数,而不是依靠更改变量 d1grps
来设置组数。
尽管这应该可行,但将内容写入文本文件并操作文本可能比以这种方式使用dplyr
更容易。
【讨论】:
【参考方案2】:我保存了您的两个示例数据集并将它们存储在单独的 .xlsb 文件中。 数据如下所示:
也许这会有所帮助...该解决方案适用于所提供的两个样本集,所以试一试。 代码假设所有数据具有相同的格式,因此所有信息始终在同一行中,并且商店名称始终在同一列中。
library( readxlsb )
library( cellranger )
library( tidyverse )
library( data.table )
#get filesnames to read
read.these.files <- list.files( path = "./temp/",
pattern = ".*\\.xlsb",
full.names = TRUE,
recursive = FALSE )
#now read the data to a list, using lapply()
# assuming the data needed is in the first sheet of the .xlsb-file
L <- lapply( read.these.files, readxlsb::read_xlsb, sheet = 1, range = cellranger::cell_limits() )
#now we can loop over the read in data in list 'L', and perform operations
L.dt <- lapply( L, function(x)
#get store_name
store_name = x[2,2]
#get the data
df1 <- x[7:10,]
#set the colmanes (=labels) right
colnames <- x[5:6,]
colnames[ colnames == "" ] <- NA
names(df1) <- colnames %>% tidyr::fill( names(colnames) ) %>% slice(2)
names(df1)[1] <- "product"
#melt df1 to long format
df1 <- df1 %>% tidyr::pivot_longer( cols = tidyselect::starts_with("label"), names_to = "label" )
#set the dates right
dates <- x[4, ]
dates <- dates %>% tidyr::pivot_longer( cols = tidyselect::everything())
dates[ dates == "" ] <- NA
dates <- tidyr::fill( dates, value ) %>% dplyr::slice(2:n() )
#add the dates and storename and tidy the .copy column
df1 <- df1 %>%
dplyr::mutate( date = rep( dates$value, nrow(df1) / length( dates$value) ),
store = store_name ) %>%
dplyr::select( -.copy )
)
#create a names list, based on the sourecefile-names
names(L.dt) <- basename( read.these.files )
#now, bind the list of alterend data together into one _long_ data set
L.dt_tbl <- bind_rows(L.dt, .id = 'id')
L.dt_tbl %>% dplyr::mutate(date = str_sub(date, start= -10) %>%
lubridate::dmy() ) -> L.dt_tbl
'
将value
转换为double
,
dta_long <- type_convert(L.dt_tbl, cols(
`Type of NPE` = col_character(),
`What NPE` = col_character(),
value = col_double(),
institut = col_character()
))
最终数据,
dta_long
# A tibble: 96 x 6
id product label value date store
<chr> <chr> <chr> <dbl> <date> <chr>
1 dta_wide1.xlsb peach label1 7 2019-03-31 Store A
2 dta_wide1.xlsb peach label2 0.5 2019-03-31 Store A
3 dta_wide1.xlsb peach label4a 21 2019-03-31 Store A
4 dta_wide1.xlsb peach label4b 0.3 2019-03-31 Store A
5 dta_wide1.xlsb peach label1 5 2019-06-30 Store A
6 dta_wide1.xlsb peach label2 0.46 2019-06-30 Store A
7 dta_wide1.xlsb peach label4a 19 2019-06-30 Store A
8 dta_wide1.xlsb peach label4b 0.4 2019-06-30 Store A
9 dta_wide1.xlsb peach label1 4 2019-09-30 Store A
10 dta_wide1.xlsb peach label2 0.1 2019-09-30 Store A
# ... with 86 more rows
【讨论】:
感谢您回答我的问题。我运行了你的代码。当我运行您的代码说我无法对不存在的列进行子集处理 时出现错误,因为列.copy
不存在。然而,我意识到我应该更清楚地表明我正在寻找一个可以在循环中使用的灵活解决方案。因此,可重现示例中的两个数据集。
我想我需要在提供的数据中获取第 5 行中 dates 的位置; dta_wide1[5, ]
和 dta_wide2[5, ]
。然后从那里构建;计算该给定块中的季度数,在列中插入日期,等等。另外,正如我所提到的,我正在寻找一种可以帮助我说服我的研究小组使用 R 的解决方案,因此我的赏金和关于简单/智能/干净的解决方案的要点。跨度>
循环导入?为什么不将文件名添加到向量/列表中(使用list.files()
,并使用lapply()
“循环”它
它本身不一定是一个循环。主要的是调用是灵活的,并且能够处理需要从宽到长旋转的具有不同宽度的数据。我想lapply()
电话也可以。
上面的代码处理样本日期中不同宽度的日期(即 2 和 4 个季度),只要季度的“块”格式相同,它最多可以处理 n 个季度。 .以上是关于通过灵活的调用将具有不同宽度的数据从宽转向长(用于循环)的主要内容,如果未能解决你的问题,请参考以下文章
R语言将dataframe数据从宽表变为长表实战(melt函数pivot_longer函数gather函数)
R语言ggplot2可视化:应用pivot_longer函数将数据从宽格式转换为长格式为dataframe的每一列绘制密度图和直方图(堆叠)
R语言将dataframe数据从宽表(wide)变为长表(long)实战:tidyr包的gather函数cdata包的unpivot_to_blocks函数data.table使用melt函数