在 R 中使用 openxlsx 进行条件格式化的 Tidyverse/更快的解决方案?
Posted
技术标签:
【中文标题】在 R 中使用 openxlsx 进行条件格式化的 Tidyverse/更快的解决方案?【英文标题】:Tidyverse/faster solution to conditional formatting with openxlsx in R? 【发布时间】:2018-12-02 05:28:09 【问题描述】:我正在处理类似于此表但更大的基因数据:
ID allele.a allele.b
A 115 90
A 115 90
A 116 90
B 120 82
B 120 82
B 120 82M
我的目标是为每个 ID 突出显示哪些等位基因不与每个 ID 组的第一行中列出的等位基因匹配。我需要将数据导出到格式良好的 excel 文件中。
这就是我想要的:
我可以使用以下脚本到达那里,但实际脚本涉及大约 67 个“ID”、1000 行数据和 37 列。运行大约需要 5 分钟,所以我希望找到一个可以显着减少处理时间的解决方案。也许是来自 tidyverse 的“做”解决方案 - 不知道会是什么样子。
这是我的脚本,包括一个测试 data.frame。还包括一个更大的测试数据框架,用于速度测试。
library(xlsx)
library(openxlsx)
library(tidyverse)
# Small data.frame
dframe <- data.frame(ID = c("A", "A", "A", "B", "B", "B"),
allele.a = c("115", "115", "116", "120", "120", "120"),
allele.b = c("90", "90", "90", "82", "82", "82M"),
stringsAsFactors = F)
# Bigger data.frame for speed test
# dframe <- data.frame(ID = rep(letters, each = 30),
# allele.a = rep(as.character(round(rnorm(n = 30, mean = 100, sd = 0.3), 0)), 26),
# allele.b = rep(as.character(round(rnorm(n = 30, mean = 90, sd = 0.3), 0)), 26),
# allele.c = rep(as.character(round(rnorm(n = 30, mean = 80, sd = 0.3), 0)), 26),
# allele.d = rep(as.character(round(rnorm(n = 30, mean = 70, sd = 0.3), 0)), 26),
# allele.e = rep(as.character(round(rnorm(n = 30, mean = 60, sd = 0.3), 0)), 26),
# allele.f = rep(as.character(round(rnorm(n = 30, mean = 50, sd = 0.3), 0)), 26),
# allele.g = rep(as.character(round(rnorm(n = 30, mean = 40, sd = 0.3), 0)), 26),
# allele.h = rep(as.character(round(rnorm(n = 30, mean = 30, sd = 0.3), 0)), 26),
# allele.i = rep(as.character(round(rnorm(n = 30, mean = 20, sd = 0.3), 0)), 26),
# allele.j = rep(as.character(round(rnorm(n = 30, mean = 10, sd = 0.3), 0)), 26),
# stringsAsFactors = F)
# Create a new excel workbook ----
wb <- createWorkbook()
# Add a worksheets
addWorksheet(wb, sheet = 1, gridLines = TRUE)
# add the data to the worksheet
writeData(wb, sheet = 1, dframe, rowNames = FALSE)
# Create a style to show alleles that do not match the first row.
style_Red_NoMatch <- createStyle(fontColour = "#FFFFFF", # white text
bgFill = "#CC0000", # Dark red background
textDecoration = c("BOLD")) # bold text
Groups <- unique(dframe$ID)
start_time <- Sys.time()
# For each unique group,
for(i in 1:length(Groups))
# Print a message telling us where the script is processing in the file.
print(paste("Formatting unique group ", i, "/", length(Groups), sep = ""))
# What are the allele values of the *first* individual in the group?
Allele.values <- dframe %>%
filter(ID == Groups[i]) %>%
slice(1) %>%
select(2:ncol(dframe)) %>%
as.character()
# for each column that has allele values in it,
for (j in 1:length(Allele.values))
# format the rest of the rows so that a value that does not match the first value gets red style
conditionalFormatting(wb, sheet = 1,
style_Red_NoMatch,
rows = (which(dframe$ID == Groups[i]) + 1),
cols = 1+j, rule=paste("<>\"", Allele.values[j], "\"", sep = ""))
end_time <- Sys.time()
end_time - start_time
saveWorkbook(wb, "Example.xlsx", overwrite = TRUE)
【问题讨论】:
【参考方案1】:我想改进的一种方法是在整个列上应用conditionalFormatting
,而不必遍历每个单元格。
这是一种方法。这种方法的一个缺点是它创建了TRUE
和FALSE
的逻辑向量,用于conditionalFormatting
。可以使用setColWidths
函数隐藏这些列。
数据
library(openxlsx)
dframe <- data.frame(ID = rep(letters, each = 30),
allele.a = rep(as.character(round(rnorm(n = 30, mean = 100, sd = 0.3), 0)), 26),
allele.b = rep(as.character(round(rnorm(n = 30, mean = 90, sd = 0.3), 0)), 26),
allele.c = rep(as.character(round(rnorm(n = 30, mean = 80, sd = 0.3), 0)), 26),
allele.d = rep(as.character(round(rnorm(n = 30, mean = 70, sd = 0.3), 0)), 26),
allele.e = rep(as.character(round(rnorm(n = 30, mean = 60, sd = 0.3), 0)), 26),
allele.f = rep(as.character(round(rnorm(n = 30, mean = 50, sd = 0.3), 0)), 26),
allele.g = rep(as.character(round(rnorm(n = 30, mean = 40, sd = 0.3), 0)), 26),
allele.h = rep(as.character(round(rnorm(n = 30, mean = 30, sd = 0.3), 0)), 26),
allele.i = rep(as.character(round(rnorm(n = 30, mean = 20, sd = 0.3), 0)), 26),
allele.j = rep(as.character(round(rnorm(n = 30, mean = 10, sd = 0.3), 0)), 26),
stringsAsFactors = F)
脚本的第一部分没有改变。
# Create a new excel workbook ----
wb <- createWorkbook()
# Add a worksheets
addWorksheet(wb, sheet = 1, gridLines = TRUE)
# Create a style to show alleles that do not match the first row.
style_Red_NoMatch <- createStyle(fontColour = "#FFFFFF", # white text
bgFill = "#CC0000", # Dark red background
textDecoration = c("BOLD")) # bold text
然后确定每个 ID 的第一行并合并到原始数据集中。然后检查任何单元格中是否有任何变化(循环遍历每一列)。
# selects first row for each ID which will be used as benchmark
first_row <- dframe[!duplicated(dframe$ID), ]
# Creating new df with the first_row columns added
dframe_chk <- merge(dframe, first_row, by = "ID", all.x = TRUE, suffixes = c("", "_first"))
# Adding TRUE/FALSE factor for each column to see if it matches or not (-1 to exclude ID column)
for (j in names(dframe)[-1])
dframe_chk[, paste0(j, "_chk")] <- dframe_chk[, j] == dframe_chk[, paste0(j, "_first")]
# Remove _first columns when exporting into Excel
cols <- names(dframe_chk)[!grepl("_first", names(dframe_chk))]
# add the data to the worksheet
writeData(wb, sheet = 1, dframe_chk[, cols], rowNames = FALSE)
# This is for conditional Formatting
# first_row is header
row_start <- 2
# Need to add 1 to cover full range (as first row is header)
row_end <- nrow(dframe) + 1
# first column is ID
col_start <- 2
# last column as per the original dataset
col_end <- ncol(dframe)
# this is to point to the _chk column.
# Note if you have columns more than A-Z then this needs to be adjusted
rule_col <- LETTERS[col_end + 1]
# Using the _chk column to apply conditional formula
conditionalFormatting(wb, sheet = 1,
style_Red_NoMatch,
rows = row_start:row_end,
cols = col_start:col_end,
rule = paste0(rule_col, "2 = FALSE"))
# Exported file includes _chk columns. Hide these columns.
setColWidths(wb, sheet = 1, cols = (col_end + 1):length(cols), hidden = TRUE)
saveWorkbook(wb, "Example2.xlsx", overwrite = TRUE)
【讨论】:
以上是关于在 R 中使用 openxlsx 进行条件格式化的 Tidyverse/更快的解决方案?的主要内容,如果未能解决你的问题,请参考以下文章