如何加速R中的循环计算循环
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如何加速R中的循环计算循环相关的知识,希望对你有一定的参考价值。
简单地说我有378742个观测值(每个观测值都有一个发射和截止日期),我想检查每次观测的持续时间与所有其他(378741)观测值的重叠并总结。
我运行以下代码,由于嵌套循环,它需要永远(我估计它是205天)。有没有办法加快计算速度? (我使用DescTools
package作为Overlap
命令。)
a <- c(1:378742)
for (i in 1:378742) {
mydata$competition[i] <- sum(a, na.rm = T)
for (j in 1:378742) {
a[j] <- Overlap(c(mydata$Launched[i], mydata$Deadline[i]), c(mydata$Launched[j], mydata$Deadline[j]))
}
}
答案
你可以通过矢量化内部循环来节省大量时间(然后我使用apply()
作为外部循环):
# We'll need both DescTools and microbenchmark
library(DescTools)
library(microbenchmark)
# Make example data
set.seed(123) # setting seed for reproducibility
n <- 10
x <- sample(seq(as.Date("2008/10/20"), as.Date("2018/10/20"), "day"), n)
y <- sample(seq(as.Date("2008/10/20"), as.Date("2018/10/20"), "day"), n)
(mat <- cbind(x, y))
#> x y
#> [1,] 15222 17667
#> [2,] 17050 15827
#> [3,] 15665 16645
#> [4,] 17395 16262
#> [5,] 17603 14547
#> [6,] 14338 17454
#> [7,] 16098 15069
#> [8,] 17425 14325
#> [9,] 16181 15367
#> [10,] 15835 17650
# First get the answer using nested loops
a <- z <- 1:n
for (i in 1:n) {
for (j in 1:n) {
a[j] <- Overlap(mat[i, ],mat[j, ])
}
# Noticed I've moved this sum to the bottom,
# so that our first element isn't just a sum from one to n
z[i] <- sum(a, na.rm = T)
}
z
#> [1] 16102 9561 7860 7969 18169 18140 6690 18037 6017 12374
apply(mat, 1, function(r) sum(Overlap(r, mat)))
#> [1] 16102 9561 7860 7969 18169 18140 6690 18037 6017 12374
microbenchmark(apply = apply(mat, 1, function(r) sum(Overlap(r, mat))),
loop = for (i in 1:n) {
for (j in 1:n) {
a[j] <- Overlap(mat[i, ],mat[j, ])
}
# Noticed I've moved this sum to the bottom,
# so that our first element isn't just a sum from one to n
z[i] <- sum(a, na.rm = T)
})
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> apply 7.538967 7.688929 7.894379 7.767989 7.891177 13.57523 100
#> loop 76.051011 77.203810 80.045325 78.158369 79.206538 114.68139 100
#> cld
#> a
#> b
由reprex package创建于2018-10-20(v0.2.1)
现在让我们试着了解它如何与(稍微)更大的示例数据进行扩展(如果数据太大,基准测试需要永久):
#
n <- 100
x <- sample(seq(as.Date("2008/10/20"), as.Date("2018/10/20"), "day"), n, r = T)
y <- sample(seq(as.Date("2008/10/20"), as.Date("2018/10/20"), "day"), n, r = T)
mat <- cbind(x, y)
a <- z <- 1:n
for (i in 1:n) {
for (j in 1:n) {
a[j] <- Overlap(mat[i, ],mat[j, ])
}
z[i] <- sum(a, na.rm = T)
}
# In case you're concerned it still works:
all.equal(z, apply(mat, 1, function(r) sum(Overlap(r, mat))))
#> [1] TRUE
microbenchmark(apply = apply(mat, 1, function(r) sum(Overlap(r, mat))),
loop = for (i in 1:n) {
for (j in 1:n) {
a[j] <- Overlap(mat[i, ],mat[j, ])
}
# Noticed I've moved this sum to the bottom,
# so that our first element isn't just a sum from one to n
z[i] <- sum(a, na.rm = T)
})
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> apply 258.1151 262.8007 269.8172 265.9643 276.8799 296.2167 100
#> loop 5806.9834 5841.3362 5890.4988 5863.7317 5884.2308 6222.1670 100
#> cld
#> a
#> b
由reprex package创建于2018-10-20(v0.2.1)
另一答案
在Bioinformatics中,我们用于查找GenomicRanges
包的重叠范围。
我曾经使用我常用的for
-loops和lapply
函数进行计算,我的计算机将计算5天。但后来我找到了GenomicRanges
包 - 它在几秒钟内完成了!
(对我来说很遗憾,我仍然不知道它是如何工作的...将与有序的树结构和有效交叉...而且部分也可能涉及C++
代码?..)
无论如何,结果是闪电般快速。你会惊讶的!
GenomicRanges包用于减轻快速范围计算
############################
# Install GenomicRanges package
############################
# since this year introduced: `BiocManager`
# Bioconductor is main code repository for Bionformaticians.
# It is kind of `CRAN` for Bioinformaticians programming with R
install.packages("BiocManager")
require(BiocManager)
BiocManager::install("GenomicRanges")
# In older systems, you have to do:
install.packages("BiocInstaller")
require(BiocInstaller)
biocLite("GenomicRanges")
############################
# Load the GenomicRanges package
############################
require(GenomicRanges)
############################
# create dates as positive intervals
############################
set.seed(123) # for reproducibility of random stuff
n <- 1000 # later: 378742
x <- sample(seq(as.Date("2008/10/20"), as.Date("2038/10/20"), "day"), replace=TRUE, n)
# y <- sapply(x, function(date) date + sample(1:1000, 1)) # too slow!
deltas <- sample(1:10000, replace=TRUE, n) # immediate response `sapply` needs very long
y <- x + deltas
df <- data.frame(seqnames="1", start=x, end=y)
gr <- GRanges(df)
gr <- sort(gr)
############################
# Be careful, GRanges obj is 1-based system and not 0-based!
############################
# each row is one index - gr behaves when indexing like a vector
gr[5] # selects fifth row
gr[4:7] # selects 4th to 7th row
############################
# which range overlaps with which range?
############################
system.time({hits <- findOverlaps(gr, gr)})
# system.time({ your-R-expression }) - very convenient speed measuring!
# the numbers in the table are the index (i-th row) in each of the tables
# query and subject table - which are in this case identical tables - gr
############################
# what is the amount of overlap?
############################
overlaps <- pintersect(gr[queryHits(hits)], gr[subjectHits(hits)])
amount.overlaps <- width(overlaps) - 1 # - 1 because 1-based systems do +1 when ranges
# 1-base versus 0-based coordinate systems: https://www.biostars.org/p/84686/
以上是关于如何加速R中的循环计算循环的主要内容,如果未能解决你的问题,请参考以下文章