创建巨大矩阵的相关矩阵时出现错误的解决方法？ [复制]

Posted 2023-02-14

技术标签:

【中文标题】创建巨大矩阵的相关矩阵时出现错误的解决方法？ [复制]【英文标题】：Workaround for error when creating a correlation matrix of a huge matrix? [duplicate] 【发布时间】：2022-01-04 01:18:49 【问题描述】：

我有一个大小为 55422 x 888 的矩阵，因此 rcorr 函数会产生此错误：

M1<-matrix(rnorm(36),nrow=55422, ncol=888)

cor <- rcorr(t(M1), type = "pearson")

double(p * p) 中的错误：向量大小不能为 NA 另外：警告信息：在 p * p : 整数溢出产生的 NAs

有什么办法可以解决这个问题吗？

【问题讨论】：

【参考方案1】：

这不是一个完整/有效的解决方案，但可以让您了解一些问题。

您的相关矩阵将包含 n*(n-1)/2 = 1535771331 个唯一元素。如果计算每个相关性需要一毫秒，则计算相关性矩阵将花费 (n^2-n)/2/(1e6*3600) = 0.42 小时并需要 (n^2-n)/2*8/(2^30) = 11.4 Gb 的存储空间。如果您有大量 RAM 和时间，这些要求并非不可能......

事实上它比这更糟一点，因为rcorr 将其结果作为对称矩阵返回（即，不利用对称性），并返回n 和P 矩阵，所以存储需求大约是上述的 5 倍（完整矩阵的两倍，x 2.5，因为我们有两个双精度矩阵和一个整数矩阵）。

针对您的具体问题，section on long vectors in the R internals manual 讨论了 R 中对象的最大大小。“标准”限制是矩阵的元素总数应小于 2^31 ((n^2-n)/2/(2^31-1) = 0.72 )，但矩阵中的冗余会给您带来麻烦（相关性、p 值、和样本大小的存储也会遇到麻烦）。

如果你还想继续，这里是 A.N. 的一个实现。 Spiess，从here 复制而来，将问题分解为块并将结果存储在磁盘支持的数组中（即不在 RAM 中）。这不会为您提供 p 值（并且仍然不清楚您将如何处理所有这些值......），但它至少可以处理多达 40,000 列（大约需要一分钟）。

但是，长度过大似乎会影响您的实际问题大小 (888 x 55242)。我必须更仔细地看一下，看看这里是否有我们可以绕过的限制......看来我们实际上仍然受到矩阵尺寸的限制......（最大矩阵尺寸是 sqrt(2^31-1) 大约 46341 。 .. 随着更多的工作，我们仍然可以做块对角线的事情并将其分成几个组件...

set.seed(101)
nc <- 55422
nr <- 888
d <- matrix(rnorm(nr*nc), ncol = nc)
t1 <- system.time(b1 <- bigcor(d))

bigcor <- function(
x,
y = NULL,
fun = c("cor", "cov"),
size = 2000,
verbose = TRUE,
...)

  if (!require("ff")) stop("please install the ff package")
  fun <- match.arg(fun)
  if (fun == "cor") FUN <- cor else FUN <- cov
  if (fun == "cor") STR <- "Correlation" else STR <- "Covariance"
  if (!is.null(y) & NROW(x) != NROW(y)) stop("'x' and 'y' must have compatible dimensions!")

  NCOL <- ncol(x)
  if (!is.null(y)) YCOL <- NCOL(y)

  ## calculate remainder, largest 'size'-divisible integer and block size
  REST <- NCOL %% size
  LARGE <- NCOL - REST
  NBLOCKS <- NCOL %/% size

  ## preallocate square matrix of dimension
  ## ncol(x) in 'ff' single format
  if (is.null(y)) resMAT <- ff(vmode = "double", dim = c(NCOL, NCOL))
  else resMAT <- ff(vmode = "double", dim = c(NCOL, YCOL))

  ## split column numbers into 'nblocks' groups + remaining block
  GROUP <- rep(1:NBLOCKS, each = size)
  if (REST > 0) GROUP <- c(GROUP, rep(NBLOCKS + 1, REST))
  SPLIT <- split(1:NCOL, GROUP)

  ## create all unique combinations of blocks
  COMBS <- expand.grid(1:length(SPLIT), 1:length(SPLIT))
  COMBS <- t(apply(COMBS, 1, sort))
  COMBS <- unique(COMBS)
  if (!is.null(y)) COMBS <- cbind(1:length(SPLIT), rep(1, length(SPLIT)))

  ## initiate time counter
  timeINIT <- proc.time()

  ## iterate through each block combination, calculate correlation matrix
  ## between blocks and store them in the preallocated matrix on both
  ## symmetric sides of the diagonal
  for (i in 1:nrow(COMBS)) 
    COMB <- COMBS[i, ]
    G1 <- SPLIT[[COMB[1]]]
    G2 <- SPLIT[[COMB[2]]]

    ## if y = NULL
    if (is.null(y)) 
      if (verbose) cat(sprintf("#%d: %s of Block %s and Block %s (%s x %s) ... ", i, STR,  COMB[1],
                               COMB[2], length(G1),  length(G2)))
      RES <- FUN(x[, G1], x[, G2], ...)
      resMAT[G1, G2] <- RES
      resMAT[G2, G1] <- t(RES)
     else ## if y = smaller matrix or vector
    
      if (verbose) cat(sprintf("#%d: %s of Block %s and 'y' (%s x %s) ... ", i, STR,  COMB[1],
                               length(G1),  YCOL))
      RES <- FUN(x[, G1], y, ...)
      resMAT[G1, ] <- RES
    

    if (verbose) 
      timeNOW <- proc.time() - timeINIT
      cat(timeNOW[3], "s\n")
    

    gc()
  

  return(resMAT)

【讨论】：

以上是关于创建巨大矩阵的相关矩阵时出现错误的解决方法？ [复制]的主要内容，如果未能解决你的问题，请参考以下文章

将 OpenMP 与 Fortran 一起使用时出现内存错误，运行 FFTW

为啥在遍历列表矩阵时出现列表索引超出范围错误？

编译动态矩阵时出现file.exe错误

从稀疏矩阵导入时出现 Modin AttributeError

使用 Numba 进行矩阵乘法时出现 CUDA 内存不足错误

为啥我在尝试将元素添加到 vector<vector<int>>（矩阵实现）时出现段错误？