Rcpp：通过引用列出<->矩阵转换？？ + 使用矩阵编程时优化内存分配

Posted 2023-02-16

技术标签:

【中文标题】Rcpp：通过引用列出<->矩阵转换？？ + 使用矩阵编程时优化内存分配【英文标题】：Rcpp: List <-> Matrix conversions by reference?? + Optimizing memory allocation when programming with matrices 【发布时间】：2019-07-30 22:46:02 【问题描述】：

据我了解，在 Rcpp 中，矩阵被实现为具有维度属性的向量，而列表是一种不同对象的向量。因此，是否有一个技巧可以将等长向量列表（即 DataFrame）转换为 NumericMatrix（或 arma::mat），反之亦然，即无需逐列复制数据到新的数据结构？

我认为这是不可能的，因为它对 R 用户非常有用，我相信我会遇到它。在这个假设下——这意味着 List 和 Matrix 方法需要单独实现——我的问题就变成了如何为同样有效的 Lists 和 Matrices 编写 Rcpp 函数。我的经验是，列表更节省内存，因为它们会在填充时自动分配内存，而矩阵需要预先定义并分配所有内存。考虑下面的例子：我编写了两个版本的分组求和 - 一个用于矩阵，一个用于 Lists / data.frames：

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericMatrix gsumm(NumericMatrix x, int ng = 0, IntegerVector g = 0, bool fill = false) 
  int n = x.nrow(); 
  int c = x.ncol(); 
  NumericMatrix sum(ng, c); // here memory needs to be allocated
    for(int j = c; j--; )  
      NumericMatrix::Column column = x( _ , j); 
      NumericMatrix::Column sumj = sum( _ , j); 
      for(int i = n; i--; ) sumj[g[i]-1] += column[i];
    
  if(fill) 
  NumericMatrix out(n, c); // allocating space for this matrix is the main speed-killer
    for(int j = c; j--; ) 
      NumericMatrix::Column colo = out( _ , j); 
      NumericMatrix::Column sumj = sum( _ , j);
      for(int i = n; i--; ) colo[i] = sumj[g[i]-1];
    
    return out;
   else return sum;


// [[Rcpp::export]]
List gsuml(List x, int ng = 0, IntegerVector g = 0, bool fill = false) 
  int l = x.size(), n;
  List sum(l);
    for(int j = l; j--; ) 
      NumericVector column = x[j];
      n = column.size();
      NumericVector sumj(ng);
      for(int i = n; i--; ) sumj[g[i]-1] += column[i];
      sum[j] = sumj;
    
    if(fill) for(int j = l; j--; ) 
      NumericVector sgj(n);
      NumericVector sumj = sum[j];
      for(int i = n; i--; ) sgj[i] = sumj[g[i]-1];
      sum[j] = sgj; 
    
    return sum;

如果fill = false，则返回组聚合数据，而如果fill = true，则返回相同维度的数据，其中每个元素都替换为其组内总和。在这两种情况下，list 方法都更快，特别是如果 fill = true 需要在填充之前创建一个完整的空 n x c 矩阵：

library(microbenchmark)
testm = matrix(rnorm(10000000), ncol = 1000)
testl = as.data.frame(testm)
ng = 1000
g = sample.int(ng, 10000, replace = TRUE)

> microbenchmark(gsumm(testm,ng,g, fill = FALSE),gsuml(testl,ng,g, fill = FALSE))
Unit: milliseconds
                              expr      min       lq     mean   median       uq      max neval
 gsumm(testm, ng, g, fill = FALSE) 15.45847 16.28559 17.82400 16.67717 17.41415 63.40689   100
 gsuml(testl, ng, g, fill = FALSE) 13.61055 14.12062 16.06388 14.59342 15.45356 96.93972   100
 cld
   a
   a

> microbenchmark(gsumm(testm,ng,g, fill = TRUE),gsuml(testl,ng,g, fill = TRUE))
Unit: milliseconds
                             expr      min       lq     mean   median       uq      max neval cld
 gsumm(testm, ng, g, fill = TRUE) 34.45835 36.28886 51.42828 39.87513 60.51453 242.2054   100   b
 gsuml(testl, ng, g, fill = TRUE) 29.92314 30.69269 34.83283 31.33239 32.67136 115.8745   100  a

如果这两种方法可以达到相同的速度，那就太好了。甚至可能变得更有效率。我想为此需要设计一种避免大量先验内存分配的矩阵编程方式。我非常感谢任何 cmets 和建议！

【问题讨论】：

我即将出发，我只看了你的问题，但我担心你对 R 对其对象的作用做出了不完全正确的假设——这会让你走错路。不要害怕矩阵——那些通过 Rcpp 和 RcppArmadillo “免费”进入 C++ 的。阅读内存分析以说服自己。感谢@DirkEddelbuettel！我想你的意思是adv-r.had.co.nz/memory.html。我明天试着读一下。但是，如果您能进一步说明与我的问题和提供的示例有关的问题，我也将不胜感激。不，我是指您的副本 R 随附的 R 文档——编写 R 扩展手册。其他一切都是衍生的。 【参考方案1】：

在这两种方法中分配的内存量是相同的。使用bench::mark() 进行基准测试时，您可以从mem_alloc 列中看到这一点：

> bench::mark(gsumm(testm,ng,g, fill = FALSE),gsuml(testl,ng,g, fill = FALSE), check = FALSE)
# A tibble: 2 x 13
  expression                           min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
  <bch:expr>                        <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
1 gsumm(testm, ng, g, fill = FALSE) 14.1ms 15.1ms      64.7    7.63MB     0       33     0      510ms <dbl …
2 gsuml(testl, ng, g, fill = FALSE) 12.5ms 15.1ms      67.0    7.68MB     4.19    32     2      478ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>

> bench::mark(gsumm(testm,ng,g, fill = TRUE),gsuml(testl,ng,g, fill = TRUE), check = FALSE)
# A tibble: 2 x 13
  expression                          min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
  <bch:expr>                       <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
1 gsumm(testm, ng, g, fill = TRUE) 39.2ms 45.6ms      20.0    83.9MB     20.0     5     5      250ms <dbl …
2 gsuml(testl, ng, g, fill = TRUE) 30.3ms   32ms      26.7      84MB     20.0     8     6      299ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>

但是，内存不仅是分配的，反正速度很快，而且到处都用零初始化。这在您的情况下是不必要的，可以通过将Rcpp::NumericMatrix mat(rows, cols) 替换为Rcpp::NumericMatrix mat = Rcpp::no_init(rows, cols) 以及将Rcpp::NumericVector vec(length) 替换为Rcpp::NumericVector vec = Rcpp::no_init(length) 来避免。如果我使用您的代码执行此操作，那么这两个函数都会受益：

> bench::mark(gsumm(testm,ng,g, fill = FALSE),gsuml(testl,ng,g, fill = FALSE), check = FALSE)
# A tibble: 2 x 13
  expression                           min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
  <bch:expr>                        <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
1 gsumm(testm, ng, g, fill = FALSE)   13ms 14.7ms      67.1    7.63MB     0       34     0      507ms <dbl …
2 gsuml(testl, ng, g, fill = FALSE) 12.8ms 14.6ms      67.4    7.68MB     2.04    33     1      489ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>

> bench::mark(gsumm(testm,ng,g, fill = TRUE),gsuml(testl,ng,g, fill = TRUE), check = FALSE)
# A tibble: 2 x 13
  expression                          min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
  <bch:expr>                       <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
1 gsumm(testm, ng, g, fill = TRUE) 27.5ms   31ms      26.6    83.9MB     10.7    10     4      375ms <dbl …
2 gsuml(testl, ng, g, fill = TRUE) 24.7ms 26.4ms      36.9      84MB     36.9     9     9      244ms <list…
# … with 3 more variables: memory <list>, time <list>, gc <list>

不过，我不确定为什么矩阵版本会从不初始化内存中获得更多收益。

【讨论】：

非常感谢（再次）Ralf Stubner！这确实加快了矩阵方法并使其更接近列表方法。总体而言，列表方法仍然要快一些，这可能与矩阵方法中幕后发生的向量子集有关。我想这已经很好了。

以上是关于Rcpp：通过引用列出<->矩阵转换？？ + 使用矩阵编程时优化内存分配的主要内容，如果未能解决你的问题，请参考以下文章