Armadillo C++：根据其他两个向量对向量进行排序

Posted 2023-02-17

技术标签:

【中文标题】Armadillo C++：根据其他两个向量对向量进行排序【英文标题】：Armadillo C++: Sorting a vector in terms of two other vectors 【发布时间】：2018-03-29 11:25:03 【问题描述】：

我的问题与排序练习有关，我可以在 R 中轻松（但可能很慢）进行排序，并希望在 C++ 中进行以加快我的代码速度。

考虑三个大小相同的向量 a、b 和 c。在 R 中，以下命令将首先根据 b 对向量进行排序，然后在出现平局的情况下，将根据 c 进一步排序。

a<-a[order(b,c),1]

例子：

a<-c(1,2,3,4,5)
b<-c(1,2,1,2,1)
c<-c(5,4,3,2,1)

> a[order(b,c)]
[1] 5 3 1 4 2

在 C++ 中使用犰狳向量是否有有效的方法来执行此操作？

【问题讨论】：

在 R 中应该足够快。这个问题可能对***.com/questions/48118248/…有帮助 【参考方案1】：

我们可以编写以下 C++ 解决方案，我在一个文件 SO_answer.cpp 中：

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]

using namespace arma;

// [[Rcpp::export]]
vec arma_sort(vec x, vec y, vec z) 
    // Order the elements of x by sorting y and z;
    // we order by y unless there's a tie, then order by z.
    // First create a vector of indices
    uvec idx = regspace<uvec>(0, x.size() - 1);
    // Then sort that vector by the values of y and z
    std::sort(idx.begin(), idx.end(), [&](int i, int j)
        if ( y[i] == y[j] ) 
            return z[i] < z[j];
        
        return y[i] < y[j];
    );
    // And return x in that order
    return x(idx);

我们所做的是利用std::sort() 允许您根据自定义比较器进行排序这一事实。我们使用比较器，仅当y 的元素相等时才比较z 的元素；否则比较y的值。¹ 然后我们可以编译文件并在R中测试函数：

library(Rcpp)
sourceCpp("SO_answer.cpp")

set.seed(1234)
x <- sample(1:10)
y <- sample(1:10)
z <- sample(1:10)

y[sample(1:10, 1)] <- 1 # create a tie

all.equal(x[order(y, z)], c(arma_sort(x, y, z))) # check against R
# [1] TRUE # Good

当然，我们还必须考虑这是否真的会给您带来任何性能提升，这就是您这样做的全部原因。让我们进行基准测试：

library(microbenchmark)
microbenchmark(r = x[order(y, z)],
               arma = arma_sort(x, y, z),
               times = 1e4)

Unit: microseconds
 expr    min    lq      mean median    uq      max neval cld
    r 36.040 37.23 39.386160  37.64 38.32 3316.286 10000   b
 arma  5.055  6.07  7.155676   7.00  7.53  107.230 10000  a

在我的机器上，使用小向量时您的速度似乎提高了 5-6 倍，但当您扩大规模时，这种优势就不再存在了：

x <- sample(1:100)
y <- sample(1:100)
z <- sample(1:100)

y[sample(1:100, 10)] <- 1 # create some ties

all.equal(x[order(y, z)], c(arma_sort(x, y, z))) # check against R
# [1] TRUE # Good

microbenchmark(r = x[order(y, z)],
               arma = arma_sort(x, y, z),
               times = 1e4)

Unit: microseconds
 expr   min     lq     mean median     uq      max neval cld
    r 44.50 46.360 48.01275 46.930 47.755  294.051 10000   b
 arma 10.76 12.045 16.30033 13.015 13.715 5262.132 10000  a 

x <- sample(1:1000)
y <- sample(1:1000)
z <- sample(1:1000)

y[sample(1:100, 10)] <- 1 # create some ties

all.equal(x[order(y, z)], c(arma_sort(x, y, z))) # check against R
# [1] TRUE # Good

microbenchmark(r = x[order(y, z)],
               arma = arma_sort(x, y, z),
               times = 1e4)

Unit: microseconds
 expr     min       lq     mean   median       uq      max neval cld
    r 113.765 118.7950 125.7387 120.5075 122.4475 3373.696 10000   b
 arma  82.690  91.3925 104.0755  95.2350  99.4325 6040.162 10000  a

它仍然更快，但在长度为 1000 的向量处不到 2 倍。这可能是F. Privé 说此操作在 R 中应该足够快的原因。虽然使用 Rcpp 迁移到 C++ 可以为您提供出色的性能优势，您获得收益的程度在很大程度上取决于上下文，正如Dirk Eddelbuettel 在此处回答各种问题时多次提到的那样。

1 _{请注意，通常我建议使用sort() 或sort_index() 对犰狳向量进行排序（参见犰狳文档here）。如果您尝试按第二个vec 的值对vec 进行排序，您可以使用x(arma::sort_index(y))，正如我在对相关问题here 的回答中指出的那样。您甚至可以使用stable_sort_index() 来保持联系。但是，我无法弄清楚如何使用这些功能来解决您在此处提出的具体问题。}

【讨论】：

不错的答案。我也没有看到如何使现有的犰狳例程尊重一对以上的向量。感谢@DirkEddelbuettel！是的，如果您查看 source code，您会发现无法将自定义比较器传递给 Armadillo 排序函数，但您也可以看到它们只是调用 std::sort()，所以我认为这种方法已经足够好了。谢谢@duckmayr！经过一番思考，另一个解决方案在于，您始终可以定义一个新的向量 d，使其提供与 b 相同的排名，除非出现平局，在这种情况下，您可以让 c 提供的排名发挥作用。您的解决方案更简洁。

以上是关于Armadillo C++：根据其他两个向量对向量进行排序的主要内容，如果未能解决你的问题，请参考以下文章