一个向量与多个向量的余弦相似度的快速计算[关闭]

Posted 2023-02-21

技术标签:

【中文标题】一个向量与多个向量的余弦相似度的快速计算[关闭]【英文标题】：Fast calculation of cosine similarity of one vector with many [closed] 【发布时间】：2017-03-30 04:55:17 【问题描述】：

我很想听听关于优化代码以计算向量 x（长度为 l）与 n 其他向量（存储在任何结构中，例如矩阵 m 与n 行和l 列）。

n 的值通常远大于l 的值。

我目前正在使用此自定义 Rcpp 函数来计算向量 x 与矩阵 m 的每一行的相似度：

library(Rcpp)
cppFunction('NumericVector cosine_x_to_m(NumericVector x, NumericMatrix m) 
  int nrows = m.nrow();
  NumericVector out(nrows);
  for (int i = 0; i < nrows; i++) 
    NumericVector y = m(i, _);
    out[i] = sum(x * y) / sqrt(sum(pow(x, 2.0)) * sum(pow(y, 2.0)));
  
  return out;
')

不同的n 和l，我得到了以下几种时间：

下面的可重现代码。

# Function to simulate data
sim_data <- function(l, n) 
  # Feature vector to be used for computing similarity
  x <- runif(l)

  # Matrix of feature vectors (1 per row) to compare against x
  m <- matrix(runif(n * l), nrow = n)

  list(x = x, m = m)


# Rcpp function to compute similarity of x to each row of m
library(Rcpp)
cppFunction('NumericVector cosine_x_to_m(NumericVector x, NumericMatrix m) 
  int nrows = m.nrow();
  NumericVector out(nrows);
  for (int i = 0; i < nrows; i++) 
    NumericVector y = m(i, _);
    out[i] = sum(x * y) / sqrt(sum(pow(x, 2.0)) * sum(pow(y, 2.0)));
  
  return out;
')    

# Timer function
library(microbenchmark)
timer <- function(l, n) 
  dat <- sim_data(l, n)
  microbenchmark(cosine_x_to_m(dat$x, dat$m))


# Results for grid of l and n
library(tidyverse)
results <- cross_d(list(l = seq(200, 1000, by = 200), n = seq(500, 4000, by = 500))) %>% 
  mutate(timings = map2(l, n, timer))

# Plot results
results_plot <- results %>%
  unnest(timings) %>% 
  mutate(time = time / 1000000) %>%  # Convert time to seconds
  group_by(l, n) %>% 
  summarise(mean = mean(time), ci = 1.96 * sd(time) / sqrt(n()))

pd <- position_dodge(width = 20)

results_plot %>% 
  ggplot(aes(n, mean, group= l)) +
  geom_line(aes(color = factor(l)), position = pd, size = 2) +
  geom_errorbar(aes(ymin = mean - ci, ymax = mean + ci), position = pd, width = 100) +
  geom_point(position = pd, size = 2) +
  scale_color_brewer(palette = "Blues") +
  theme_minimal() +
  labs(x = "n", y = "Seconds", color = "l") +
  ggtitle("Algorithm Runtime",
          subtitle = "Error bars represent 95% confidence intervals")

【问题讨论】：

这比 SO 更适合 CodeReview。感谢您的建议。我在 CodeReview 上打开了这个问题：codereview.stackexchange.com/questions/159396/… 【参考方案1】：

我使用的是 Microsoft R（带有 Intel MKL），它使矩阵乘法更快，但为了公平比较，我将其设置为单线程。

setMKLthreads(1)

在我的测试中，这个纯 R 版本 cosine_x_to_m 比你的快两倍。

cosine_x_to_m2 = function(x,m)
  x = x / sqrt(crossprod(x));
  return(  as.vector((m %*% x) / sqrt(rowSums(m^2))) );

用 C/C++ 重写 rowSums(m^2) 使其更快，比原来的速度快四倍。

library(ramwas)
cosine_x_to_m2 = function(x,m)
  x = x / sqrt(crossprod(x));
  return(  as.vector((m %*% x) / sqrt(rowSumsSq(m))) );

初始性能：

最终版本性能：

【讨论】：

感谢您的好评！根据版主的建议，我已将问题移至 CodeReview。你介意把你的答案也移到那里吗？ codereview.stackexchange.com/questions/159396/… 两个地方都可以接受吗？是的，因为我只是复制并粘贴了问题（不是官方举措）。但我认为最好不要再对这个问题投票/接受，并在 CodeReview 部分这样做。也许我现在最好从 SO 中删除这个问题？我不同意版主的意见。我同意你不接受这里的答案，但不支持删除它。

以上是关于一个向量与多个向量的余弦相似度的快速计算[关闭]的主要内容，如果未能解决你的问题，请参考以下文章