R：基于自定义距离函数和多个条件快速匹配记录

Posted 2023-02-19

技术标签:

【中文标题】R：基于自定义距离函数和多个条件快速匹配记录【英文标题】：R: quickly match records based on custom distance function and multiple criteria 【发布时间】：2014-08-27 18:12:17 【问题描述】：

我在 R 中创建了一些函数，以根据自定义光谱相似性函数和所谓的保留指数的匹配，将化学质谱（具有两列具有整数质量和强度的矩阵）与此类光谱库进行匹配。化合物（即洗脱时间）（参见此处的示例，http://webbook.nist.gov/cgi/cbook.cgi?ID=C630035&Mask=200）。为此，必须将每个记录的列表元素“RI”与库中的元素进行比较，当绝对偏差小于给定容差时，它应该将最佳光谱库匹配添加到我的记录中。下面是我为此编写的一些代码，但问题是它对于我的目的来说太慢了（我通常有大约 1000 个样本光谱和 200 000 个库光谱）。我尝试并行化它，但这似乎也没有多大帮助。关于如何使下面的代码更高效的任何想法，例如使用更多矢量化，或使用内联 C 代码，或其他一些 R 技巧？我知道这方面的一般建议，但不太清楚在这种情况下如何轻松实现它（不幸的是，我对 C 语言还不是很精通）......有什么想法或建议吗？哦，是的，我如何在使用sfLapply 时添加进度条？首先将我的光谱放在一个大（稀疏，因为有很多零）矩阵中是否有帮助，以避免光谱相似性函数中的merge 步骤，或者使用其他标准，例如仅在查询光谱中最大/最强的峰与库光谱具有相同的质量（或者包含在库光谱中的 5 个最大峰的集合中）？无论如何，任何关于如何加快这项任务的想法都将不胜感激！

编辑：我仍然有一个剩余查询是如何避免在函数 addbestlibmatches1 中制作样本记录 recs 的完整副本，而是仅更改存在库匹配的记录？此外，传递保留索引匹配的图书馆记录的选择可能效率不高（在函数 addbestlibmatch 中）。有什么想法可以避免这种情况吗？

# EXAMPLE DATA

rec1=list(RI=1100,spectrum=as.matrix(cbind(mz=c(57,43,71,41,85,56,55,70,42,84,98,99,39,69,58,113,156),int=c(999,684,396,281,249,173,122,107,94,73,51,48,47,47,37,33,32))))
randrec=function() list(RI=runif(1,1000,1200),spectrum=as.matrix(cbind(mz=seq(30,600,1),int=round(runif(600-30+1,0,999)))))

# spectral library
libsize=2000 # my real lib has 200 000 recs
lib=lapply(1:libsize,function(i) randrec())
lib=append(list(rec1),lib) 

# sample spectra
ssize=100 # I usually have around 1000
s=lapply(1:ssize,function(i) randrec())
s=append(s,list(rec1)) # we add the first library record as the last sample spectrum, so this should match



# SPECTRAL SIMILARITY FUNCTION

SpecSim=function (ms1,ms2,log=F)  
  alignment = merge(ms1,ms2,by=1,all=T)
  alignment[is.na(alignment)]=0
  if (nrow(alignment)!=0) 
    alignment[,2]=100*alignment[,2]/max(alignment[,2]) # normalize base peak intensities to 100
    alignment[,3]=100*alignment[,3]/max(alignment[,3])
    if (log==T) alignment[,2]=log2(alignment[,2]+1);alignment[,3]=log2(alignment[,3]+1) # work on log2 intensity scale if requested
    u = alignment[,2]; v = alignment[,3]
    similarity_score = as.vector((u %*% v) / (sqrt(sum(u^2)) * sqrt(sum(v^2))))
    similarity_score[is.na(similarity_score)]=-1
    return(similarity_score) else return(-1) 



# FUNCTION TO CALCULATE SIMILARITY VECTOR OF SPECTRUM WITH LIBRARY SPECTRA

SpecSimLib=function(spec,lib,log=F) 
  sapply(1:length(lib), function(i) SpecSim(spec,lib[[i]]$spectrum,log=log)) 



# FUNCTION TO ADD BEST MATCH OF SAMPLE RECORD rec IN SPECTRAL LIBRARY lib TO ORIGINAL RECORD
# we only compare spectra when list element RI in the sample record is within tol of that in the library
# when there is a spectral match > specsimcut within a RI devation less than tol,
# we add the record nrs in the library with the best spectral matches, the spectral similarity and the RI deviation to recs

addbestlibmatch=function(rec,lib,xvar="RI",tol=10,log=F,specsimcut=0.8) 
    nohit=list(record=-1,specmatch=NA,RIdev=NA)
    selected=abs(sapply(lib, "[[", xvar)-rec[[xvar]])<tol
    if (sum(selected)!=0) 
    specsims=SpecSimLib(rec$spectrum,lib[selected],log) # HOW CAN I AVOID PASSING THE RIGHT LIBRARY SUBSET EACH TIME?
    maxspecsim=max(specsims)
    if (maxspecsim>specsimcut) besthsel=which(specsims==maxspecsim)[[1]] # nr of best hit among selected elements, in case of ties we just take the 1st hit
                                idbesth=which(selected)[[besthsel]] # record nr of best hit in library lib
                                return(modifyList(rec,list(record=idbesth,specsim=specsims[[besthsel]],RIdev=rec[[xvar]]-lib[[idbesth]][[xvar]])))
                                else return(rec)  else return(rec)




# FUNCTION TO ADD BEST LIBRARY MATCHES TO RECORDS RECS

library(pbapply)
addbestlibmatches1=function(recs,lib,xvar="RI",tol=10,log=F,specsimcut=0.8) 
  pblapply(1:length(recs), function(i) addbestlibmatch(recs[[i]],lib,xvar,tol,log,specsimcut))


# PARALLELIZED VERSION
library(snowfall)
addbestlibmatches2=function(recs,lib,xvar="RI",tol=10,log=F,specsimcut=0.8,cores=4) 
  sfInit(parallel=TRUE,cpus=cores,type="SOCK")
  sfExportAll()
  sfLapply(1:length(recs), function(i) addbestlibmatch(recs[[i]],lib,xvar,tol,log,specsimcut))
  sfStop() 




# TEST TIMINGS

system.time(addbestlibmatches1(s,lib))
#|++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
#user  system elapsed 
#83.60    0.06   83.82 

system.time(addbestlibmatches2(s,lib))
#user  system elapsed  - a bit better, but not much
#2.59    0.74   42.37

【问题讨论】：

简单进度条：

pb &lt;- txtProgressBar(0, 100, style = 3); sapply(1:100, function(i) setTxtProgressBar(pb, i); Sys.sleep(0.05); i); close(pb)

关于性能，您可能需要更深入的分析来确定瓶颈。您可以从 Advanced R 的this chapter 开始。 Thx - 刚刚尝试使用 devtools::install_github("hadley/lineprof") library(lineprof) l=lineprof(addbestlibmatches1(s,lib)) Shine(l) 查看它，但只是得到一个空的屏幕和一个警告警告在 file(con, "r") : file("") 只支持 open = "w+" 和 open = "w+b": 使用前者有什么想法吗？我认为加速我的代码的主要关键是矢量化并避免复制事物，但不知道在这种情况下如何做到这一点...... 我也没能成功。尝试使用 base Rprof 进行分析，也许？例如。回答here。 【参考方案1】：

如果不详细查看您的所有代码，我认为 SpecSim 功能还有改进的空间，而无需全部使用 C++。您正在使用合并，它将您的矩阵强制转换为 data.frames。这对性能总是不利的。您的大部分代码时间可能都在 merge() 和子集化中。 data.frame 子集慢，矩阵或向量快。

SpecSim2 <- function (ms1,ms2,log=F) 
  i <- unique(c(ms1[,1], ms2[,1]))
  y <- x <- numeric(length(i))
  x[match(ms1[,1], i)] <- ms1[, 2]
  y[match(ms2[,1], i)] <- ms2[, 2]
  x <- 100 * x / max(x)
  y <- 100 * y / max(y)
  if (log)
    x <- log2(x + 1)
    y <- log2(y + 1)
  
  similarity.score <- x %*% y / (sqrt(sum(x^2)) * sqrt(sum(y^2)))
  if (is.na(similarity.score))
    return(-1)
   else 
    return(similarity.score)

以下分别是重写和原版的一些时间安排：

> system.time(addbestlibmatches1(s,lib))
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
   user  system elapsed 
   4.16    0.00    4.17

> system.time(addbestlibmatches1(s,lib))
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
   user  system elapsed 
  34.25    0.02   34.34

可能无法为您提供所需的速度，但比 8 倍的改进要好...

至于如何处理addbestlibmatches()，我认为你需要重新考虑这个问题作为矩阵问题。与其使用列表来保存您的库，不如使用向量作为库和样本的 RI 值。然后你可以像这样做你的初始屏幕：

selected <- outer(sRI, libRI, FUN = '-') < 10

如果您的库是一个大矩阵（质量 x 光谱），那么您可以为所选光谱对库进行子集化，并计算样品与在一次通过中选择的整个库部分之间的距离，如下所示：

SpecSimMat <- function(x, lib, log = F)
  stopifnot(require(matrixStats))
  x <- 100 * x / max(x)
  lib <- sweep(lib, 2, colMaxs(lib))
  x %*% lib / (sqrt(sum(x^2)) * sqrt(colSums(lib^2)))

其中 x 是样本，lib 是所选光谱的矩阵（质量 x 光谱）。这样，您对矩阵进行子集化（快速），然后执行一系列矩阵运算（快速）。

【讨论】：

哈哈，太好了——非常感谢！很高兴知道大部分时间都花在了相似度分数的计算上，所以我可以专注于加快这部分的速度...... 哈，我仍然有一个剩余查询是如何避免在函数 addbestlibmatches1 中制作样本记录 recs 的完整副本，而是仅更改存在库匹配的记录？此外，传递保留索引匹配的图书馆记录选择的副本可能效率不高（在函数 addbestlibmatch 中）。有什么想法可以避免这种情况吗？

以上是关于R：基于自定义距离函数和多个条件快速匹配记录的主要内容，如果未能解决你的问题，请参考以下文章

R语言ggplot2可视化自定义多个图例（legend）标签之间的距离实战（例如，改变数据点颜色和数据点大小图例之间的距离）

R语言使用fs包的path函数基于自定义文件路径规则，批量生成多个文件或者文件夹对应的相对(relative)文件路径（constructs a relative path）

R语言使用fs包的path_wd函数基于自定义文件路径规则，批量生成多个文件或者文件夹对应的绝对（absolute）文件路径（constructs absolute path）

R语言的自定义函数—字符组合

自定义损失函数 Tensorflow / Keras 惩罚相对距离

用条件在keras中实现自定义丢失函数