R中列表中相交向量的并集

Posted 2023-02-22

技术标签:

【中文标题】R中列表中相交向量的并集【英文标题】：Union of intersecting vectors in a list in R 【发布时间】：2015-02-15 16:33:09 【问题描述】：

我有一个向量列表如下。

data <- list(v1=c("a", "b", "c"), v2=c("g", "h", "k"), 
             v3=c("c", "d"), v4=c("n", "a"), v5=c("h", "i"))

我正在努力实现以下目标：

检查是否有任何向量相互交叉如果找到相交向量，则获取它们的并集

所以想要的输出是

out <- list(v1=c("a", "b", "c", "d", "n"), v2=c("g", "h", "k", "i"))

我可以得到一组相交集的并集如下。

 Reduce(union, list(data[[1]], data[[3]], data[[4]]))
 Reduce(union, list(data[[2]], data[[5]])

如何首先识别相交向量？有没有办法将列表分成相交向量组的列表？

#更新

这是一个使用 data.table 的尝试。得到想要的结果。但是对于像 example 数据集这样的大型列表仍然很慢。

datasets. 
data <- sapply(data, function(x) paste(x, collapse=", "))
data <- as.data.frame(data, stringsAsFactors = F)

repeat 
  M <- nrow(data)
  data <- data.table( data , key = "data" )
  data <- data[ , list(dataelement = unique(unlist(strsplit(data , ", " )))), by = list(data)]
  data <- data.table(data , key = "dataelement" )
  data <- data[, list(data = paste0(sort(unique(unlist(strsplit(data, split=", ")))), collapse=", ")), by = "dataelement"]
  data$dataelement <- NULL
  data <- unique(data)
  N <- nrow(data)
  if (M == N)
    break


data <- strsplit(as.character(data$data) , "," )

【问题讨论】：

【参考方案1】：

这有点像图形问题，所以我喜欢使用 igraph 库来解决这个问题，使用您的示例数据，您可以这样做

library(igraph)
#build edgelist
el <- do.call("rbind",lapply(data, embed, 2))
#make a graph
gg <- graph.edgelist(el, directed=F)
#partition the graph into disjoint sets
split(V(gg)$name, clusters(gg)$membership)

# $`1`
# [1] "b" "a" "c" "d" "n"
# 
# $`2`
# [1] "h" "g" "k" "i"

我们可以查看结果

V(gg)$color=c("green","purple")[clusters(gg)$membership]
plot(gg)

【讨论】：

很好的解决方案，但内存可能会成为瓶颈。【参考方案2】：

这是另一种仅使用基础 R 的方法

更新

在 akrun 的评论和他的示例数据之后的下一次更新：

data <- list(v1=c('g', 'k'), v2= letters[1:4], v3= c('b', 'c', 'd', 'a'))

修改功能：

x <- lapply(seq_along(data), function(i) 
  if(!any(data[[i]] %in% unlist(data[-i]))) 
    data[[i]]
   else if (any(data[[i]] %in% unlist(data[seq_len(i-1)]))) 
    NULL 
   else 
    z <- lapply(data[-seq_len(i)], intersect,  data[[i]]) 
    z <- names(z[sapply(z, length) >= 1L])
    if (is.null(z)) NULL else union(data[[i]], unlist(data[z]))
  
)
x[!sapply(x, is.null)]
#[[1]]
#[1] "g" "k"
#
#[[2]]
#[1] "a" "b" "c" "d"

这适用于原始样本数据、MrFlick 的样本数据和 akrun 的样本数据。

【讨论】：

@Crops，好点子！我用修改后的函数更新了我的答案这似乎与样本数据表现不佳：data <- list(v1=c("a", "b"), v2=c("b", "c"), v3=c("a", "d"), v4=c("g", "k"), v5=c("c", "d"))。它还将数据的不完整子集作为适当的组返回。很好，@MrFlick！我再次更新了我的答案。 @docendodiscimus 刚刚看到您的更新。但是，它似乎仍然不适用于data <- list(v1=c('g', 'k'), v2= letters[1:4], v3= c('b', 'c', 'd', 'a'))。顺便说一句，我没有更改我的代码，所以它也不适用于我的代码。 @akrun，我提供了我的答案的另一个更新。对于这些情况，您可能也想检查一下您的情况【参考方案3】：

效率该死，你们还睡觉吗？仅以 R 为基础，并且比最快的答案慢得多。既然写了，那就贴吧。

f.union = function(x) 
  repeat
    n = length(x)
    m = matrix(F, nrow = n, ncol = n)
    for (i in 1:n)
      for (j in 1:n) 
        m[i,j] = any(x[[i]] %in% x[[j]])
      
    
    o = apply(m, 2, function(v) Reduce(union, x[v]))
    if (all(apply(m, 1, sum)==1)) return(o) else x=unique(o)
  


f.union(data)

[[1]]
[1] "a" "b" "c" "d" "n"

[[2]]
[1] "g" "h" "k" "i"

因为我喜欢慢。（在基准测试之外加载库）

Unit: microseconds
    expr      min        lq      mean    median        uq       max neval
   vlo()  896.435 1070.6540 1315.8194 1129.4710 1328.6630  7859.999  1000
 akrun()  596.263  658.6590  789.9889  694.1360  804.9035  3470.158  1000
 flick()  805.854  928.8160 1160.9509 1001.8345 1172.0965  5780.824  1000
  josh() 2427.752 2693.0065 3344.8671 2943.7860 3524.1550 16505.909  1000 <- deleted :-(
   doc()  254.462  288.9875  354.6084  302.6415  338.9565  2734.795  1000

【讨论】：

我太累了。去睡觉。我认为有第五个答案需要 RGBL，但我可能只是在想象一些事情。您的解决方案是最快的。好的，RBGL 是一种很酷的生物导体封装。刚刚跑了作者删的第五个答案。 “咖啡！因为你死了还能睡觉！” （不是我的；在 CafePress 上很容易找到，例如）【参考方案4】：

一种选择是使用combn，然后找到相交。会有更简单的选择。

indx <- combn(names(data),2)
lst <- lapply(split(indx, col(indx)), 
        function(i) Reduce(`intersect`,data[i]))
indx1 <- names(lst[sapply(lst, length)>0])
indx2 <- indx[,as.numeric(indx1)]
indx3 <- apply(indx2,2, sort)
lapply(split(1:ncol(indx3), indx3[1,]),
   function(i) unique(unlist(data[c(indx3[,i])], use.names=FALSE)))
#$v1
#[1] "a" "b" "c" "d" "n"

#$v2
#[1] "g" "h" "k" "i"

更新

您可以使用library(gRbase) 中的combnPrim 来加快速度。使用稍大的数据集

library(gRbase)
set.seed(25)
data <- setNames(lapply(1:1e3,function(i)sample(letters,
         sample(1:20), replace=FALSE)), paste0("v", 1:1000))

并与fastest 进行比较。这些是基于 OP 对 @docendo discimus 的 cmets 的修改函数。

akrun2M <- function()
     ind <- sapply(seq_along(data), function(i)#copied from @docendo discimus
            !any(data[[i]] %in% unlist(data[-i]))
              )
     data1 <- data[!ind] 
     indx <- combnPrim(names(data1),2)
     lst <- lapply(split(indx, col(indx)), 
              function(i) Reduce(`intersect`,data1[i]))
     indx1 <- names(lst[sapply(lst, length)>0])
     indx2 <- indx[,as.numeric(indx1)]
     indx3 <- apply(indx2,2, sort)
     c(data[ind],lapply(split(1:ncol(indx3), indx3[1,]),
        function(i) unique(unlist(data[c(indx3[,i])], use.names=FALSE))))
    

doc2 <- function()
      x <- lapply(seq_along(data), function(i) 
          if(!any(data[[i]] %in% unlist(data[-i]))) 
               data[[i]]
            
          else 
            z <- unlist(data[names(unlist(lapply(data[-c(1:i)],
                                     intersect, data[[i]])))]) 
          if (is.null(z)) 
               z
               
          else union(data[[i]], z)
        
   )
x[!sapply(x, is.null)]

基准测试

 microbenchmark(doc2(), akrun2M(), times=10L)
 # Unit: seconds
 #    expr      min       lq     mean   median       uq      max neval  cld
 #   doc2() 35.43687 53.76418 54.77813 54.34668 62.86665 67.76754    10   b
 #akrun2M() 26.64997 28.74721 38.02259 35.35081 47.56781 49.82158    10   a

【讨论】：

我几天前写了这篇文章，只是想让你知道我已经从你的解决方案中提取了最后两行，所以我要感谢你。它很稳定，但速度很慢。我正在考虑用lapply 或类似的东西替换所有for 循环。 @AnoushiravanR 我无法想象我写了这个。对我来说看起来很陌生（几年后）:-) 哈哈哈只是想知道几年前你写这篇文章时我在做什么。我处于完全的黑暗中。我一直在为我的问题寻找igraph 的替代解决方案并想出了这个，但我必须学习igraph 它非常有用。一切从这里开始：***.com/questions/27520310/… 我不想再耽误你的时间了，亲爱的阿伦，你今天一如既往地非常慷慨。我不知道该怎么感谢你。【参考方案5】：

我遇到了一个类似的问题，促使我到处寻找解决方案。感谢这里的许多出色的贡献者，我终于找到了一个非常好的函数，但是当我看到这篇文章时，我想我会为此编写自己的自定义函数。它实际上并不优雅，而且速度太慢，但我认为它非常有效，现在可以解决问题，直到我做出一些改进：

anoush <- function(x) 
# First we check whether x is a list

  stopifnot(is.list(x)) 

# Then we take every element of the input and calculate the intersect between
# that element & others. In case there were some we would store the indices 
# in `vec` vector. So in the end we have a list called `ind` whose elements 
# are all the indices connected with the corresponding elements of the original 
# list for example first element of `ind` is `1`, `2`, `3` which means in 
# the original list these elements have common values.
  
  ind <- lapply(1:length(x), function(a) 
    vec <- c()
    for(i in 1:length(x)) 
      if(length(unique(base::intersect(x[[a]], x[[i]]))) > 0) 
        vec <- c(vec, i)
      
    
    vec 
    )

# Then we go on to again compare each element of `ind` with other elements
# in case there were any intersect, we will calculate the `union` of them.
# for each element we will end up with a list of accumulated values but
# but in the end we use `Reduce` to capture only the last one. So for each
# element of `ind` we end up having a collection of indices that also 
# result in duplicated values. For example elements `1` through `5` of 
# `dup_ind` contains the same value cause in the original list these 
# elements have common values.

  dup_ind <- lapply(1:length(ind), function(a) 
    out <- c()
    for(i in 1:length(ind)) 
      if(length(unique(base::intersect(ind[[a]], ind[[i]]))) > 0) 
        out[[i]] <- union(ind[[a]], ind[[i]])
      
      vec2 <- Reduce("union", out)
    
    vec2
  ) 

# Here we get rid of the duplicated elements of the list by means of 
# `relist` funciton and since in this process all the duplicated elements
# will turn to `integer(0)` I have filtered those out.
  
  un <- unlist(dup_ind)
  res <- Map(`[`, dup_ind, relist(!duplicated(un), skeleton = dup_ind))
  res2 <- Filter(length, res)
  
  sapply(res2, function(a) unique(unlist(lapply(a, function(b) `[[`(x, b)))))

OP 的数据样本

> anoush(data)

[[1]]
[1] "a" "b" "c" "d" "n"

[[2]]
[1] "g" "h" "k" "i"

亲爱的@akrun 的数据样本

data <- list(v1=c('g', 'k'), v2= letters[1:4], v3= c('b', 'c', 'd', 'a'))

> anoush(data)
[[1]]
[1] "g" "k"

[[2]]
[1] "a" "b" "c" "d"

【讨论】：

【参考方案6】：

一般来说，你不能比 Floyd-Warshall-Algorithm 做得更好/更快，如下所示：

library(Rcpp)

cppFunction(
  "LogicalMatrix floyd(LogicalMatrix w)
    int n = w.nrow();
    for( int k = 0; k < n; k++ )
     for( int i = 0; i < (n-1); i++ )
      for( int j = i+1; j < n; j++ ) 
       if( w(i,k) && w(k,j) ) 
        w(i,j) = true;
        w(j,i) = true;
       
   return w;
")

fw.union<-function(x) 
  n<-length(x)
  w<-matrix(F,nrow=n,ncol=n)
  for( i in 1:n ) 
   w[i,i]<-T
  
  for( i in 1:(n-1) ) 
   for( j in (i+1):n ) 
     w[i,j]<-w[j,i]<- any(x[[i]] %in% x[[j]])
   
  
 apply( unique( floyd(w) ), 1, function(y)  Reduce(union,x[y])  )

不过，运行基准测试会很有趣。初步测试表明，我的实现比 Vlo 的快 2-3 倍。

【讨论】：

以上是关于R中列表中相交向量的并集的主要内容，如果未能解决你的问题，请参考以下文章