两个字符串数组的快速 count() 交集

Posted 2023-03-22

技术标签:

【中文标题】两个字符串数组的快速 count() 交集【英文标题】：Fast count() intersection of two string arrays 【发布时间】：2012-12-16 10:46:52 【问题描述】：

我需要统计两个大字符串数组的交集对应的元素个数，而且速度非常快。

我正在使用以下代码：

arr1[i].Intersect(arr2[j]).Count()

对于 CPU Time，VS Profiler 表示

85.1% 在System.Linq.Enumerable.Count() 0.3% 在System.Linq.Enumerable.Intersect()

不幸的是，完成所有工作可能需要几个小时。

如何更快？

【问题讨论】：

您从分析器获得的数字可能不是“正确的”。因为当你说 .Intersect() 时不会执行 Intersect，所以当你说 .Count() 时会执行整个查询。这就是 LINQ 的本质。我怀疑相交时要做的工作比计数时要多。如果你真的需要这方面的性能，请尝试不使用 LINQ。如果足够大，把它放到数据库中，或者创建一个计算机/线程集群，也许做一些 MapReduce.. 您是在与arr1 和arr2 中的字符串或arr1 中每个字符串中的每个字符相交，还是与arr2 中每个字符串中的每个字符相交？我几乎不认为这是你的瓶颈。在我的测试中，对 2 个包含 500 万个字符串（avg.length=60 字符）的数组进行交叉 + 计数需要约 3.5 秒...您的数组有多大？ 【参考方案1】：

您可以将HashSet 与arr2 一起使用

HashSet<string> arr2Set = new HashSet<string>(arr2);
arr1.Where(x=>arr2Set.Contains(x)).Count();
              ------------------
                      |
                      |->HashSet's contains method executes quickly using hash-based lookup..

不考虑从arr2到arr2Set的转换，应该是O(n)

【讨论】：

这是最好的方法！最有效的。此方法适用于 Instersect。是否可以对 Union 使用相同的原则？如果arr1 包含在arr2 中也有重复的重复字符串，这会给出不正确的答案。你需要Where(x => arr2Set.Remove(x))。另外，如果数组变大，这个Contains 版本最终会变慢 - 但带有Remove 的版本确实似乎保持得更快. 只是语法改进 - arr1.Where(x=>arr2Set.Contains(x)).Count(); 可以替换为 arr1.Count(arr2Set.Contains);【参考方案2】：

我怀疑分析器在Count 中显示所消耗的时间的原因是，这是实际枚举集合的位置（Intersect 被延迟评估并且在您需要结果之前不会运行）。

我相信Intersect 应该进行一些内部优化以使其相当快，但是您可以尝试使用HashSet<string>，这样您就可以确定可以在不搜索每个元素的内部数组的情况下进行相交：

HashSet<string> set = new HashSet<string>(arr1);
set.IntersectWith(arr2);
int count = set.Count;

【讨论】：

奇怪的是，在我的测试中，这最终比Some1 的答案的原始版本和我的更正版本都慢，而我原本预计它会很棒。【参考方案3】：

嗯，相交大概是 N^2

使其更快地对两个数组进行快速排序。而不是遍历两个数组。计算交叉点。

懒得测试它有多快，但应该 O(nlogn +n)

    using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Test

    class Program
    
        static void Main(string[] args)
        
            const int arrsize = 1000000;
            Random rnd = new Random(42);
            string[] arr1 = new string[arrsize];
            string[] arr2 = new string[arrsize];
            for (int i = 0; i < arrsize; i++)
            
                arr1[i] = rnd.Next().ToString();
                arr2[i] = rnd.Next().ToString();
            
            
                var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                arr1.Intersect(arr2).Count();
                Console.WriteLine("array" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            

        

            HashSet<string> set = new HashSet<string>(arr1);
            var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
            set.IntersectWith(arr2);
            int count = set.Count;
            Console.WriteLine("HashSet" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
        
            
               var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                HashSet<string> set = new HashSet<string>(arr1);
                set.IntersectWith(arr2);
                int count = set.Count;
                Console.WriteLine("HashSet + new" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            

            
                var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                SortedSet<string> set = new SortedSet<string>(arr1);
                set.IntersectWith(arr2);
                int count = set.Count;
                Console.WriteLine("SortedSet +new " + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            

            

                SortedSet<string> set = new SortedSet<string>(arr1);
                var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                set.IntersectWith(arr2);
                int count = set.Count;
                Console.WriteLine("SortedSet without new " + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));

结果

数组 914,637

哈希集 816,119

HashSet +new 1,150,978

SortedSet +new 16,173,836

SortedSet 没有新的 7,946,709

看来最好的方法是保持一个现成的哈希集。

【讨论】：

linq 扩展中的相交实现 O(n log n) HashSet中的Intersect也实现了O(n log n)，只是在反汇编中查看 @user287107：这不是真的。 Intersect 和 HashSet 都使用哈希表，因此 hashset 渐近为O(1)，intersect 为O(n)。当然，有时O(logn) 复杂度可能会更快（这取决于 gethashcode 实现、存储桶和其他因素），但它们不是使用二叉树或具有O(logn) 复杂度的东西来实现的。 @user287107: c-sharp-snippets.blogspot.it/2010/03/… 我知道它可能会更快，但请看一下 IntersectIterator 类的实现。迭代器执行以下操作：1）创建新集合，2）遍历 source1，将项目添加到集合 3）遍历 source2，如果删除成功，则返回元素。添加和删除是O(log n)，循环是O(n log n)【参考方案4】：

当您使用集合时，您的复杂度将是 O((n log n)*(m log m)) 左右，

我认为这里应该更快，但我不确定现在是否为 O((n log n)+(m log m))

possible would be 
var Set1 = arr1[i].Distinct().ToArray(); // if necessary, if arr1 or arr2 could be not distinct
var Set2 = arr2[j].Distinct().ToArray();  

nCount = Set1.Count() + Set2.Count() - Set1.Append(Set2).Distinct().Count();

【讨论】：

【参考方案5】：

使用 smaller 数组构建一个 HashSet，然后遍历较大的数组，如果该项目存在于哈希集中，则递增一个计数器。

【讨论】：

以上是关于两个字符串数组的快速 count() 交集的主要内容，如果未能解决你的问题，请参考以下文章