在 .NET 中有效地合并字符串数组，保持不同的值

Posted 2023-02-23

技术标签:

【中文标题】在 .NET 中有效地合并字符串数组，保持不同的值【英文标题】：Efficiently merge string arrays in .NET, keeping distinct values 【发布时间】：2010-09-13 20:47:14 【问题描述】：

我正在使用 .NET 3.5。我有两个字符串数组，它们可能共享一个或多个值：

string[] list1 = new string[]  "apple", "orange", "banana" ;
string[] list2 = new string[]  "banana", "pear", "grape" ;

我想要一种方法将它们合并到一个没有重复值的数组中：

 "apple", "orange", "banana", "pear", "grape"

我可以用 LINQ 做到这一点：

string[] result = list1.Concat(list2).Distinct().ToArray();

但我认为这对于大型数组来说效率不高。

有没有更好的办法？

【问题讨论】：

【参考方案1】：

string[] result = list1.Union(list2).ToArray();

来自msdn：“此方法从返回集中排除重复项。这与 Concat(TSource) 方法的行为不同，后者返回输入序列中的所有元素，包括重复项。”

【讨论】：

我回到这个话题来发布这个解决方案。我相信它在各方面都很理想！一个小问题，但是 Union 的返回类型是 IEnumerable，所以你需要添加一个 ToArray() 来获取 string[] 这在 10 年后仍然有用：D【参考方案2】：

.NET 3.5 引入了可以做到这一点的 HashSet 类：

IEnumerable<string> mergedDistinctList = new HashSet<string>(list1).Union(list2);

不确定性能，但它应该优于您提供的 Linq 示例。

编辑：我站得更正了。 Concat 和 Distinct 的惰性实现具有关键的内存和速度优势。 Concat/Distinct 大约快 10%，并且可以保存多个数据副本。

我通过代码确认：

Setting up arrays of 3000000 strings overlapping by 300000
Starting Hashset...
HashSet: 00:00:02.8237616
Starting Concat/Distinct...
Concat/Distinct: 00:00:02.5629681

是输出：

        int num = 3000000;
        int num10Pct = (int)(num / 10);

        Console.WriteLine(String.Format("Setting up arrays of 0 strings overlapping by 1", num, num10Pct));
        string[] list1 = Enumerable.Range(1, num).Select((a) => a.ToString()).ToArray();
        string[] list2 = Enumerable.Range(num - num10Pct, num + num10Pct).Select((a) => a.ToString()).ToArray();

        Console.WriteLine("Starting Hashset...");
        Stopwatch sw = new Stopwatch();
        sw.Start();
        string[] merged = new HashSet<string>(list1).Union(list2).ToArray();
        sw.Stop();
        Console.WriteLine("HashSet: " + sw.Elapsed);

        Console.WriteLine("Starting Concat/Distinct...");
        sw.Reset();
        sw.Start();
        string[] merged2 = list1.Concat(list2).Distinct().ToArray();
        sw.Stop();
        Console.WriteLine("Concat/Distinct: " + sw.Elapsed);

【讨论】：

实际上，我希望它比 Concat/Distinct 方式效率低，因为 Union 需要形成第二个 HashSet。【参考方案3】：

免责声明这是过早的优化。对于您的示例数组，请使用 3.5 扩展方法。在您知道您在该区域存在性能问题之前，您应该使用库代码。

如果您可以对数组进行排序，或者当您到达代码中的那个点时它们已排序，您可以使用以下方法。

这些将从两者中提取一个项目，并产生“最低”项目，然后从相应的来源获取一个新项目，直到两个来源都用尽。如果从两个来源获取的当前项目相等，它将从第一个来源生成一个，并在两个来源中跳过它们。

private static IEnumerable<T> Merge<T>(IEnumerable<T> source1,
    IEnumerable<T> source2)

    return Merge(source1, source2, Comparer<T>.Default);


private static IEnumerable<T> Merge<T>(IEnumerable<T> source1,
    IEnumerable<T> source2, IComparer<T> comparer)

    #region Parameter Validation

    if (Object.ReferenceEquals(null, source1))
        throw new ArgumentNullException("source1");
    if (Object.ReferenceEquals(null, source2))
        throw new ArgumentNullException("source2");
    if (Object.ReferenceEquals(null, comparer))
        throw new ArgumentNullException("comparer");

    #endregion

    using (IEnumerator<T>
        enumerator1 = source1.GetEnumerator(),
        enumerator2 = source2.GetEnumerator())
    
        Boolean more1 = enumerator1.MoveNext();
        Boolean more2 = enumerator2.MoveNext();

        while (more1 && more2)
        
            Int32 comparisonResult = comparer.Compare(
                enumerator1.Current,
                enumerator2.Current);
            if (comparisonResult < 0)
            
                // enumerator 1 has the "lowest" item
                yield return enumerator1.Current;
                more1 = enumerator1.MoveNext();
            
            else if (comparisonResult > 0)
            
                // enumerator 2 has the "lowest" item
                yield return enumerator2.Current;
                more2 = enumerator2.MoveNext();
            
            else
            
                // they're considered equivalent, only yield it once
                yield return enumerator1.Current;
                more1 = enumerator1.MoveNext();
                more2 = enumerator2.MoveNext();
            
        

        // Yield rest of values from non-exhausted source
        while (more1)
        
            yield return enumerator1.Current;
            more1 = enumerator1.MoveNext();
        
        while (more2)
        
            yield return enumerator2.Current;
            more2 = enumerator2.MoveNext();

请注意，如果其中一个来源包含重复项，您可能会在输出中看到重复项。如果要在已排序的列表中删除这些重复项，请使用以下方法：

private static IEnumerable<T> CheapDistinct<T>(IEnumerable<T> source)

    return CheapDistinct<T>(source, Comparer<T>.Default);


private static IEnumerable<T> CheapDistinct<T>(IEnumerable<T> source,
    IComparer<T> comparer)

    #region Parameter Validation

    if (Object.ReferenceEquals(null, source))
        throw new ArgumentNullException("source");
    if (Object.ReferenceEquals(null, comparer))
        throw new ArgumentNullException("comparer");

    #endregion

    using (IEnumerator<T> enumerator = source.GetEnumerator())
    
        if (enumerator.MoveNext())
        
            T item = enumerator.Current;

            // scan until different item found, then produce
            // the previous distinct item
            while (enumerator.MoveNext())
            
                if (comparer.Compare(item, enumerator.Current) != 0)
                
                    yield return item;
                    item = enumerator.Current;
                
            

            // produce last item that is left over from above loop
            yield return item;

请注意，这些都不会在内部使用数据结构来保存数据的副本，因此如果输入已排序，它们会很便宜。如果您不能或不会保证，您应该使用您已经找到的 3.5 扩展方法。

下面是调用上述方法的示例代码：

String[] list_1 =  "apple", "orange", "apple", "banana" ;
String[] list_2 =  "banana", "pear", "grape" ;

Array.Sort(list_1);
Array.Sort(list_2);

IEnumerable<String> items = Merge(
    CheapDistinct(list_1),
    CheapDistinct(list_2));
foreach (String item in items)
    Console.Out.WriteLine(item);

【讨论】：

+1 开箱即用：如果它们被排序怎么办？对于很多代码。再说一次，对它们进行排序所花费的时间可能会超出整个目的。因此免责声明:)【参考方案4】：

您为什么认为它会效率低下？据我所知，Concat 和 Distinct 都是惰性求值的，在 Distinct 后台使用 HashSet 来跟踪已经返回的元素。

我不知道你是如何设法让它比一般的方式更有效:)

编辑：Distinct 实际上使用 Set（一个内部类）而不是 HashSet，但要点仍然正确。这是一个很好的例子，说明 LINQ 是多么的简洁。在没有更多领域知识的情况下，最简单的答案几乎与您所能达到的一样有效。

效果相当于：

public static IEnumerable<T> DistinctConcat<T>(IEnumerable<T> first, IEnumerable<T> second)

    HashSet<T> returned = new HashSet<T>();
    foreach (T element in first)
    
        if (returned.Add(element))
        
            yield return element;
        
    
    foreach (T element in second)
    
        if (returned.Add(element))
        
            yield return element;

【讨论】：

【参考方案5】：

在您测量之前，您不知道哪种方法更快。 LINQ 方式优雅且易于理解。

另一种方法是将集合实现为哈希数组（字典）并将两个数组的所有元素添加到集合中。然后使用 set.Keys.ToArray() 方法创建结果数组。

【讨论】：

【参考方案6】：

可能以您的值作为键创建一个哈希表（仅添加那些不存在的键），然后将键转换为数组可能是一个可行的解决方案。

【讨论】：

以上是关于在 .NET 中有效地合并字符串数组，保持不同的值的主要内容，如果未能解决你的问题，请参考以下文章

array_merge_recursive — 递归地合并一个或多个数组

如何有效地合并两个 BST？

稀疏数组

在 PostgreSQL 中有效地合并最近日期的两个数据集

Numpy：使用字典作为地图有效地替换二维数组中的值

有效地减去不同形状的numpy数组