在随机生成的整数列表中查找所有模式及其出现频率的最有效方法

Posted 2023-03-24

技术标签:

【中文标题】在随机生成的整数列表中查找所有模式及其出现频率的最有效方法【英文标题】：Most efficient way to find all modes in a List of randomly generated integers and how often they occured 【发布时间】：2019-01-18 03:29:12 【问题描述】：

如果用 C# 编写的方法将传递一个空值或 0 到 6,000,000 个随机生成且未排序的整数，那么确定所有模式以及它们发生多少次的最有效方法是什么？特别是，任何人都可以帮助我解决我正在苦苦挣扎的基于 LINQ 的解决方案吗？

这是我目前所拥有的：

到目前为止，我最接近的 LINQ 解决方案只抓取它找到的第一个模式，并且没有指定出现的次数。它在我的计算机上的速度也比我的丑陋、笨重的实现慢 7 倍左右，这太可怕了。

    int mode = numbers.GroupBy(number => number).OrderByDescending(group => group.Count()).Select(k => k.Key).FirstOrDefault();

我的手动编码方法。

    public class NumberCount
    
        public int Value;
        public int Occurrences;

        public NumberCount(int value, int occurrences)
        
            Value = value;
            Occurrences = occurrences;
        
    

    private static List<NumberCount> findMostCommon(List<int> integers)
    
        if (integers == null)
            return null;
        else if (integers.Count < 1)
            return new List<NumberCount>();

        List<NumberCount> mostCommon = new List<NumberCount>();

        integers.Sort();

        mostCommon.Add(new NumberCount(integers[0], 1));
        for (int i=1; i<integers.Count; i++)
        
            if (mostCommon[mostCommon.Count - 1].Value != integers[i])
                mostCommon.Add(new NumberCount(integers[i], 1));
            else
                mostCommon[mostCommon.Count - 1].Occurrences++;
        

        List<NumberCount> answer = new List<NumberCount>();
        answer.Add(mostCommon[0]);
        for (int i=1; i<mostCommon.Count; i++) 
        
            if (mostCommon[i].Occurrences > answer[0].Occurrences)
            
                if (answer.Count == 1)
                
                    answer[0] = mostCommon[i];
                
                else
                
                    answer = new List<NumberCount>();
                    answer.Add(mostCommon[i]);
                
            
            else if (mostCommon[i].Occurrences == answer[0].Occurrences)
            
                answer.Add(mostCommon[i]);
            
        

        return answer;

基本上，我试图获得一个优雅、紧凑的 LINQ 解决方案，至少与我的丑陋方法一样快。提前感谢您的任何建议。

【问题讨论】：

为什么你认为 linq 会做得更好或更丑？您的代码看起来很简单。除此之外，我仍然没有得到你真正想要实现的目标。你的情况是什么模式？如何创建它？ LINQ 在一行中完成了大部分工作。但是，我被困在最后一点。众数是最常出现的数字。 @HimBromBeere：我相信在这种情况下“模式”=“出现最大次数的数字”。所以在数组 1, 3, 2, 4, 1, 4, 5 中，模式是 4 和 1，因为它们都出现了两次，并且没有任何事情出现超过两次。您的随机数是否在给定范围内？或者您知道最多会有少量不同的值吗？例如，如果只有十个值，您可以创建一个简单的计数集合，然后遍历该集合以查看最大的集合是什么（这很好而且很快，因为您只有十个要排序/比较的东西）。如果您可能有 5,000,000 个不同的整数，则此方法的效率会大大降低... 唯一的限制是它们是有效的非负 32 位整数。 【参考方案1】：

不同的代码对于不同的长度效率更高，但是随着长度接近 600 万，这种方法似乎最快。一般来说，LINQ 不是为了提高代码的速度，而是为了理解和可维护性，这取决于你对函数式编程风格的感受。

您的代码相当快，并且优于使用 GroupBy 的简单 LINQ 方法。它从使用List.Sort 高度优化的事实中获得了一个很好的优势，我的代码也使用它，但是在列表的本地副本上以避免更改源。我的代码在方法上与您的类似，但它是围绕一次通过完成所有需要的计算而设计的。它使用我为这个问题重新优化的扩展方法，称为GroupByRuns，返回一个IEnumerable<IGrouping<T,T>>。它也是手动扩展的，而不是依赖于通用的GroupByRuns，它需要额外的参数来选择键和结果。由于.Net 没有最终用户可访问的IGrouping<,> 实现（！），我推出了自己的实现ICollection 以优化Count()。

此代码的运行速度大约是您的 1.3 倍（在我将您的代码稍微优化了 5% 之后）。

首先，RunGrouping 类返回一组运行：

public class RunGrouping<T> : IGrouping<T, T>, ICollection<T> 
    public T Key  get; 
    int Count;

    int ICollection<T>.Count => Count;
    public bool IsReadOnly => true;

    public RunGrouping(T key, int count) 
        Key = key;
        Count = count;
    

    public IEnumerator<T> GetEnumerator() 
        for (int j1 = 0; j1 < Count; ++j1)
            yield return Key;
    

    IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();

    public void Add(T item) => throw new NotImplementedException();
    public void Clear() => throw new NotImplementedException();
    public bool Contains(T item) => Count > 0 && EqualityComparer<T>.Default.Equals(Key, item);
    public void CopyTo(T[] array, int arrayIndex) => throw new NotImplementedException();
    public bool Remove(T item) => throw new NotImplementedException();

第二，IEnumerable 上对运行进行分组的扩展方法：

public static class IEnumerableExt 
    public static IEnumerable<IGrouping<T, T>> GroupByRuns<T>(this IEnumerable<T> src) 
        var cmp = EqualityComparer<T>.Default;
        bool notAtEnd = true;
        using (var e = src.GetEnumerator()) 
            bool moveNext() 
                return notAtEnd;
            
            IGrouping<T, T> NextRun() 
                var prev = e.Current;
                var ct = 0;
                while (notAtEnd && cmp.Equals(e.Current, prev)) 
                    ++ct;
                    notAtEnd = e.MoveNext();
                
                return new RunGrouping<T>(prev, ct);
            

            notAtEnd = e.MoveNext();
            while (notAtEnd)
                yield return NextRun();

最后，找到最大计数模式的扩展方法。基本上，它会通过运行并记录那些int 与当前最长的运行计数。

public static class IEnumerableIntExt 
    public static IEnumerable<KeyValuePair<int, int>> MostCommon(this IEnumerable<int> src) 
        var mysrc = new List<int>(src);
        mysrc.Sort();
        var maxc = 0;
        var maxmodes = new List<int>();
        foreach (var g in mysrc.GroupByRuns()) 
            var gc = g.Count();

            if (gc > maxc) 
                maxmodes.Clear();
                maxmodes.Add(g.Key);
                maxc = gc;
            
            else if (gc == maxc)
                maxmodes.Add(g.Key);
        

        return maxmodes.Select(m => new KeyValuePair<int, int>(m, maxc));

给定一个现有的随机整数列表rl，您可以通过以下方式获得答案：

var ans = rl.MostCommon();

【讨论】：

我试过你的代码，太棒了！ Linq 比您的代码慢 3 倍。但我想知道是否有办法避免在 Linq 方法中进行全排序（我的意思是对完整列表进行排序）？ @Dongdong 有，但会慢一些，因为List.Sort 是优化的 C++ 代码，甚至比单程 LINQ 代码还要快。【参考方案2】：

我个人会使用ConcurrentDictionary 来更新计数器，并且可以更快地访问字典。这种方法我用的比较多，而且可读性更强。

  // create a dictionary
  var dictionary = new ConcurrentDictionary<int, int>();

  // list of you integers
  var numbers = new List<int>();

  // parallel the iteration ( we can because concurrent dictionary is thread safe-ish
  numbers.AsParallel().ForAll((number) =>
  
      // add the key if it's not there with value of 1 and if it's there it use the lambda function to increment by 1
      dictionary.AddOrUpdate(number, 1, (key, old) => old + 1);
  );

那么这只是获得最多发生的问题，有很多方法。我不完全理解你的版本，但最重要的只是一个像这样的聚合：

var topMostOccurence = dictionary.Aggregate((x, y) =>  return x.Value > y.Value ? x : y; );

【讨论】：

【参考方案3】：

我在 Intel i7-8700K 上使用以下代码进行了测试，得到了以下结果：

Lambda：在 134 毫秒内找到 78 个。

手动：在 368 毫秒内找到 78 个。

字典：在 195 毫秒内找到 78 个。

    static IEnumerable<int> GenerateNumbers(int amount)
    
        Random r = new Random();
        for (int i = 0; i < amount; i++)
            yield return r.Next(100);
    

    static void Main(string[] args)
    
        var numbers = GenerateNumbers(6_000_000).ToList();

        Stopwatch sw = Stopwatch.StartNew();
        int mode = numbers.GroupBy(number => number).OrderByDescending(group => group.Count()).Select(k =>
        
            int count = k.Count();
            return new  Mode = k.Key, Count = count ;
        ).FirstOrDefault().Mode;
        sw.Stop();
        Console.WriteLine($"Lambda: found mode in sw.ElapsedMilliseconds ms.");


        sw = Stopwatch.StartNew();
        mode = findMostCommon(numbers)[0].Value;
        sw.Stop();
        Console.WriteLine($"Manual: found mode in sw.ElapsedMilliseconds ms.");

        // create a dictionary
        var dictionary = new ConcurrentDictionary<int, int>();

        sw = Stopwatch.StartNew();
        // parallel the iteration ( we can because concurrent dictionary is thread safe-ish
        numbers.AsParallel().ForAll((number) =>
        
            // add the key if it's not there with value of 1 and if it's there it use the lambda function to increment by 1
            dictionary.AddOrUpdate(number, 1, (key, old) => old + 1);
        );
        mode = dictionary.Aggregate((x, y) =>  return x.Value > y.Value ? x : y; ).Key;
        sw.Stop();
        Console.WriteLine($"Dictionary: found mode in sw.ElapsedMilliseconds ms.");


        Console.ReadLine();

【讨论】：

Lambda 和 Dictionary 只返回一种模式，并没有指定它们发生的次数和次数。此外，Dictionary 正在排序的 List 上运行，因为始终使用相同的 List。此外，您将值限制为 100。当使用 int.MaxValue 时，我的结果来自：最大 100 并重新使用数字，即使它们在两者之间排序：Lambda=173 Manual=477 Dictionary=210 max is int.MaxValue每个进程都会得到一个新数组： Lambda=7276 Manual=1431 Dictionary=6381【参考方案4】：

你想要什么：2个以上的数字可以在一个数组中同时出现，例如：1,1,1,2,2,2,3,3,3

您当前的代码来自这里：Find the most occurring number in a List<int> 但它只返回一个数字，这是一个错误的结果。

Linq 的问题是：如果不想继续，循环就不能结束。

但是，我在这里根据您的需要生成一个带有 LINQ 的列表：

List<NumberCount> MaxOccurrences(List<int> integers)

    return integers?.AsParallel()
        .GroupBy(x => x)//group numbers, key is number, count is count
        .Select(k => new NumberCount(k.Key, k.Count()))
        .GroupBy(x => x.Occurrences)//group by Occurrences, key is Occurrences, value is result
        .OrderByDescending(x => x.Key) //sort
        .FirstOrDefault()? //the first one is result
        .ToList();

测试详情：

数组大小：30000

30000
MaxOccurrences only
MaxOccurrences1: 207
MaxOccurrences2: 38 
=============
Full List
Original1: 28
Original2: 23
ConcurrentDictionary1: 32
ConcurrentDictionary2: 34
AsParallel1: 27
AsParallel2: 19
AsParallel3: 36

数组大小：3000000

3000000
MaxOccurrences only
MaxOccurrences1: 3009
MaxOccurrences2: 1962 //<==this is the best one in big loop.
=============
Full List
Original1: 3200
Original2: 3234
ConcurrentDictionary1: 3391
ConcurrentDictionary2: 2681
AsParallel1: 3776
AsParallel2: 2389
AsParallel3: 2155

代码如下：

class Program

    static void Main(string[] args)
    
        const int listSize = 3000000;
        var rnd = new Random();
        var randomList = Enumerable.Range(1, listSize).OrderBy(e => rnd.Next()).ToList();

        // the code that you want to measure comes here

        Console.WriteLine(randomList.Count);
        Console.WriteLine("MaxOccurrences only");

        Test(randomList, MaxOccurrences1);
        Test(randomList, MaxOccurrences2);


        Console.WriteLine("=============");
        Console.WriteLine("Full List");
        Test(randomList, Original1);
        Test(randomList, Original2);
        Test(randomList, AsParallel1);
        Test(randomList, AsParallel2);
        Test(randomList, AsParallel3);

        Console.ReadLine();
    

    private static void Test(List<int> data, Action<List<int>> method)
    
        var watch = System.Diagnostics.Stopwatch.StartNew();
        method(data);
        watch.Stop();
        Console.WriteLine($"method.Method.Name: watch.ElapsedMilliseconds");
    
    private static void Original1(List<int> integers)
    
        integers?.GroupBy(number => number)
            .OrderByDescending(group => group.Count())
            .Select(k => new NumberCount(k.Key, k.Count()))
            .ToList();
    

    private static void Original2(List<int> integers)
    
        integers?.GroupBy(number => number)
            .Select(k => new NumberCount(k.Key, k.Count()))
            .OrderByDescending(x => x.Occurrences)
            .ToList();
    

    private static void AsParallel1(List<int> integers)
    
        integers?.GroupBy(number => number)
            .AsParallel() //each group will be count by a CPU unit
            .Select(k => new NumberCount(k.Key, k.Count())) //Grap result, before sort
            .OrderByDescending(x => x.Occurrences) //sort after result
            .ToList();
    

    private static void AsParallel2(List<int> integers)
    
        integers?.AsParallel()
            .GroupBy(number => number)
            .Select(k => new
            
                Key = k.Key,
                Occurrences = k.Count()
            ) //Grap result, before sort
            .OrderByDescending(x => x.Occurrences) //sort after result
            .ToList();
    

    private static void AsParallel3(List<int> integers)
    
        integers?.AsParallel()
            .GroupBy(number => number)
            .Select(k => new NumberCount(k.Key, k.Count())) //Grap result, before sort
            .OrderByDescending(x => x.Occurrences) //sort after result
            .ToList();
    


    private static void MaxOccurrences1(List<int> integers)
    
        integers?.AsParallel()
            .GroupBy(number => number)
            .GroupBy(x => x.Count())
            .OrderByDescending(x => x.Key)
            .FirstOrDefault()?
            .ToList()
            .Select(k => new NumberCount(k.Key, k.Count()))
            .ToList();
    

    private static void MaxOccurrences2(List<int> integers)
    
        integers?.AsParallel()
            .GroupBy(x => x)//group numbers, key is number, count is count
            .Select(k => new NumberCount(k.Key, k.Count()))
            .GroupBy(x => x.Occurrences)//group by Occurrences, key is Occurrences, value is result
            .OrderByDescending(x => x.Key) //sort
            .FirstOrDefault()? //the first one is result
            .ToList();
    
    private static void ConcurrentDictionary1(List<int> integers)
    
        ConcurrentDictionary<int, int> result = new ConcurrentDictionary<int, int>();

        integers?.ForEach(x =>  result.AddOrUpdate(x, 1, (key, old) => old + 1); );

        result.OrderByDescending(x => x.Value).ToList();
    
    private static void ConcurrentDictionary2(List<int> integers)
    
        ConcurrentDictionary<int, int> result = new ConcurrentDictionary<int, int>();

        integers?.AsParallel().ForAll(x =>  result.AddOrUpdate(x, 1, (key, old) => old + 1); );

        result.OrderByDescending(x => x.Value).ToList();
    


public class NumberCount

    public int Value;
    public int Occurrences;

    public NumberCount(int value, int occurrences)
    
        Value = value;
        Occurrences = occurrences;

【讨论】：

例如，1,1,2,2,2,3,3,3 应该返回 2 和 3 并指定每个出现 3 次。【参考方案5】：

到目前为止，Netmage 是我发现的最快的。我能够做到的唯一可以击败它的东西（至少有效范围为 1 到 500,000,000）仅适用于我的计算机上价值从 1 到 500,000,000 或更小的数组，因为我只有 8 GB 的 RAM .这使我无法在 1 到 int.MaxValue 的完整范围内对其进行测试，并且我怀疑它在该尺寸下的速度会落后，因为它似乎越来越难以适应更大的范围。它使用这些值作为索引，并将这些索引处的值作为出现次数。使用 600 万个随机生成的 16 位正整数，在 Release 模式下，它比我原来的方法快大约 20 倍。对于 1 到 500,000,000 的 32 位整数，它的速度只有大约 1.6 倍。

    private static List<NumberCount> findMostCommon(List<int> integers)
    
        List<NumberCount> answers = new List<NumberCount>();

        int[] mostCommon = new int[_Max];

        int max = 0;
        for (int i = 1; i < integers.Count; i++)
        
            int iValue = integers[i];
            mostCommon[iValue]++;
            int intVal = mostCommon[iValue];
            if (intVal > 1)
            
                if (intVal > max)
                
                    max++;
                    answers.Clear();
                    answers.Add(new NumberCount(iValue, max));
                
                else if (intVal == max)
                
                    answers.Add(new NumberCount(iValue, max));
                
            
        

        if (answers.Count < 1)
            answers.Add(new NumberCount(0, -100)); // This -100 Occurrecnces value signifies that all values are equal.

        return answers;

也许像这样的分支是最佳选择：

if (list.Count < sizeLimit) 
    answers = getFromSmallRangeMethod(list);
else 
    answers = getFromStandardMethod(list);

【讨论】：

将int[] mostCommon = new int[_Max];替换为Dictionary<int, int> mostCommon = new Dictionary<int, int>();，将mostCommon[iValue]++;更改为if(!mostCommon.ContainsKey(iValue)) mostCommon[iValue] = 1; else mostCommon[iValue]++;

以上是关于在随机生成的整数列表中查找所有模式及其出现频率的最有效方法的主要内容，如果未能解决你的问题，请参考以下文章

查找并返回数组中出现频率最高的元素

如何检查一个术语在列表中出现的次数及其频率？

用户定义的函数来计算出现的随机生成的列表包含一个以n开头的整数

python：在大整数列表中查找小整数列表的最接近匹配

python 生成包含100个两位数随机整数的元组

在python中查找数字列表的频率分布