


【中文标题】在随机生成的整数列表中查找所有模式及其出现频率的最有效方法【英文标题】:Most efficient way to find all modes in a List of randomly generated integers and how often they occured 【发布时间】:2019-01-18 03:29:12 【问题描述】:

如果用 C# 编写的方法将传递一个空值或 0 到 6,000,000 个随机生成且未排序的整数,那么确定所有模式以及它们发生多少次的最有效方法是什么?特别是,任何人都可以帮助我解决我正在苦苦挣扎的基于 LINQ 的解决方案吗?


到目前为止,我最接近的 LINQ 解决方案只抓取它找到的第一个模式,并且没有指定出现的次数。它在我的计算机上的速度也比我的丑陋、笨重的实现慢 7 倍左右,这太可怕了。

    int mode = numbers.GroupBy(number => number).OrderByDescending(group => group.Count()).Select(k => k.Key).FirstOrDefault();


    public class NumberCount
        public int Value;
        public int Occurrences;

        public NumberCount(int value, int occurrences)
            Value = value;
            Occurrences = occurrences;

    private static List<NumberCount> findMostCommon(List<int> integers)
        if (integers == null)
            return null;
        else if (integers.Count < 1)
            return new List<NumberCount>();

        List<NumberCount> mostCommon = new List<NumberCount>();


        mostCommon.Add(new NumberCount(integers[0], 1));
        for (int i=1; i<integers.Count; i++)
            if (mostCommon[mostCommon.Count - 1].Value != integers[i])
                mostCommon.Add(new NumberCount(integers[i], 1));
                mostCommon[mostCommon.Count - 1].Occurrences++;

        List<NumberCount> answer = new List<NumberCount>();
        for (int i=1; i<mostCommon.Count; i++) 
            if (mostCommon[i].Occurrences > answer[0].Occurrences)
                if (answer.Count == 1)
                    answer[0] = mostCommon[i];
                    answer = new List<NumberCount>();
            else if (mostCommon[i].Occurrences == answer[0].Occurrences)

        return answer;        

基本上,我试图获得一个优雅、紧凑的 LINQ 解决方案,至少与我的丑陋方法一样快。提前感谢您的任何建议。


为什么你认为 linq 会做得更好或更丑?您的代码看起来很简单。除此之外,我仍然没有得到你真正想要实现的目标。你的情况是什么模式?如何创建它? LINQ 在一行中完成了大部分工作。但是,我被困在最后一点。众数是最常出现的数字。 @HimBromBeere:我相信在这种情况下“模式”=“出现最大次数的数字”。所以在数组 1, 3, 2, 4, 1, 4, 5 中,模式是 4 和 1,因为它们都出现了两次,并且没有任何事情出现超过两次。 您的随机数是否在给定范围内?或者您知道最多会有少量不同的值吗?例如,如果只有十个值,您可以创建一个简单的计数集合,然后遍历该集合以查看最大的集合是什么(这很好而且很快,因为您只有十个要排序/比较的东西)。如果您可能有 5,000,000 个不同的整数,则此方法的效率会大大降低... 唯一的限制是它们是有效的非负 32 位整数。 【参考方案1】:

不同的代码对于不同的长度效率更高,但是随着长度接近 600 万,这种方法似乎最快。一般来说,LINQ 不是为了提高代码的速度,而是为了理解和可维护性,这取决于你对函数式编程风格的感受。

您的代码相当快,并且优于使用 GroupBy 的简单 LINQ 方法。它从使用List.Sort 高度优化的事实中获得了一个很好的优势,我的代码也使用它,但是在列表的本地副本上以避免更改源。我的代码在方法上与您的类似,但它是围绕一次通过完成所有需要的计算而设计的。它使用我为这个问题重新优化的扩展方法,称为GroupByRuns,返回一个IEnumerable&lt;IGrouping&lt;T,T&gt;&gt;。它也是手动扩展的,而不是依赖于通用的GroupByRuns,它需要额外的参数来选择键和结果。由于.Net 没有最终用户可访问的IGrouping&lt;,&gt; 实现(!),我推出了自己的实现ICollection 以优化Count()

此代码的运行速度大约是您的 1.3 倍(在我将您的代码稍微优化了 5% 之后)。

首先,RunGrouping 类返回一组运行:

public class RunGrouping<T> : IGrouping<T, T>, ICollection<T> 
    public T Key  get; 
    int Count;

    int ICollection<T>.Count => Count;
    public bool IsReadOnly => true;

    public RunGrouping(T key, int count) 
        Key = key;
        Count = count;

    public IEnumerator<T> GetEnumerator() 
        for (int j1 = 0; j1 < Count; ++j1)
            yield return Key;

    IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();

    public void Add(T item) => throw new NotImplementedException();
    public void Clear() => throw new NotImplementedException();
    public bool Contains(T item) => Count > 0 && EqualityComparer<T>.Default.Equals(Key, item);
    public void CopyTo(T[] array, int arrayIndex) => throw new NotImplementedException();
    public bool Remove(T item) => throw new NotImplementedException();

第二,IEnumerable 上对运行进行分组的扩展方法:

public static class IEnumerableExt 
    public static IEnumerable<IGrouping<T, T>> GroupByRuns<T>(this IEnumerable<T> src) 
        var cmp = EqualityComparer<T>.Default;
        bool notAtEnd = true;
        using (var e = src.GetEnumerator()) 
            bool moveNext() 
                return notAtEnd;
            IGrouping<T, T> NextRun() 
                var prev = e.Current;
                var ct = 0;
                while (notAtEnd && cmp.Equals(e.Current, prev)) 
                    notAtEnd = e.MoveNext();
                return new RunGrouping<T>(prev, ct);

            notAtEnd = e.MoveNext();
            while (notAtEnd)
                yield return NextRun();

最后,找到最大计数模式的扩展方法。基本上,它会通过运行并记录那些int 与当前最长的运行计数。

public static class IEnumerableIntExt 
    public static IEnumerable<KeyValuePair<int, int>> MostCommon(this IEnumerable<int> src) 
        var mysrc = new List<int>(src);
        var maxc = 0;
        var maxmodes = new List<int>();
        foreach (var g in mysrc.GroupByRuns()) 
            var gc = g.Count();

            if (gc > maxc) 
                maxc = gc;
            else if (gc == maxc)

        return maxmodes.Select(m => new KeyValuePair<int, int>(m, maxc));


var ans = rl.MostCommon();


我试过你的代码,太棒了! Linq 比您的代码慢 3 倍。但我想知道是否有办法避免在 Linq 方法中进行全排序(我的意思是对完整列表进行排序)? @Dongdong 有,但会慢一些,因为List.Sort 是优化的 C++ 代码,甚至比单程 LINQ 代码还要快。【参考方案2】:

我个人会使用ConcurrentDictionary 来更新计数器,并且可以更快地访问字典。这种方法我用的比较多,而且可读性更强。

  // create a dictionary
  var dictionary = new ConcurrentDictionary<int, int>();

  // list of you integers
  var numbers = new List<int>();

  // parallel the iteration ( we can because concurrent dictionary is thread safe-ish
  numbers.AsParallel().ForAll((number) =>
      // add the key if it's not there with value of 1 and if it's there it use the lambda function to increment by 1
      dictionary.AddOrUpdate(number, 1, (key, old) => old + 1);


var topMostOccurence = dictionary.Aggregate((x, y) =>  return x.Value > y.Value ? x : y; );



我在 Intel i7-8700K 上使用以下代码进行了测试,得到了以下结果:

Lambda:在 134 毫秒内找到 78 个。

手动:在 368 毫秒内找到 78 个。

字典:在 195 毫秒内找到 78 个。

    static IEnumerable<int> GenerateNumbers(int amount)
        Random r = new Random();
        for (int i = 0; i < amount; i++)
            yield return r.Next(100);

    static void Main(string[] args)
        var numbers = GenerateNumbers(6_000_000).ToList();

        Stopwatch sw = Stopwatch.StartNew();
        int mode = numbers.GroupBy(number => number).OrderByDescending(group => group.Count()).Select(k =>
            int count = k.Count();
            return new  Mode = k.Key, Count = count ;
        Console.WriteLine($"Lambda: found mode in sw.ElapsedMilliseconds ms.");

        sw = Stopwatch.StartNew();
        mode = findMostCommon(numbers)[0].Value;
        Console.WriteLine($"Manual: found mode in sw.ElapsedMilliseconds ms.");

        // create a dictionary
        var dictionary = new ConcurrentDictionary<int, int>();

        sw = Stopwatch.StartNew();
        // parallel the iteration ( we can because concurrent dictionary is thread safe-ish
        numbers.AsParallel().ForAll((number) =>
            // add the key if it's not there with value of 1 and if it's there it use the lambda function to increment by 1
            dictionary.AddOrUpdate(number, 1, (key, old) => old + 1);
        mode = dictionary.Aggregate((x, y) =>  return x.Value > y.Value ? x : y; ).Key;
        Console.WriteLine($"Dictionary: found mode in sw.ElapsedMilliseconds ms.");



Lambda 和 Dictionary 只返回一种模式,并没有指定它们发生的次数和次数。 此外,Dictionary 正在排序的 List 上运行,因为始终使用相同的 List。此外,您将值限制为 100。当使用 int.MaxValue 时,我的结果来自:最大 100 并重新使用数字,即使它们在两者之间排序:Lambda=173 Manual=477 Dictionary=210 max is int.MaxValue每个进程都会得到一个新数组: Lambda=7276 Manual=1431 Dictionary=6381【参考方案4】:


您当前的代码来自这里:Find the most occurring number in a List<int> 但它只返回一个数字,这是一个错误的结果。

Linq 的问题是:如果不想继续,循环就不能结束。

但是,我在这里根据您的需要生成一个带有 LINQ 的列表:

List<NumberCount> MaxOccurrences(List<int> integers)

    return integers?.AsParallel()
        .GroupBy(x => x)//group numbers, key is number, count is count
        .Select(k => new NumberCount(k.Key, k.Count()))
        .GroupBy(x => x.Occurrences)//group by Occurrences, key is Occurrences, value is result
        .OrderByDescending(x => x.Key) //sort
        .FirstOrDefault()? //the first one is result



MaxOccurrences only
MaxOccurrences1: 207
MaxOccurrences2: 38 
Full List
Original1: 28
Original2: 23
ConcurrentDictionary1: 32
ConcurrentDictionary2: 34
AsParallel1: 27
AsParallel2: 19
AsParallel3: 36


MaxOccurrences only
MaxOccurrences1: 3009
MaxOccurrences2: 1962 //<==this is the best one in big loop.
Full List
Original1: 3200
Original2: 3234
ConcurrentDictionary1: 3391
ConcurrentDictionary2: 2681
AsParallel1: 3776
AsParallel2: 2389
AsParallel3: 2155


class Program

    static void Main(string[] args)
        const int listSize = 3000000;
        var rnd = new Random();
        var randomList = Enumerable.Range(1, listSize).OrderBy(e => rnd.Next()).ToList();

        // the code that you want to measure comes here

        Console.WriteLine("MaxOccurrences only");

        Test(randomList, MaxOccurrences1);
        Test(randomList, MaxOccurrences2);

        Console.WriteLine("Full List");
        Test(randomList, Original1);
        Test(randomList, Original2);
        Test(randomList, AsParallel1);
        Test(randomList, AsParallel2);
        Test(randomList, AsParallel3);


    private static void Test(List<int> data, Action<List<int>> method)
        var watch = System.Diagnostics.Stopwatch.StartNew();
        Console.WriteLine($"method.Method.Name: watch.ElapsedMilliseconds");
    private static void Original1(List<int> integers)
        integers?.GroupBy(number => number)
            .OrderByDescending(group => group.Count())
            .Select(k => new NumberCount(k.Key, k.Count()))

    private static void Original2(List<int> integers)
        integers?.GroupBy(number => number)
            .Select(k => new NumberCount(k.Key, k.Count()))
            .OrderByDescending(x => x.Occurrences)

    private static void AsParallel1(List<int> integers)
        integers?.GroupBy(number => number)
            .AsParallel() //each group will be count by a CPU unit
            .Select(k => new NumberCount(k.Key, k.Count())) //Grap result, before sort
            .OrderByDescending(x => x.Occurrences) //sort after result

    private static void AsParallel2(List<int> integers)
            .GroupBy(number => number)
            .Select(k => new
                Key = k.Key,
                Occurrences = k.Count()
            ) //Grap result, before sort
            .OrderByDescending(x => x.Occurrences) //sort after result

    private static void AsParallel3(List<int> integers)
            .GroupBy(number => number)
            .Select(k => new NumberCount(k.Key, k.Count())) //Grap result, before sort
            .OrderByDescending(x => x.Occurrences) //sort after result

    private static void MaxOccurrences1(List<int> integers)
            .GroupBy(number => number)
            .GroupBy(x => x.Count())
            .OrderByDescending(x => x.Key)
            .Select(k => new NumberCount(k.Key, k.Count()))

    private static void MaxOccurrences2(List<int> integers)
            .GroupBy(x => x)//group numbers, key is number, count is count
            .Select(k => new NumberCount(k.Key, k.Count()))
            .GroupBy(x => x.Occurrences)//group by Occurrences, key is Occurrences, value is result
            .OrderByDescending(x => x.Key) //sort
            .FirstOrDefault()? //the first one is result
    private static void ConcurrentDictionary1(List<int> integers)
        ConcurrentDictionary<int, int> result = new ConcurrentDictionary<int, int>();

        integers?.ForEach(x =>  result.AddOrUpdate(x, 1, (key, old) => old + 1); );

        result.OrderByDescending(x => x.Value).ToList();
    private static void ConcurrentDictionary2(List<int> integers)
        ConcurrentDictionary<int, int> result = new ConcurrentDictionary<int, int>();

        integers?.AsParallel().ForAll(x =>  result.AddOrUpdate(x, 1, (key, old) => old + 1); );

        result.OrderByDescending(x => x.Value).ToList();

public class NumberCount

    public int Value;
    public int Occurrences;

    public NumberCount(int value, int occurrences)
        Value = value;
        Occurrences = occurrences;


例如,1,1,2,2,2,3,3,3 应该返回 2 和 3 并指定每个出现 3 次。【参考方案5】:

到目前为止,Netmage 是我发现的最快的。我能够做到的唯一可以击败它的东西(至少有效范围为 1 到 500,000,000)仅适用于我的计算机上价值从 1 到 500,000,000 或更小的数组,因为我只有 8 GB 的 RAM .这使我无法在 1 到 int.MaxValue 的完整范围内对其进行测试,并且我怀疑它在该尺寸下的速度会落后,因为它似乎越来越难以适应更大的范围。它使用这些值作为索引,并将这些索引处的值作为出现次数。使用 600 万个随机生成的 16 位正整数,在 Release 模式下,它比我原来的方法快大约 20 倍。对于 1 到 500,000,000 的 32 位整数,它的速度只有大约 1.6 倍。

    private static List<NumberCount> findMostCommon(List<int> integers)
        List<NumberCount> answers = new List<NumberCount>();

        int[] mostCommon = new int[_Max];

        int max = 0;
        for (int i = 1; i < integers.Count; i++)
            int iValue = integers[i];
            int intVal = mostCommon[iValue];
            if (intVal > 1)
                if (intVal > max)
                    answers.Add(new NumberCount(iValue, max));
                else if (intVal == max)
                    answers.Add(new NumberCount(iValue, max));

        if (answers.Count < 1)
            answers.Add(new NumberCount(0, -100)); // This -100 Occurrecnces value signifies that all values are equal.

        return answers;


if (list.Count < sizeLimit) 
    answers = getFromSmallRangeMethod(list);
    answers = getFromStandardMethod(list);


int[] mostCommon = new int[_Max];替换为Dictionary&lt;int, int&gt; mostCommon = new Dictionary&lt;int, int&gt;();,将mostCommon[iValue]++;更改为if(!mostCommon.ContainsKey(iValue)) mostCommon[iValue] = 1; else mostCommon[iValue]++;






