从元素具有权重的列表中选择 k 个随机元素

Posted

技术标签:

【中文标题】从元素具有权重的列表中选择 k 个随机元素【英文标题】:Select k random elements from a list whose elements have weights 【发布时间】:2011-01-09 14:13:42 【问题描述】:

没有任何权重(等概率)的选择被精美地描述了here。

我想知道是否有办法将这种方法转换为加权方法。

我也对其他方法感兴趣。

更新:采样没有替换

【问题讨论】:

抽样是否有替换? 不管怎样,都是***.com/questions/352670/…的副本 @Jason 我在问一种将优雅的方法转换为加权方法的方法,它并不完全重复 nimcap:我链接到的问题是关于加权随机选择。 在不替换权重的情况下进行抽样以使权重与每个元素的包含概率成正比的方式远非一项简单的任务,最近有很好的研究.例如:books.google.com.br/books/about/… 【参考方案1】:

我使用了关联图(权重、对象)。例如:


(10,"low"),
(100,"mid"),
(10000,"large")


total=10110

查看一个介于 0 和 'total' 之间的随机数并遍历键,直到该数字适合给定范围。

【讨论】:

在我看来,问题是关于一次选择多个项目(请参阅链接的问题)。您的方法需要通过每个选择。【参考方案2】:

在您链接到的问题中,Kyle 的解决方案可以进行简单的概括。 扫描列表并将总权重相加。那么选择一个元素的概率应该是:

1 - (1 - (#needed/(剩余重量)))/(n 时的重量)。访问一个节点后,从总数中减去它的权重。另外,如果你需要 n 并且还剩下 n,你必须明确地停止。

您可以检查所有重量为 1 的东西,这简化了 kyle 的解决方案。

已编辑:(不得不重新考虑两倍可能的含义)

【讨论】:

假设列表中有 4 个元素的权重为 2, 1, 1, 1。我将从这个列表中选择 3 个。根据您的公式,对于第一个元素 3* 2/5 = 1.2 这是 >1 我在这里缺少什么? 现在用 2,1,1,1 选择 3,选择第一个元素的概率是 1 - (1 - (3/5))/2 = 1 - (2/5 )/2 = 1 - 1/5 = 4/5,正如预期的那样。 我相信你的公式有问题,我现在没有时间写,有时间我会写的。如果您尝试以不同的顺序对前两个元素应用公式,您会发现它不会产生相同的结果。无论顺序如何,它都应该提供相同的结果。 说有 4 个元素的权重为 7, 1, 1, 1,我们将选择 2。让我们计算选择前 2 个元素的机会:P(1st)*P(2nd ) = (1-(1-2/10)/7)*(1-(1-1/3)) = 31/105。让我们将列表更改为 1, 7, 1, 1,选择前 2 个元素的概率应该保持不变 P(1st)*P(2nd) = (1-(1-2/10))*(1- (1-1/9)/7) = 11/63。它们不一样。 更简单,假设有 2 个权重为 1/2, 1/2 的元素,我们将选择 2。概率应该是 1;我们必须两者兼得。但公式给出 1 - (1 - (2/1)))/(1/2) = 3。【参考方案3】:

如果抽样是有放回的,你可以使用这个算法(这里用 Python 实现):

import random

items = [(10, "low"),
         (100, "mid"),
         (890, "large")]

def weighted_sample(items, n):
    total = float(sum(w for w, v in items))
    i = 0
    w, v = items[0]
    while n:
        x = total * (1 - random.random() ** (1.0 / n))
        total -= x
        while x > w:
            x -= w
            i += 1
            w, v = items[i]
        w -= x
        yield v
        n -= 1

这是 O(n + m),其中 m 是项目数。

为什么会这样?它基于以下算法:

def n_random_numbers_decreasing(v, n):
    """Like reversed(sorted(v * random() for i in range(n))),
    but faster because we avoid sorting."""
    while n:
        v *= random.random() ** (1.0 / n)
        yield v
        n -= 1

函数weighted_sample就是这个算法融合了items列表的遍历,以挑选出那些随机数选择的项目。

这反过来又有效,因为 n 个随机数 0..v 都恰好小于 z 的概率是 P = (z/v)n。求解 z,得到 z = vP1/n。用一个随机数代替 P 选择具有正确分布的最大数;我们可以重复这个过程来选择所有其他数字。

如果抽样没有放回,你可以把所有的项放到一个二叉堆中,每个节点缓存该子堆中所有项的权重总和。构建堆是 O(m)。从堆中选择一个随机项,尊重权重,是 O(log m)。删除该项目并更新缓存的总数也是 O(log m)。所以你可以在 O(m + n log m) 时间内挑选 n 个项目。

(注意:这里的“权重”是指每次选择一个元素时,剩余的可能性都以与其权重成正比的概率被选择。这并不意味着元素以与其权重成正比的可能性出现在输出中。)

这是一个实现,大量评论:

import random

class Node:
    # Each node in the heap has a weight, value, and total weight.
    # The total weight, self.tw, is self.w plus the weight of any children.
    __slots__ = ['w', 'v', 'tw']
    def __init__(self, w, v, tw):
        self.w, self.v, self.tw = w, v, tw

def rws_heap(items):
    # h is the heap. It's like a binary tree that lives in an array.
    # It has a Node for each pair in `items`. h[1] is the root. Each
    # other Node h[i] has a parent at h[i>>1]. Each node has up to 2
    # children, h[i<<1] and h[(i<<1)+1].  To get this nice simple
    # arithmetic, we have to leave h[0] vacant.
    h = [None]                          # leave h[0] vacant
    for w, v in items:
        h.append(Node(w, v, w))
    for i in range(len(h) - 1, 1, -1):  # total up the tws
        h[i>>1].tw += h[i].tw           # add h[i]'s total to its parent
    return h

def rws_heap_pop(h):
    gas = h[1].tw * random.random()     # start with a random amount of gas

    i = 1                     # start driving at the root
    while gas >= h[i].w:      # while we have enough gas to get past node i:
        gas -= h[i].w         #   drive past node i
        i <<= 1               #   move to first child
        if gas >= h[i].tw:    #   if we have enough gas:
            gas -= h[i].tw    #     drive past first child and descendants
            i += 1            #     move to second child
    w = h[i].w                # out of gas! h[i] is the selected node.
    v = h[i].v

    h[i].w = 0                # make sure this node isn't chosen again
    while i:                  # fix up total weights
        h[i].tw -= w
        i >>= 1
    return v

def random_weighted_sample_no_replacement(items, n):
    heap = rws_heap(items)              # just make a heap...
    for i in range(n):
        yield rws_heap_pop(heap)        # and pop n items off it.

【讨论】:

+1 非常棒地使用了我以前从未见过的 Python 控制结构变体 查看我对另一个问题的回答,了解二叉树方法的 Python 实现:***.com/questions/526255/… “修复”策略可能是最快的。您可以通过将原始值存储在每个 Node 而不是单独的字典中来加快速度。 javascript 端口(如果人们需要的话):gist.github.com/seejohnrun/5291246 嘿!你是完全正确的。 0.5 在第二组。好的,我很喜欢&gt;=【参考方案4】:

如果抽样是有放回的,使用roulette-wheel selection技术(常用于遗传算法):

    对权重进行排序 计算累积权重 在[0,1]*totalWeight中选择一个随机数 找出这个数落入的区间 选择对应区间的元素 重复k

如果采样没有替换,您可以通过在每次迭代后从列表中删除所选元素来调整上述技术,然后重新归一化权重,使其总和为 1(有效的概率分布函数)

【讨论】:

+1,这在清晰度上大获全胜。但请注意,轮盘赌算法需要 O(n log m + m) 时间,其中 n 是样本数,m 是项目数。那是如果您省略不必要的排序,并在步骤 4 中进行二进制搜索。此外,它需要 O(m) 空间来存储累积权重。在我的回答中,有一个 14 行函数在 O(n + m) 时间和 O(1) 空间中执行相同的操作。 如果我必须删除选定的元素,我需要复制整个列表,我假设我们不允许对输入列表进行任何修改,这很昂贵。 您需要对权重进行排序吗?有必要吗? 你认为在这里使用 Fenwick 树有帮助吗? 不建议更换方法不好。谷歌“不等概率的系统抽样”。有一个 O(n) 算法,无需重新计算权重。【参考方案5】:

如果您想生成大型随机整数数组替换,您可以使用分段线性插值。例如,使用 NumPy/SciPy:

import numpy
import scipy.interpolate

def weighted_randint(weights, size=None):
    """Given an n-element vector of weights, randomly sample
    integers up to n with probabilities proportional to weights"""
    n = weights.size
    # normalize so that the weights sum to unity
    weights = weights / numpy.linalg.norm(weights, 1)
    # cumulative sum of weights
    cumulative_weights = weights.cumsum()
    # piecewise-linear interpolating function whose domain is
    # the unit interval and whose range is the integers up to n
    f = scipy.interpolate.interp1d(
            numpy.hstack((0.0, weights)),
            numpy.arange(n + 1), kind='linear')
    return f(numpy.random.random(size=size)).astype(int)

如果您想在不更换的情况下进行采样,则此方法无效。

【讨论】:

【参考方案6】:

我已经在 Ruby 中完成了这项工作

https://github.com/fl00r/pickup

require 'pickup'
pond = 
  "selmon"  => 1,
  "carp" => 4,
  "crucian"  => 3,
  "herring" => 6,
  "sturgeon" => 8,
  "gudgeon" => 10,
  "minnow" => 20

pickup = Pickup.new(pond, uniq: true)
pickup.pick(3)
#=> [ "gudgeon", "herring", "minnow" ]
pickup.pick
#=> "herring"
pickup.pick
#=> "gudgeon"
pickup.pick
#=> "sturgeon"

【讨论】:

与 Jason Orendorff 的帖子相比,此版本返回错误的答案。具体来说,在 [1,1,1,1,9996] 中的诸如 pick(4, unique) 之类的权重上,低权重项的结果不是均匀的。【参考方案7】:

这是来自geodns 的 Go 实现:

package foo

import (
    "log"
    "math/rand"
)

type server struct 
    Weight int
    data   interface


func foo(servers []server) 
    // servers list is already sorted by the Weight attribute

    // number of items to pick
    max := 4

    result := make([]server, max)

    sum := 0
    for _, r := range servers 
        sum += r.Weight
    

    for si := 0; si < max; si++ 
        n := rand.Intn(sum + 1)
        s := 0

        for i := range servers 
            s += int(servers[i].Weight)
            if s >= n 
                log.Println("Picked record", i, servers[i])
                sum -= servers[i].Weight
                result[si] = servers[i]

                // remove the server from the list
                servers = append(servers[:i], servers[i+1:]...)
                break
            
        
    

    return result

【讨论】:

【参考方案8】:

这个完全是 O(n) 并且没有过多的内存使用。我相信这是一个聪明而有效的解决方案,易于移植到任何语言。前两行只是在 Drupal 中填充示例数据。

function getNrandomGuysWithWeight($numitems)
  $q = db_query('SELECT id, weight FROM theTableWithTheData');
  $q = $q->fetchAll();

  $accum = 0;
  foreach($q as $r)
    $accum += $r->weight;
    $r->weight = $accum;
  

  $out = array();

  while(count($out) < $numitems && count($q))
    $n = rand(0,$accum);
    $lessaccum = NULL;
    $prevaccum = 0;
    $idxrm = 0;
    foreach($q as $i=>$r)
      if(($lessaccum == NULL) && ($n <= $r->weight))
        $out[] = $r->id;
        $lessaccum = $r->weight- $prevaccum;
        $accum -= $lessaccum;
        $idxrm = $i;
      else if($lessaccum)
        $r->weight -= $lessaccum;
      
      $prevaccum = $r->weight;
    
    unset($q[$idxrm]);
  
  return $out;

【讨论】:

【参考方案9】:

我在这里放了一个简单的选择 1 项的解决方案,您可以轻松地将其扩展为 k 项(Java 风格):

double random = Math.random();
double sum = 0;
for (int i = 0; i < items.length; i++) 
    val = items[i];
    sum += val.getValue();
    if (sum > random) 
        selected = val;
        break;
    

【讨论】:

【参考方案10】:

如果您想从加权集合中选择 x 个元素而不进行替换,以便以与其权重成正比的概率选择元素:

import random

def weighted_choose_subset(weighted_set, count):
    """Return a random sample of count elements from a weighted set.

    weighted_set should be a sequence of tuples of the form 
    (item, weight), for example:  [('a', 1), ('b', 2), ('c', 3)]

    Each element from weighted_set shows up at most once in the
    result, and the relative likelihood of two particular elements
    showing up is equal to the ratio of their weights.

    This works as follows:

    1.) Line up the items along the number line from [0, the sum
    of all weights) such that each item occupies a segment of
    length equal to its weight.

    2.) Randomly pick a number "start" in the range [0, total
    weight / count).

    3.) Find all the points "start + n/count" (for all integers n
    such that the point is within our segments) and yield the set
    containing the items marked by those points.

    Note that this implementation may not return each possible
    subset.  For example, with the input ([('a': 1), ('b': 1),
    ('c': 1), ('d': 1)], 2), it may only produce the sets ['a',
    'c'] and ['b', 'd'], but it will do so such that the weights
    are respected.

    This implementation only works for nonnegative integral
    weights.  The highest weight in the input set must be less
    than the total weight divided by the count; otherwise it would
    be impossible to respect the weights while never returning
    that element more than once per invocation.
    """
    if count == 0:
        return []

    total_weight = 0
    max_weight = 0
    borders = []
    for item, weight in weighted_set:
        if weight < 0:
            raise RuntimeError("All weights must be positive integers")
        # Scale up weights so dividing total_weight / count doesn't truncate:
        weight *= count
        total_weight += weight
        borders.append(total_weight)
        max_weight = max(max_weight, weight)

    step = int(total_weight / count)

    if max_weight > step:
        raise RuntimeError(
            "Each weight must be less than total weight / count")

    next_stop = random.randint(0, step - 1)

    results = []
    current = 0
    for i in range(count):
        while borders[current] <= next_stop:
            current += 1
        results.append(weighted_set[current][0])
        next_stop += step

    return results

【讨论】:

我认为您可以通过在开头复制weighted_set并将其改组来消除所选元素之间的相关性。 经过反思,我不确定如何证明这一点。 两者都一样 :)【参考方案11】:

我知道这是一个非常古老的问题,但我认为如果你应用一点数学知识,可以在 O(n) 时间内完成这个问题!

exponential distribution 有两个非常有用的属性。

    给定来自不同指数分布且具有不同速率参数的 n 个样本,给定样本最小的概率等于其速率参数除以所有速率参数的总和。

    它是“无记忆的”。因此,如果您已经知道最小值,那么任何剩余元素是 2nd-to-min 的概率与如果真正的 min 被删除(并且从未生成),则该元素将是新元素的概率相同分钟。这似乎很明显,但我认为由于一些条件概率问题,其他分布可能并非如此。

使用事实1,我们知道选择单个元素可以通过生成这些速率参数等于权重的指数分布样本,然后选择具有最小值的那个来完成。

使用事实 2,我们知道我们不必重新生成指数样本。相反,只需为每个元素生成一个,并获取样本最少的 k 个元素。

找到最小的 k 可以在 O(n) 中完成。使用Quickselect算法找到第k个元素,然后简单地再次遍历所有元素并输出所有低于第k个的元素。

一个有用的提示:如果您无法立即访问库来生成指数分布样本,可以通过以下方式轻松完成:-ln(rand())/weight

【讨论】:

我想这是这么多年后这里唯一正确的答案。我曾经找到这种正确的方式,但从来没有记得进入这里。我会仔细阅读您的回答并接受它。 这个方法有参考吗?我发现Weighted Random Sampling 有类似的想法。虽然我看到了我想念的是一个参考的逻辑,但它表明这是正确的分布。 不,抱歉,我没有;这是我自己想出来的。所以你的怀疑是对的。但是,如果比我更聪明的人之前没有发现它并给予它比堆栈溢出帖子更严格的处理,我会感到惊讶。 我在 JavaScript 中实现了这个,但它没有给出预期的结果:jsfiddle.net/gasparl/kamub5rq/18 - 有什么想法吗?【参考方案12】:

我已经实现了一个类似于 Jason Orendorff 在 Rust here 中的想法的算法。我的版本还支持批量操作:在O(m + log n) time 的数据结构中插入和删除(当你想删除一组由它们的 id 给出的项目时,而不是通过加权选择路径)其中 m 是要删除的项目数和 n 存储的项目数。

【讨论】:

【参考方案13】:

不使用递归替换的采样 - c# 中优雅且非常简短的解决方案

//我们可以从60个学生中选择4个有多少种方式,这样每次我们选择不同的4个

class Program

    static void Main(string[] args)
    
        int group = 60;
        int studentsToChoose = 4;

        Console.WriteLine(FindNumberOfStudents(studentsToChoose, group));
    

    private static int FindNumberOfStudents(int studentsToChoose, int group)
    
        if (studentsToChoose == group || studentsToChoose == 0)
            return 1;

        return FindNumberOfStudents(studentsToChoose, group - 1) + FindNumberOfStudents(studentsToChoose - 1, group - 1);

    

【讨论】:

【参考方案14】:

我只是花了几个小时试图了解无替换采样的底层算法,这个话题比我最初想象的要复杂。真令人兴奋!为了未来读者的利益(祝您有美好的一天!)我在这里记录了我的见解包括一个即用型功能,该功能尊重给定的包含概率。可以在这里找到各种方法的快速数学概述:Tillé: Algorithms of sampling with equal or unequal probabilities。例如,Jason 的方法可以在第 46 页找到。他的方法需要注意的是,权重与包含概率成正比,如文档中所述。实际上,第 i 个包含概率可以递归计算如下:

def inclusion_probability(i, weights, k):
    """
        Computes the inclusion probability of the i-th element
        in a randomly sampled k-tuple using Jason's algorithm
        (see https://***.com/a/2149533/7729124)
    """
    if k <= 0: return 0
    cum_p = 0
    for j, weight in enumerate(weights):
        # compute the probability of j being selected considering the weights
        p = weight / sum(weights)

        if i == j:
            # if this is the target element, we don't have to go deeper,
            # since we know that i is included
            cum_p += p
        else:
            # if this is not the target element, than we compute the conditional
            # inclusion probability of i under the constraint that j is included
            cond_i = i if i < j else i-1
            cond_weights = weights[:j] + weights[j+1:]
            cond_p = inclusion_probability(cond_i, cond_weights, k-1)
            cum_p += p * cond_p
    return cum_p

我们可以通过比较来检查上面函数的有效性

In : for i in range(3): print(i, inclusion_probability(i, [1,2,3], 2))
0 0.41666666666666663
1 0.7333333333333333
2 0.85

In : import collections, itertools
In : sample_tester = lambda f: collections.Counter(itertools.chain(*(f() for _ in range(10000))))
In : sample_tester(lambda: random_weighted_sample_no_replacement([(1,'a'),(2,'b'),(3,'c')],2))
Out: Counter('a': 4198, 'b': 7268, 'c': 8534)

指定包含概率的一种方法(也在上面的文档中提出)是计算它们的权重。手头问题的整个复杂性源于这样一个事实,即人们不能直接这样做,因为基本上必须反转递归公式,象征性地我声称这是不可能的。从数值上讲,它可以使用各种方法来完成,例如牛顿法。然而,使用普通 Python 反转雅可比矩阵的复杂性很快变得难以忍受,我真的建议在这种情况下查看numpy.random.choice

幸运的是,有一种使用纯 Python 的方法可能对您的目的具有足够的性能,也可能不会,如果没有那么多不同的权重,它会很好用。您可以在第 75 和 76 页找到该算法。它通过将采样过程分成具有相同包含概率的部分来工作,即我们可以再次使用random.sample!我不打算在这里解释原理,因为第 69 页很好地介绍了基础知识。这是希望有足够数量的 cmets 的代码:

def sample_no_replacement_exact(items, k, best_effort=False, random_=None, ε=1e-9):
    """
        Returns a random sample of k elements from items, where items is a list of
        tuples (weight, element). The inclusion probability of an element in the
        final sample is given by
           k * weight / sum(weights).

        Note that the function raises if a inclusion probability cannot be
        satisfied, e.g the following call is obviously illegal:
           sample_no_replacement_exact([(1,'a'),(2,'b')],2)
        Since selecting two elements means selecting both all the time,
        'b' cannot be selected twice as often as 'a'. In general it can be hard to
        spot if the weights are illegal and the function does *not* always raise
        an exception in that case. To remedy the situation you can pass
        best_effort=True which redistributes the inclusion probability mass
        if necessary. Note that the inclusion probabilities will change
        if deemed necessary.

        The algorithm is based on the splitting procedure on page 75/76 in:
        http://www.eustat.eus/productosServicios/52.1_Unequal_prob_sampling.pdf
        Additional information can be found here:
        https://***.com/questions/2140787/

        :param items: list of tuples of type weight,element
        :param k: length of resulting sample
        :param best_effort: fix inclusion probabilities if necessary,
                            (optional, defaults to False)
        :param random_: random module to use (optional, defaults to the
                        standard random module)
        :param ε: fuzziness parameter when testing for zero in the context
                  of floating point arithmetic (optional, defaults to 1e-9)
        :return: random sample set of size k
        :exception: throws ValueError in case of bad parameters,
                    throws AssertionError in case of algorithmic impossibilities
    """
    # random_ defaults to the random submodule
    if not random_:
        random_ = random

    # special case empty return set
    if k <= 0:
        return set()

    if k > len(items):
        raise ValueError("resulting tuple length exceeds number of elements (k > n)")

    # sort items by weight
    items = sorted(items, key=lambda item: item[0])

    # extract the weights and elements
    weights, elements = list(zip(*items))

    # compute the inclusion probabilities (short: π) of the elements
    scaling_factor = k / sum(weights)
    π = [scaling_factor * weight for weight in weights]

    # in case of best_effort: if a inclusion probability exceeds 1,
    # try to rebalance the probabilities such that:
    # a) no probability exceeds 1,
    # b) the probabilities still sum to k, and
    # c) the probability masses flow from top to bottom:
    #    [0.2, 0.3, 1.5] -> [0.2, 0.8, 1]
    # (remember that π is sorted)
    if best_effort and π[-1] > 1 + ε:
        # probability mass we still we have to distribute
        debt = 0.
        for i in reversed(range(len(π))):
            if π[i] > 1.:
                # an 'offender', take away excess
                debt += π[i] - 1.
                π[i] = 1.
            else:
                # case π[i] < 1, i.e. 'save' element
                # maximum we can transfer from debt to π[i] and still not
                # exceed 1 is computed by the minimum of:
                # a) 1 - π[i], and
                # b) debt
                max_transfer = min(debt, 1. - π[i])
                debt -= max_transfer
                π[i] += max_transfer
        assert debt < ε, "best effort rebalancing failed (impossible)"

    # make sure we are talking about probabilities
    if any(not (0 - ε <= π_i <= 1 + ε) for π_i in π):
        raise ValueError("inclusion probabilities not satisfiable: " \
                         .format(list(zip(π, elements))))

    # special case equal probabilities
    # (up to fuzziness parameter, remember that π is sorted)
    if π[-1] < π[0] + ε:
        return set(random_.sample(elements, k))

    # compute the two possible lambda values, see formula 7 on page 75
    # (remember that π is sorted)
    λ1 = π[0] * len(π) / k
    λ2 = (1 - π[-1]) * len(π) / (len(π) - k)
    λ = min(λ1, λ2)

    # there are two cases now, see also page 69
    # CASE 1
    # with probability λ we are in the equal probability case
    # where all elements have the same inclusion probability
    if random_.random() < λ:
        return set(random_.sample(elements, k))

    # CASE 2:
    # with probability 1-λ we are in the case of a new sample without
    # replacement problem which is strictly simpler,
    # it has the following new probabilities (see page 75, π^(2)):
    new_π = [
        (π_i - λ * k / len(π))
        /
        (1 - λ)
        for π_i in π
    ]
    new_items = list(zip(new_π, elements))

    # the first few probabilities might be 0, remove them
    # NOTE: we make sure that floating point issues do not arise
    #       by using the fuzziness parameter
    while new_items and new_items[0][0] < ε:
        new_items = new_items[1:]

    # the last few probabilities might be 1, remove them and mark them as selected
    # NOTE: we make sure that floating point issues do not arise
    #       by using the fuzziness parameter
    selected_elements = set()
    while new_items and new_items[-1][0] > 1 - ε:
        selected_elements.add(new_items[-1][1])
        new_items = new_items[:-1]

    # the algorithm reduces the length of the sample problem,
    # it is guaranteed that:
    # if λ = λ1: the first item has probability 0
    # if λ = λ2: the last item has probability 1
    assert len(new_items) < len(items), "problem was not simplified (impossible)"

    # recursive call with the simpler sample problem
    # NOTE: we have to make sure that the selected elements are included
    return sample_no_replacement_exact(
        new_items,
        k - len(selected_elements),
        best_effort=best_effort,
        random_=random_,
        ε=ε
    ) | selected_elements

例子:

In : sample_no_replacement_exact([(1,'a'),(2,'b'),(3,'c')],2)
Out: 'b', 'c'

In : import collections, itertools
In : sample_tester = lambda f: collections.Counter(itertools.chain(*(f() for _ in range(10000))))
In : sample_tester(lambda: sample_no_replacement_exact([(1,'a'),(2,'b'),(3,'c'),(4,'d')],2))
Out: Counter('a': 2048, 'b': 4051, 'c': 5979, 'd': 7922)

权重总和为 10,因此包含概率计算为:a → 20%,b → 40%,c → 60%,d → 80%。 (总和:200% = k。)有效!

对于有效使用此功能,请注意一句话,很难发现权重的非法输入。一个明显的非法示例是

In: sample_no_replacement_exact([(1,'a'),(2,'b')],2)
ValueError: inclusion probabilities not satisfiable: [(0.6666666666666666, 'a'), (1.3333333333333333, 'b')]

b 的出现频率不能是a 的两倍,因为两者都必须始终被选中。还有更微妙的例子。为了避免生产中的异常,只需使用 best_effort=True,它会重新平衡包含概率质量,以便我们始终拥有一个有效的分布。显然,这可能会改变包含概率。

【讨论】:

以上是关于从元素具有权重的列表中选择 k 个随机元素的主要内容,如果未能解决你的问题,请参考以下文章

从集合中选择 N 个随机数

权重随机算法的java实现

权重随机算法的java实现

如何利用Python随机从list中挑选一个元素

从 C# 中的 List<T> 中选择 N 个随机元素的算法[重复]

如何使用numpy从列表中随机选择n个元素?