生成具有给定（数值）分布的随机数

Posted 2023-02-19

技术标签:

【中文标题】生成具有给定（数值）分布的随机数【英文标题】：Generate random numbers with a given (numerical) distribution 【发布时间】：2011-05-15 00:55:47 【问题描述】：

我有一个文件，其中包含一些不同值的概率，例如：

我想使用此分布生成随机数。是否存在处理此问题的现有模块？自己编写代码相当简单（构建累积密度函数，生成随机值 [0,1] 并选择相应的值）但看起来这应该是一个常见问题，并且可能有人已经创建了一个函数/模块它。

我需要这个，因为我想生成一个生日列表（不遵循标准 random 模块中的任何分布）。

【问题讨论】：

不是random.choice()？您使用正确的出现次数构建主列表并选择一个。当然，这是一个重复的问题。 Random weighted choice的可能重复 @S.Lott 对于分布中的巨大差异不是非常占用内存吗？ @S.Lott：您的选择方法可能适用于少数情况，但我宁愿避免在不必要时创建大量列表。 @S.Lott：好的，大约 10000*365 = 3650000 = 360 万个元素。我不确定 Python 中的内存使用情况，但至少 3.6M*4B =14.4MB。数量不是很大，但当有一个同样简单的方法不需要额外的内存时，你也不应该忽略。 【参考方案1】：

scipy.stats.rv_discrete 可能是您想要的。您可以通过values 参数提供您的概率。然后可以使用分布对象的rvs()方法生成随机数。

正如 Eugene Pakhomov 在 cmets 中所指出的，您还可以将 p 关键字参数传递给 numpy.random.choice()，例如

numpy.random.choice(numpy.arange(1, 7), p=[0.1, 0.05, 0.05, 0.2, 0.4, 0.2])

如果您使用的是 Python 3.6 或更高版本，则可以使用标准库中的 random.choices() - 请参阅 answer by Mark Dickinson。

【讨论】：

在我的机器上numpy.random.choice() 快了将近 20 倍。它的 w.r.t. 完全相同。到原来的问题。例如：numpy.random.choice(numpy.arange(1, 7), p=[0.1, 0.05, 0.05, 0.2, 0.4, 0.2]) @EugenePakhomov 很好，我不知道。我可以看到有一个答案进一步提到了这一点，但它不包含任何示例代码并且没有很多赞成票。我将对此答案添加评论以提高可见性。令人惊讶的是，rv_discrete.rvs() 在 O(len(p) * size) 时间和内存中工作！虽然choice() 似乎在最佳 O(len(p) + log(len(p)) * size) 时间内运行。如果您使用的是 Python 3.6 或更新版本，another answer 不需要任何插件包。【参考方案2】：

从 Python 3.6 开始，Python 的标准库中有一个解决方案，即random.choices。

示例用法：让我们设置与 OP 问题中的匹配的总体和权重：

>>> from random import choices
>>> population = [1, 2, 3, 4, 5, 6]
>>> weights = [0.1, 0.05, 0.05, 0.2, 0.4, 0.2]

现在choices(population, weights) 生成一个样本：

>>> choices(population, weights)
4

可选的仅关键字参数k 允许一次请求多个样本。这很有价值，因为在生成任何样本之前，random.choices 每次被调用时都必须做一些准备工作；通过一次生成许多样本，我们只需要做一次准备工作。这里我们生成一百万个样本，并使用collections.Counter 来检查我们得到的分布是否与我们给出的权重大致匹配。

>>> million_samples = choices(population, weights, k=10**6)
>>> from collections import Counter
>>> Counter(million_samples)
Counter(5: 399616, 6: 200387, 4: 200117, 1: 99636, 3: 50219, 2: 50025)

【讨论】：

这个有 Python 2.7 版本吗？ @abbas786：不是内置的，但这个问题的其他答案应该都适用于 Python 2.7。如果愿意，您还可以查找 Python 3 的 random.choices 源代码并复制它。【参考方案3】：

使用 CDF 生成列表的一个优点是您可以使用二进制搜索。虽然您需要 O(n) 时间和空间进行预处理，但您可以在 O(k log n) 中获得 k 个数字。由于普通的 Python 列表效率低下，您可以使用 array 模块。

如果您坚持恒定空间，您可以执行以下操作； O(n) 时间，O(1) 空间。

def random_distr(l):
    r = random.uniform(0, 1)
    s = 0
    for item, prob in l:
        s += prob
        if s >= r:
            return item
    return item  # Might occur because of floating point inaccuracies

【讨论】：

列表中 (item, prob) 对的顺序在您的实现中很重要，对吧？ @***user2010：没关系（浮点中的模错误）不错。我发现这比 scipy.stats.rv_discrete 快 30%。很多次这个函数会抛出一个 KeyError 因为最后一行。 @DrunkenMaster：我不明白。你知道l[-1] 返回列表的最后一个元素吗？【参考方案4】：

（好吧，我知道你要的是收缩包装，但也许那些本土解决方案不够简洁，不符合你的喜好。:-)

pdf = [(1, 0.1), (2, 0.05), (3, 0.05), (4, 0.2), (5, 0.4), (6, 0.2)]
cdf = [(i, sum(p for j,p in pdf if j < i)) for i,_ in pdf]
R = max(i for r in [random.random()] for i,c in cdf if c <= r)

我通过观察这个表达式的输出来伪确认这是有效的：

sorted(max(i for r in [random.random()] for i,c in cdf if c <= r)
       for _ in range(1000))

【讨论】：

这看起来令人印象深刻。只是为了把事情放在上下文中，这里是上述代码连续执行 3 次的结果：['Count of 1 with prob: 0.1 is: 113', 'Count of 2 with prob: 0.05 is: 55', 'Count of概率为 0.05 的 3 为：50'，'概率为 0.2 的 4 计数为：201'，'概率为 0.4 的 5 计数为：388'，'概率为 0.2 的 6 计数为：193']。 .............['概率为 1 的计数：0.1 为：77'，'概率为 2 的计数：0.05 为：60'，'概率为 0.05 的 3 计数为： 51'，'概率为 0.2 的 4 计数为：193'，'概率为 0.4 的 5 计数为：438'，'概率为 0.2 的 6 计数为：181'] .... .....和 ['概率为 0.1 的 1 计数为：84'，'概率为 0.05 的 2 计数为：52'，'概率：0.05 的 3 计数为：53'，'计数4 个概率：0.2 是：210'，'5 个概率：0.4 是：405'，'6 个概率：0.2 是：196'] 一个问题，我如何返回 max(i... , if 'i' is an object? @Vaibhav i 不是对象。【参考方案5】：

也许有点晚了。但是你可以使用numpy.random.choice()，传递p参数：

val = numpy.random.choice(numpy.arange(1, 7), p=[0.1, 0.05, 0.05, 0.2, 0.4, 0.2])

【讨论】：

OP 不想使用 random.choice() - 请参阅 cmets。 numpy.random.choice()与random.choice()完全不同，支持概率分布。不能用函数定义p吗？我为什么要用数字来定义它？【参考方案6】：

我为从自定义连续分布中抽取随机样本编写了一个解决方案。

我需要这个用于与您类似的用例（即生成具有给定概率分布的随机日期）。

您只需要函数random_custDist 和行samples=random_custDist(x0,x1,custDist=custDist,size=1000)。剩下的就是装饰了^^。

import numpy as np

#funtion
def random_custDist(x0,x1,custDist,size=None, nControl=10**6):
    #genearte a list of size random samples, obeying the distribution custDist
    #suggests random samples between x0 and x1 and accepts the suggestion with probability custDist(x)
    #custDist noes not need to be normalized. Add this condition to increase performance. 
    #Best performance for max_x in [x0,x1] custDist(x) = 1
    samples=[]
    nLoop=0
    while len(samples)<size and nLoop<nControl:
        x=np.random.uniform(low=x0,high=x1)
        prop=custDist(x)
        assert prop>=0 and prop<=1
        if np.random.uniform(low=0,high=1) <=prop:
            samples += [x]
        nLoop+=1
    return samples

#call
x0=2007
x1=2019
def custDist(x):
    if x<2010:
        return .3
    else:
        return (np.exp(x-2008)-1)/(np.exp(2019-2007)-1)
samples=random_custDist(x0,x1,custDist=custDist,size=1000)
print(samples)

#plot
import matplotlib.pyplot as plt
#hist
bins=np.linspace(x0,x1,int(x1-x0+1))
hist=np.histogram(samples, bins )[0]
hist=hist/np.sum(hist)
plt.bar( (bins[:-1]+bins[1:])/2, hist, width=.96, label='sample distribution')
#dist
grid=np.linspace(x0,x1,100)
discCustDist=np.array([custDist(x) for x in grid]) #distrete version
discCustDist*=1/(grid[1]-grid[0])/np.sum(discCustDist)
plt.plot(grid,discCustDist,label='custom distribustion (custDist)', color='C1', linewidth=4)
#decoration
plt.legend(loc=3,bbox_to_anchor=(1,0))
plt.show()

这个解决方案的性能肯定是可以提高的，但我更喜欢可读性。

【讨论】：

assert prop>=0 and prop<=1 为什么连续分布的密度会小于1？【参考方案7】：

根据他们的weights列出项目：

items = [1, 2, 3, 4, 5, 6]
probabilities= [0.1, 0.05, 0.05, 0.2, 0.4, 0.2]
# if the list of probs is normalized (sum(probs) == 1), omit this part
prob = sum(probabilities) # find sum of probs, to normalize them
c = (1.0)/prob # a multiplier to make a list of normalized probs
probabilities = map(lambda x: c*x, probabilities)
print probabilities

ml = max(probabilities, key=lambda x: len(str(x)) - str(x).find('.'))
ml = len(str(ml)) - str(ml).find('.') -1
amounts = [ int(x*(10**ml)) for x in probabilities]
itemsList = list()
for i in range(0, len(items)): # iterate through original items
  itemsList += items[i:i+1]*amounts[i]

# choose from itemsList randomly
print itemsList

优化可能是通过最大公约数对数量进行标准化，以使目标列表更小。

另外，this 可能很有趣。

【讨论】：

如果项目列表很大，这可能会占用大量额外内存。 @pafcu 同意。只是一个解决方案，我想到了第二个（第一个是搜索“重量概率python”之类的东西:)）。【参考方案8】：

另一个答案，可能更快:)

distribution = [(1, 0.2), (2, 0.3), (3, 0.5)]  
# init distribution  
dlist = []  
sumchance = 0  
for value, chance in distribution:  
    sumchance += chance  
    dlist.append((value, sumchance))  
assert sumchance == 1.0 # not good assert because of float equality  

# get random value  
r = random.random()  
# for small distributions use lineair search  
if len(distribution) < 64: # don't know exact speed limit  
    for value, sumchance in dlist:  
        if r < sumchance:  
            return value  
else:  
    # else (not implemented) binary search algorithm

【讨论】：

剂量distribution列表是否需要按概率排序？不需要，但是按照概率最大的在前排序会执行得最快。【参考方案9】：

from __future__ import division
import random
from collections import Counter


def num_gen(num_probs):
    # calculate minimum probability to normalize
    min_prob = min(prob for num, prob in num_probs)
    lst = []
    for num, prob in num_probs:
        # keep appending num to lst, proportional to its probability in the distribution
        for _ in range(int(prob/min_prob)):
            lst.append(num)
    # all elems in lst occur proportional to their distribution probablities
    while True:
        # pick a random index from lst
        ind = random.randint(0, len(lst)-1)
        yield lst[ind]

验证：

gen = num_gen([(1, 0.1),
               (2, 0.05),
               (3, 0.05),
               (4, 0.2),
               (5, 0.4),
               (6, 0.2)])
lst = []
times = 10000
for _ in range(times):
    lst.append(next(gen))
# Verify the created distribution:
for item, count in Counter(lst).iteritems():
    print '%d has %f probability' % (item, count/times)

1 has 0.099737 probability
2 has 0.050022 probability
3 has 0.049996 probability 
4 has 0.200154 probability
5 has 0.399791 probability
6 has 0.200300 probability

【讨论】：

【参考方案10】：

基于其他解决方案，您生成累积分布（整数或浮点数），然后您可以使用 bisect 使其快速

这是一个简单的例子（我在这里使用整数）

l=[(20, 'foo'), (60, 'banana'), (10, 'monkey'), (10, 'monkey2')]
def get_cdf(l):
    ret=[]
    c=0
    for i in l: c+=i[0]; ret.append((c, i[1]))
    return ret

def get_random_item(cdf):
    return cdf[bisect.bisect_left(cdf, (random.randint(0, cdf[-1][0]),))][1]

cdf=get_cdf(l)
for i in range(100): print get_random_item(cdf),

get_cdf 函数会将其从 20、60、10、10 转换为 20、20+60、20+60+10、20+60+10+10

现在我们使用 random.randint 选择一个最大为 20+60+10+10 的随机数，然后我们使用 bisect 快速获取实际值

【讨论】：

【参考方案11】：

你可能想看看 NumPy Random sampling distributions

【讨论】：

numpy 函数似乎也只支持有限数量的发行版，不支持指定您自己的发行版。更新链接docs.scipy.org/doc/numpy-1.14.0/reference/routines.random.html而不是docs.scipy.org/doc/numpy/reference/routines.random.html【参考方案12】：

这些答案都不是特别清楚或简单。

这是一个清晰、简单的方法，保证有效。

accumulate_normalize_probabilities 采用字典p 将符号映射到概率OR 频率。它输出可用的元组列表，从中进行选择。

def accumulate_normalize_values(p):
        pi = p.items() if isinstance(p,dict) else p
        accum_pi = []
        accum = 0
        for i in pi:
                accum_pi.append((i[0],i[1]+accum))
                accum += i[1]
        if accum == 0:
                raise Exception( "You are about to explode the universe. Continue ? Y/N " )
        normed_a = []
        for a in accum_pi:
                normed_a.append((a[0],a[1]*1.0/accum))
        return normed_a

产量：

>>> accumulate_normalize_values(  'a': 100, 'b' : 300, 'c' : 400, 'd' : 200   )
[('a', 0.1), ('c', 0.5), ('b', 0.8), ('d', 1.0)]

为什么会起作用

累加步骤将每个符号变成其自身与前一个符号概率或频率之间的间隔（或在第一个符号的情况下为 0）。这些间隔可用于从列表中进行选择（并因此对提供的分布进行采样），方法是简单地遍历列表，直到间隔 0.0 -> 1.0（之前准备的）中的随机数小于或等于当前符号的间隔端点。

规范化让我们不再需要确保一切总和为某个值。归一化后，概率的“向量”总和为 1.0。

用于从分布中选择和生成任意长样本的其余代码如下：

def select(symbol_intervals,random):
        print symbol_intervals,random
        i = 0
        while random > symbol_intervals[i][1]:
                i += 1
                if i >= len(symbol_intervals):
                        raise Exception( "What did you DO to that poor list?" )
        return symbol_intervals[i][0]


def gen_random(alphabet,length,probabilities=None):
        from random import random
        from itertools import repeat
        if probabilities is None:
                probabilities = dict(zip(alphabet,repeat(1.0)))
        elif len(probabilities) > 0 and isinstance(probabilities[0],(int,long,float)):
                probabilities = dict(zip(alphabet,probabilities)) #ordered
        usable_probabilities = accumulate_normalize_values(probabilities)
        gen = []
        while len(gen) < length:
                gen.append(select(usable_probabilities,random()))
        return gen

用法：

>>> gen_random (['a','b','c','d'],10,[100,300,400,200])
['d', 'b', 'b', 'a', 'c', 'c', 'b', 'c', 'c', 'c']   #<--- some of the time

【讨论】：

【参考方案13】：

这是一种更有效的方法：

只需使用您的“权重”数组（假设索引作为相应项目）和编号调用以下函数。所需的样品。可以轻松修改此函数以处理有序对。

使用它们各自的概率返回采样/挑选（替换）的索引（或项目）：

def resample(weights, n):
    beta = 0

    # Caveat: Assign max weight to max*2 for best results
    max_w = max(weights)*2

    # Pick an item uniformly at random, to start with
    current_item = random.randint(0,n-1)
    result = []

    for i in range(n):
        beta += random.uniform(0,max_w)

        while weights[current_item] < beta:
            beta -= weights[current_item]
            current_item = (current_item + 1) % n   # cyclic
        else:
            result.append(current_item)
    return result

关于 while 循环中使用的概念的简短说明。我们从累积 beta 减去当前 item 的权重，累积 beta 是随机均匀构造的累积值，并增加当前 index 以找到权重与 beta 值匹配的 item。

【讨论】：

以上是关于生成具有给定（数值）分布的随机数的主要内容，如果未能解决你的问题，请参考以下文章