直方图中的随机数

Posted

技术标签:

【中文标题】直方图中的随机数【英文标题】:Random Number from Histogram 【发布时间】:2013-07-23 04:30:44 【问题描述】:

假设我使用 scipy/numpy 创建了一个直方图,所以我有两个数组:一个用于 bin 计数,一个用于 bin 边缘。如果我使用直方图来表示一个概率分布函数,我怎样才能有效地从该分布中生成随机数?

【问题讨论】:

你能澄清一下吗?您想要每个直方图间隔有一定数量的随机数,还是想要基于权重函数的随机数,该权重函数基于直方图值的多项式插值? 归还垃圾箱中心没问题。不需要插值或拟合。 【参考方案1】:

这可能是np.random.choice 在@Ophion 的答案中所做的,但是您可以构造一个归一化的累积密度函数,然后根据统一的随机数进行选择:

from __future__ import division
import numpy as np
import matplotlib.pyplot as plt

data = np.random.normal(size=1000)
hist, bins = np.histogram(data, bins=50)

bin_midpoints = bins[:-1] + np.diff(bins)/2
cdf = np.cumsum(hist)
cdf = cdf / cdf[-1]
values = np.random.rand(10000)
value_bins = np.searchsorted(cdf, values)
random_from_cdf = bin_midpoints[value_bins]

plt.subplot(121)
plt.hist(data, 50)
plt.subplot(122)
plt.hist(random_from_cdf, 50)
plt.show()


2D 案例可以如下完成:

data = np.column_stack((np.random.normal(scale=10, size=1000),
                        np.random.normal(scale=20, size=1000)))
x, y = data.T                        
hist, x_bins, y_bins = np.histogram2d(x, y, bins=(50, 50))
x_bin_midpoints = x_bins[:-1] + np.diff(x_bins)/2
y_bin_midpoints = y_bins[:-1] + np.diff(y_bins)/2
cdf = np.cumsum(hist.ravel())
cdf = cdf / cdf[-1]

values = np.random.rand(10000)
value_bins = np.searchsorted(cdf, values)
x_idx, y_idx = np.unravel_index(value_bins,
                                (len(x_bin_midpoints),
                                 len(y_bin_midpoints)))
random_from_cdf = np.column_stack((x_bin_midpoints[x_idx],
                                   y_bin_midpoints[y_idx]))
new_x, new_y = random_from_cdf.T

plt.subplot(121, aspect='equal')
plt.hist2d(x, y, bins=(50, 50))
plt.subplot(122, aspect='equal')
plt.hist2d(new_x, new_y, bins=(50, 50))
plt.show()

【讨论】:

是的,这肯定行得通!能否推广到更高维的直方图? @xvtk 我用二维直方图编辑了我的答案。您应该能够将相同的方案应用于更高维度的分布。 如果你使用的是python 2,则需要添加“from future import Division”导入,或者将cdf规范化行改为cdf = cdf / float(cdf[ -1]) 你完全正确,诺姆。在我编写的每一个 Python 的第一行中,它已经成为我的第二天性,以至于我一直忘记它不是标准行为。已编辑我的答案。 我还在您的代码中添加了一个示例(作为新答案),如何从直方图的 kde(内核密度估计)生成随机数,它更好地捕获了直方图。【参考方案2】:

@Jaime 解决方案很棒,但您应该考虑使用直方图的 kde(核密度估计)。 here

很好地解释了为什么对直方图进行统计是有问题的,以及为什么应该使用 kde

我编辑了@Jaime 的代码来展示如何使用 scipy 中的 kde。它看起来几乎相同,但更好地捕捉直方图生成器。

from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

def run():
    data = np.random.normal(size=1000)
    hist, bins = np.histogram(data, bins=50)

    x_grid = np.linspace(min(data), max(data), 1000)
    kdepdf = kde(data, x_grid, bandwidth=0.1)
    random_from_kde = generate_rand_from_pdf(kdepdf, x_grid)

    bin_midpoints = bins[:-1] + np.diff(bins) / 2
    random_from_cdf = generate_rand_from_pdf(hist, bin_midpoints)

    plt.subplot(121)
    plt.hist(data, 50, normed=True, alpha=0.5, label='hist')
    plt.plot(x_grid, kdepdf, color='r', alpha=0.5, lw=3, label='kde')
    plt.legend()
    plt.subplot(122)
    plt.hist(random_from_cdf, 50, alpha=0.5, label='from hist')
    plt.hist(random_from_kde, 50, alpha=0.5, label='from kde')
    plt.legend()
    plt.show()


def kde(x, x_grid, bandwidth=0.2, **kwargs):
    """Kernel Density Estimation with Scipy"""
    kde = gaussian_kde(x, bw_method=bandwidth / x.std(ddof=1), **kwargs)
    return kde.evaluate(x_grid)


def generate_rand_from_pdf(pdf, x_grid):
    cdf = np.cumsum(pdf)
    cdf = cdf / cdf[-1]
    values = np.random.rand(1000)
    value_bins = np.searchsorted(cdf, values)
    random_from_cdf = x_grid[value_bins]
    return random_from_cdf

【讨论】:

你为什么要bw_method=bandwidth / x.std(ddof=1)?我会认为bw_method=bandwidth * x.std(ddof=1) 而不是?【参考方案3】:

也许是这样的。使用直方图的计数作为权重,并根据该权重选择索引值。

import numpy as np

initial=np.random.rand(1000)
values,indices=np.histogram(initial,bins=20)
values=values.astype(np.float32)
weights=values/np.sum(values)

#Below, 5 is the dimension of the returned array.
new_random=np.random.choice(indices[1:],5,p=weights)
print new_random

#[ 0.55141614  0.30226256  0.25243184  0.90023117  0.55141614]

【讨论】:

【参考方案4】:

我和 OP 遇到了同样的问题,我想分享我解决这个问题的方法。

在Jaime answer 和Noam Peled answer 之后,我使用Kernel Density Estimation (KDE) 构建了一个二维问题的解决方案。

首先,让我们生成一些随机数据,然后从 KDE 计算它的Probability Density Function (PDF)。我将为此使用example available in SciPy。

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

def measure(n):
    "Measurement model, return two coupled measurements."
    m1 = np.random.normal(size=n)
    m2 = np.random.normal(scale=0.5, size=n)
    return m1+m2, m1-m2

m1, m2 = measure(2000)
xmin = m1.min()
xmax = m1.max()
ymin = m2.min()
ymax = m2.max()

X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([X.ravel(), Y.ravel()])
values = np.vstack([m1, m2])
kernel = stats.gaussian_kde(values)
Z = np.reshape(kernel(positions).T, X.shape)

fig, ax = plt.subplots()
ax.imshow(np.rot90(Z), cmap=plt.cm.gist_earth_r,
          extent=[xmin, xmax, ymin, ymax])
ax.plot(m1, m2, 'k.', markersize=2)
ax.set_xlim([xmin, xmax])
ax.set_ylim([ymin, ymax])

情节是:

现在,我们从 KDE 获得的 PDF 中获取随机数据,即变量Z

# Generate the bins for each axis
x_bins = np.linspace(xmin, xmax, Z.shape[0]+1)
y_bins = np.linspace(ymin, ymax, Z.shape[1]+1)

# Find the middle point for each bin
x_bin_midpoints = x_bins[:-1] + np.diff(x_bins)/2
y_bin_midpoints = y_bins[:-1] + np.diff(y_bins)/2

# Calculate the Cumulative Distribution Function(CDF)from the PDF
cdf = np.cumsum(Z.ravel())
cdf = cdf / cdf[-1] # Normalização

# Create random data
values = np.random.rand(10000)

# Find the data position
value_bins = np.searchsorted(cdf, values)
x_idx, y_idx = np.unravel_index(value_bins,
                                (len(x_bin_midpoints),
                                 len(y_bin_midpoints)))

# Create the new data
new_data = np.column_stack((x_bin_midpoints[x_idx],
                            y_bin_midpoints[y_idx]))
new_x, new_y = new_data.T

我们可以根据这些新数据计算 KDE 并绘制它。

kernel = stats.gaussian_kde(new_data.T)
new_Z = np.reshape(kernel(positions).T, X.shape)

fig, ax = plt.subplots()
ax.imshow(np.rot90(new_Z), cmap=plt.cm.gist_earth_r,
          extent=[xmin, xmax, ymin, ymax])
ax.plot(new_x, new_y, 'k.', markersize=2)
ax.set_xlim([xmin, xmax])
ax.set_ylim([ymin, ymax])

【讨论】:

【参考方案5】:

这是一个解决方案,它返回均匀分布在每个 bin 内而不是 bin 中心内的数据点:

def draw_from_hist(hist, bins, nsamples = 100000):
    cumsum = [0] + list(I.np.cumsum(hist))
    rand = I.np.random.rand(nsamples)*max(cumsum)
    return [I.np.interp(x, cumsum, bins) for x in rand]

【讨论】:

【参考方案6】:

对于@daniel、@arco-bast、et al 建议的解决方案,有些事情并不奏效

以最后一个例子

def draw_from_hist(hist, bins, nsamples = 100000):
    cumsum = [0] + list(I.np.cumsum(hist))
    rand = I.np.random.rand(nsamples)*max(cumsum)
    return [I.np.interp(x, cumsum, bins) for x in rand]

这假设至少第一个 bin 的内容为零,这可能是真的,也可能不是。其次,这假设 PDF 的值位于 bin 的 边界,但事实并非如此 - 它主要位于 bin 的中心。

这是另一个分两部分完成的解决方案

def init_cdf(hist,bins):
    """Initialize CDF from histogram

    Parameters
    ----------
        hist : array-like, float of size N
            Histogram height 
        bins : array-like, float of size N+1
            Histogram bin boundaries 

    Returns:
    --------
        cdf : array-like, float of size N+1
    """
    from numpy import concatenate, diff,cumsum

    # Calculate half bin sizes
    steps  = diff(bins) / 2  # Half bin size

    # Calculate slope between bin centres 
    slopes = diff(hist) / (steps[:-1]+steps[1:]) 

    # Find height of end points by linear interpolation
    # - First part is linear interpolation from second over first
    #   point to lowest bin edge
    # - Second part is linear interpolation left neighbor to 
    #   right neighbor up to but not including last point
    # - Third part is linear interpolation from second to last point 
    #   over last point to highest bin edge
    # Can probably be done more elegant
    ends = concatenate(([hist[0] - steps[0] * slopes[0]], 
                        hist[:-1] + steps[:-1] * slopes,
                        [hist[-1] + steps[-1] * slopes[-1]]))

    # Calculate cumulative sum 
    sum = cumsum(ends)
    # Subtract off lower bound and scale by upper bound 
    sum -= sum[0]
    sum /= sum[-1]

    # Return the CDF 
    return sum

def sample_cdf(cdf,bins,size):
    """Sample a CDF defined at specific points.

    Linear interpolation between defined points 

    Parameters
    ----------
       cdf : array-like, float, size N
           CDF evaluated at all points of bins. First and 
           last point of bins are assumed to define the domain
           over which the CDF is normalized. 
       bins : array-like, float, size N
           Points where the CDF is evaluated.  First and last points 
           are assumed to define the end-points of the CDF's domain
       size : integer, non-zero
           Number of samples to draw 
    Returns
    -------
        sample : array-like, float, of size ``size``
             Random sample
    """
    from numpy import interp
    from numpy.random import random 

    return interp(random(size), cdf, bins)

# Begin example code
import numpy as np
import matplotlib.pyplot as plt

# initial histogram, coarse binning
hist,bins = np.histogram(np.random.normal(size=1000),np.linspace(-2,2,21))

# Calculate CDF, make sample, and new histogram w/finer binning
cdf = init_cdf(hist,bins)
sample = sample_cdf(cdf,bins,1000)
hist2,bins2 = np.histogram(sample,np.linspace(-3,3,61))

# Calculate bin centres and widths 
mx = (bins[1:]+bins[:-1])/2
dx = np.diff(bins)
mx2 = (bins2[1:]+bins2[:-1])/2
dx2 = np.diff(bins2)

# Plot, taking care to show uncertainties and so on
plt.errorbar(mx,hist/dx,np.sqrt(hist)/dx,dx/2,'.',label='original')
plt.errorbar(mx2,hist2/dx2,np.sqrt(hist2)/dx2,dx2/2,'.',label='new')
plt.legend()

抱歉,我不知道如何让它显示在 *** 中,所以请复制并运行以了解重点。

【讨论】:

我的解决方案不假定第一个 bin 是空的。试试draw_from_hist([1],[0,1])。正如预期的那样,这从区间 [0,1] 均匀绘制。【参考方案7】:

当我在寻找一种基于另一个数组的分布生成随机数组的方法时,我偶然发现了这个问题。如果这是在 numpy 中,我会称之为 random_like() 函数。

然后我意识到,我已经编写了一个包Redistributor,即使该包的创建动机有点不同(Sklearn 转换器能够将数据从任意分布转换为机器的任意已知分布,它也可能为我做到这一点学习目的)。当然我知道不需要不必要的依赖,但至少知道这个包有一天可能对你有用。 OP 询问的事情基本上是在这里完成的。

警告:在引擎盖下,一切都是在一维中完成的。该包还实现了多维包装器,但我没有使用它编写此示例,因为我觉得它太小众了。

安装:

pip install git+https://gitlab.com/paloha/redistributor

实施:

import numpy as np
import matplotlib.pyplot as plt

def random_like(source, bins=0, seed=None):
    from redistributor import Redistributor
    np.random.seed(seed)
    noise = np.random.uniform(source.min(), source.max(), size=source.shape)
    s = Redistributor(bins=bins, bbox=[source.min(), source.max()]).fit(source.ravel())
    s.cdf, s.ppf = s.source_cdf, s.source_ppf
    r = Redistributor(target=s, bbox=[noise.min(), noise.max()]).fit(noise.ravel())
    return r.transform(noise.ravel()).reshape(noise.shape)

source = np.random.normal(loc=0, scale=1, size=(100,100))
t = random_like(source, bins=80) # More bins more precision (0 = automatic)

# Plotting
plt.figure(figsize=(12,4))
plt.subplot(121); plt.title(f'Distribution of source data, shape: source.shape')
plt.hist(source.ravel(), bins=100)
plt.subplot(122); plt.title(f'Distribution of generated data, shape: t.shape') 
plt.hist(t.ravel(), bins=100); plt.show()

说明:

import numpy as np
import matplotlib.pyplot as plt
from redistributor import Redistributor
from sklearn.metrics import mean_squared_error

# We have some source array with "some unknown" distribution (e.g. an image)
# For the sake of example we just generate a random gaussian matrix
source = np.random.normal(loc=0, scale=1, size=(100,100))
plt.figure(figsize=(12,4))
plt.subplot(121); plt.title('Source data'); plt.imshow(source, origin='lower') 
plt.subplot(122); plt.title('Source data hist'); plt.hist(source.ravel(), bins=100); plt.show()

# We want to generate a random matrix from the distribution of the source
# So we create a random uniformly distributed array called noise
noise = np.random.uniform(source.min(), source.max(), size=(100,100))
plt.figure(figsize=(12,4))
plt.subplot(121); plt.title('Uniform noise'); plt.imshow(noise, origin='lower')
plt.subplot(122); plt.title('Uniform noise hist'); plt.hist(noise.ravel(), bins=100); plt.show()

# Then we fit (approximate) the source distribution using Redistributor
# This step internally approximates the cdf and ppf functions.
s = Redistributor(bins=200, bbox=[source.min(), source.max()]).fit(source.ravel())

# A little naming workaround to make obj s work as a target distribution
s.cdf = s.source_cdf
s.ppf = s.source_ppf

# Here we create another Redistributor but now we use the fitted Redistributor s as a target
r = Redistributor(target=s, bbox=[noise.min(), noise.max()])

# Here we fit the Redistributor r to the noise array's distribution
r.fit(noise.ravel())

# And finally, we transform the noise into the source's distribution
t = r.transform(noise.ravel()).reshape(noise.shape)
plt.figure(figsize=(12,4))
plt.subplot(121); plt.title('Transformed noise'); plt.imshow(t, origin='lower')
plt.subplot(122); plt.title('Transformed noise hist'); plt.hist(t.ravel(), bins=100); plt.show()

# Computing the difference between the two arrays
print('Mean Squared Error between source and transformed: ', mean_squared_error(source, t))

源和转换后的均方误差:2.0574123162302143

【讨论】:

以上是关于直方图中的随机数的主要内容,如果未能解决你的问题,请参考以下文章

matlab 统计直方图

概率论——随机变量及其分布

随机生成动态散点直方图

2.13生成可控的随机数据集合 生成九个分布的直方图

11.14

d3 几种常用的柱状图