如何仅从直方图值创建 KDE？

Posted 2023-02-14

技术标签:

【中文标题】如何仅从直方图值创建 KDE？【英文标题】：How can you create a KDE from histogram values only? 【发布时间】：2019-05-18 07:11:28 【问题描述】：

我有一组值，我想绘制高斯核密度估计，但是我遇到了两个问题：

我只有条形的值而不是值本身我正在绘制一个分类轴

这是我到目前为止生成的情节： y轴的顺序实际上是相关的，因为它代表了每个细菌物种的系统发育。

我想为每种颜色添加一个高斯 kde 叠加层，但到目前为止，我还不能利用 seaborn 或 scipy 来做到这一点。

这是使用 python 和 matplotlib 进行上述分组条形图的代码：

enterN = len(color1_plotting_values)
fig, ax = plt.subplots(figsize=(20,30))
ind = np.arange(N)    # the x locations for the groups
width = .5         # the width of the bars
p1 = ax.barh(Species_Ordering.Species.values, color1_plotting_values, width, label='Color1', log=True)
p2 = ax.barh(Species_Ordering.Species.values, color2_plotting_values, width, label='Color2', log=True)
for b in p2:
    b.xy = (b.xy[0], b.xy[1]+width)

谢谢！

【问题讨论】：

看起来你是从一个数据框中拉出来的，你试过内置的kde plotting functionality吗？是的，我试过了，但我不知道如何让它正确解释分类轴。生成的 kde 是数据直方图的 kde。但是，数据已经表示直方图条的高度。将每个细菌物种视为一个 bin，将每个数字视为该 bin 中值的计数。希望这有助于展示数据的格式！ KDE 通常涉及对相邻数据点的集成。对于诸如您的不同物种之类的分类数据，没有客观的距离标准（更不用说尊重三角不等式的标准了）。因此，在这里使用 KDE 既不可能也不可取。 @PaulBrodersen 抱歉打扰，假设我们忘记了数据是分类的，我们将其视为具有相等 bin 的直方图，或者可能只是均匀采样域上的函数。是否可以在这样的环境下运行 KDE？我的意思是无法访问样本本身，仅访问分箱直方图 @filippo 有点像。在某种意义上，从直方图中确定 KDE 类似于使用加权样本的 KDE（对于大多数 KDE 方法来说，这是一个简单的扩展）。问题是您不知道 bin 边缘内某个点的真实位置。因此，如果内核宽度与 bin 宽度相似或小于 bin 宽度，您就会遇到问题（很容易看出您是否在均匀区间上模拟了一堆点，应用您选择的 KDE 算法，然后将结果与您将点坐标四舍五入到 1 个有效数字）。不过，宽内核应该没问题。 【参考方案1】：

如何从直方图开始绘制“KDE”

核密度估计协议需要基础数据。您可以想出一种新方法，改用经验 pdf（即直方图），但它不会是 KDE 分布。

不过，并不是所有的希望都落空了。您可以通过首先从直方图中获取样本，然后在这些样本上使用 KDE 来获得 KDE 分布的良好近似值。这是一个完整的工作示例：

import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sts

n = 100000

# generate some random multimodal histogram data
samples = np.concatenate([np.random.normal(np.random.randint(-8, 8), size=n)*np.random.uniform(.4, 2) for i in range(4)])
h,e = np.histogram(samples, bins=100, density=True)
x = np.linspace(e.min(), e.max())

# plot the histogram
plt.figure(figsize=(8,6))
plt.bar(e[:-1], h, width=np.diff(e), ec='k', align='edge', label='histogram')

# plot the real KDE
kde = sts.gaussian_kde(samples)
plt.plot(x, kde.pdf(x), c='C1', lw=8, label='KDE')

# resample the histogram and find the KDE.
resamples = np.random.choice((e[:-1] + e[1:])/2, size=n*5, p=h/h.sum())
rkde = sts.gaussian_kde(resamples)

# plot the KDE
plt.plot(x, rkde.pdf(x), '--', c='C3', lw=4, label='resampled KDE')
plt.title('n = %d' % n)
plt.legend()
plt.show()

输出：

图中红色虚线和橙色线几乎完全重叠，表明真实的 KDE 和通过重新采样直方图计算的 KDE 非常吻合。

如果您的直方图非常嘈杂（就像您在上面的代码中设置 n = 10 所得到的那样），那么在将重新采样的 KDE 用于除绘图之外的任何其他用途时，您应该谨慎一点：

总体而言，真实 KDE 和重新采样的 KDE 之间的一致性仍然很好，但偏差很明显。

将您的分类数据转换成适当的形式

由于您尚未发布实际数据，因此我无法为您提供详细建议。我认为最好的办法是按顺序对类别进行编号，然后将该数字用作直方图中每个条形的“x”值。

【讨论】：

定义“x”的目的是什么？你能用“e”代替吗？【参考方案2】：

我已经声明了我对在上面的 cmets 中将 KDE 应用于 OP 的分类数据的保留意见。基本上，由于物种之间的系统发育距离不服从三角不等式，因此不可能有可用于核密度估计的有效核。但是，还有其他密度估计方法不需要构建内核。一种这样的方法是k-最近邻inverse distance weighting，它只需要不需要满足三角不等式的非负距离（我认为甚至不需要对称）。下面概述了这种方法：

import numpy as np

#--------------------------------------------------------------------------------
# simulate data

total_classes = 10
sample_values = np.random.rand(total_classes)
distance_matrix = 100 * np.random.rand(total_classes, total_classes)

# Distances to the values itself are zero; hence remove diagonal.
distance_matrix -= np.diag(np.diag(distance_matrix))

# --------------------------------------------------------------------------------
# For each sample, compute an average based on the values of the k-nearest neighbors.
# Weigh each sample value by the inverse of the corresponding distance.

# Apply a regularizer to the distance matrix.
# This limits the influence of values with very small distances.
# In particular, this affects how the value of the sample itself (which has distance 0)
# is weighted w.r.t. other values.
regularizer = 1.
distance_matrix += regularizer

# Set number of neighbours to "interpolate" over.
k = 3

# Compute average based on sample value itself and k neighbouring values weighted by the inverse distance.
# The following assumes that the value of distance_matrix[ii, jj] corresponds to the distance from ii to jj.
for ii in range(total_classes):

    # determine neighbours
    indices = np.argsort(distance_matrix[ii, :])[:k+1] # +1 to include the value of the sample itself

    # compute weights
    distances = distance_matrix[ii, indices]
    weights = 1. / distances
    weights /= np.sum(weights) # weights need to sum to 1

    # compute weighted average
    values = sample_values[indices]
    new_sample_values[ii] = np.sum(values * weights)

print(new_sample_values)

【讨论】：

【参考方案3】：

简单的方法

目前，我将跳过关于在此类设置中使用核密度的有效性的任何哲学争论。稍后会解决。

一个简单的方法是使用 scikit-learn KernelDensity:

import numpy as np
import pandas as pd
from sklearn.neighbors import KernelDensity
from sklearn import preprocessing

ds=pd.read_csv('data-by-State.csv')

Y=ds.loc[:,'State'].values # State is AL, AK, AZ, etc...

# With categorical data we need some label encoding here...
le = preprocessing.LabelEncoder()
le.fit(Y)                            # le.classes_ would be ['AL', 'AK', 'AZ',...
y=le.transform(Y)                    # y would be [0, 2, 3, ..., 6, 7, 9]
y=y[:, np.newaxis]                   # preparing for kde

kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(y)

# You can control the bandwidth so the KDE function performs better
# To find the optimum bandwidth for your data you can try Crossvalidation

x=np.linspace(0,5,100)[:, np.newaxis] # let's get some x values to plot on
log_dens=kde.score_samples(x)
dens=np.exp(log_dens)            # these are the density function values

array([0.06625658, 0.06661817, 0.06676005, 0.06669403, 0.06643584,
       0.06600488, 0.0654239 , 0.06471854, 0.06391682, 0.06304861,
       0.06214499, 0.06123764, 0.06035818, 0.05953754, 0.05880534,
       0.05818931, 0.05771472, 0.05740393, 0.057276  , 0.05734634,
       0.05762648, 0.05812393, 0.05884214, 0.05978051, 0.06093455,
       ..............
       0.11885574, 0.11883695, 0.11881434, 0.11878766, 0.11875657,
       0.11872066, 0.11867943, 0.11863229, 0.11857859, 0.1185176 ,
       0.11844852, 0.11837051, 0.11828267, 0.11818407, 0.11807377])

这些值是您在直方图上绘制核密度所需的全部内容。卡皮托？

现在，在理论上，如果 X 是一个分类 (*)、无序变量，具有 c 个可能值，那么对于 0 ≤ h

是一个有效的内核。对于有序 X，

其中|x1-x2|应该理解为x1和x2相隔多少层。当 h 趋于零时，这两者都成为指标并返回相对频率计数。 h 通常称为带宽。

(*) 变量空间不需要定义距离。不需要是度量空间。

Devroye, Luc and Gábor Lugosi (2001). Combinatorial Methods in Density Estimation. Berlin: Springer-Verlag.

【讨论】：

以上是关于如何仅从直方图值创建 KDE？的主要内容，如果未能解决你的问题，请参考以下文章