如何在 Python 中实现 KS-Test

Posted 2023-03-12

技术标签:

【中文标题】如何在 Python 中实现 KS-Test【英文标题】：How to implement a KS-Test in Python 【发布时间】：2019-09-30 19:55:48 【问题描述】：

scipy.stats.kstest(rvs, cdf, N) 可以对数据集rvs 执行 KS-Test。它测试数据集是否遵循概率分布，其cdf 在此方法的参数中指定。

现在考虑一个包含N=4800 样本的数据集。我已经对这些数据执行了 KDE，因此有一个估计的 PDF。这个 PDF 看起来非常像双峰分布。在绘制估计的 PDF 和曲线拟合双峰分布时，这两个图几乎相同。拟合双峰分布的参数为（scale1, mean1, stdv1, scale2, mean2, stdv2）： [0.6 0.036 0.52, 0.23 1.25 0.4]

我如何申请scipy.stats.kstest 来测试我估计的 PDF 是否是双峰分布的？作为我的零假设，我声明估计的 PDF 等于以下 PDF：

hypoDist = 0.6*norm(loc=0, scale=0.2).pdf(x_grid) + 0.3*norm(loc=1, scale=0.2).pdf(x_grid)
hypoCdf = np.cumsum(hypoDist)/len(x_grid)

x_grid 只是一个向量，其中包含我评估估计 PDF 的 x 值。所以pdf的每个条目都有一个对应的x_grid的值。可能是我对hypoCdf 的计算不正确。也许我应该除以np.sum(hypoDist)而不是除以len(x_grid)？

挑战：kstest 的cdf 参数不能指定为双峰。我也不能将其指定为hypoDist。

如果我想测试我的数据集是否是高斯分布的，我会这样写：

KS_result = kstest(measurementError, norm(loc=mean(pdf), scale=np.std(pdf)).cdf)
print(KS_result)

measurementError 是我执行 KDE 的数据集。这将返回： statistic=0.459, pvalue=0.0 对我来说，pvalue 是 0.0 有点烦人

【问题讨论】：

您在文本中有scale1 = 0.6 和scale2 = 0.23，但它们不加到1。在创建hypoDist 的代码中，比例显然是0.6 和0.3 ，这也不会加到 1。您的双峰分布是两个正态分布的 mixture。对于混合分布，权重（或称其为尺度）的总和应为 1。您能否修复这些值，或解释为什么您使用的值总和不为 1？ @WarrenWeckesser：你说得对，这很烦人。我在代码中使用的参数是我根据估计的 PDF 的图初步猜测的。参数 0.6 和 0.23 已由我的曲线拟合返回：我已将双峰分布拟合到估计的 PDF，这些是拟合的参数这是个问题。如果这些值的总和不等于 1，则您的组合 PDF 不会创建适当的概率密度函数。您的曲线拟合过程应包含这些值总和为 1 的约束。或者仅使用参数scale1，并将scale2 替换为1 - scale1。好的。我只是在使用popt_bimodal, pcov_bimodal = curve_fit(bimodal, x_grid, pdf, p0=[0.6, 0, 0.2, 0.3, 1, 0.2])。在这里，bimodal 是我定义的一个函数。它只是添加了两个高斯。 ... 啊：你说 curve_fits 适合我的数据的曲线，但该曲线不是双峰分布！是的--curve_fit 不知道您的函数是一个 PDF，其对实线的积分必须为 1。请参阅我的答案，了解一种仅使用一个权重的方法。 【参考方案1】：

kstest 的 cdf 参数可以是一个可调用，它实现了您想要测试数据的分布的累积分布函数。要使用它，您必须实现双峰分布的 CDF。您希望分布是两个正态分布的混合。您可以通过计算构成混合的两个正态分布的 CDF 的加权和来实现此分布的 CDF。

这里有一个脚本，展示了如何做到这一点。为了演示如何使用kstest，脚本运行kstest 两次。首先，它使用分布中非的样本。正如预期的那样，kstest 为第一个样本计算了一个非常小的 p 值。然后它会生成一个从混合物中提取的样本。对于这个样本，p值不小。

import numpy as np
from scipy import stats


def bimodal_cdf(x, weight1, mean1, stdv1, mean2, stdv2):
    """
    CDF of a mixture of two normal distributions.
    """
    return (weight1*stats.norm.cdf(x, mean1, stdv1) +
            (1 - weight1)*stats.norm.cdf(x, mean2, stdv2))


# We only need weight1, since weight2 = 1 - weight1.
weight1 = 0.6
mean1 = 0.036
stdv1 = 0.52
mean2 = 1.25
stdv2 = 0.4

n = 200

# Create a sample from a regular normal distribution that has parameters
# similar to the bimodal distribution.
sample1 = stats.norm.rvs(0.5*(mean1 + mean2), 0.5, size=n)

# The result of kstest should show that sample1 is not from the bimodal
# distribution (i.e. the p-value should be very small).
stat1, pvalue1 = stats.kstest(sample1, cdf=bimodal_cdf,
                              args=(weight1, mean1, stdv2, mean2, stdv2))
print("sample1 p-value =", pvalue1)

# Create a sample from the bimodal distribution.  This sample is the
# concatenation of samples from the two normal distributions that make
# up the bimodal distribution.  The number of samples to take from the
# first distributions is determined by a binomial distribution of n
# samples with probability weight1.
n1 = np.random.binomial(n, p=weight1)
sample2 = np.concatenate((stats.norm.rvs(mean1, stdv1, size=n1),
                         (stats.norm.rvs(mean2, stdv2, size=n - n1))))

# Most of time, the p-value returned by kstest with sample2 will not
# be small.  We expect the value to be uniformly distributed in the interval
# [0, 1], so in general it will not be very small.
stat2, pvalue2 = stats.kstest(sample2, cdf=bimodal_cdf,
                              args=(weight1, mean1, stdv1, mean2, stdv2))
print("sample2 p-value =", pvalue2)

典型输出（每次运行脚本时数字都会不同）：

sample1 p-value = 2.8395166853884146e-11
sample2 p-value = 0.3289374831186403

您可能会发现，对于您的问题，此测试效果不佳。您有 4800 个样本，但在您的代码中，您的参数的数值只有一位或两位有效数字。除非您有充分的理由相信您的样本是从具有完全这些参数的分布中抽取的，否则kstest 很可能会返回一个非常小的 p 值。

【讨论】：

“这是对大样本使用这样的测试的讽刺之一。”为什么会有讽刺意味？这么敏感不是好事吗？有什么替代方案？ Mabe “讽刺”这个词不合适。我会把它拿出来。 @WarrenWeckesser：p 值可以正好是 0.000 吗？我在测试高斯分布时得到了这个结果，我的数据很可能不是从中提取的 @Trilarion 这可能并不具有讽刺意味，但无论如何先验地知道在这种情况下零假设是错误的，因为任何真实分布都不能完全是高斯的混合，因此对于足够大的样本，必须拒绝原假设。这是显着性测试中众所周知的功能/错误。我对 OP 或任何人的建议是抛弃重要性测试的东西，并采用贝叶斯决策理论方法，它允许表达实际意义。 @Trilarion 看看罗伯特克莱门的“做出艰难的决定”。

以上是关于如何在 Python 中实现 KS-Test的主要内容，如果未能解决你的问题，请参考以下文章