针对零假设测试 80,000 多个模拟正态分布观察集

Posted 2023-03-12

技术标签:

【中文标题】针对零假设测试 80,000 多个模拟正态分布观察集【英文标题】：Test 80,000+ simulated normal distribution observation sets against a null hypothesis 【发布时间】：2022-01-21 05:59:45 【问题描述】：

我需要从方差为 1 和我指定的真实 mu（平均值）的正态分布中生成大小为 200（n=200）的随机样本；然后，我根据一个假设检验平局：mu

我已经为 n=1 做了这个，但我意识到我的方法是不可复制的。对于每 400 个 theta，我运行以下命令：

sample_r200n1_t2=normal(loc=-0.99, scale=1, size=200)
sample_r200n1_t3=normal(loc=-0.98, scale=1, size=200)
sample_r200n1_t4=normal(loc=-0.97, scale=1, size=200)
sample_r200n1_t5=normal(loc=-0.96, scale=1, size=200)
... on and on to loc = 3

然后，我分别测试了生成数组中的每个元素。然而，这种方法需要我生成数以万计的样本，我生成与每个样本相关的平均值，然后根据我的标准测试该平均值。这必须完成 80,000 次（除此之外，我还需要针对多个不同大小的 n 执行此操作）。显然 - 这不是要采取的方法。

我怎样才能达到我想要的结果？例如，有没有办法生成一组样本均值并将这些均值放入一个数组中，每个 theta 一个？然后我可以像以前一样测试。或者，还有其他方法吗？

【问题讨论】：

【参考方案1】：

您可以在一个 numpy 数组中生成所有 200*200*400 = 16 million 随机值（这会消耗大约 122 兆字节的内存；请使用 draws.nbytes/1024/1024 进行检查），并使用 SciPy 对每个随机值运行单边单样本 t 检验对于每个 theta 值的 200 个观测值的 200 个样本：

from numpy.random import normal
from scipy.stats import ttest_1samp
import matplotlib.pyplot as plt

# Array of loc values; for each loc, we draw 200 
# samples of 200 normally distributed observations
locs = np.linspace(-1, 3, 401)

# Array of shape (401, 200, 200) = (locs, samples, observations)
# Note that 200 draws of 200 i.i.d. observations is the same as
# 1 draw of 200*200 i.i.d. observations, reshaped to (200, 200)
draws = np.array([normal(loc=x, scale=1, size=200*200)
                  for x in locs]).reshape(401, 200, 200)

# axis=1 computes t-test across columns.
# Alternative hypothesis that sample mean
# is less than the population mean of 1 implies a null
# hypothesis that sample mean is greater than or equal to
# the population mean
tstats, pvals = ttest_1samp(draws, 1, alternative='less', axis=1)

# Count how many out of 200 t-tests reject the null hypothesis
# at the alpha=0.05 level
rejects = (pvals < 0.05).sum(axis=1)

# Visual check: p-values should be low for sample means
# far below 1, as these tests should reject the null 
# hypothesis that sample mean >= 1
plt.plot(locs, rejects)
plt.axvline(1, c='r')
plt.title('Number of t-tests rejecting $H_0 : \mu \geq 1$ with $p < 0.05$')
plt.xlabel('Known sample mean $\\theta$')

【讨论】：

谢谢，这非常有帮助！只是一个 Q - 例如，一项任务是从具有其中一个 theta 的正态分布中抽取 200 个随机样本，每个样本大小为 200。使用这种方法，我认为我会运行该代码 200 次？只是想确保我正确地解释了这一点此代码为每个 theta 值抽取 200 个随机样本，其中 theta 有 401 个值：-1、-0.99、-0.98、...、2.98、2.99、3。所以不需要多次重新运行此代码。结果是一个二维 numpy 数组，我将其命名为 draws。该数组有 401 行和 200 列。每行包含该行对应的 theta 值的 200 个随机正常值。道歉 - 根据 theta 我正在尝试抽取 200 个随机样本，其中每个样本包含 200 个观察值。看起来这会吸引 200 i.i.d.每个 theta 值的观察值 - 也就是 200 个观察值的一个随机样本。还是我很困惑？好像每一行都是一个theta，每一列都是一个iid观察？在理想情况下，每列中的值将是与 200 个观察值的随机 IID 抽取相关的平均值哦，我明白了 - 您对我当前答案的理解是正确的。在为每个 theta 值抽取 200 个样本，每个样本包含 200 个观测值后，您需要如何运行 t 检验？我应该对每个样本进行 t 检验。因此，找到每个样本的平均值并根据原假设对其进行检验。然后，比较不同 n 水平的结果（在不同的真实 theta 水平/不同水平的 n 下，接受测试的百分比与拒绝 null 的百分比）

以上是关于针对零假设测试 80,000 多个模拟正态分布观察集的主要内容，如果未能解决你的问题，请参考以下文章