python3 - 熊猫确定事件发生是不是具有统计意义

Posted

技术标签:

【中文标题】python3 - 熊猫确定事件发生是不是具有统计意义【英文标题】:python3 - pandas determine if events occurrence are statistically significantpython3 - 熊猫确定事件发生是否具有统计意义 【发布时间】:2020-03-05 07:57:14 【问题描述】:

我有一个大型数据集,如下所示。我想知道事件发生与不发生之间是否存在显着的统计差异。这里的假设是百分比变化越高,越有意义/越好。

在另一个数据集中,“事件发生”列是“真、假、中性”。 (请忽略索引,因为这是默认的 pandas 索引。)

   index    event occurs            percent change
    148       False                  11.27
    149        True                  14.56
    150       False                  10.35
    151       False                   6.07
    152       False                  21.14
    153       False                   7.26
    154       False                   7.07
    155       False                   5.37
    156        True                   2.75
    157       False                   7.12
    158       False                   7.24

在“真/假”或“真/假/中性”时确定重要性的最佳方法是什么?

【问题讨论】:

你试过什么? :) Obv,没有任何效果(还)! :) 让我们将event_occurs 分为FalseTrue。找到两者的平均值 percent_change,然后运行 ​​shapiro-francis 测试以查看数据是否正常。如果是,请尝试找出均值的差异是否具有统计显着性。如果不正常,请回复我。 事件如何发生Neutral 如果每组的数据不正常,就用distribution-free测试。没有那么强,但会。 【参考方案1】:

谢谢@DarkDrassher34 和@ChrisDanger。我将来自 Dark 答案的各种来源的代码示例放在一起,然后在 Chris 的帖子之后进行了审查。想法?

corr_data = df[['event occurs','percent change']]
cat1 = corr_data[corr_data['event occurs']==True]
cat2 = corr_data[corr_data['event occurs']==False]


#----------------------
# is the sample normal / gaussian
#----------------------
from scipy.stats import shapiro # test for normalcy in small samples
from scipy.stats import normaltest

if (len(cat1['percent change'].index) <= 20 ):
    stat1, p1 = shapiro(cat1['percent change'])
else:
    stat1, p1 = normaltest(cat1['percent change'])

if (len(cat2['percent change'].index) <= 20 ):
    stat2, p2 = shapiro(cat2['percent change'])
else:
    stat2, p2 = normaltest(cat2['percent change'])


alpha = 0.05 # stat threshold
# both groups are normal
if ((p1 > alpha) and (p2 > alpha)):
    print('Samples looks Gaussian (fail to reject H0)')

    #----------------------
    # if normal / gaussian run these tests
    #----------------------
    from scipy.stats import ttest_ind
    stat, p = ttest_ind(cat1['percent change'], cat2['percent change'])
    print('Statistics=%.3f, p=%.3f' % (stat, p))
    if p > alpha:
        print('Same distribution (fail to reject H0)')
    else:
        print('Different distribution (reject H0)')


else:
    print('Samples do not look Gaussian (reject H0)')
    #----------------------
    # if not normal / gaussian run these tests
    #----------------------
    from scipy.stats import mannwhitneyu
    stat, p = mannwhitneyu(cat1['percent change'], cat2['percent change'])
    print('Statistics=%.3f, p=%.3f' % (stat, p))
    if p > alpha:
        print('Same distribution (fail to reject H0)')
    else:
        print('Different distribution (reject H0)')

【讨论】:

AnonPyDev,干得好。我喜欢你处理 n 谢谢。非常感谢@chris。 AnonPyDev,如果你喜欢我的代码示例,你介意给它点个赞吗? 完成@ChrisDanger【参考方案2】:

加载包、设置全局变量、生成数据。

import scipy.stats as stats
import numpy as np

n = 60
stat_sig_thresh = 0.05

event_perc = pd.DataFrame("event occurs": np.random.choice([True,False],n),
                          "percent change": [i*.1 for i in np.random.randint(1,1000,n)])

判断分布是否正态

stat_sig = event_perc.groupby("event occurs").apply(lambda x: stats.normaltest(x))
stat_sig = pd.DataFrame(stat_sig)
stat_sig = pd.DataFrame(stat_sig[0].values.tolist(), index=stat_sig.index).reset_index()
stat_sig.loc[(stat_sig.pvalue <= stat_sig_thresh), "Normal"] = False
stat_sig["Normal"].fillna("True",inplace=True)

>>>stat_sig

    event occurs  statistic             pvalue                  Normal
0   False         [2.9171920993203915]  [0.23256255191146755]   True
1   True          [2.938332679486047]   [0.23011724484588764]   True

确定统计意义

normal = [bool(i) for i in stat_sig.Normal.unique().tolist()]

rvs1 = event_perc["percent change"][event_perc["event occurs"] == True]
rvs2 = event_perc["percent change"][event_perc["event occurs"] == False]

if (len(normal) == 1) & (normal[0] == True):
    print("the distributions are normal")
    if stats.ttest_ind(rvs1,rvs2).pvalue >= stat_sig_thresh:
        # we cannot reject the null hypothesis of identical average scores
        print("we can't say whether there is statistically significant difference")
    else:
        # we reject the null hypothesis of equal averages
        print("there is a statisically significant difference")

elif (len(normal) == 1) & (normal[0] == False):
    print("the distributions are not normal")
    if stats.wilcoxon(rvs1,rvs2).pvalue >= stat_sig_thresh:
        # we cannot reject the null hypothesis of identical average scores
        print("we can't say whether there is statistically significant difference")
    else:
        # we reject the null hypothesis of equal averages
        print("there is a statisically significant difference")
else:
    print("samples are drawn from different distributions")

the distributions are normal
we can't say whether there is statistically significant difference

【讨论】:

感谢您的回复和代码示例,我对此进行了审核。我也发了一张。有什么想法吗?

以上是关于python3 - 熊猫确定事件发生是不是具有统计意义的主要内容,如果未能解决你的问题,请参考以下文章

熊猫日期和事件

第20件事 风险分析

概率论与数理统计基本概念

无法测试事件是不是具有 Laravel 5 和 PHPUnit 的属性

熊猫:检查是不是存在具有某些值的行

如何让熊猫打印出数据而不是内存地址?