python3 - 熊猫确定事件发生是不是具有统计意义
Posted
技术标签:
【中文标题】python3 - 熊猫确定事件发生是不是具有统计意义【英文标题】:python3 - pandas determine if events occurrence are statistically significantpython3 - 熊猫确定事件发生是否具有统计意义 【发布时间】:2020-03-05 07:57:14 【问题描述】:我有一个大型数据集,如下所示。我想知道事件发生与不发生之间是否存在显着的统计差异。这里的假设是百分比变化越高,越有意义/越好。
在另一个数据集中,“事件发生”列是“真、假、中性”。 (请忽略索引,因为这是默认的 pandas 索引。)
index event occurs percent change
148 False 11.27
149 True 14.56
150 False 10.35
151 False 6.07
152 False 21.14
153 False 7.26
154 False 7.07
155 False 5.37
156 True 2.75
157 False 7.12
158 False 7.24
在“真/假”或“真/假/中性”时确定重要性的最佳方法是什么?
【问题讨论】:
你试过什么? :) Obv,没有任何效果(还)! :) 让我们将event_occurs
分为False
和True
。找到两者的平均值 percent_change
,然后运行 shapiro-francis 测试以查看数据是否正常。如果是,请尝试找出均值的差异是否具有统计显着性。如果不正常,请回复我。
事件如何发生Neutral
?
如果每组的数据不正常,就用distribution-free测试。没有那么强,但会。
【参考方案1】:
谢谢@DarkDrassher34 和@ChrisDanger。我将来自 Dark 答案的各种来源的代码示例放在一起,然后在 Chris 的帖子之后进行了审查。想法?
corr_data = df[['event occurs','percent change']]
cat1 = corr_data[corr_data['event occurs']==True]
cat2 = corr_data[corr_data['event occurs']==False]
#----------------------
# is the sample normal / gaussian
#----------------------
from scipy.stats import shapiro # test for normalcy in small samples
from scipy.stats import normaltest
if (len(cat1['percent change'].index) <= 20 ):
stat1, p1 = shapiro(cat1['percent change'])
else:
stat1, p1 = normaltest(cat1['percent change'])
if (len(cat2['percent change'].index) <= 20 ):
stat2, p2 = shapiro(cat2['percent change'])
else:
stat2, p2 = normaltest(cat2['percent change'])
alpha = 0.05 # stat threshold
# both groups are normal
if ((p1 > alpha) and (p2 > alpha)):
print('Samples looks Gaussian (fail to reject H0)')
#----------------------
# if normal / gaussian run these tests
#----------------------
from scipy.stats import ttest_ind
stat, p = ttest_ind(cat1['percent change'], cat2['percent change'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
if p > alpha:
print('Same distribution (fail to reject H0)')
else:
print('Different distribution (reject H0)')
else:
print('Samples do not look Gaussian (reject H0)')
#----------------------
# if not normal / gaussian run these tests
#----------------------
from scipy.stats import mannwhitneyu
stat, p = mannwhitneyu(cat1['percent change'], cat2['percent change'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
if p > alpha:
print('Same distribution (fail to reject H0)')
else:
print('Different distribution (reject H0)')
【讨论】:
AnonPyDev,干得好。我喜欢你处理 n 谢谢。非常感谢@chris。 AnonPyDev,如果你喜欢我的代码示例,你介意给它点个赞吗? 完成@ChrisDanger【参考方案2】:加载包、设置全局变量、生成数据。
import scipy.stats as stats
import numpy as np
n = 60
stat_sig_thresh = 0.05
event_perc = pd.DataFrame("event occurs": np.random.choice([True,False],n),
"percent change": [i*.1 for i in np.random.randint(1,1000,n)])
判断分布是否正态
stat_sig = event_perc.groupby("event occurs").apply(lambda x: stats.normaltest(x))
stat_sig = pd.DataFrame(stat_sig)
stat_sig = pd.DataFrame(stat_sig[0].values.tolist(), index=stat_sig.index).reset_index()
stat_sig.loc[(stat_sig.pvalue <= stat_sig_thresh), "Normal"] = False
stat_sig["Normal"].fillna("True",inplace=True)
>>>stat_sig
event occurs statistic pvalue Normal
0 False [2.9171920993203915] [0.23256255191146755] True
1 True [2.938332679486047] [0.23011724484588764] True
确定统计意义
normal = [bool(i) for i in stat_sig.Normal.unique().tolist()]
rvs1 = event_perc["percent change"][event_perc["event occurs"] == True]
rvs2 = event_perc["percent change"][event_perc["event occurs"] == False]
if (len(normal) == 1) & (normal[0] == True):
print("the distributions are normal")
if stats.ttest_ind(rvs1,rvs2).pvalue >= stat_sig_thresh:
# we cannot reject the null hypothesis of identical average scores
print("we can't say whether there is statistically significant difference")
else:
# we reject the null hypothesis of equal averages
print("there is a statisically significant difference")
elif (len(normal) == 1) & (normal[0] == False):
print("the distributions are not normal")
if stats.wilcoxon(rvs1,rvs2).pvalue >= stat_sig_thresh:
# we cannot reject the null hypothesis of identical average scores
print("we can't say whether there is statistically significant difference")
else:
# we reject the null hypothesis of equal averages
print("there is a statisically significant difference")
else:
print("samples are drawn from different distributions")
the distributions are normal
we can't say whether there is statistically significant difference
【讨论】:
感谢您的回复和代码示例,我对此进行了审核。我也发了一张。有什么想法吗?以上是关于python3 - 熊猫确定事件发生是不是具有统计意义的主要内容,如果未能解决你的问题,请参考以下文章