威尔逊分数区间的 Python 实现?

Posted

技术标签:

【中文标题】威尔逊分数区间的 Python 实现?【英文标题】:Python implementation of the Wilson Score Interval? 【发布时间】:2012-04-19 06:01:47 【问题描述】:

在阅读How Not to Sort by Average Rating 之后,我很好奇是否有人有针对伯努利参数的威尔逊分数置信区间下限的 Python 实现?

【问题讨论】:

如果 n*p-cap*(1-p-cap) 低于某个阈值,例如 30-35,为了更精确,我会使用 df 的 t 分布:(pos+neg )-2 而不是普通的发行版。无论如何。只是我的两分钱 【参考方案1】:

Reddit 使用 Wilson 得分区间进行评论排名,解释和 python 实现可以看here

#Rewritten code from /r2/r2/lib/db/_sorts.pyx

from math import sqrt

def confidence(ups, downs):
    n = ups + downs

    if n == 0:
        return 0

    z = 1.0 #1.44 = 85%, 1.96 = 95%
    phat = float(ups) / n
    return ((phat + z*z/(2*n) - z * sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n))

【讨论】:

如果您只是要发布链接,请在评论中进行。如果您将其作为答案发布,请从内容中提供更多信息和/或提取代码,这样不是每个人都需要关注链接,即使链接失效,答案也有价值。 应更正此答案以包含以下修改! @Vladtn 我刚刚用 Gullevek 的回答更新了它。让我知道是否还有其他问题。 我想补充一点,对于 95% 的置信区间,z 分数应该是 1.96,而不是 1.6。 @Wesley 是的,我相信1.0 = 85% 也是错误的,已经更新了答案...这里有一个值表dummies.com/how-to/content/…【参考方案2】:

我认为这是一个错误的 wilson 调用,因为如果你有 1 向上 0 向下,你会得到 NaN,因为你不能对负值执行 sqrt

查看文章How not to sort by average page中的ruby示例时可以找到正确的示例:

return ((phat + z*z/(2*n) - z * sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n))

【讨论】:

【参考方案3】:

要获得没有连续性校正的 Wilson CI,您可以在 statsmodels.stats.proportion 中使用 proportion_confint。要获得具有连续性校正的 Wilson CI,您可以使用以下代码。

# cf. 
# [1] R. G. Newcombe. Two-sided confidence intervals for the single proportion, 1998
# [2] R. G. Newcombe. Interval Estimation for the difference between independent proportions:        comparison of eleven methods, 1998

import numpy as np
from statsmodels.stats.proportion import proportion_confint

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 
def propci_wilson_cc(count, nobs, alpha=0.05):
    # get confidence limits for proportion
    # using wilson score method w/ cont correction
    # i.e. Method 4 in Newcombe [1]; 
    # verified via Table 1
    from scipy import stats
    n = nobs
    p = count/n
    q = 1.-p
    z = stats.norm.isf(alpha / 2.)
    z2 = z**2   
    denom = 2*(n+z2)
    num = 2.*n*p+z2-1.-z*np.sqrt(z2-2-1./n+4*p*(n*q+1))    
    ci_l = num/denom
    num = 2.*n*p+z2+1.+z*np.sqrt(z2+2-1./n+4*p*(n*q-1))
    ci_u = num/denom
    if p == 0:
        ci_l = 0.
    elif p == 1:
        ci_u = 1.
    return ci_l, ci_u


def dpropci_wilson_nocc(a,m,b,n,alpha=0.05):
    # get confidence limits for difference in proportions
    #   a/m - b/n
    # using wilson score method WITHOUT cont correction
    # i.e. Method 10 in Newcombe [2]
    # verified via Table II    
    theta = a/m - b/n        
    l1, u1 = proportion_confint(count=a, nobs=m, alpha=0.05, method='wilson')
    l2, u2 = proportion_confint(count=b, nobs=n, alpha=0.05, method='wilson')
    ci_u = theta + np.sqrt((a/m-u1)**2+(b/n-l2)**2)
    ci_l = theta - np.sqrt((a/m-l1)**2+(b/n-u2)**2)     
    return ci_l, ci_u


def dpropci_wilson_cc(a,m,b,n,alpha=0.05):
    # get confidence limits for difference in proportions
    #   a/m - b/n
    # using wilson score method w/ cont correction
    # i.e. Method 11 in Newcombe [2]    
    # verified via Table II  
    theta = a/m - b/n    
    l1, u1 = propci_wilson_cc(count=a, nobs=m, alpha=alpha)
    l2, u2 = propci_wilson_cc(count=b, nobs=n, alpha=alpha)    
    ci_u = theta + np.sqrt((a/m-u1)**2+(b/n-l2)**2)
    ci_l = theta - np.sqrt((a/m-l1)**2+(b/n-u2)**2)     
    return ci_l, ci_u


# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 
# single proportion testing 
# these come from Newcombe [1] (Table 1)
a_vec = np.array([81, 15, 0, 1])
m_vec = np.array([263, 148, 20, 29])
for (a,m) in zip(a_vec,m_vec):
    l1, u1 = proportion_confint(count=a, nobs=m, alpha=0.05, method='wilson')
    l2, u2 = propci_wilson_cc(count=a, nobs=m, alpha=0.05)
    print(a,m,l1,u1,'   ',l2,u2)

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 
# difference in proportions testing 
# these come from Newcombe [2] (Table II)
a_vec = np.array([56,9,6,5,0,0,10,10],dtype=float)
m_vec = np.array([70,10,7,56,10,10,10,10],dtype=float)
b_vec = np.array([48,3,2,0,0,0,0,0],dtype=float)
n_vec = np.array([80,10,7,29,20,10,20,10],dtype=float)

print('\nWilson without CC')
for (a,m,b,n) in zip(a_vec,m_vec,b_vec,n_vec):
    l, u = dpropci_wilson_nocc(a,m,b,n,alpha=0.05)
    print(':2.0f/:2.0f-:2.0f/:2.0f ; :6.4f ; :8.4f, :8.4f'.format(a,m,b,n,a/m-b/n,l,u))

print('\nWilson with CC')
for (a,m,b,n) in zip(a_vec,m_vec,b_vec,n_vec):
    l, u = dpropci_wilson_cc(a,m,b,n,alpha=0.05)
    print(':2.0f/:2.0f-:2.0f/:2.0f ; :6.4f ; :8.4f, :8.4f'.format(a,m,b,n,a/m-b/n,l,u))

HTH

【讨论】:

【参考方案4】:

公认的解决方案似乎使用硬编码的 z 值(性能最佳)。

如果您想要来自 the blogpost 的 ruby​​ 公式的直接 python 等效项,并且具有动态 z 值(基于置信区间):

import math

import scipy.stats as st


def ci_lower_bound(pos, n, confidence):
    if n == 0:
        return 0
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    phat = 1.0 * pos / n
    return (phat + z * z / (2 * n) - z * math.sqrt((phat * (1 - phat) + z * z / (4 * n)) / n)) / (1 + z * z / n)

【讨论】:

【参考方案5】:

如果您想直接从置信区间实际计算 z 并且希望避免安装 numpy/scipy,您可以使用以下 sn-p 代码,

import math

def binconf(p, n, c=0.95):
  '''
  Calculate binomial confidence interval based on the number of positive and
  negative events observed.  Uses Wilson score and approximations to inverse
  of normal cumulative density function.

  Parameters
  ----------
  p: int
      number of positive events observed
  n: int
      number of negative events observed
  c : optional, [0,1]
      confidence percentage. e.g. 0.95 means 95% confident the probability of
      success lies between the 2 returned values

  Returns
  -------
  theta_low  : float
      lower bound on confidence interval
  theta_high : float
      upper bound on confidence interval
  '''
  p, n = float(p), float(n)
  N    = p + n

  if N == 0.0: return (0.0, 1.0)

  p = p / N
  z = normcdfi(1 - 0.5 * (1-c))

  a1 = 1.0 / (1.0 + z * z / N)
  a2 = p + z * z / (2 * N)
  a3 = z * math.sqrt(p * (1-p) / N + z * z / (4 * N * N))

  return (a1 * (a2 - a3), a1 * (a2 + a3))


def erfi(x):
  """Approximation to inverse error function"""
  a  = 0.147  # MAGIC!!!
  a1 = math.log(1 - x * x)
  a2 = (
    2.0 / (math.pi * a)
    + a1 / 2.0
  )

  return (
    sign(x) *
    math.sqrt( math.sqrt(a2 * a2 - a1 / a) - a2 )
  )


def sign(x):
  if x  < 0: return -1
  if x == 0: return  0
  if x  > 0: return  1


def normcdfi(p, mu=0.0, sigma2=1.0):
  """Inverse CDF of normal distribution"""
  if mu == 0.0 and sigma2 == 1.0:
    return math.sqrt(2) * erfi(2 * p - 1)
  else:
    return mu + math.sqrt(sigma2) * normcdfi(p)

【讨论】:

print(binconf(50, 100)) => (0.26291792852889806, 0.41206457669597374) ... 50 个积极事件,总共 100 个事件给出了上限低于 0.5 的范围?

以上是关于威尔逊分数区间的 Python 实现?的主要内容,如果未能解决你的问题,请参考以下文章

皮尔逊相关系数理解

如何理解皮尔逊相关系数

算法学习——枚举之最简真分数

hdu6070(分数规划/二分+线段树区间更新,区间最值)

5884. 解出数学表达式的学生分数(区间dp)

Echarts 统计分数区间数据