在 Python 中计算 Pearson 相关性和显着性

Posted 2023-03-12

技术标签:

【中文标题】在 Python 中计算 Pearson 相关性和显着性【英文标题】：Calculating Pearson correlation and significance in Python 【发布时间】：2011-04-26 08:11:00 【问题描述】：

我正在寻找一个函数，它以两个列表作为输入，并返回 Pearson correlation，以及相关性的重要性。

【问题讨论】：

【参考方案1】：

你可以看看scipy.stats：

from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)

>>>
Help on function pearsonr in module scipy.stats.stats:

pearsonr(x, y)
 Calculates a Pearson correlation coefficient and the p-value for testing
 non-correlation.

 The Pearson correlation coefficient measures the linear relationship
 between two datasets. Strictly speaking, Pearson's correlation requires
 that each dataset be normally distributed. Like other correlation
 coefficients, this one varies between -1 and +1 with 0 implying no
 correlation. Correlations of -1 or +1 imply an exact linear
 relationship. Positive correlations imply that as x increases, so does
 y. Negative correlations imply that as x increases, y decreases.

 The p-value roughly indicates the probability of an uncorrelated system
 producing datasets that have a Pearson correlation at least as extreme
 as the one computed from these datasets. The p-values are not entirely
 reliable but are probably reasonable for datasets larger than 500 or so.

 Parameters
 ----------
 x : 1D array
 y : 1D array the same length as x

 Returns
 -------
 (Pearson's correlation coefficient,
  2-tailed p-value)

 References
 ----------
 http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation

【讨论】：

两个字典的相关系数如何？！ @user702846 Pearson 相关是在 2xN 矩阵上定义的。没有普遍适用的方法可以将两个字典转换为 2xN 矩阵，但您可以使用字典值对的数组，这些字典值对对应于字典键的交集键。【参考方案2】：

皮尔逊相关系数可以用numpy的corrcoef计算。

import numpy
numpy.corrcoef(list1, list2)[0, 1]

【讨论】：

输出令人困惑，但实际上非常简单。检查这个解释***.com/a/3425548/1245622 这不会产生所要求的相关意义吧？【参考方案3】：

替代方法可以是来自linregress 的本机 scipy 函数，它计算：

slope : 回归线的斜率

intercept : 回归线的截距

r 值：相关系数

p 值：假设检验的双边 p 值，其原假设是斜率为零

stderr : 估计的标准误差

这是一个例子：

a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
from scipy.stats import linregress
linregress(a, b)

会还给你：

LinregressResult(slope=0.20833333333333337, intercept=13.375, rvalue=0.14499815458068521, pvalue=0.68940144811669501, stderr=0.50261704627083648)

【讨论】：

很好的答案 - 迄今为止最丰富的答案。也适用于两行 pandas.DataFrame：lineregress(two_row_df) 出色的答案。也很直观，如果你想的话【参考方案4】：

如果您不想安装 scipy，我使用了这个快速 hack，对 Programming Collective Intelligence 稍作修改：

def pearsonr(x, y):
  # Assume len(x) == len(y)
  n = len(x)
  sum_x = float(sum(x))
  sum_y = float(sum(y))
  sum_x_sq = sum(xi*xi for xi in x)
  sum_y_sq = sum(yi*yi for yi in y)
  psum = sum(xi*yi for xi, yi in zip(x, y))
  num = psum - (sum_x * sum_y/n)
  den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
  if den == 0: return 0
  return num / den

【讨论】：

我惊讶地发现这与 Excel、NumPy 和 R 不一致。请参阅 ***.com/questions/3949226/…。正如另一位评论者指出的那样，这有一个浮点/整数错误。我认为 sum_y/n 是整数的整数除法。如果你使用 sum_x = float(sum(x)) 和 sum_y = float(sum(y))，它可以工作。 @dfrankow 我认为这是因为 imap 无法处理浮动。 python 在num = psum - (sum_x * sum_y/n) 处给出TypeError: unsupported operand type(s) for -: 'itertools.imap' and 'float' 作为一种风格说明，Python 不赞成这种不必要的 map 用法（支持列表推导式）作为评论，请考虑像 scipy 等人的库是由了解大量数值分析的人开发的。这可以避免很多常见的陷阱（例如，在 X 或 Y 中有非常大和非常小的数字可能会导致灾难性的取消）【参考方案5】：

以下代码是the definition的直接解释：

import math

def average(x):
    assert len(x) > 0
    return float(sum(x)) / len(x)

def pearson_def(x, y):
    assert len(x) == len(y)
    n = len(x)
    assert n > 0
    avg_x = average(x)
    avg_y = average(y)
    diffprod = 0
    xdiff2 = 0
    ydiff2 = 0
    for idx in range(n):
        xdiff = x[idx] - avg_x
        ydiff = y[idx] - avg_y
        diffprod += xdiff * ydiff
        xdiff2 += xdiff * xdiff
        ydiff2 += ydiff * ydiff

    return diffprod / math.sqrt(xdiff2 * ydiff2)

测试：

print pearson_def([1,2,3], [1,5,7])

0.981980506062

这与 Excel、this calculator、SciPy（也称为 NumPy）一致，它们分别返回 0.981980506 和 0.9819805060619657 和 0.98198050606196574。

> cor( c(1,2,3), c(1,5,7))
[1] 0.9819805

编辑：修复了评论者指出的错误。

【讨论】：

注意变量的类型！您遇到了 int/float 问题。在sum(x) / len(x) 中，您划分整数，而不是浮点数。所以sum([1,5,7]) / len([1,5,7]) = 13 / 3 = 4，根据整数除法（而你想要13. / 3. = 4.33...）。要修复它，将此行重写为 float(sum(x)) / float(len(x))（一个浮点数就足够了，因为 Python 会自动转换它）。您的代码不适用于以下情况：[10,10,10],[0,0,0] 或 [10,10],[10,0]。甚至 [10,10],[10,10] 没有为任何这些情况定义相关系数。将它们放入 R 中将为所有三个返回“NA”。【参考方案6】：

您也可以使用pandas.DataFrame.corr 做到这一点：

import pandas as pd
a = [[1, 2, 3],
     [5, 6, 9],
     [5, 6, 11],
     [5, 6, 13],
     [5, 3, 13]]
df = pd.DataFrame(data=a)
df.corr()

这给了

          0         1         2
0  1.000000  0.745601  0.916579
1  0.745601  1.000000  0.544248
2  0.916579  0.544248  1.000000

【讨论】：

这只是相关性，没有意义【参考方案7】：

在python中使用pandas计算皮尔逊系数：我建议尝试这种方法，因为您的数据包含列表。由于您可以可视化数据结构并根据需要更新它，因此可以轻松地与您的数据交互并从控制台对其进行操作。您还可以导出数据集并将其保存并从 python 控制台添加新数据以供以后分析。此代码更简单，包含的代码行更少。我假设您需要几行快速的代码来筛选数据以进行进一步分析

例子：

data = 'list 1':[2,4,6,8],'list 2':[4,16,36,64]

import pandas as pd #To Convert your lists to pandas data frames convert your lists into pandas dataframes

df = pd.DataFrame(data, columns = ['list 1','list 2'])

from scipy import stats # For in-built method to get PCC

pearson_coef, p_value = stats.pearsonr(df["list 1"], df["list 2"]) #define the columns to perform calculations on
print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", p_value) # Results

但是，您没有发布您的数据让我查看数据集的大小或分析之前可能需要的转换。

【讨论】：

您好，欢迎来到 ***！尝试在答案开头添加简短说明，说明您选择此代码的原因以及它在这种情况下的应用方式！【参考方案8】：

与其依赖 numpy/scipy，我认为我的答案应该是最容易编写代码并理解计算 Pearson 相关系数 (PCC) 的步骤。

import math

# calculates the mean
def mean(x):
    sum = 0.0
    for i in x:
         sum += i
    return sum / len(x) 

# calculates the sample standard deviation
def sampleStandardDeviation(x):
    sumv = 0.0
    for i in x:
         sumv += (i - mean(x))**2
    return math.sqrt(sumv/(len(x)-1))

# calculates the PCC using both the 2 functions above
def pearson(x,y):
    scorex = []
    scorey = []

    for i in x: 
        scorex.append((i - mean(x))/sampleStandardDeviation(x)) 

    for j in y:
        scorey.append((j - mean(y))/sampleStandardDeviation(y))

# multiplies both lists together into 1 list (hence zip) and sums the whole list   
    return (sum([i*j for i,j in zip(scorex,scorey)]))/(len(x)-1)

PCC 的意义基本上是向您展示两个变量/列表的强相关。需要注意的是，PCC 值的范围从 -1 到 1。 0 到 1 之间的值表示正相关。 0 值 = 最高变化（没有任何相关性）。 -1 到 0 之间的值表示负相关。

【讨论】：

请注意，Python 有一个内置的sum 函数。它在具有 500 多个值的 2 个列表上具有惊人的复杂性和缓慢的性能。【参考方案9】：

嗯，其中许多响应的代码很长且难以阅读...

我建议在处理数组时使用 numpy 及其漂亮的功能：

import numpy as np
def pcc(X, Y):
   ''' Compute Pearson Correlation Coefficient. '''
   # Normalise X and Y
   X -= X.mean(0)
   Y -= Y.mean(0)
   # Standardise X and Y
   X /= X.std(0)
   Y /= Y.std(0)
   # Compute mean product
   return np.mean(X*Y)

# Using it on a random example
from random import random
X = np.array([random() for x in xrange(100)])
Y = np.array([random() for x in xrange(100)])
pcc(X, Y)

【讨论】：

虽然我非常喜欢这个答案，但我建议在函数内复制/克隆 X 和 Y。否则两者都会被改变，这可能不是我们想要的行为。【参考方案10】：

这是使用 numpy 实现 Pearson Correlation 函数：


def corr(data1, data2):
    "data1 & data2 should be numpy arrays."
    mean1 = data1.mean() 
    mean2 = data2.mean()
    std1 = data1.std()
    std2 = data2.std()

#     corr = ((data1-mean1)*(data2-mean2)).mean()/(std1*std2)
    corr = ((data1*data2).mean()-mean1*mean2)/(std1*std2)
    return corr

【讨论】：

【参考方案11】：

这是 mkh 答案的一个变体，运行速度比它快得多，还有 scipy.stats.pearsonr，使用 numba。

import numba

@numba.jit
def corr(data1, data2):
    M = data1.size

    sum1 = 0.
    sum2 = 0.
    for i in range(M):
        sum1 += data1[i]
        sum2 += data2[i]
    mean1 = sum1 / M
    mean2 = sum2 / M

    var_sum1 = 0.
    var_sum2 = 0.
    cross_sum = 0.
    for i in range(M):
        var_sum1 += (data1[i] - mean1) ** 2
        var_sum2 += (data2[i] - mean2) ** 2
        cross_sum += (data1[i] * data2[i])

    std1 = (var_sum1 / M) ** .5
    std2 = (var_sum2 / M) ** .5
    cross_mean = cross_sum / M

    return (cross_mean - mean1 * mean2) / (std1 * std2)

【讨论】：

【参考方案12】：

这是一个基于稀疏向量的皮尔逊相关的实现。这里的向量表示为表示为 (index, value) 的元组列表。这两个稀疏向量可以有不同的长度，但所有向量的大小必须相同。这对于向量大小非常大的文本挖掘应用程序很有用，因为大多数特征是词袋，因此通常使用稀疏向量执行计算。

def get_pearson_corelation(self, first_feature_vector=[], second_feature_vector=[], length_of_featureset=0):
    indexed_feature_dict = 
    if first_feature_vector == [] or second_feature_vector == [] or length_of_featureset == 0:
        raise ValueError("Empty feature vectors or zero length of featureset in get_pearson_corelation")

    sum_a = sum(value for index, value in first_feature_vector)
    sum_b = sum(value for index, value in second_feature_vector)

    avg_a = float(sum_a) / length_of_featureset
    avg_b = float(sum_b) / length_of_featureset

    mean_sq_error_a = sqrt((sum((value - avg_a) ** 2 for index, value in first_feature_vector)) + ((
        length_of_featureset - len(first_feature_vector)) * ((0 - avg_a) ** 2)))
    mean_sq_error_b = sqrt((sum((value - avg_b) ** 2 for index, value in second_feature_vector)) + ((
        length_of_featureset - len(second_feature_vector)) * ((0 - avg_b) ** 2)))

    covariance_a_b = 0

    #calculate covariance for the sparse vectors
    for tuple in first_feature_vector:
        if len(tuple) != 2:
            raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
        indexed_feature_dict[tuple[0]] = tuple[1]
    count_of_features = 0
    for tuple in second_feature_vector:
        count_of_features += 1
        if len(tuple) != 2:
            raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
        if tuple[0] in indexed_feature_dict:
            covariance_a_b += ((indexed_feature_dict[tuple[0]] - avg_a) * (tuple[1] - avg_b))
            del (indexed_feature_dict[tuple[0]])
        else:
            covariance_a_b += (0 - avg_a) * (tuple[1] - avg_b)

    for index in indexed_feature_dict:
        count_of_features += 1
        covariance_a_b += (indexed_feature_dict[index] - avg_a) * (0 - avg_b)

    #adjust covariance with rest of vector with 0 value
    covariance_a_b += (length_of_featureset - count_of_features) * -avg_a * -avg_b

    if mean_sq_error_a == 0 or mean_sq_error_b == 0:
        return -1
    else:
        return float(covariance_a_b) / (mean_sq_error_a * mean_sq_error_b)

单元测试：

def test_get_get_pearson_corelation(self):
    vector_a = [(1, 1), (2, 2), (3, 3)]
    vector_b = [(1, 1), (2, 5), (3, 7)]
    self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 3), 0.981980506062, 3, None, None)

    vector_a = [(1, 1), (2, 2), (3, 3)]
    vector_b = [(1, 1), (2, 5), (3, 7), (4, 14)]
    self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 5), -0.0137089240555, 3, None, None)

【讨论】：

【参考方案13】：

我有一个非常简单易懂的解决方案。对于两个长度相等的数组，皮尔逊系数可以很容易地计算如下：

def manual_pearson(a,b):
"""
Accepts two arrays of equal length, and computes correlation coefficient. 
Numerator is the sum of product of (a - a_avg) and (b - b_avg), 
while denominator is the product of a_std and b_std multiplied by 
length of array. 
"""
  a_avg, b_avg = np.average(a), np.average(b)
  a_stdev, b_stdev = np.std(a), np.std(b)
  n = len(a)
  denominator = a_stdev * b_stdev * n
  numerator = np.sum(np.multiply(a-a_avg, b-b_avg))
  p_coef = numerator/denominator
  return p_coef

【讨论】：

【参考方案14】：

您可能想知道如何在寻找特定方向的相关性（负相关或正相关）的上下文中解释您的概率。这是我编写的一个函数来帮助解决这个问题。甚至可能是对的！

它基于我从 http://www.vassarstats.net/rsig.html 和 http://en.wikipedia.org/wiki/Student%27s_t_distribution 收集的信息，感谢此处发布的其他答案。

# Given (possibly random) variables, X and Y, and a correlation direction,
# returns:
#  (r, p),
# where r is the Pearson correlation coefficient, and p is the probability
# that there is no correlation in the given direction.
#
# direction:
#  if positive, p is the probability that there is no positive correlation in
#    the population sampled by X and Y
#  if negative, p is the probability that there is no negative correlation
#  if 0, p is the probability that there is no correlation in either direction
def probabilityNotCorrelated(X, Y, direction=0):
    x = len(X)
    if x != len(Y):
        raise ValueError("variables not same len: " + str(x) + ", and " + \
                         str(len(Y)))
    if x < 6:
        raise ValueError("must have at least 6 samples, but have " + str(x))
    (corr, prb_2_tail) = stats.pearsonr(X, Y)

    if not direction:
        return (corr, prb_2_tail)

    prb_1_tail = prb_2_tail / 2
    if corr * direction > 0:
        return (corr, prb_1_tail)

    return (corr, 1 - prb_1_tail)

【讨论】：

【参考方案15】：

你可以看看这篇文章。这是一个有据可查的示例，用于使用 pandas 库（用于 Python）根据来自多个文件的历史外汇货币对数据计算相关性，然后使用 seaborn 库生成热图。

http://www.tradinggeeks.net/2015/08/calculating-correlation-in-python/

【讨论】：

【参考方案16】：

从Python 3.10 开始，Pearson 相关系数 (statistics.correlation) 在标准库中直接可用：

from statistics import correlation

# a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
# b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
correlation(a, b)
# 0.1449981545806852

【讨论】：

【参考方案17】：

def pearson(x,y):
  n=len(x)
  vals=range(n)

  sumx=sum([float(x[i]) for i in vals])
  sumy=sum([float(y[i]) for i in vals])

  sumxSq=sum([x[i]**2.0 for i in vals])
  sumySq=sum([y[i]**2.0 for i in vals])

  pSum=sum([x[i]*y[i] for i in vals])
  # Calculating Pearson correlation
  num=pSum-(sumx*sumy/n)
  den=((sumxSq-pow(sumx,2)/n)*(sumySq-pow(sumy,2)/n))**.5
  if den==0: return 0
  r=num/den
  return r

【讨论】：

纯代码答案不被认为是好的做法。请考虑添加几句话来解释您的代码如何解决该问题。（阅读有关如何回答关于 SO 的问题的帮助页面）

以上是关于在 Python 中计算 Pearson 相关性和显着性的主要内容，如果未能解决你的问题，请参考以下文章