如何在 python 中计算大型 spark 数据框的 kendall tau？

Posted 2023-04-15

技术标签:

【中文标题】如何在 python 中计算大型 spark 数据框的 kendall tau？【英文标题】：How to calculate kendall's tau for a large spark dataframe in python? 【发布时间】：2019-07-19 20:32:52 【问题描述】：

我想计算大型 spark 数据帧的成对 kendall 的 tau 等级相关性。它很大（比如 10m 行和 10k 列）无法转换为 pandas 数据帧，然后使用 pandas.DataFrame.corr 进行计算。

另外，每一列可能有空值，因此在计算成对的kendall's tau时，需要排除两列中任何一个具有空值的行。

我检查了 pyspark.mllib.stat.Statistics.corr。它支持“pearson”和“spearman”。

    df_rdd = df.rdd.map(lambda row: row[0:])
    corr_mat = Statistics.corr(df_rdd, method='spearman')

Spearman 可能会替代我的 Kendall。但是，它不排除空值，因此返回的相关矩阵会受到空值的影响（如果一列包含空，则与该列的相关变为全空）。

有人遇到过同样的问题吗？将列分成块只会得到一个块相关矩阵。相反，循环遍历所有对非常慢...

谢谢！！

【问题讨论】：

【参考方案1】：

Spark 尚不支持 Kendalls 的等级。但是，如果这对您来说还不算太晚，我找到了以下code ，您可以使用它来计算它。

这里是一个例子：

from operator import add 

#sample data in lists
variable_1 = [106, 86, 100, 101, 99, 103, 97, 113, 112, 110]
variable_2 = [7, 0, 27, 50, 28, 29, 20, 12, 6, 17]

#zip sample data and convert to rdd
example_data = zip(variable_1, variable_2)
example_rdd = sc.parallelize(example_data)

#filer out all your null values. Row containing nulls will be removed
example_rdd = example_rdd.filter(lambda x: x is not None).filter(lambda x: x != "")

#take the cartesian product of example data (generate all possible combinations)
all_pairs = example_rdd.cartesian(example_rdd)

#function calculating concorant and disconordant pairs
def calc(pair):
    p1, p2 = pair
    x1, y1 = p1
    x2, y2 = p2
    if (x1 == x2) and (y1 == y2):
        return ("t", 1) #tie
    elif ((x1 > x2) and (y1 > y2)) or ((x1 < x2) and (y1 < y2)):
        return ("c", 1) #concordant pair
    else:
        return ("d", 1) #discordant pair

#rank all pairs and calculate concordant / disconrdant pairs with calc() then return results
results  = all_pairs.map(calc)

#aggregate the results
results = results.aggregateByKey(0, add, add)

#count and collect
n  = example_rdd.count()
d = k: v for (k, v) in results.collect()

# http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient
tau = (d["c"] - d["d"]) / (0.5 * n * (n-1))

也许这会有所帮助，或者至少可供将来参考。

【讨论】：

以上是关于如何在 python 中计算大型 spark 数据框的 kendall tau？的主要内容，如果未能解决你的问题，请参考以下文章