需要通过类似于 scipy.linalg.eig 的特征值分解在 pyspark 中找到非对称方阵的特征向量

Posted 2023-04-15

技术标签:

【中文标题】需要通过类似于 scipy.linalg.eig 的特征值分解在 pyspark 中找到非对称方阵的特征向量【英文标题】：Need to find eigenvectors in pyspark for a non-symmetric square matrix through eigen value decomposition similar to scipy.linalg.eig 【发布时间】：2017-09-21 14:31:07 【问题描述】：

我是初学者，如有错误请指正。

我有一个大小为 100 万 x 100 万的方阵。我想在 pyspark 中找到它的特征向量。我知道 computeSVD 给了我特征向量，但那些是通过 SVD 得到的，结果是一个密集矩阵，它是一个本地数据结构。我想要 scipy.linalg.eig 给出的结果。

我看到有一个函数 EigenValueDecomposition 在 java 中使用 ARPACK 和用于 spark 的 scala api。它会给出与 scipy 中的 eig 相同的特征向量吗？如果是，有什么办法可以在 pyspark 中使用它？或者是否有任何替代解决方案。我可以以某种方式直接在我的代码中使用 ARPACK，还是我必须自己编写 Arnoldi 迭代（例如）？感谢您的帮助。

【问题讨论】：

你在 pyspark 中尝试过 scipy.linalg.eig 吗？我认为它会起作用。 @RishikeshTeke 问题是整个矩阵不能存在于驱动节点上并且内存不足。所以我需要分布式的东西。我在 Pyspark 文档中发现它可能会有所帮助：对于密集向量，MLlib 使用 NumPy 数组类型，因此您可以简单地传递 NumPy 数组。对于稀疏向量，如果 SciPy 在他们的环境中可用，用户可以从 MLlib 构造一个 SparseVector 对象或传递 SciPy scipy.sparse 列向量。在密集矩阵上，有 toArray 方法将其转换为 ndArray ，您可以将其提供给 scipy.linalg.eig @RishikeshTeke 感谢您的帮助。问题是我不能使用 numpy 数组，因为它不适合驱动程序内存。我只需要在分布式矩阵中执行整个操作。我需要一个与 spark 中的 scipy.linalg.eig 类似的函数，它将以分布式方式运行，而不是在使用本地数据结构（如 numpy 数组）的驱动程序上运行。在这种情况下，我建议使用 SciPy 的 ARPACK 包装器 scipy.sparse.linalg.eigs 【参考方案1】：

我开发了一个 python 代码来获取一个 scipy 稀疏矩阵并创建一个 RowMatrix 作为 computeSVD 的输入。这是您需要将 csr_matrix 转换为 SparseVectors 列表的部分。我使用并行版本，因为顺序版本要慢得多，并且很容易使其并行。

from pyspark.ml.linalg import SparseVector
from pyspark.mllib.linalg.distributed import RowMatrix
from multiprocessing.dummy import Pool as ThreadPool 
from functools import reduce
from pyspark.sql import DataFrame


num_row, num_col = fullMatrix.shape
lst_total = [None] * num_row
selected_indices = [i for i in range(num_row)]

def addMllibSparseVector(idx):
    curr = fullMatrix.getrow(idx)
    arr_ind = np.argsort(curr.indices)
    lst_total[idx] = (idx, SparseVector(num_col\
                 , curr.indices[arr_ind], curr.data[arr_ind]),)
pool = ThreadPool() 
pool.map(addMllibSparseVector, selected_indices)
pool.close()
pool.join()

然后我使用下面的代码创建数据框。

import math
lst_dfs = []
batch_size = 5000
num_range = math.ceil(num_row / batch_size)

lst_dfs = [None] * num_range
selected_dataframes = [i for i in range(num_range)]

def makeDataframes(idx):
    start = idx * batch_size
    end = min(start + batch_size, num_row)
    lst_dfs[idx] = sqlContext.createDataFrame(lst_total[start:end]\
        , ["id", "features"])            
pool = ThreadPool() 
pool.map(makeDataframes, selected_dataframes)
pool.close()
pool.join()

然后我将它们减少到 1 个数据帧并创建 RowMatrix。

raw_df = reduce(DataFrame.unionAll,*lst_dfs)
raw_rdd = raw_df.select('features').rdd.map(list)
raw_rdd.cache() 
mat = RowMatrix(raw_rdd)
svd = mat.computeSVD(100, computeU=True)

我简化了代码，还没有完全测试过。如果有问题，请随时发表评论。

【讨论】：

以上是关于需要通过类似于 scipy.linalg.eig 的特征值分解在 pyspark 中找到非对称方阵的特征向量的主要内容，如果未能解决你的问题，请参考以下文章