有效地找到稀疏矩阵的最小列的索引

Posted 2023-03-12

技术标签:

【中文标题】有效地找到稀疏矩阵的最小列的索引【英文标题】：Efficiently finding the indices of a sparse matrix's smallest columns 【发布时间】：2022-01-15 10:01:39 【问题描述】：

“最小列”是指元素总和最小（即最负数）的列。这是我的尝试，但效率不高，因为我构建了一个列总和的完整列表。 h 是 scipy.sparse 矩阵，k 是请求的索引数。结果是否排序并不重要。

def indices_of_smallest_columns(h,k):
    size=h.get_shape()[0]
    arr=[h.tocsc().getcol(i).sum() for i in range(size)]
    return np.argpartition(arr,k)[:k]

【问题讨论】：

如果不找到所有总和，你怎么能得到最小的？我应该更准确。在这种情况下，列表理解/for 循环绝不是一种方式。 【参考方案1】：

In [1]: from scipy import sparse
In [2]: M = sparse.random(10,10,.2)
In [3]: M
Out[3]: 
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 20 stored elements in COOrdinate format>

你的总和清单：

In [5]: [M.tocsc().getcol(i).sum() for i in range(10)]
Out[5]: 
[1.5659425833256746,
 1.7665038140319338,
 0.0,
 0.6422706809316442,
 0.24922121199061487,
 1.439977730279475,
 0.17827454933565012,
 1.7955436609690185,
 0.4275656628694753,
 1.4029484081520989]

直接获取矩阵和：

In [6]: M.sum(axis=0)
Out[6]: 
matrix([[1.56594258, 1.76650381, 0.        , 0.64227068, 0.24922121,
         1.43997773, 0.17827455, 1.79554366, 0.42756566, 1.40294841]])

sparse 使用矩阵乘法得到这样的和。

时间安排：

In [7]: timeit [M.tocsc().getcol(i).sum() for i in range(10)]
2.87 ms ± 90.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: timeit M.sum(axis=0)
161 µs ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

如果矩阵已经是csc，时间会更好：

In [12]: %%timeit h=M.tocsc()
    ...: h.sum(axis=0)
    ...: 
    ...: 
54.5 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

【讨论】：

以上是关于有效地找到稀疏矩阵的最小列的索引的主要内容，如果未能解决你的问题，请参考以下文章

如何在python中有效地计算（稀疏）位矩阵的矩阵乘积

如何在 scipy 稀疏矩阵中确定索引数组的顺序？

使用pandas创建稀疏矩阵，并使用来自.dat文件的其他两列的索引[x，y]的.dat文件的一列中的值填充它

scipy构建稀疏矩阵

数据结构

稀疏矩阵的压缩与还原