如何点积 (1,10^13) (10^13,1) scipy 稀疏矩阵

Posted 2023-03-12

技术标签:

【中文标题】如何点积 (1,10^13) (10^13,1) scipy 稀疏矩阵【英文标题】：How to dot product (1,10^13) (10^13,1) scipy sparse matrix如何点积 (1,10^13) (10^13,1) scipy 稀疏矩阵 【发布时间】：2021-12-24 09:51:32 【问题描述】：

基本上是标题的含义。这两个矩阵大多为零。第一个是 1 x 9999999999999，第二个是 9999999999999 x 1 当我尝试做一个点积时，我得到了这个。

Unable to allocate 72.8 TiB for an array with shape (10000000000000,) and data type int64

Full traceback </br>

    MemoryError: Unable to allocate 72.8 TiB for an array with shape (10000000000000,) and data type int64

In [31]: imputed.dot(s)
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-31-670cfc69d4cf> in <module>
----> 1 imputed.dot(s)

~/.local/lib/python3.8/site-packages/scipy/sparse/base.py in dot(self, other)
    357 
    358         """
--> 359         return self * other
    360 
    361     def power(self, n, dtype=None):

~/.local/lib/python3.8/site-packages/scipy/sparse/base.py in __mul__(self, other)
    478             if self.shape[1] != other.shape[0]:
    479                 raise ValueError('dimension mismatch')
--> 480             return self._mul_sparse_matrix(other)
    481 
    482         # If it's a list or whatever, treat it like a matrix

~/.local/lib/python3.8/site-packages/scipy/sparse/compressed.py in _mul_sparse_matrix(self, other)
    499 
    500         major_axis = self._swap((M, N))[0]
--> 501         other = self.__class__(other)  # convert to this format
    502 
    503         idx_dtype = get_index_dtype((self.indptr, self.indices,

~/.local/lib/python3.8/site-packages/scipy/sparse/compressed.py in __init__(self, arg1, shape, dtype, copy)
     32                 arg1 = arg1.copy()
     33             else:
---> 34                 arg1 = arg1.asformat(self.format)
     35             self._set_self(arg1)
     36 

~/.local/lib/python3.8/site-packages/scipy/sparse/base.py in asformat(self, format, copy)
    320             # Forward the copy kwarg, if it's accepted.
    321             try:
--> 322                 return convert_method(copy=copy)
    323             except TypeError:
    324                 return convert_method()

~/.local/lib/python3.8/site-packages/scipy/sparse/csc.py in tocsr(self, copy)
    135         idx_dtype = get_index_dtype((self.indptr, self.indices),
    136                                     maxval=max(self.nnz, N))
--> 137         indptr = np.empty(M + 1, dtype=idx_dtype)
    138         indices = np.empty(self.nnz, dtype=idx_dtype)
    139         data = np.empty(self.nnz, dtype=upcast(self.dtype))

MemoryError: Unable to allocate 72.8 TiB for an array with shape (10000000000000,) and data type int64

似乎 scipy 正在尝试创建一个临时数组。我正在使用 scipy 提供的 .dot 方法。我也对非 scipy 解决方案持开放态度。谢谢！

【问题讨论】：

我的意思是一般乘法。这里我使用稀疏矩阵点。这与您提供的类似，但适用于稀疏矩阵。所以基本上答案是肯定的你用了什么代码？ np.dot(A,B)? A.dot(B)? A*B? A@B?您阅读了多少scipy.sparse 文档？ A.dot(B) 和 A*B 和 A.multiply(B) 都产生相同的错误确实没有阅读太多文档。 【参考方案1】：

In [105]: from scipy import sparse

如果我制作一个 (100,1) csr 矩阵：

In [106]: A = sparse.random(100,1,format='csr')
In [107]: A
Out[107]: 
<100x1 sparse matrix of type '<class 'numpy.float64'>'
    with 1 stored elements in Compressed Sparse Row format>

数据和索引是：

In [109]: A.data
Out[109]: array([0.19060481])
In [110]: A.indices
Out[110]: array([0], dtype=int32)
In [112]: A.indptr
Out[112]: 
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

因此，即使只有 1 个非零项，一个数组也很大 (101)。

另一方面，同一数组的csc 格式的存储空间要小得多。但是具有 (1,100) 形状的 csc 看起来像 csr。

In [113]: Ac = A.tocsc()
In [114]: Ac.indptr
Out[114]: array([0, 1], dtype=int32)
In [115]: Ac.indices
Out[115]: array([88], dtype=int32)

数学，尤其是矩阵乘积是使用csr/csc 格式完成的。因此，可能很难避免使用 80 TB 内存。

查看回溯，我发现它正在尝试将 other 转换为与 self 匹配的格式。

所以 A.dot(B) 和 A 是 (1,N) csr，小形状。 B 是 (N,1) csc，也是小形状。但是B.tocsr() 需要大的(N+1,) 形状的indptr。

让我们试试dot的替代方法

前 2 个矩阵：

In [122]: A = sparse.random(1,100, .2,format='csr')
In [123]: B = sparse.random(100,1, .2,format='csc')
In [124]: A
Out[124]: 
<1x100 sparse matrix of type '<class 'numpy.float64'>'
    with 20 stored elements in Compressed Sparse Row format>
In [125]: B
Out[125]: 
<100x1 sparse matrix of type '<class 'numpy.float64'>'
    with 20 stored elements in Compressed Sparse Column format>
In [126]: A@B
Out[126]: 
<1x1 sparse matrix of type '<class 'numpy.float64'>'
    with 1 stored elements in Compressed Sparse Row format>
In [127]: _.A
Out[127]: array([[1.33661021]])

它们的非零元素索引。只有匹配的才重要。

In [128]: A.indices, B.indices
Out[128]: 
(array([16, 20, 23, 28, 30, 37, 39, 40, 43, 49, 54, 59, 61, 63, 67, 70, 74,
        91, 94, 99], dtype=int32),
 array([ 5,  8, 15, 25, 34, 35, 40, 46, 47, 51, 53, 60, 68, 70, 75, 81, 87,
        90, 91, 94], dtype=int32))

等式矩阵：

In [129]: mask = A.indices[:,None]==B.indices

In [132]: np.nonzero(mask.any(axis=0))
Out[132]: (array([ 6, 13, 18, 19]),)
In [133]: np.nonzero(mask.any(axis=1))
Out[133]: (array([ 7, 15, 17, 18]),)

匹配的索引：

In [139]: A.indices[Out[133]]
Out[139]: array([40, 70, 91, 94], dtype=int32)
In [140]: B.indices[Out[132]]
Out[140]: array([40, 70, 91, 94], dtype=int32)

对应的data值的总和匹配[127]

In [141]: (A.data[Out[133]]*B.data[Out[132]]).sum()
Out[141]: 1.3366102138511582

【讨论】：

以上是关于如何点积 (1,10^13) (10^13,1) scipy 稀疏矩阵的主要内容，如果未能解决你的问题，请参考以下文章