使用 Numba 进行矩阵乘法时出现 CUDA 内存不足错误

Posted

技术标签:

【中文标题】使用 Numba 进行矩阵乘法时出现 CUDA 内存不足错误【英文标题】:CUDA out of memory error when doing matrix multiplication using Numba 【发布时间】:2021-07-08 15:29:44 【问题描述】:

我需要将一个矩阵与其转置相乘,但我的 GPU 上的内存不足并出现错误消息 numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

我希望我的矩阵的大小约为 10k 行和 100k 列,因此将其乘以它的 trnspose 将得到一个 10k 行和 10k 列的方阵的结果。矩阵只包含 0 和 1。

这是我正在运行的脚本。

from numba import cuda, uint16
import numba
import numpy
import math
import time

TPB = 16

@cuda.jit()
def matmul_shared_mem(A, B, C):
    sA = cuda.shared.array((TPB, TPB), dtype=uint16)
    sB = cuda.shared.array((TPB, TPB), dtype=uint16)
    x, y = cuda.grid(2)
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    if x >= C.shape[0] and y >= C.shape[1]:
        return
    tmp = 0.
    for i in range(int(A.shape[1] / TPB)):
        sA[tx, ty] = A[x, ty + i * TPB]
        sB[tx, ty] = B[tx + i * TPB, y]
        cuda.syncthreads()
        for j in range(TPB):
            tmp += sA[tx, j] * sB[j, ty]
        cuda.syncthreads()
    C[x, y] = tmp

A = numpy.random.randint(2, size=(TPB * 625, 50000))

B = A.transpose()

C_shared_mem = cuda.device_array((A.shape[0], B.shape[1]))

threads_per_block = (TPB, TPB)
blocks_per_grid_x = int(math.ceil(A.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(math.ceil(B.shape[1] / threads_per_block[1]))
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)

start_gpu_shared_memory = time.time()
matmul_shared_mem[blocks_per_grid, threads_per_block](A, B, C_shared_mem)
cuda.synchronize()
end_gpu_shared_memory = time.time()

time_gpu_shared = end_gpu_shared_memory - start_gpu_shared_memory
print("GPU time(shared memory):" + str(time_gpu_shared))

更新 1:

根据您的建议,我做了一些更改,但是我的内存仍然不足。

import numpy as np
import numba as nb
colm = int(200000/8)
rows = 100000
cols = int(colm*8)
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
A = np.empty((rows,colm), dtype=np.uint8)

@nb.njit('void(uint8[:,:],int8[:,:])', parallel=True)
def compute(A, AU):
    for i in nb.prange(A.shape[0]):
        for j in range(A.shape[1]):
            offset = j * 8
            res = AU[i,offset] << 7
            res |= AU[i,offset+1] << 6
            res |= AU[i,offset+2] << 5
            res |= AU[i,offset+3] << 4
            res |= AU[i,offset+4] << 3
            res |= AU[i,offset+5] << 2
            res |= AU[i,offset+6] << 1
            res |= AU[i,offset+7]
            A[i,j] = res

compute(A, AU)

from numba import cuda, uint8, int32
import numba
import numpy as np
import math
import time

TPB = 8
TPB1 = 9

@cuda.jit()
def bit_A_AT(A, C):
    sA = cuda.shared.array((TPB, TPB), dtype=uint8)
    sB = cuda.shared.array((TPB, TPB1), dtype=uint8)
    x, y = cuda.grid(2)
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bx = cuda.blockIdx.x
    by = cuda.blockIdx.y
    if bx >= by:
        tmp = int32(0)
        for i in range((A.shape[1]+TPB-1) // TPB):
            if y < A.shape[0] and (i*TPB+tx) < A.shape[1]:
                sA[ty, tx] = A[y, i*TPB+tx]
            else:
                sA[ty, tx] = 0
            if (TPB*bx+ty) < A.shape[0] and (i*TPB+tx) < A.shape[1]:
                sB[ty, tx] = A[TPB*bx+ty, i*TPB+tx]
            else:
                sB[ty, tx] = 0
            cuda.syncthreads()
            for j in range(TPB):
                tmp1 = sA[ty,j] & sB[tx, j]
                test = uint8(1)
                for k in range(8):
                    if (tmp1 & test) > 0:
                        tmp += 1
                    test <<= 1
            cuda.syncthreads()
        if y < C.shape[0] and x < C.shape[1]:
            C[y, x] = tmp

C = np.empty((A.shape[0], A.shape[0]), dtype=np.int32)
threads_per_block = (TPB, TPB)
blocks_per_grid_x = int(math.ceil(A.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(math.ceil(A.shape[0] / threads_per_block[1]))
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)

start_gpu_shared_memory = time.time()
bit_A_AT[blocks_per_grid, threads_per_block](A, C)
cuda.synchronize()
end_gpu_shared_memory = time.time()

time_gpu_shared = end_gpu_shared_memory - start_gpu_shared_memory
print("GPU time(shared memory):" + str(time_gpu_shared))

知道如何解决这个问题吗?

【问题讨论】:

1.您正在使用不必要的大类型。您的某些类型是 64 位的,并且您正在混合类型,这很糟糕。始终使用一致的 32 位 dtype。这将使您的内存使用量减少一半。 int32float32 应该没问题。 2. 要再次将内存使用量减半,请使用方法here。由于可以从相同的数据中提取矩阵及其转置,因此无需在主机上转置并复制到设备。 3.您还有其他问题,建议查看this。 我将数据类型更改为unit16 以节省空间,但这仍然不能解决我的内存问题。我是新手,所以不确定,但我没有在代码中将我的两个矩阵 A 和 B 发送到 GPU 内存,所以它们应该仍然在主机内存上。对吗? 另外,我不太懂C语言。 当您执行此操作时:matmul_shared_mem[blocks_per_grid, threads_per_block](A, B, C_shared_mem) 您正在将 A 和 B 发送到设备。 numba 会自动执行此操作。对于 10kx100k 的矩阵,结果为 10kx10k,如果矩阵元素只有 0、1 的理论最大值为 100,000,这将溢出 16 位整数。如果您指出您正在运行的 GPU 以及该 GPU 上有多少内存,这可能会有所帮助。对于内核代码,从 C 到 python 的转换或多或少是机械的。您不必对 C 了解太多,只需了解 numba 等价物是什么。 【参考方案1】:

以下方法应减少计算 A x AT 所需的设备内存量。我们将使用以下想法:

由于输入数组 (A) 只取 0,1 的值,我们将把该数组的存储空间减少到最小的方便大小,int8,即每个元素一个字节 由于B 数组只是A 数组的转置,因此无需显式处理。我们可以从A 数组派生它,有点类似于here,尽管它执行的是AT x A A x AT 的矩阵乘法涉及获取矩阵 A 的行的点积,如 here 所示 我们将使用调整后的索引在 sB 数组中提供 A 转置版本 对您的代码进行了一系列其他更改,以解决各种错误并提高加载/存储效率,例如全面逆转您对 x,y 索引的使用 我还修复了您对 syncthreads 的使用并修改了代码以允许行和列维度的任意值

这是一个有效的例子:

$ cat t62.py
from numba import cuda, int32, int8
import numba
import numpy as np
import math
import time

TPB = 32
TPB1 = TPB+1
@cuda.jit()
def byte_A_AT(A, C):
    sA = cuda.shared.array((TPB, TPB),  dtype=int8)
    sB = cuda.shared.array((TPB, TPB1), dtype=int8)
    x, y = cuda.grid(2)
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bx = cuda.blockIdx.x
    by = cuda.blockIdx.y
# uncomment and indent remainder of kernel to only do the "symmetric half" of calculation
#    if bx >= by:
    tmp = int32(0)
    for i in range((A.shape[1]+TPB-1)// TPB):
        if y < A.shape[0] and (i*TPB+tx) < A.shape[1]:
            sA[ty, tx] = A[y, i*TPB+tx]
        else:
            sA[ty, tx] = 0
        if (TPB*bx+ty) < A.shape[0] and (i*TPB+tx) < A.shape[1]:
            sB[ty, tx] = A[TPB*bx+ty, i*TPB+tx]
        else:
            sB[ty, tx] = 0
        cuda.syncthreads()
        for j in range(TPB):
            tmp += int32(sA[ty,j]) * int32(sB[tx, j])
        cuda.syncthreads()
    if y < C.shape[0] and x < C.shape[1]:
        C[y, x] = tmp
rows = 1041
cols = 1043
print('host mem: ', (rows*cols*2+rows*rows*4*2)//1048576, 'MB device mem: ', (rows*cols+rows*rows*4)//1048576, 'MB')
A = np.random.randint(2,size=(rows, cols),dtype=np.int8)
AT = A.transpose()
CU = np.matmul(A,AT, dtype = np.int32)
C = np.empty((A.shape[0], A.shape[0]), dtype=np.int32)
threads_per_block = (TPB, TPB)
blocks_per_grid_x = int(math.ceil(A.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(math.ceil(A.shape[0] / threads_per_block[1]))
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)
byte_A_AT[blocks_per_grid, threads_per_block](A, C)
cuda.synchronize()
start_gpu_shared_memory = time.time()
byte_A_AT[blocks_per_grid, threads_per_block](A, C)
cuda.synchronize()
end_gpu_shared_memory = time.time()

time_gpu_shared = end_gpu_shared_memory - start_gpu_shared_memory
print("GPU time(shared memory):" + str(time_gpu_shared))
test = np.array_equal(C, CU)
print(test)
if test == False:
    for i in range(C.shape[0]):
        for j in range(C.shape[1]):
            if C[i,j] != CU[i,j]:
                print(i, ' ' , j ,' ' , C[i,j] , ' ' , CU[i,j])
$ python t62.py
host mem:  10 MB device mem:  5 MB
GPU time(shared memory):0.019593000411987305
True
$

注意事项:

上述代码的大部分运行时间将花费在python中(在np.matmul()操作中,这实际上只是用于验证结果,实际实现应该不需要),而不是GPU部分。随着矩阵变大,代码运行速度会变慢。 如 cmets 中所述,A x AT 的结果是对称矩阵。我的代码没有利用这一点,但是我们可以通过在内核开头取消注释 if 测试然后缩进内核的其余部分来粗略地利用它。但是,这当然会导致主机代码np.array_equal 测试失败。 设备内存消耗在代码中计算。对于您在 cmets 中的最大值(行 = 30k,列 = 200k),这将达到大约 10GB,因此它仍然无法在您的 8GB GPU 中运行。 我创建了此代码的一个版本,它为A 矩阵每个字节打包 8 个元素,这将进一步减少内存需求,但是编写此代码以处理任意列尺寸的情况(与 8 的倍数相比)事实证明相当混乱。但是,对于 30k 行和 200k 列的情况,该代码可以将设备总内存消耗降低到大约 5GB。

这里是每个元素一位的版本,它要求列的数量是可以被 8 整除的整数:

$ cat t61.py
from numba import cuda, uint8, int32
import numba
import numpy as np
import math
import time

TPB = 32
TPB1 = 33

@cuda.jit()
def bit_A_AT(A, C):
    sA = cuda.shared.array((TPB, TPB), dtype=uint8)
    sB = cuda.shared.array((TPB, TPB1), dtype=uint8)
    x, y = cuda.grid(2)
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bx = cuda.blockIdx.x
    by = cuda.blockIdx.y
    tmp = int32(0)
    for i in range((A.shape[1]+TPB-1) // TPB):
        if y < A.shape[0] and (i*TPB+tx) < A.shape[1]:
            sA[ty, tx] = A[y, i*TPB+tx]
        else:
            sA[ty, tx] = 0
        if (TPB*bx+ty) < A.shape[0] and (i*TPB+tx) < A.shape[1]:
            sB[ty, tx] = A[TPB*bx+ty, i*TPB+tx]
        else:
            sB[ty, tx] = 0
        cuda.syncthreads()
        for j in range(TPB):
            tmp1 = sA[ty,j] & sB[tx, j]
            test = uint8(1)
            for k in range(8):
                if (tmp1 & test) > 0:
                    tmp += 1
                test <<= 1
        cuda.syncthreads()
    if y < C.shape[0] and x < C.shape[1]:
        C[y, x] = tmp
colm = 131
rows = 1041
cols = int(colm*8)
print('host mem: ', (rows*cols*2+rows*rows*4*2)//1048576, 'MB device mem: ', (((rows*cols)//8)+rows*rows*4)//1048576, 'MB')
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
AUT = AU.transpose()
CU = np.matmul(AU,AUT,dtype=np.int32)
A = np.empty((rows,colm), dtype=np.uint8)
for i in range(A.shape[0]):
    for j in range(A.shape[1]):
        A[i,j] = 0
        for k in range(8):
            if AU[i,(j*8)+k] == 1:
                A[i,j] = A[i,j] | (1<<(7-k))
C = np.empty((A.shape[0], A.shape[0]), dtype=np.int32)
threads_per_block = (TPB, TPB)
blocks_per_grid_x = int(math.ceil(A.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(math.ceil(A.shape[0] / threads_per_block[1]))
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)
bit_A_AT[blocks_per_grid, threads_per_block](A, C)
cuda.synchronize()

start_gpu_shared_memory = time.time()
bit_A_AT[blocks_per_grid, threads_per_block](A, C)
cuda.synchronize()
end_gpu_shared_memory = time.time()

time_gpu_shared = end_gpu_shared_memory - start_gpu_shared_memory
print("GPU time(shared memory):" + str(time_gpu_shared))
test = np.array_equal(C, CU)
print(test)
if test == False:
    for i in range(C.shape[0]):
        for j in range(C.shape[1]):
            if C[i,j] != CU[i,j]:
                print(i, ' ' , j ,' ' , C[i,j] , ' ' , CU[i,j])
                break
$ python t61.py
host mem:  10 MB device mem:  4 MB
GPU time(shared memory):0.009343624114990234
True
$

编辑:回答 cmets 中的一些问题,更新,现在考虑到 A 矩阵可能有明显超过 30k 的行,这将导致 C 矩阵也增加。如果 A 矩阵可以放入 GPU 内存中,我们可以通过分段计算来减少 C 矩阵的内存需求。这些部分将是一组一起计算的行,我将其称为 C 矩阵的row_slice。以下代码演示了通过对上面的代码进行相对较小的更改即可实现这一点:

$ cat t63.py
from numba import cuda, uint8, int32
import numba as nb
import numpy as np
import math
import time

TPB = 32
TPB1 = 33

@cuda.jit()
def bit_slice_A_AT(A, C, row_offset):
    sA = cuda.shared.array((TPB, TPB), dtype=uint8)
    sB = cuda.shared.array((TPB, TPB1), dtype=uint8)
    x, y = cuda.grid(2)
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bx = cuda.blockIdx.x
    by = cuda.blockIdx.y
    tmp = int32(0)
    for i in range((A.shape[1]+TPB-1) // TPB):
        if y+row_offset < A.shape[0] and (i*TPB+tx) < A.shape[1]:
            sA[ty, tx] = A[y+row_offset, i*TPB+tx]
        else:
            sA[ty, tx] = 0
        if (TPB*bx+ty) < A.shape[0] and (i*TPB+tx) < A.shape[1]:
            sB[ty, tx] = A[TPB*bx+ty, i*TPB+tx]
        else:
            sB[ty, tx] = 0
        cuda.syncthreads()
        for j in range(TPB):
            tmp1 = sA[ty,j] & sB[tx, j]
            test = uint8(1)
            for k in range(8):
                if (tmp1 & test) > 0:
                    tmp += 1
                test <<= 1
        cuda.syncthreads()
    if y < C.shape[0] and x < C.shape[1]:
        C[y, x] = tmp

@nb.njit('void(uint8[:,:],int8[:,:])', parallel=True)
def bitpack(A, AU):
    for i in nb.prange(A.shape[0]):
        for j in range(A.shape[1]):
            offset = j * 8
            res = AU[i,offset] << 7
            res |= AU[i,offset+1] << 6
            res |= AU[i,offset+2] << 5
            res |= AU[i,offset+3] << 4
            res |= AU[i,offset+4] << 3
            res |= AU[i,offset+5] << 2
            res |= AU[i,offset+6] << 1
            res |= AU[i,offset+7]
            A[i,j] = res

colm = 131
rows = 1535
cols = int(colm*8)
row_slice = 512
print('host mem: ', (rows*cols*2+rows*rows*4*2)//1048576, 'MB device mem: ', (((rows*cols)//8)+row_slice*rows*4)//1048576, 'MB')
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
CU = np.matmul(AU,AU.T,dtype=np.int32)
A = np.empty((rows,colm), dtype=np.uint8)
bitpack(A, AU)
A_dev = cuda.to_device(A)
threads_per_block = (TPB, TPB)
C = np.empty((row_slice, A.shape[0]), dtype=np.int32)
blocks_per_grid_x = int(math.ceil(A.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(row_slice / threads_per_block[1])
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)
for i in range((A.shape[0]+row_slice-1)//row_slice):
    bit_slice_A_AT[blocks_per_grid, threads_per_block](A_dev, C, i*row_slice)
    lower = i*row_slice
    upper = min(lower+row_slice, CU.shape[0])
    width = upper-lower
    test = np.array_equal(C[:width,:], CU[i*row_slice:i*row_slice+width,:])
    print(test)
cuda.synchronize()
C_dev = cuda.device_array_like(C)
start_gpu_shared_memory = time.time()
for i in range((A.shape[0]+row_slice-1)//row_slice):
    bit_slice_A_AT[blocks_per_grid, threads_per_block](A_dev, C_dev, i*row_slice)
cuda.synchronize()
end_gpu_shared_memory = time.time()

time_gpu_shared = end_gpu_shared_memory - start_gpu_shared_memory
print("GPU time(shared memory):" + str(time_gpu_shared))
$ python t63.py
host mem:  21 MB device mem:  3 MB
True
True
True
GPU time(shared memory):0.010116815567016602
$

这意味着,正如建议的那样,对于问题的最新更新中给出的行 = 100k 和列 = 200k 的情况,我们应该能够将 C 矩阵划分为 5k 行的块。 A 矩阵的内存使用量为 2.5GB,但对于 C 矩阵,由于我们一次只计算 5k 行切片,因此所需的设备内存存储量为100k*5k*4 字节,因此本示例为 2GB。

经过进一步研究,我们可以通过从int8数据类型切换到float32数据类型来加快主机代码matmul的操作。这使得该操作快了很多,但 GPU 代码似乎仍然比这快 4 倍:

$ cat t64.py
from numba import cuda, uint8, int32
import numba as nb
import numpy as np
import math
import time

TPB = 32
TPB1 = 33

@cuda.jit()
def bit_slice_A_AT(A, C, row_offset):
    sA = cuda.shared.array((TPB, TPB), dtype=uint8)
    sB = cuda.shared.array((TPB, TPB1), dtype=uint8)
    x, y = cuda.grid(2)
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bx = cuda.blockIdx.x
    by = cuda.blockIdx.y
    tmp = int32(0)
    for i in range((A.shape[1]+TPB-1) // TPB):
        if y+row_offset < A.shape[0] and (i*TPB+tx) < A.shape[1]:
            sA[ty, tx] = A[y+row_offset, i*TPB+tx]
        else:
            sA[ty, tx] = 0
        if (TPB*bx+ty) < A.shape[0] and (i*TPB+tx) < A.shape[1]:
            sB[ty, tx] = A[TPB*bx+ty, i*TPB+tx]
        else:
            sB[ty, tx] = 0
        cuda.syncthreads()
        for j in range(TPB):
            tmp1 = sA[ty,j] & sB[tx, j]
            test = uint8(1)
            for k in range(8):
                if (tmp1 & test) > 0:
                    tmp += 1
                test <<= 1
        cuda.syncthreads()
    if y < C.shape[0] and x < C.shape[1]:
        C[y, x] = tmp

@nb.njit('void(uint8[:,:],float32[:,:])', parallel=True)
def bitpack(A, AU):
    for i in nb.prange(A.shape[0]):
        for j in range(A.shape[1]):
            offset = j * 8
            res = int(AU[i,offset]) << 7
            res |= int(AU[i,offset+1]) << 6
            res |= int(AU[i,offset+2]) << 5
            res |= int(AU[i,offset+3]) << 4
            res |= int(AU[i,offset+4]) << 3
            res |= int(AU[i,offset+5]) << 2
            res |= int(AU[i,offset+6]) << 1
            res |= int(AU[i,offset+7])
            A[i,j] = res

colm = 1000
rows = 6000
cols = int(colm*8)
row_slice = 512
print('host mem: ', (rows*cols*4+rows*colm+rows*rows*4+rows*row_slice*4)//1048576, 'MB device mem: ', (((rows*cols)//8)+row_slice*rows*4)//1048576, 'MB')
t1 = time.time()
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
AU = AU.astype(np.float32)
print("randint:" + str(time.time()-t1))
t1 = time.time()
#CU = np.empty((rows, rows), dtype=np.int32)
CU = np.matmul(AU,AU.T,dtype=np.float32)
print("matmul:" + str(time.time()-t1))
t1 = time.time()
A = np.empty((rows,colm), dtype=np.uint8)
print("np.empty:" + str(time.time()-t1))
t1 = time.time()
bitpack(A, AU)
print("bitpack:" + str(time.time()-t1))
t1 = time.time()
A_dev = cuda.to_device(A)
threads_per_block = (TPB, TPB)
C = np.empty((row_slice, A.shape[0]), dtype=np.int32)
blocks_per_grid_x = int(math.ceil(A.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(row_slice / threads_per_block[1])
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)
for i in range((A.shape[0]+row_slice-1)//row_slice):
    bit_slice_A_AT[blocks_per_grid, threads_per_block](A_dev, C, i*row_slice)
    lower = i*row_slice
    upper = min(lower+row_slice, CU.shape[0])
    width = upper-lower
    test = np.array_equal(C[:width,:], CU[i*row_slice:i*row_slice+width,:])
    print(test)
cuda.synchronize()
C_dev = cuda.device_array_like(C)
start_gpu_shared_memory = time.time()
for i in range((A.shape[0]+row_slice-1)//row_slice):
    bit_slice_A_AT[blocks_per_grid, threads_per_block](A_dev, C_dev, i*row_slice)
cuda.synchronize()
end_gpu_shared_memory = time.time()

time_gpu_shared = end_gpu_shared_memory - start_gpu_shared_memory
print("GPU time(shared memory):" + str(time_gpu_shared))
$ python t64.py
host mem:  337 MB device mem:  17 MB
randint:0.1817936897277832
matmul:3.498671531677246
np.empty:7.62939453125e-05
bitpack:0.03707313537597656
True
True
True
True
True
True
True
True
True
True
True
True
GPU time(shared memory):0.8318064212799072
$

我还没有彻底测试过这些代码。可能存在错误。使用风险自负。

对于归属,numba 位打包代码似乎来自here。

【讨论】:

好主意。我非常感谢您的回复。第二种解决方案是天才,虽然它需要很长时间在 CPU 上,因为它必须遍历矩阵并分配位,但这是个好主意。第一个解决方案确实很好用。由于我的矩阵中的大多数值都是 0,我目前正在研究进行稀疏矩阵乘法,因为这样可以节省大量空间。 如果我将我的实例更改为一个有 4 个 GPU 每个有 8GB 的​​实例,Numba 是否会默认将所有这些 GPU 用于 GPU 内存,即总 GPU 内存 32gigs? 不,不会。这需要明确的多 GPU 编程。 拜托,你能检查一下更新吗?谢谢。 A 应该使用大约 2.5GB。 C 应该使用 100,000 * 100,000 * 4 字节 = 40GB。所以这是你的问题。对于 A (A.shape[0]) 的行,您的原始问题公式的尺寸要小得多。当您从 10k 行或 30k 行变为 100k 行时,这会使输出数组爆炸。请留在计算并打印所需内存的行中,这样您就不必通过 cmets 询问我进行计算。我无法解决这个问题。如果您将行改回 10k 或 30k,我认为您的代码不应遇到内存不足错误。

以上是关于使用 Numba 进行矩阵乘法时出现 CUDA 内存不足错误的主要内容,如果未能解决你的问题,请参考以下文章

使用 numba 无法获得与 numpy 元素矩阵乘法相同的值

即使对于巨型矩阵,NUMBA CUDA 也比并行 CPU 慢

比较 Python、Numpy、Numba 和 C++ 的矩阵乘法

在矩阵乘法中使用 C++2011 线程而不是 OpenMP 时出现异常加速

使用 CUDA 进行矩阵乘法:2D 块与 1D 块

[CUDA]CUDA编程实战四——矩阵乘法