CUDA GPU处理。TypeError: compile_kernel()得到了一个意外的关键字 "boundscheck"。
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了CUDA GPU处理。TypeError: compile_kernel()得到了一个意外的关键字 "boundscheck"。相关的知识,希望对你有一定的参考价值。
今天我开始研究CUDA和GPU处理。我找到了这个教程。https:/www.geeksforgeeks.orgrunning-python-script-on-gpu
不幸的是,我第一次尝试运行gpu代码失败了。
from numba import jit, cuda
import numpy as np
# to measure exec time
from timeit import default_timer as timer
# normal function to run on cpu
def func(a):
for i in range(10000000):
a[i]+= 1
# function optimized to run on gpu
@jit(target ="cuda")
def func2(a):
for i in range(10000000):
a[i]+= 1
if __name__=="__main__":
n = 10000000
a = np.ones(n, dtype = np.float64)
b = np.ones(n, dtype = np.float32)
start = timer()
func(a)
print("without GPU:", timer()-start)
start = timer()
func2(a)
print("with GPU:", timer()-start)
输出:
/home/amu/anaconda3/bin/python /home/amu/PycharmProjects/gpu_processing_base/gpu_base_1.py
without GPU: 4.89985659904778
Traceback (most recent call last):
File "/home/amu/PycharmProjects/gpu_processing_base/gpu_base_1.py", line 30, in <module>
func2(a)
File "/home/amu/anaconda3/lib/python3.7/site-packages/numba/cuda/dispatcher.py", line 40, in __call__
return self.compiled(*args, **kws)
File "/home/amu/anaconda3/lib/python3.7/site-packages/numba/cuda/compiler.py", line 758, in __call__
kernel = self.specialize(*args)
File "/home/amu/anaconda3/lib/python3.7/site-packages/numba/cuda/compiler.py", line 769, in specialize
kernel = self.compile(argtypes)
File "/home/amu/anaconda3/lib/python3.7/site-packages/numba/cuda/compiler.py", line 785, in compile
**self.targetoptions)
File "/home/amu/anaconda3/lib/python3.7/site-packages/numba/core/compiler_lock.py", line 32, in _acquire_compile_lock
return func(*args, **kwargs)
TypeError: compile_kernel() got an unexpected keyword argument 'boundscheck'
Process finished with exit code 1
我已经安装了 numba
和 cudatoolkit
教程中提到的在pycharm的anaconda环境中进行处理。
答案
添加一个答案,让这个从未回答的队列中删除。
那个例子中的代码是坏的。这不是你的numba或CUDA安装有什么问题。你的问题中的代码(或你从博客中复制的代码)不可能产生博客文章中所说的结果。
有很多方法可以修改成这样。其中一种方法是这样的。
from numba import vectorize, jit, cuda
import numpy as np
# to measure exec time
from timeit import default_timer as timer
# normal function to run on cpu
def func(a):
for i in range(10000000):
a[i]+= 1
# function optimized to run on gpu
@vectorize(['float64(float64)'], target ="cuda")
def func2(x):
return x+1
if __name__=="__main__":
n = 10000000
a = np.ones(n, dtype = np.float64)
start = timer()
func(a)
print("without GPU:", timer()-start)
start = timer()
func2(a)
print("with GPU:", timer()-start)
这里 func2
变成一个为设备编译的ufunc。然后它将在GPU上的整个输入数组上运行。这样做是这样的。
$ python bogoexample.py
without GPU: 4.314514834433794
with GPU: 0.21419800259172916
所以速度更快,但请记住,GPU的时间包括编译GPU ufunc的时间。
另一种选择是真正写一个GPU内核。像这样。
from numba import vectorize, jit, cuda
import numpy as np
# to measure exec time
from timeit import default_timer as timer
# normal function to run on cpu
def func(a):
for i in range(10000000):
a[i]+= 1
# function optimized to run on gpu
@vectorize(['float64(float64)'], target ="cuda")
def func2(x):
return x+1
# kernel to run on gpu
@cuda.jit
def func3(a, N):
tid = cuda.grid(1)
if tid < N:
a[tid] += 1
if __name__=="__main__":
n = 10000000
a = np.ones(n, dtype = np.float64)
for i in range(0,5):
start = timer()
func(a)
print(i, " without GPU:", timer()-start)
for i in range(0,5):
start = timer()
func2(a)
print(i, " with GPU ufunc:", timer()-start)
threadsperblock = 1024
blockspergrid = (a.size + (threadsperblock - 1)) // threadsperblock
for i in range(0,5):
start = timer()
func3[blockspergrid, threadsperblock](a, n)
print(i, " with GPU kernel:", timer()-start)
它的运行方式是这样的:
$ python bogoexample.py
0 without GPU: 4.885275377891958
1 without GPU: 4.748716968111694
2 without GPU: 4.902181145735085
3 without GPU: 4.889955999329686
4 without GPU: 4.881594380363822
0 with GPU ufunc: 0.16726416163146496
1 with GPU ufunc: 0.03758022002875805
2 with GPU ufunc: 0.03580896370112896
3 with GPU ufunc: 0.03530424740165472
4 with GPU ufunc: 0.03579768259078264
0 with GPU kernel: 0.1421878095716238
1 with GPU kernel: 0.04386183246970177
2 with GPU kernel: 0.029975440353155136
3 with GPU kernel: 0.029602501541376114
4 with GPU kernel: 0.029780613258481026
在这里你可以看到内核的运行速度比ufunc略快,而且缓存(这是JIT编译函数的缓存,而不是调用的备忘录)大大加快了GPU的调用速度。
以上是关于CUDA GPU处理。TypeError: compile_kernel()得到了一个意外的关键字 "boundscheck"。的主要内容,如果未能解决你的问题,请参考以下文章
GPU/CUDA:网格的最大块数和每个多处理器的最大驻留块数