cublas的tensorflow运行错误

Posted

技术标签:

【中文标题】cublas的tensorflow运行错误【英文标题】:tensorflow running error with cublas 【发布时间】:2016-11-13 05:24:35 【问题描述】:

当我在集群上成功安装 tensorflow 后,我立即运行 mnist demo 来检查它是否运行良好,但是在这里我遇到了一个问题。我不知道这是怎么回事,但看起来错误来自 CUDA

python3 -m tensorflow.models.image.mnist.convolutional
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:924] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: Tesla K20m
major: 3 minor: 5 memoryClockRate (GHz) 0.7055
pciBusID 0000:03:00.0
Total memory: 5.00GiB
Free memory: 4.92GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K20m, pci bus id: 0000:03:00.0)
Initialized!
E tensorflow/stream_executor/cuda/cuda_blas.cc:461] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 715, in _do_call
return fn(*args)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 697, in _run_fn
status, run_metadata)
  File "/home/gpuusr/local/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/framework/errors.py", line 450, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors.InternalError: Blas SGEMM launch failed : a.shape=(64, 3136), b.shape=(3136, 512), m=64, n=512, k=3136
 [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](Reshape, Variable_4/read)]]
 [[Node: add_5/_35 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_299_add_5", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/gpuusr/local/lib/python3.5/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
  File "/home/gpuusr/local/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 316, in <module>
tf.app.run()
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 294, in main
feed_dict=feed_dict)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 372, in run
run_metadata_ptr)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 636, in _run
feed_dict_string, options, run_metadata)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 708, in _do_run
target_list, options, run_metadata)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 728, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InternalError: Blas SGEMM launch failed : a.shape=(64, 3136), b.shape=(3136, 512), m=64, n=512, k=3136
 [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](Reshape, Variable_4/read)]]
 [[Node: add_5/_35 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_299_add_5", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op 'MatMul', defined at:
  File "/home/gpuusr/local/lib/python3.5/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
  File "/home/gpuusr/local/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 316, in <module>
tf.app.run()
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 221, in main
logits = model(train_data_node, True)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 213, in model
hidden = tf.nn.relu(tf.matmul(reshape, fc1_weights) + fc1_biases)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 1209, in matmul
name=name)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1178, in _mat_mul
transpose_b=transpose_b, name=name)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/ops/op_def_library.py", line 704, in apply_op
op_def=op_def)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2260, in create_op
original_op=self._default_original_op, op_def=op_def)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1230, in __init__
self._traceback = _extract_stack()

Segmentation fault (core dumped)

【问题讨论】:

为了构建或运行支持 GPU 的 TensorFlow,需要安装 NVIDIA 的 Cuda Toolkit (>= 7.0) 和 cuDNN (>= v2)。 TensorFlow GPU 支持需要具有 NVidia Compute Capability >= 3.0 的 GPU 卡。你按照官方的设置了吗? tensorflow.org/versions/r0.9/get_started/os_setup.html 绝对是的,我的cuda版本是7.5,cudnn版本是v4 好的,你的显卡性能大于等于3.0? 我的显卡是 Nvidia Tesla K20m。我刚查了一下,发现它的 cuda 特性是 3.5(它是计算能力吗?)来自 Nvidia 网站 @clemej 你找到解决方案了吗? 我正在现在打这个 【参考方案1】:

找到修复方法是一场噩梦——但修复方法有点简单

https://www.tensorflow.org/guide/using_gpu

# add to the top of your code under import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config....)

【讨论】:

在 Keras/tf 上为我工作 在 tensorflow-gpu 的 pip 安装上工作,但不需要配置 gpu_options,只需将 configproto init 传递给会话。 对于 TensorFlow 2,使用 tf.compat.v1.ConfigPrototf.compat.v1.Session 而不是答案中提到的那些。 您的答案中的椭圆是什么意思?据我所知,tf.Session(config=config....) 不是有效的 Python。 有人能解决吗:***.com/questions/60766376/…【参考方案2】:

我使用最新的堆栈(tensorflow 2.5、Cuda 11.1、Nvidia 3080)重新出现了这个问题。上面的修复(针对 Tensorflow 2 进行了修正)就像一个魅力:

config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.Session(config=config)

【讨论】:

【参考方案3】:

我有完全相同的错误,因为在 LD_LIBRARY_PATH 我在 7.5 前面有 cuda 5.5。在我将 7.5 移到 5.5 前面之后,现在一切正常。

【讨论】:

【参考方案4】:

除了上述解决方案之外,当 CUBLAS 版本与 CUDA 版本不兼容时,也会引发此错误。就我而言,libclubas10 版本10.2.2.89-1CUDA 10.1 不兼容,所以我不得不降级:

sudo apt-get install libcublas10=10.2.1.243-1 libcublas-dev=10.2.1.243-1 cuda-libraries-10-1 cuda-libraries-dev-10-1

【讨论】:

同样,我得到了同样的错误,因为我的 cudnn 版本与 CUDA 版本不匹配。我安装了与 cuda-11-0 不匹配的 libcudnn8=8.1.1.33-1。【参考方案5】:

以下两行对我有用。我是从github 复制过来的,但我不知道它们是什么意思。

physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)

做同样事情的另一种更简单的方法是:

os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'

我的环境: TF版本:2.6.0,CUDA版本:11.2,GPU驱动版本:460.32.03。 不知道cuDNN的版本是什么,因为找不到。

【讨论】:

这对我在 Ubuntu 20.04 上的 python 3.8.6 有效。【参考方案6】:

确保在每个会话之间使用 sess.close() 来释放资源,否则您将不得不在任务管理器中终止进程

【讨论】:

【参考方案7】:

CUDA 版本与 TensorFlow 版本的兼容性问题。就我而言,我的 CUDA 版本是 10.0,TensorFlow 版本是 2.1.0,就会出现这个问题。将 TensorFlow 2.1.0 更改为 TensorFlow 2.0.0 后,此问题消失。

【讨论】:

以上是关于cublas的tensorflow运行错误的主要内容,如果未能解决你的问题,请参考以下文章

相当于 cuBLAS 的 cudaGetErrorString?

无法创建cudnn句柄:CUBLAS_STATUS_ALLOC_FAILED

Tensorflow 运行错误记录之“引用tensorflow包运行代码提示:ImportError: DLL load failed: 找不到指定的模块。“

cuPrintf 啥都不做(程序使用固定+映射内存,CUBLAS 也是)

解决 conda tensorflow failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED

在 MATLAB MEX 文件中使用 Thrust 的运行时链接器错误