Google Colab 中的空闲 GPU 内存
Posted
技术标签:
【中文标题】Google Colab 中的空闲 GPU 内存【英文标题】:Free GPU memory in Google Colab 【发布时间】:2021-12-12 17:15:09 【问题描述】:我想知道是否有办法在 Google Colab 中释放 GPU 内存。
我正在使用来自tf.datasets
的eurosat/rgb/
数据集循环训练一些CNN。模型没有那么大,数据集也没有。
错误如下:
Epoch 1/8
---------------------------------------------------------------------------
ResourceExhaustedError Traceback (most recent call last)
<ipython-input-15-c4badfe8da7d> in <module>()
27 nclasses=NCLASSES,
28 metadic = METADIC,
---> 29 val_split = 0.20)
30 plot_results(record=current_exp,run='avg',batch=False,save=True)
31 plot_results(record=current_exp,run='avg',batch=True,save=True)
7 frames
<ipython-input-6-f1fac48c4ac9> in run_experiment(bloques, input_shape, init_conv_filters, batch_size, epochs, init_lr, end_lr, nruns, optimizer, sma_periods, nclasses, metadic, val_split)
75 epochs = epochs,
76 workers = 1,
---> 77 callbacks = [LRFinder]
78 )
79
/usr/local/lib/python3.7/dist-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
1182 _r=1):
1183 callbacks.on_train_batch_begin(step)
-> 1184 tmp_logs = self.train_function(iterator)
1185 if data_handler.should_sync:
1186 context.async_wait()
/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py in __call__(self, *args, **kwds)
883
884 with OptionalXlaContext(self._jit_compile):
--> 885 result = self._call(*args, **kwds)
886
887 new_tracing_count = self.experimental_get_tracing_count()
/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py in _call(self, *args, **kwds)
948 # Lifting succeeded, so variables are initialized and we can run the
949 # stateless function.
--> 950 return self._stateless_fn(*args, **kwds)
951 else:
952 _, _, _, filtered_flat_args = \
/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py in __call__(self, *args, **kwargs)
3038 filtered_flat_args) = self._maybe_define_function(args, kwargs)
3039 return graph_function._call_flat(
-> 3040 filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access
3041
3042 @property
/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
1962 # No tape is watching; skip to running the function.
1963 return self._build_call_outputs(self._inference_function.call(
-> 1964 ctx, args, cancellation_manager=cancellation_manager))
1965 forward_backward = self._select_forward_and_backward_functions(
1966 args,
/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py in call(self, ctx, args, cancellation_manager)
594 inputs=args,
595 attrs=attrs,
--> 596 ctx=ctx)
597 else:
598 outputs = execute.execute_with_cancellation(
/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
58 ctx.ensure_initialized()
59 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60 inputs, attrs, num_outputs)
61 except core._NotOkStatusException as e:
62 if name is not None:
ResourceExhaustedError: failed to allocate memory
[[node dense1/kernel/Regularizer/Square (defined at <ipython-input-6-f1fac48c4ac9>:77) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[Op:__inference_train_function_309982]
Function call stack:
train_function
到目前为止我所做的尝试:
我做了一些研究,并在每个模型训练完成后调用以下函数
def reset_tensorflow_keras_backend():
# to be further investigated, but this seems to be enough
import tensorflow as tf
import tensorflow.keras as keras
tf.keras.backend.clear_session()
tf.compat.v1.reset_default_graph()
_ = gc.collect()
作为为每个模型获取新会话的一种方式。我已经能够使用多个模型运行一个循环并且没有收到错误,但是今天错误再次出现在一个最简单的模型中,这很奇怪。
故障时刻的GPU使用情况为:
!nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 48C P0 57W / 149W | 11077MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
这显然接近 100%。
【问题讨论】:
【参考方案1】:发生这种情况可能是因为每次您在 colab 中打开会话时,您并不总是获得相同的 GPU,您可以像这样检查分配的 GPU。
!nvidia-smi -L
我所做的是重置会话,直到谷歌用 Tesla T4 祝福我。
我搜索了过去释放内存的方法,但唯一的方法是重新启动会话。
我相信通过选择 GPU,您不会再遇到问题。
如你所见,谷歌为你分配了一个Tesla K80
,这是最糟糕的一个
【讨论】:
您必须手动执行此操作吗?这可能需要很长时间,对吧? 您打算重置吗?不,您只需要在大多数情况下关闭会话几次,这是 1-2 分钟的操作,但这取决于服务器的拥塞程度,如果您想要高级服务,您可以购买高级版 colab,但它并非无处不在以上是关于Google Colab 中的空闲 GPU 内存的主要内容,如果未能解决你的问题,请参考以下文章
InternalError:Google Colab 中的 GPU 同步失败
如何在 Google Colab 中获得分配的 GPU 规格
google Colab 使用教程 免费GPU google Colaboratory 上运行 pytorch tensorboard