异常后重置 Cuda 上下文

Posted 2023-03-23

技术标签:

【中文标题】异常后重置 Cuda 上下文【英文标题】：Reset Cuda Context after exception 【发布时间】：2019-10-13 05:23:51 【问题描述】：

我有一个使用 Cuda / C++ 的工作应用程序，但有时由于内存泄漏，会引发异常。我需要能够实时重置 GPU，我的应用是服务器，所以它必须保持可用。

我试过这样的方法，但它似乎不起作用：

try

    // do process using GPU

catch (std::exception &e)

    // catching exception from cuda only

    cudaSetDevice(0);
    CUDA_RETURN_(cudaDeviceReset());

我的想法是每次从 GPU 收到异常时重置设备，但我无法使其正常工作。 :( 顺便说一句，由于某些原因，我无法解决我的 Cuda 代码的所有问题，我需要一个临时解决方案。谢谢！

【问题讨论】：

请澄清内存泄漏是否在GPU上，以及异常是否由CUDA运行时API引发。另外，为什么内存泄漏会导致异常？最多你应该会遇到分配更多内存的失败。 【参考方案1】：

在发生不可恢复（“粘滞”）CUDA 错误后恢复正确设备功能的唯一方法是终止启动的主机进程（即发出导致的 CUDA 运行时 API 调用）错误。

因此，对于单进程应用程序，唯一的方法就是终止应用程序。

应该可以设计一个多进程应用程序，其中初始（“父”）进程不使用任何 CUDA，并产生一个使用 GPU 的子进程。当子进程遇到不可恢复的 CUDA 错误时，必须终止。

父进程可以选择监视子进程。如果确定子进程已终止，则可以重新生成该进程并恢复 CUDA 功能行为。

粘滞与非粘滞错误在其他地方进行了介绍，例如 here。

一个适当的多进程应用程序示例，它使用例如fork() 生成使用 CUDA 的子进程可在 CUDA 示例代码 simpleIPC 中找到。这是从simpleIPC 示例（适用于linux）组装而成的粗略示例：

$ cat t477.cu
/*
 * Copyright 1993-2015 NVIDIA Corporation.  All rights reserved.
 *
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.
 *
 */

// Includes
#include <stdio.h>
#include <assert.h>

// CUDA runtime includes
#include <cuda_runtime_api.h>

// CUDA utilities and system includes
#include <helper_cuda.h>

#define MAX_DEVICES          1
#define PROCESSES_PER_DEVICE 1
#define DATA_BUF_SIZE        4096

#ifdef __linux
#include <unistd.h>
#include <sched.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <linux/version.h>

typedef struct ipcDevices_st

    int count;
    int results[MAX_DEVICES];
 ipcDevices_t;


// CUDA Kernel
__global__ void simpleKernel(int *dst, int *src, int num)

    // Dummy kernel
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    dst[idx] = src[idx] / num;



void runTest(int index, ipcDevices_t* s_devices)

    if (s_devices->results[0] == 0)
        simpleKernel<<<1,1>>>(NULL, NULL, 1);  // make a fault
        cudaDeviceSynchronize();
        s_devices->results[0] = 1;
    else 
        int *d, *s;
        int n = 1;
        cudaMalloc(&d, n*sizeof(int));
        cudaMalloc(&s, n*sizeof(int));
        simpleKernel<<<1,1>>>(d, s, n);
        cudaError_t err = cudaDeviceSynchronize();
        if (err != cudaSuccess)
          s_devices->results[0] = 0;
        else
          s_devices->results[0] = 2;
    cudaDeviceReset();

#endif

int main(int argc, char **argv)


    ipcDevices_t *s_devices = (ipcDevices_t *) mmap(NULL, sizeof(*s_devices),
                                                    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, 0, 0);
    assert(MAP_FAILED != s_devices);

    // We can't initialize CUDA before fork() so we need to spawn a new process
    s_devices->count = 1;
    s_devices->results[0] = 0;

    printf("\nSpawning child process\n");
    int index = 0;

    pid_t pid = fork();

    printf("> Process %3d\n", pid);
    if (pid == 0)  // child process
    // launch our test
      runTest(index, s_devices);
    
    // Cleanup and shutdown
    else  // parent process
            int status;
            waitpid(pid, &status, 0);
            if (s_devices->results[0] < 2) 
              printf("first process launch reported error: %d\n", s_devices->results[0]);
              printf("respawn\n");
              pid_t newpid = fork();
              if (newpid == 0)  // child process
                    // launch our test
                 runTest(index, s_devices);
                  
    // Cleanup and shutdown
              else  // parent process
                int status;
                waitpid(newpid, &status, 0);
                if (s_devices->results[0] < 2)
                  printf("second process launch reported error: %d\n", s_devices->results[0]);
                else
                  printf("second process launch successful\n");
                

            

    

    printf("\nShutting down...\n");

    exit(EXIT_SUCCESS);


$ nvcc -I/usr/local/cuda/samples/common/inc t477.cu -o t477
$ ./t477

Spawning child process
> Process 10841
> Process   0

Shutting down...
first process launch reported error: 1
respawn

Shutting down...
second process launch successful

Shutting down...
$

对于 windows，唯一需要做的改变应该是使用 windows IPC 机制进行主机进程间通信。

【讨论】：

以上是关于异常后重置 Cuda 上下文的主要内容，如果未能解决你的问题，请参考以下文章