N-body OpenCL 代码：错误 CL_OUT_OF_HOST_MEMORY 与 GPU 卡 NVIDIA A6000 [关闭]

Posted 2023-04-15

技术标签:

【中文标题】N-body OpenCL 代码：错误 CL_OUT_OF_HOST_MEMORY 与 GPU 卡 NVIDIA A6000 [关闭]【英文标题】：N-body OpenCL code : error CL_OUT_OF_HOST_MEMORY with GPU card NVIDIA A6000 [closed] 【发布时间】：2021-10-03 11:18:37 【问题描述】：

我想运行一个使用 OpenCL 的旧 N-body。

我有 2 张带有 NVLink 的 NVIDIA A6000 卡，这是一个从硬件（可能还有软件？）角度绑定这 2 张 GPU 卡的组件。

但在执行时，我得到以下结果：

这里是使用的内核代码（我已经放了我估计对 NVIDIA 卡有用的 pragma）：

#pragma OPENCL EXTENSION cl_khr_fp64 : enable

__kernel
void
nbody_sim(
    __global double4* pos ,
    __global double4* vel,
    int numBodies,
    double deltaTime,
    double epsSqr,
    __local double4* localPos,
    __global double4* newPosition,
    __global double4* newVelocity)

    unsigned int tid = get_local_id(0);
    unsigned int gid = get_global_id(0);
    unsigned int localSize = get_local_size(0);

    // Gravitational constant
    double G_constant = 227.17085e-74;

    // Number of tiles we need to iterate
    unsigned int numTiles = numBodies / localSize;

    // position of this work-item
    double4 myPos = pos[gid];
    double4 acc = (double4) (0.0f, 0.0f, 0.0f, 0.0f);

    for(int i = 0; i < numTiles; ++i)
    
        // load one tile into local memory
        int idx = i * localSize + tid;
        localPos[tid] = pos[idx];

        // Synchronize to make sure data is available for processing
        barrier(CLK_LOCAL_MEM_FENCE);

        // Calculate acceleration effect due to each body
        // a[i->j] = m[j] * r[i->j] / (r^2 + epsSqr)^(3/2)
        for(int j = 0; j < localSize; ++j)
        
            // Calculate acceleration caused by particle j on particle i
            double4 r = localPos[j] - myPos;
            double distSqr = r.x * r.x  +  r.y * r.y  +  r.z * r.z;
            double invDist = 1.0f / sqrt(distSqr + epsSqr);
            double invDistCube = invDist * invDist * invDist;
            double s = G_constant * localPos[j].w * invDistCube;

            // accumulate effect of all particles
            acc += s * r;
        

        // Synchronize so that next tile can be loaded
        barrier(CLK_LOCAL_MEM_FENCE);
    

    double4 oldVel = vel[gid];

    // updated position and velocity
    double4 newPos = myPos + oldVel * deltaTime + acc * 0.5f * deltaTime * deltaTime;
    newPos.w = myPos.w;
    double4 newVel = oldVel + acc * deltaTime;

    // write to global memory
    newPosition[gid] = newPos;
    newVelocity[gid] = newVel;

设置内核代码的部分代码如下：

int NBody::setupCL()

  cl_int status = CL_SUCCESS;
  cl_event writeEvt1, writeEvt2;

  // The block is to move the declaration of prop closer to its use
  cl_command_queue_properties prop = 0;
  commandQueue = clCreateCommandQueue(
      context,
      devices[current_device],
      prop,
      &status);
  CHECK_OPENCL_ERROR( status, "clCreateCommandQueue failed.");

    ...

// create a CL program using the kernel source
  const char *kernelName = "NBody_Kernels.cl";
  FILE *fp = fopen(kernelName, "r");
  if (!fp) 
    fprintf(stderr, "Failed to load kernel.\n");
    exit(1);
  
  char *source = (char*)malloc(10000);
  int sourceSize = fread( source, 1, 10000, fp);
  fclose(fp);

  // Create a program from the kernel source
  program = clCreateProgramWithSource(context, 1, (const char **)&source, (const size_t *)&sourceSize, &status);

  // Build the program
  status = clBuildProgram(program, 1, devices, NULL, NULL, NULL);

  // get a kernel object handle for a kernel with the given name
  kernel = clCreateKernel(
      program,
      "nbody_sim",
      &status);
  CHECK_OPENCL_ERROR(status, "clCreateKernel failed.");

  status = waitForEventAndRelease(&writeEvt1);
  CHECK_ERROR(status, NBODY_SUCCESS, "WaitForEventAndRelease(writeEvt1) Failed");

  status = waitForEventAndRelease(&writeEvt2);
  CHECK_ERROR(status, NBODY_SUCCESS, "WaitForEventAndRelease(writeEvt2) Failed");

  return NBODY_SUCCESS;

因此，错误发生在创建内核代码时。有没有办法将 the 2 GPU 视为具有 NVLINK component 的独特 GPU？我的意思是从软件的角度来看？

如何解决创建内核代码的错误？

更新 1

I) 我自愿通过修改下面的这个循环将 GPU 设备的数量限制为只有一个 GPU（实际上，它仍然只有一个迭代）：

  // Print device index and device names
  //for(cl_uint i = 0; i < deviceCount; ++i)
  for(cl_uint i = 0; i < 1; ++i)
  
    char deviceName[1024];
    status = clGetDeviceInfo(deviceIds[i], CL_DEVICE_NAME, sizeof(deviceName), deviceName, NULL);
    CHECK_OPENCL_ERROR(status, "clGetDeviceInfo failed");

    std::cout << "Device " << i << " : " << deviceName <<" Device ID is "<<deviceIds[i]<< std::endl;
  

  // Set id = 0 for currentDevice with deviceType
  *currentDevice = 0;

  free(deviceIds);

  return NBODY_SUCCESS;

在经典调用之后做：

 status = clBuildProgram(program, 1, devices, NULL, NULL, NULL);

但错误仍然存在，在消息下方：

II) 如果我不修改此循环并应用建议的解决方案，即设置 devices[current_device] 而不是 devices 我会收到如下编译错误：

In file included from NBody.hpp:8,
                 from NBody.cpp:1:
/opt/AMDAPPSDK-3.0/include/CL/cl.h:863:16: note:   initializing argument 3 of ‘cl_int clBuildProgram(cl_program, cl_uint, _cl_device_id* const*, const char*, void (*)(cl_program, void*), void*)’
                const cl_device_id * /* device_list */,

如何规避这个编译问题？

更新 2

我在这部分代码中打印了status 变量的值：

我得到了status = -44 的值。从CL/cl.h，它将对应于CL_INVALID_PROGRAM 错误：

然后，当我执行应用程序时，我得到：

由于我在 NVIDIA 卡上使用 OpenCL，我想知道我是否没有错过在内核代码中添加特殊的编译指示，不是吗？

顺便问一下，变量 devices 的类型是什么？我无法正确打印。

更新 3

我添加了以下几行，但在执行时仍然是-44 error。我没有放置所有相关代码，而是提供以下链接来下载源文件：http://31.207.36.11/NBody.cpp 和用于编译的 Makefile：http://31.207.36.11/Makefile。也许有人会发现一些错误，但我主要想知道为什么我会得到这个error -44。

更新 4

我正在接手这个项目。

这是 clinfo 命令的结果：

$ clinfo
Number of platforms:                 1
  Platform Profile:              FULL_PROFILE
  Platform Version:              OpenCL 3.0 CUDA 11.4.94
  Platform Name:                 NVIDIA CUDA
  Platform Vendor:               NVIDIA Corporation
  Platform Extensions:               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_khr_gl_event cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_nv_kernel_attribute cl_khr_device_uuid cl_khr_pci_bus_info


  Platform Name:                 NVIDIA CUDA
Number of devices:               2
  Device Type:                   CL_DEVICE_TYPE_GPU
  Vendor ID:                     10deh
  Max compute units:                 84
  Max work items dimensions:             3
    Max work items[0]:               1024
    Max work items[1]:               1024
    Max work items[2]:               64
  Max work group size:               1024
  Preferred vector width char:           1
  Preferred vector width short:          1
  Preferred vector width int:            1
  Preferred vector width long:           1
  Preferred vector width float:          1
  Preferred vector width double:         1
  Native vector width char:          1
  Native vector width short:             1
  Native vector width int:           1
  Native vector width long:          1
  Native vector width float:             1
  Native vector width double:            1
  Max clock frequency:               1800Mhz
  Address bits:                  64
  Max memory allocation:             12762480640
  Image support:                 Yes
  Max number of images read arguments:       256
  Max number of images write arguments:      32
  Max image 2D width:                32768
  Max image 2D height:               32768
  Max image 3D width:                16384
  Max image 3D height:               16384
  Max image 3D depth:                16384
  Max samplers within kernel:            32
  Max size of kernel argument:           4352
  Alignment (bits) of base address:      4096
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                     Yes
    Quiet NaNs:                  Yes
    Round to nearest even:           Yes
    Round to zero:               Yes
    Round to +ve and infinity:           Yes
    IEEE754-2008 fused multiply-add:         Yes
  Cache type:                    Read/Write
  Cache line size:               128
  Cache size:                    2408448
  Global memory size:                51049922560
  Constant buffer size:              65536
  Max number of constant args:           9
  Local memory type:                 Scratchpad
  Local memory size:                 49152
  Max pipe arguments:                0
  Max pipe active reservations:          0
  Max pipe packet size:              0
  Max global variable size:          0
  Max global variable preferred total size:  0
  Max read/write image args:             0
  Max on device events:              0
  Queue on device max size:          0
  Max on device queues:              0
  Queue on device preferred size:        0
  SVM capabilities:
    Coarse grain buffer:             Yes
    Fine grain buffer:               No
    Fine grain system:               No
    Atomics:                     No
  Preferred platform atomic alignment:       0
  Preferred global atomic alignment:         0
  Preferred local atomic alignment:      0
  Kernel Preferred work group size multiple:     32
  Error correction support:          0
  Unified memory for Host and Device:        0
  Profiling timer resolution:            1000
  Device endianess:              Little
  Available:                     Yes
  Compiler available:                Yes
  Execution capabilities:
    Execute OpenCL kernels:          Yes
    Execute native function:             No
  Queue on Host properties:
    Out-of-Order:                Yes
    Profiling :                  Yes
  Queue on Device properties:
    Out-of-Order:                No
    Profiling :                  No
  Platform ID:                   0x1e97440
  Name:                      NVIDIA RTX A6000
  Vendor:                    NVIDIA Corporation
  Device OpenCL C version:           OpenCL C 1.2
  Driver version:                470.57.02
  Profile:                   FULL_PROFILE
  Version:                   OpenCL 3.0 CUDA
  Extensions:                    cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_khr_gl_event cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_nv_kernel_attribute cl_khr_device_uuid cl_khr_pci_bus_info


  Device Type:                   CL_DEVICE_TYPE_GPU
  Vendor ID:                     10deh
  Max compute units:                 84
  Max work items dimensions:             3
    Max work items[0]:               1024
    Max work items[1]:               1024
    Max work items[2]:               64
  Max work group size:               1024
  Preferred vector width char:           1
  Preferred vector width short:          1
  Preferred vector width int:            1
  Preferred vector width long:           1
  Preferred vector width float:          1
  Preferred vector width double:         1
  Native vector width char:          1
  Native vector width short:             1
  Native vector width int:           1
  Native vector width long:          1
  Native vector width float:             1
  Native vector width double:            1
  Max clock frequency:               1800Mhz
  Address bits:                  64
  Max memory allocation:             12762578944
  Image support:                 Yes
  Max number of images read arguments:       256
  Max number of images write arguments:      32
  Max image 2D width:                32768
  Max image 2D height:               32768
  Max image 3D width:                16384
  Max image 3D height:               16384
  Max image 3D depth:                16384
  Max samplers within kernel:            32
  Max size of kernel argument:           4352
  Alignment (bits) of base address:      4096
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                     Yes
    Quiet NaNs:                  Yes
    Round to nearest even:           Yes
    Round to zero:               Yes
    Round to +ve and infinity:           Yes
    IEEE754-2008 fused multiply-add:         Yes
  Cache type:                    Read/Write
  Cache line size:               128
  Cache size:                    2408448
  Global memory size:                51050315776
  Constant buffer size:              65536
  Max number of constant args:           9
  Local memory type:                 Scratchpad
  Local memory size:                 49152
  Max pipe arguments:                0
  Max pipe active reservations:          0
  Max pipe packet size:              0
  Max global variable size:          0
  Max global variable preferred total size:  0
  Max read/write image args:             0
  Max on device events:              0
  Queue on device max size:          0
  Max on device queues:              0
  Queue on device preferred size:        0
  SVM capabilities:
    Coarse grain buffer:             Yes
    Fine grain buffer:               No
    Fine grain system:               No
    Atomics:                     No
  Preferred platform atomic alignment:       0
  Preferred global atomic alignment:         0
  Preferred local atomic alignment:      0
  Kernel Preferred work group size multiple:     32
  Error correction support:          0
  Unified memory for Host and Device:        0
  Profiling timer resolution:            1000
  Device endianess:              Little
  Available:                     Yes
  Compiler available:                Yes
  Execution capabilities:
    Execute OpenCL kernels:          Yes
    Execute native function:             No
  Queue on Host properties:
    Out-of-Order:                Yes
    Profiling :                  Yes
  Queue on Device properties:
    Out-of-Order:                No
    Profiling :                  No
  Platform ID:                   0x1e97440
  Name:                      NVIDIA RTX A6000
  Vendor:                    NVIDIA Corporation
  Device OpenCL C version:           OpenCL C 1.2
  Driver version:                470.57.02
  Profile:                   FULL_PROFILE
  Version:                   OpenCL 3.0 CUDA
  Extensions:                    cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_khr_gl_event cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_nv_kernel_attribute cl_khr_device_uuid cl_khr_pci_bus_info

所以我有一个带有 2 个 GPU 卡 A6000 的平台。

鉴于我想运行我的代码的原始版本（即使用single GPU card），我必须在源NBody.cpp 中只选择一个ID（我将在第二次看到如何使用 2 个 GPU 卡进行管理，但这是用于之后的）。所以，我刚刚修改了这个源码。

代替：

  // Print device index and device names
  for(cl_uint i = 0; i < deviceCount; ++i)
  
    char deviceName[1024];
    status = clGetDeviceInfo(deviceIds[i], CL_DEVICE_NAME, sizeof(deviceName), deviceName, NULL);
    CHECK_OPENCL_ERROR(status, "clGetDeviceInfo failed");

    std::cout << "Device " << i << " : " << deviceName <<" Device ID is "<<deviceIds[i]<< std::endl;

我做到了：

// Print device index and device names
  //for(cl_uint i = 0; i < deviceCount; ++i)
  for(cl_uint i = 0; i < 1; ++i)
  
    char deviceName[1024];
    status = clGetDeviceInfo(deviceIds[i], CL_DEVICE_NAME, sizeof(deviceName), deviceName, NULL);
    CHECK_OPENCL_ERROR(status, "clGetDeviceInfo failed");

    std::cout << "Device " << i << " : " << deviceName <<" Device ID is "<<deviceIds[i]<< std::endl;

如你所见，我已经强制考虑deviceIds[0]，也就是单GPU卡。

关键点也是构建程序的一部分。

  // create a CL program using the kernel source 
  const char *kernelName = "NBody_Kernels.cl";
  FILE *fp = fopen(kernelName, "r");
  if (!fp) 
    fprintf(stderr, "Failed to load kernel.\n");
    exit(1);
  
  char *source = (char*)malloc(10000);
  int sourceSize = fread( source, 1, 10000, fp);
  fclose(fp);

  // Create a program from the kernel source
  program = clCreateProgramWithSource(context, 1, (const char **)&source, (const size_t *)&sourceSize, &status);

  // Build the program
  //status = clBuildProgram(program, 1, devices, NULL, NULL, NULL);
  status = clBuildProgram(program, 1, &devices[current_device], NULL, NULL, NULL);
  printf("status1 = %d\n", status);
  //printf("devices = %d\n", devices[current_device]);

  // get a kernel object handle for a kernel with the given name
  kernel = clCreateKernel(
      program,
      "nbody_sim",
      &status);
  printf("status2 = %d\n", status);
  CHECK_OPENCL_ERROR(status, "clCreateKernel failed.");

在执行时，我得到status1 和status2 的以下值：

Selected Platform Vendor : NVIDIA Corporation
deviceCount = 2/nDevice 0 : NVIDIA RTX A6000 Device ID is 0x55c38207cdb0
status1 = -44
devices = -2113661720
status2 = -44
clCreateKernel failed.
clSetKernelArg failed. (updatedPos)
clEnqueueNDRangeKernel failed.
clEnqueueNDRangeKernel failed.
clEnqueueNDRangeKernel failed.
clEnqueueNDRangeKernel failed.

第一个错误是内核创建失败。这是我的NBody_Kernels.cl 来源：

#pragma OPENCL EXTENSION cl_khr_fp64 : enable

__kernel
void 
nbody_sim(
    __global double4* pos ,
    __global double4* vel,
    int numBodies,
    double deltaTime,
    double epsSqr,
    __local double4* localPos,
    __global double4* newPosition,
    __global double4* newVelocity)

    unsigned int tid = get_local_id(0);
    unsigned int gid = get_global_id(0);
    unsigned int localSize = get_local_size(0);

    // Gravitational constant
    double G_constant = 227.17085e-74;

    // Number of tiles we need to iterate
    unsigned int numTiles = numBodies / localSize;

    // position of this work-item
    double4 myPos = pos[gid];
    double4 acc = (double4) (0.0f, 0.0f, 0.0f, 0.0f);

    for(int i = 0; i < numTiles; ++i)
    
        // load one tile into local memory
        int idx = i * localSize + tid;
        localPos[tid] = pos[idx];

        // Synchronize to make sure data is available for processing
        barrier(CLK_LOCAL_MEM_FENCE);

        // Calculate acceleration effect due to each body
        // a[i->j] = m[j] * r[i->j] / (r^2 + epsSqr)^(3/2)
        for(int j = 0; j < localSize; ++j)
        
            // Calculate acceleration caused by particle j on particle i
            double4 r = localPos[j] - myPos;
            double distSqr = r.x * r.x  +  r.y * r.y  +  r.z * r.z;
            double invDist = 1.0f / sqrt(distSqr + epsSqr);
            double invDistCube = invDist * invDist * invDist;
            double s = G_constant * localPos[j].w * invDistCube;

            // accumulate effect of all particles
            acc += s * r;
        

        // Synchronize so that next tile can be loaded
        barrier(CLK_LOCAL_MEM_FENCE);
    

    double4 oldVel = vel[gid];

    // updated position and velocity
    double4 newPos = myPos + oldVel * deltaTime + acc * 0.5f * deltaTime * deltaTime;
    newPos.w = myPos.w;
    double4 newVel = oldVel + acc * deltaTime;

    // write to global memory
    newPosition[gid] = newPos;
    newVelocity[gid] = newVel;

修改后的源码可以在这里找到：

last modified code

我不知道如何解决这个内核代码的创建和下面的值status1 = -44和status2 = -44。

更新 5

我已将clGetProgramBuildInfo 添加到以下sn-p 的代码中，以便能够查看clCreateKernl failed 错误的问题：

// Create a program from the kernel source
  program = clCreateProgramWithSource(context, 1, (const char **)&source, (const size_t *)&sourceSize, &status);

  if (clBuildProgram(program, 1, devices, NULL, NULL, NULL) != CL_SUCCESS)
  
    // Determine the size of the log
    size_t log_size;
    clGetProgramBuildInfo(program, devices[current_device], CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size);
    // Allocate memory for the log
    char *log = (char *) malloc(log_size);

    cout << "size log =" << log_size << endl;
    // Get the log
    clGetProgramBuildInfo(program, devices[current_device], CL_PROGRAM_BUILD_LOG, log_size, log, NULL);

    // Print the log
    printf("%s\n", log);
    


  // get a kernel object handle for a kernel with the given name
  kernel = clCreateKernel(
      program,
      "nbody_sim",
      &status);
  CHECK_OPENCL_ERROR(status, "clCreateKernel failed.");

不幸的是，这个函数clGetProgramBuildInfo只给出了输出：

Selected Platform Vendor : NVIDIA Corporation
Device 0 : NVIDIA RTX A6000 Device ID is 0x562857930980
size log =16
log =
clCreateKernel failed.

如何打印“value”的内容？

更新 6

如果我在 :

上执行 printf

  // Create a program from the kernel source
  program = clCreateProgramWithSource(context, 1, (const char **)&source, (const size_t *)&sourceSize, &status);
printf("status clCreateProgramWithSourceContext = %d\n", status);

我得到一个status=-6，它对应于CL_OUT_OF_HOST_MEMORY

哪些轨道可以解决这个问题？

部分解决方案

通过使用英特尔编译器（icc 和icpc）进行编译，编译执行良好，代码运行良好。我不明白为什么它不适用于GNU gcc/g++-8 编译器。如果有人有想法...

【问题讨论】：

帮助您解决错误：我想我找到了问题并更新了我的答案。另外：调用clCreateKernel后status有什么价值？我从cl.h 文件中得到clCreateKernel 之后的错误status = -44，如我的 UPDATE2 所示，它对应于CL_INVALID_PROGRAM。你对这个错误有什么想法吗？内核代码中缺少编译指示？注意：这篇文章中的IP地址链接显然是高度临时的，因此这篇文章可以说是缺少minimal reproducible example。请用每个文件的内联副本替换这些。如果这不可能，那么问题将需要暂停。我可以向你保证，它目前是题外话，一旦赏金到期，将根据该评估投票结束。我建议你改变你的方法。如果完整的源代码不适合问题，则不应发布问题，尤其是代码链接显然是临时的。通常，该问题的最佳解决方案是将问题分解为适合的较小问题。请阅读前面给出的“最小示例”链接。 【参考方案1】：

您的内核代码看起来不错，并且缓存切片实现是正确的。只需确保主体的数量是局部大小的倍数，或者另外将内部 for 循环限制为全局大小。

OpenCL 允许并行使用多个设备。您需要为每个设备单独创建一个带有队列的线程。您还需要手动处理设备间通信和同步。数据传输通过 PCIe 进行（您也可以进行远程直接内存访问）；但是您不能将 NVLink 与 OpenCL 一起使用。这在您的情况下应该不是问题，因为与算术量相比，您只需要很少的数据传输。

补充几点：

在许多情况下，N-body 需要 FP64 来对力求和并在非常不同的长度尺度上解析位置。但是在 A6000 上，FP64 性能很差，就像在 GeForce Ampere 上一样。 FP32 会明显更快（~64 倍），但这里的准确性可能不够。要获得高效的 FP64，您需要 A100 或 MI100。使用 rsqrt，而不是 1.0/sqrt。这是硬件支持的，几乎和乘法一样快。确保始终使用 FP32 浮点 (1.0f) 或 FP64 双 (1.0) 文字。使用带浮点数的双精度字面量会触发双精度运算并将结果转换回浮点数，这要慢得多。

编辑：帮助您解决错误消息：很可能clCreateKernel 处的错误（在调用clCreateKernel 后status 有什么值？）暗示program 无效。这可能是因为您为 clBuildProgram 提供了 2 个设备的向量，但将设备数量设置为仅 1 并且 context 仅用于 1 个设备。试试

status = clBuildProgram(program, 1, &devices[current_device], NULL, NULL, NULL);

只有一个设备。

要使用多 GPU，请在 CPU 上创建两个线程，分别为 GPU 0 和 1 运行 NBody::setupCL()，然后手动进行同步。

编辑 2：我看不到您创建context。如果没有有效的上下文，program 将无效，因此clBuildProgram 将抛出错误-44。打电话

context = clCreateContext(0, 1, &devices[current_device], NULL, NULL, NULL);

在你对context做任何事情之前。

【讨论】：

感谢您的反应，我添加了以下几行，但在执行时仍然-44 error。我没有放置所有相关代码，而是为您提供以下链接来下载源文件：31.207.36.11/NBody.cpp 和用于编译的 Makefile：31.207.36.11/Makefile。也许你会发现一些错误，但我主要想知道为什么我会得到这个 error -44 。问候我已经详细浏览了您的代码，对我来说看起来不错。唯一的问题是，您可以检查 status 的 clCreateContext 并查看是否正常。除此之外，错误-44 可能是由非常难以调试的线程问题引起的。最好的办法是编写一个孤立的、最小的 OpenCL 示例，比如向量加法，看看你能达到多远。我已经接管了我的代码的调试。您可以看到，当我尝试创建内核代码时，麻烦就开始了。我已经提供了我的内核代码的内容。如果你可以看看这个UPDATE4。最好的问候错误 -44 很难调试，请参阅 ***.com/q/63263231/9178992 我建议从最简单的 OpenCL Hello World 程序开始，看看是否仍然出现错误。

以上是关于N-body OpenCL 代码：错误 CL_OUT_OF_HOST_MEMORY 与 GPU 卡 NVIDIA A6000 [关闭]的主要内容，如果未能解决你的问题，请参考以下文章

N-body OpenCL 代码：错误 CL_​OUT_​OF_​HOST_​MEMORY 与 GPU 卡 NVIDIA A6000 [关闭]