通过自定义内核更改 cuda::GpuMat 值

Posted 2023-03-23

技术标签:

【中文标题】通过自定义内核更改 cuda::GpuMat 值【英文标题】：Change cuda::GpuMat values through custom kernel 【发布时间】：2021-03-15 12:33:54 【问题描述】：

我正在使用内核“循环”实时摄像机流以突出显示特定颜色区域。这些不能总是用一些cv::thresholds 重构，因此我使用的是内核。

当前内核如下：

__global__ void customkernel(unsigned char* input, unsigned char* output, int width, int height, int colorWidthStep, int outputWidthStep) 
    const int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
    const int yIndex = blockIdx.y * blockDim.y + threadIdx.y;

    if ((xIndex < width) && (yIndex < height)) 
        const int color_tid = yIndex * colorWidthStep + (3*xIndex);
        const int output_tid = yIndex * outputWidthStep + (3*xIndex);
        const unsigned char red   = input[color_tid+0];
        const unsigned char green = input[color_tid+1];
        const unsigned char blue  = input[color_tid+2];
        if (!(red > 100 && blue < 50 && red > 1.0*green)) 
            output[output_tid] = 255;
            output[output_tid+1] = 255; 
            output[output_tid+2] = 255;
         else 
            output[output_tid] = 0;
            output[output_tid+1] = 0;
            output[output_tid+2] = 0;

这个内核在这里被调用：

extern "C" void myFunction(cv::cuda::GpuMat& input, cv::cuda::GpuMat& output) 
    // Calculate total number of bytes of input and output image
    const int colorBytes = input.step * input.rows;
    const int outputBytes = output.step * output.rows;

    unsigned char *d_input, *d_output;

    // Allocate device memory
    SAFE_CALL(cudaMalloc<unsigned char>(&d_input,colorBytes),"CUDA Malloc Failed");
    SAFE_CALL(cudaMalloc<unsigned char>(&d_output,outputBytes),"CUDA Malloc Failed");

    // Copy data from OpenCV input image to device memory
    SAFE_CALL(cudaMemcpy(d_input,input.ptr(),colorBytes,cudaMemcpyHostToDevice),"CUDA Memcpy Host To Device Failed");

    // Specify a reasonable block size
    const dim3 block(16,16);

    // Calculate grid size to cover the whole image
    const dim3 grid((input.cols + block.x - 1)/block.x, (input.rows + block.y - 1)/block.y);

    // Launch the color conversion kernel
    custom_kernel<<<grid,block>>>(d_input,d_output,input.cols,input.rows,input.step,output.step);

    // Synchronize to check for any kernel launch errors
    SAFE_CALL(cudaDeviceSynchronize(),"Kernel Launch Failed");

    // Copy back data from destination device meory to OpenCV output image
    SAFE_CALL(cudaMemcpy(output.ptr(),d_output,outputBytes,cudaMemcpyDeviceToHost),"CUDA Memcpy Host To Device Failed");

    // Free the device memory
    SAFE_CALL(cudaFree(d_input),"CUDA Free Failed");
    SAFE_CALL(cudaFree(d_output),"CUDA Free Failed");

我包含了一个示例图像，它显示了红色汽车上的内核结果。如您所见，尽管我尝试访问 RGB/BGR 值并将它们设置为零或 255，但仍有垂直的红线。

我使用以下作为开始，但我觉得 cv::Mat 和 cv::cuda::GpuMat 不以相同的方式保存它们的值。我读到 GpuMat 的数据只有一个 ptr，并认为它将与 blockIdx、blockDim 参数一起使用。 https://github.com/sshniro/opencv-samples/blob/master/cuda-bgr-grey.cpp

具体问题：

红线是什么原因？

如何正确更改 RGB 值？

我在 NVidia Xavier NX 上的 Ubuntu 18.04 上使用 Cuda 10.2。

如 cmets 中所述，我更改了 cudaMemcpy 函数的参数并删除了 cudaMalloc 和 cudaFree 部分。另外我提醒自己，OpenCV 将颜色存储在 BGR 中，所以我更改了内核内部的 (+0,+1,+2)。我直接通过 cv::imread 加载了红色汽车，以排除任何以前的格式错误。太成功了，内核工作。

【问题讨论】：

this 可能感兴趣 cudaMemcpy 应该失败或具有未定义的行为，因为GpuMat 的数据已经驻留在 GPU 上。您必须对两个 cudaMemcpy 调用都使用 cudaMemcpyDeviceToDevice。另外，在这种情况下不需要分配新的内存。您可以只使用d_input = input.ptr(); 和d_output = output.ptr();。我第一次成功更新了这个问题，非常感谢你。现在我想知道是否应该提出一个新问题，因为现在的问题似乎是从 sl::Mat 到 Gpu::Mat 的正确转换。我使用了这个github.com/stereolabs/zed-opencv/issues/…，但这似乎无法正常工作。由于颜色仍然无法像 'imread->GpuMat' 好吧，.. 如果你把一个 RGBA 图像放进去，它会在 OpenCV 中被转换成一个 BGRA，所以你不想使用 (3*xIndex) 而是使用 (4*xIndex) 。 .它现在正在工作。如果您愿意，可以将您的答案作为实际帖子发布，这样我就可以接受。如果没有，我明天自己写一篇。 【参考方案1】：

正如@sgarizvi 在 cmets 中提到的，cv::cuda::GpuMat 已经驻留在 Gpu 中，所以我不得不使用 cudaMemcpyDeviceToDevice 而不是 cudaMemcpyHostToDevice。

也不需要分配新的内存，这是通过删除上面代码的cudaMalloc和cudaFree部分实现的。

最后（只是在这种情况下，其他人可能会有所不同）我的图像输入是来自 StereoLabs 的 Zed 2，它以 RGBA 发布其图像，因此内存中的顺序是 R -> G -> B -> A，转换为 OpenCV 为 B -> G -> R -> A，每像素 4 步：

const int color_tid = yIndex * colorWidthStep + (4*xIndex);
const int output_tid = yIndex * outputWidthStep + (4*xIndex);

因此，要正确处理每个像素，您必须将指针增加四倍 xIndex，如果您只有 BGR/RGB 图像，则使用三倍，如果是灰度图像，则使用一次。

【讨论】：

以上是关于通过自定义内核更改 cuda::GpuMat 值的主要内容，如果未能解决你的问题，请参考以下文章