在 GPU 上计算彩色像素 - 理论

Posted 2023-03-16

技术标签:

【中文标题】在 GPU 上计算彩色像素 - 理论【英文标题】：Counting coloured pixels on the GPU - Theory 【发布时间】：2017-11-04 21:13:38 【问题描述】：

我有一张 128 x 128 像素的图片。

它被分解成一个 8 x 8 的网格。

每个网格块包含 16 x 16 像素。

要求

我想计算我的图像包含多少黑色像素。

直截了当：

我可以通过逐行、逐列、检查整个图像并检查像素是否为黑色来做到这一点。

GPU 方式

...但我想知道如果使用 GPU，我可以将图像分解成块/块并计算每个块中的所有像素，然后对结果求和。

例如：

如果你看图片的左上角：

第一个块，“A1”（A 行，第 1 列）包含一个 16 x 16 像素的网格，我通过手动计算知道，有 16 个黑色像素。

第二个块：'A2'，（A 行，第 2 列）包含一个 16 x 16 像素的网格，我通过手动计算知道，有 62 个黑色像素。

此示例的所有其他块都是空白/空的。

如果我通过我的程序运行我的图像，我应该得到答案：16 + 62 = 78 黑色像素。

推理

据我了解，GPU 可以并行处理大量数据，有效地在跨多个 GPU 线程的大量数据上运行小程序。我不担心速度/性能，我只是想知道这是否是 GPU 可以/可以做的事情？

【问题讨论】：

这似乎是直方图的一种特殊情况，GPU 非常适合这种情况。尝试使用您最喜欢的互联网搜索引擎搜索“GPU”+“直方图”。相关：***.com/questions/15782325/cuda-programming-histogram 更多相关：devblogs.nvidia.com/parallelforall/… 你知道你可以在 CPU 上做同样的事情，对吧？没有什么能阻止您为每个矩形生成一个线程然后添加结果。 @FernandoMazzon - 当然，但我特意寻找使用 GPU 的答案。 【参考方案1】：

确实，通用 GPU（例如，从 A8 开始的 Apple 设备中的 GPU）不仅能够解决此类并行数据处理问题，而且还旨在解决此类问题。

Apple 在其平台中引入了使用 Metal 的数据并行处理，通过一些简单的代码，您可以使用 GPU 解决类似您的问题。即使这也可以使用其他框架来完成，我还是包含了一些 Metal+Swift 案例的代码作为概念证明。

以下内容在 OS X Sierra 上作为 Swift 命令行工具运行，并使用 Xcode 9 构建（是的，我知道它是 beta 版）。您可以从我的github repo 获得完整的项目。

作为main.swift:

import Foundation
import Metal
import CoreGraphics
import AppKit

guard FileManager.default.fileExists(atPath: "./testImage.png") else 
    print("./testImage.png does not exist")
    exit(1)


let url = URL(fileURLWithPath: "./testImage.png")
let imageData = try Data(contentsOf: url)

guard let image = NSImage(data: imageData),
    let imageRef = image.cgImage(forProposedRect: nil, context: nil, hints: nil) else 
    print("Failed to load image data")
    exit(1)


let bytesPerPixel = 4
let bytesPerRow = bytesPerPixel * imageRef.width

var rawData = [UInt8](repeating: 0, count: Int(bytesPerRow * imageRef.height))

let bitmapInfo = CGBitmapInfo(rawValue: CGImageAlphaInfo.premultipliedFirst.rawValue).union(.byteOrder32Big)
let colorSpace = CGColorSpaceCreateDeviceRGB()

let context = CGContext(data: &rawData,
                        width: imageRef.width,
                        height: imageRef.height,
                        bitsPerComponent: 8,
                        bytesPerRow: bytesPerRow,
                        space: colorSpace,
                        bitmapInfo: bitmapInfo.rawValue)

let fullRect = CGRect(x: 0, y: 0, width: CGFloat(imageRef.width), height: CGFloat(imageRef.height))
context?.draw(imageRef, in: fullRect, byTiling: false)

// Get access to iPhone or iPad GPU
guard let device = MTLCreateSystemDefaultDevice() else 
    exit(1)


let textureDescriptor = MTLTextureDescriptor.texture2DDescriptor(
    pixelFormat: .rgba8Unorm,
    width: Int(imageRef.width),
    height: Int(imageRef.height),
    mipmapped: true)

let texture = device.makeTexture(descriptor: textureDescriptor)

let region = MTLRegionMake2D(0, 0, Int(imageRef.width), Int(imageRef.height))
texture.replace(region: region, mipmapLevel: 0, withBytes: &rawData, bytesPerRow: Int(bytesPerRow))

// Queue to handle an ordered list of command buffers
let commandQueue = device.makeCommandQueue()

// Buffer for storing encoded commands that are sent to GPU
let commandBuffer = commandQueue.makeCommandBuffer()

// Access to Metal functions that are stored in Shaders.metal file, e.g. sigmoid()
guard let defaultLibrary = device.makeDefaultLibrary() else 
    print("Failed to create default metal shader library")
    exit(1)


// Encoder for GPU commands
let computeCommandEncoder = commandBuffer.makeComputeCommandEncoder()

// hardcoded to 16 for now (recommendation: read about threadExecutionWidth)
var threadsPerGroup = MTLSize(width:16, height:16, depth:1)
var numThreadgroups = MTLSizeMake(texture.width / threadsPerGroup.width,
                                  texture.height / threadsPerGroup.height,
                                  1);

// b. set up a compute pipeline with Sigmoid function and add it to encoder
let countBlackProgram = defaultLibrary.makeFunction(name: "countBlack")
let computePipelineState = try device.makeComputePipelineState(function: countBlackProgram!)
computeCommandEncoder.setComputePipelineState(computePipelineState)


// set the input texture for the countBlack() function, e.g. inArray
// atIndex: 0 here corresponds to texture(0) in the countBlack() function
computeCommandEncoder.setTexture(texture, index: 0)

// create the output vector for the countBlack() function, e.g. counter
// atIndex: 1 here corresponds to buffer(0) in the Sigmoid function
var counterBuffer = device.makeBuffer(length: MemoryLayout<UInt32>.size,
                                        options: .storageModeShared)
computeCommandEncoder.setBuffer(counterBuffer, offset: 0, index: 0)

computeCommandEncoder.dispatchThreadgroups(numThreadgroups, threadsPerThreadgroup: threadsPerGroup)

computeCommandEncoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()

// a. Get GPU data
// outVectorBuffer.contents() returns UnsafeMutablePointer roughly equivalent to char* in C
var data = NSData(bytesNoCopy: counterBuffer.contents(),
                  length: MemoryLayout<UInt32>.size,
                  freeWhenDone: false)
// b. prepare Swift array large enough to receive data from GPU
var finalResultArray = [UInt32](repeating: 0, count: 1)

// c. get data from GPU into Swift array
data.getBytes(&finalResultArray, length: MemoryLayout<UInt>.size)

print("Found \(finalResultArray[0]) non-white pixels")

// d. YOU'RE ALL SET!

另外，在Shaders.metal:

#include <metal_stdlib>
using namespace metal;

kernel void
countBlack(texture2d<float, access::read> inArray [[texture(0)]],
           volatile device uint *counter [[buffer(0)]],
           uint2 gid [[thread_position_in_grid]]) 

    // Atomic as we need to sync between threadgroups
    device atomic_uint *atomicBuffer = (device atomic_uint *)counter;
    float3 inColor = inArray.read(gid).rgb;
    if(inColor.r != 1.0 || inColor.g != 1.0 || inColor.b != 1.0) 
        atomic_fetch_add_explicit(atomicBuffer, 1, memory_order_relaxed);

我通过这个问题学习了一些有关 Metal 和数据并行计算的知识，因此大部分代码都被用作在线文章的样板并进行了编辑。请花时间访问下面提到的资源以获取更多示例。此外，对于这个特定问题，代码几乎是硬编码的，但你应该不会有太多的麻烦来适应它。

来源：

http://flexmonkey.blogspot.com.ar/2016/05/histogram-equalisation-with-metal.html

http://metalbyexample.com/introduction-to-compute/

http://memkite.com/blog/2014/12/15/data-parallel-programming-with-metal-and-swift-for-iphoneipad-gpu/

【讨论】：

Metal Shader的默认行为是不是并行的？【参考方案2】：

您的问题： 我想知道这是否是 GPU 可以/可以做的事情？

答案：是的，GPU 可以处理您的计算。所有数字看起来都非常适合 GPU：

经纱尺寸：32 (16x2) 每个块的最大线程数：1024 (8x128) (8x8x16) 每个多处理器的最大线程数：2048 ...等

您可以尝试多种块/线程配置以获得最佳性能。

过程：一般来说，使用GPU就是将数据从CPU内存复制到GPU内存，然后在GPU上进行计算，最后将结果复制回CPU做进一步处理计算。需要考虑的一个重要想法是，所有这些数据传输都是通过 CPU 和 GPU 之间的 PCI-e 链路完成的，与两者相比，这非常慢。

我的观点：在这种情况下，当您将图像复制到 GPU 内存时，即使您使用单独的 CPU 计算线程，您也会得到结果。这是因为您的过程不是数学/计算密集型的。您只是在读取数据并将其与黑色进行比较，然后添加累加器或计数器以获得总数（这本身会引发您必须解决的竞争条件）。

我的建议：如果在分析（分析）您的整个程序后，您认为获取黑色像素计数的例程是一个真正的瓶颈，请尝试：

分治递归算法，或

在多个 CPU 内核中并行计算。

【讨论】：

【参考方案3】：

GPU 可以在这里做很多事情。

我不确定您是否在这里寻找算法，但我可以为您指出一个广泛使用的 GPU 库，它实现了高效的计数过程。查看thrust 库中的count 函数：https://thrust.github.io/doc/group__counting.html

它以谓词函数作为输入。它计算满足谓词的输入数据的出现次数。

下面统计data中等于0的元素个数。

template <typename T>
struct zero_pixel
  __host__ __device__ bool operator()(const T &x) const return x == 0;
;
thrust::count_if(data.begin(), data.end(), zero_pixel<T>())

这里有一个工作示例：https://github.com/thrust/thrust/blob/master/testing/count.cu

您应该编写一个谓词来测试一个像素是否为黑色（取决于您的像素是什么（它可能是一个 RGB 三元组，在这种情况下，谓词应该更精细一些）。

我还将像素线性化为线性和可迭代的数据结构（但这取决于您的数据实际是什么）。

如果您对直方图方法感兴趣，您可以对图像的像素进行排序（使用任何 GPU 高效算法，或者，为什么不使用 sort、thrust::sort(...) 的 thrust 实现）数据，以便将相等的元素组合在一起，然后按 键进行归约 thrust::reduce_by_key。

看看这个例子：https://github.com/thrust/thrust/blob/master/examples/histogram.cu

请注意，直方图方法的成本更高一些，因为它解决了更大的问题（计算所有唯一元素的出现次数）。

【讨论】：

以上是关于在 GPU 上计算彩色像素 - 理论的主要内容，如果未能解决你的问题，请参考以下文章