与金属 swift 并行计算数组值的总和

Posted

技术标签:

【中文标题】与金属 swift 并行计算数组值的总和【英文标题】:Compute sum of array values in parallel with metal swift 【发布时间】:2016-11-05 00:00:41 【问题描述】:

我正在尝试与金属 swift 并行计算大型数组的总和。

有没有大神的方法?

我的平面是将我的数组划分为子数组,并行计算一个子数组的总和,然后在并行计算完成后计算子总和的总和。

例如,如果我有

array = [a0,....an] 

我将数组划分为子数组:

array_1 = [a_0,...a_i],
array_2 = [a_i+1,...a_2i],
....
array_n/i = [a_n-1, ... a_n]

这个数组的总和是并行计算的,我得到

sum_1, sum_2, sum_3, ... sum_n/1

最后只计算子和的总和。

我创建了运行金属着色器的应用程序,但有些东西我不太明白。

        var array:[[Float]] = [[1,2,3], [4,5,6], [7,8,9]]

        // get device
        let device: MTLDevice! = MTLCreateSystemDefaultDevice()

        // get library
        let defaultLibrary:MTLLibrary! = device.newDefaultLibrary()

        // queue
        let commandQueue:MTLCommandQueue! = device.newCommandQueue()

        // function
        let kernerFunction: MTLFunction! = defaultLibrary.newFunctionWithName("calculateSum")

        // pipeline with function
        let pipelineState: MTLComputePipelineState! = try device.newComputePipelineStateWithFunction(kernerFunction)

        // buffer for function
        let commandBuffer:MTLCommandBuffer! = commandQueue.commandBuffer()

        // encode function
        let commandEncoder:MTLComputeCommandEncoder = commandBuffer.computeCommandEncoder()

        // add function to encode
        commandEncoder.setComputePipelineState(pipelineState)

        // options
        let resourceOption = MTLResourceOptions()

        let arrayBiteLength = array.count * array[0].count * sizeofValue(array[0][0])

        let arrayBuffer = device.newBufferWithBytes(&array, length: arrayBiteLength, options: resourceOption)

        commandEncoder.setBuffer(arrayBuffer, offset: 0, atIndex: 0)

        var result:[Float] = [0,0,0]

        let resultBiteLenght = sizeofValue(result[0])

        let resultBuffer = device.newBufferWithBytes(&result, length: resultBiteLenght, options: resourceOption)

        commandEncoder.setBuffer(resultBuffer, offset: 0, atIndex: 1)

        let threadGroupSize = MTLSize(width: 1, height: 1, depth: 1)

        let threadGroups = MTLSize(width: (array.count), height: 1, depth: 1)

        commandEncoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadGroupSize)

        commandEncoder.endEncoding()

        commandBuffer.commit()

        commandBuffer.waitUntilCompleted()

        let data = NSData(bytesNoCopy: resultBuffer.contents(), length: sizeof(Float), freeWhenDone: false)

        data.getBytes(&result, length: result.count * sizeof(Float))

        print(result)

是我的 Swift 代码,

我的着色器是:

kernel void calculateSum(const device float *inFloat [[buffer(0)]],
                     device float *result [[buffer(1)]],
                     uint id [[ thread_position_in_grid ]]) 


    float * f = inFloat[id];
    float sum = 0;
    for (int i = 0 ; i < 3 ; ++i) 
        sum = sum + f[i];
    

    result = sum;

我不知道如何定义 inFloat 是数组数组。 我不确切知道什么是threadGroupSize 和threadGroups。 我不知道着色器属性中的 device 和 uint 是什么。

这是正确的方法吗?

【问题讨论】:

你为什么选择 Metal? 首先看它是怎么做的,其次因为GPU可以计算很多并行计算,CPU只有2个或4个。 你的“大数组”有多大?你打算在 macOS 或 ios 上运行它吗? 数组将是 200 000 000 或 2 Mil 大。我将在 iOS 上运行它。 抱歉类型错误:200 000 或 2 000 000。 【参考方案1】:

我花时间创建了一个使用 Metal 解决这个问题的完整示例。解释在 cmets 中:

let count = 10_000_000
let elementsPerSum = 10_000

// Data type, has to be the same as in the shader
typealias DataType = CInt

let device = MTLCreateSystemDefaultDevice()!
let library = self.library(device: device)
let parsum = library.makeFunction(name: "parsum")!
let pipeline = try! device.makeComputePipelineState(function: parsum)

// Our data, randomly generated:
var data = (0..<count).map _ in DataType(arc4random_uniform(100)) 
var dataCount = CUnsignedInt(count)
var elementsPerSumC = CUnsignedInt(elementsPerSum)
// Number of individual results = count / elementsPerSum (rounded up):
let resultsCount = (count + elementsPerSum - 1) / elementsPerSum

// Our data in a buffer (copied):
let dataBuffer = device.makeBuffer(bytes: &data, length: MemoryLayout<DataType>.stride * count, options: [])!
// A buffer for individual results (zero initialized)
let resultsBuffer = device.makeBuffer(length: MemoryLayout<DataType>.stride * resultsCount, options: [])!
// Our results in convenient form to compute the actual result later:
let pointer = resultsBuffer.contents().bindMemory(to: DataType.self, capacity: resultsCount)
let results = UnsafeBufferPointer<DataType>(start: pointer, count: resultsCount)

let queue = device.makeCommandQueue()!
let cmds = queue.makeCommandBuffer()!
let encoder = cmds.makeComputeCommandEncoder()!

encoder.setComputePipelineState(pipeline)

encoder.setBuffer(dataBuffer, offset: 0, index: 0)

encoder.setBytes(&dataCount, length: MemoryLayout<CUnsignedInt>.size, index: 1)
encoder.setBuffer(resultsBuffer, offset: 0, index: 2)
encoder.setBytes(&elementsPerSumC, length: MemoryLayout<CUnsignedInt>.size, index: 3)

// We have to calculate the sum `resultCount` times => amount of threadgroups is `resultsCount` / `threadExecutionWidth` (rounded up) because each threadgroup will process `threadExecutionWidth` threads
let threadgroupsPerGrid = MTLSize(width: (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth, height: 1, depth: 1)

// Here we set that each threadgroup should process `threadExecutionWidth` threads, the only important thing for performance is that this number is a multiple of `threadExecutionWidth` (here 1 times)
let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1)

encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
encoder.endEncoding()

var start, end : UInt64
var result : DataType = 0

start = mach_absolute_time()
cmds.commit()
cmds.waitUntilCompleted()
for elem in results 
    result += elem


end = mach_absolute_time()

print("Metal result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
result = 0

start = mach_absolute_time()
data.withUnsafeBufferPointer  buffer in
    for elem in buffer 
        result += elem
    

end = mach_absolute_time()

print("CPU result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")

我使用我的 Mac 对其进行了测试,但它在 iOS 上应该可以正常工作。

输出:

Metal result: 494936505, time: 0.024611456
CPU result: 494936505, time: 0.163341018

Metal 版本大约快 7 倍。我敢肯定,如果您使用截止或其他方式实现分而治之之类的东西,您可以获得更快的速度。

【讨论】:

谢谢,这真的很棒。只有一个问题。如果我正确理解这个共享,每个 resultIndex 的子总和将被计算 1000 次(相同的结果)?或者我不明白着色器是如何顺序调用的。? @MarkoZadravec 10.000 是每个单独任务总和的元素数量。这意味着如果数据大小为 1.000.000,则将有 100 个任务。 results 数组包含这 100 个任务总和的所有结果。着色器只知道它的任务编号 (resultIndex) 以及需要汇总多少项 (elementsPerSum),其中任务编号表示它需要在结果缓冲区中存储总和的位置并将这两个数字相乘获取需要总结的 10000 个元素中的第一个。我知道这不是最好的解释,也许您可​​以阅读一些文章以了解更多信息 不,我了解你如何获得开始和结束位置的算术。 (我将 100 误认为 10000)我的问题是,每个块(线程)的大小为 10.000。在块内我计算开始和结束位置并计算总和。但是它是每个块只调用一次着色器,还是为块中的每个元素调用一次?因为在这种情况下,我们将计算一个子总和 10.000 次(结果将始终相同,因为我们计算了 sam 的开始和结束位置)。如果这是真的,用整数数组代替整数数组不是更好吗? @MarkoZadravec 每个子和/块调用着色器一次,并带有相应的索引。 dispatch 方法会发生这种情况,该方法在总线程组*threadsPerThreadgroup 线程中分派,并为每个线程分配线程组索引和thread_in_threadgroup 索引。数组的数组与单个数组相同,因为内存只是线性的,而不是二维的。同样在 Swift 中,数组只是一个引用,所以这根本行不通。 @MarkoZadravec DataType 必须与着色器使用的类型相同 => 如果将其更改为例如CInt,您还需要在着色器中将其更改为int。请记住始终使用带有 C 前缀的 Swift 等效项,因为 Metal 使用 C 数据类型。我用CInt 更新了我的答案并修复了类型,看看the changes【参考方案2】:

公认的答案令人讨厌地缺少为其编写的内核。 The source is here,但这里是可以作为 swift 命令行应用程序运行的完整程序和着色器。

/*
 * Command line Metal Compute Shader for data processing
 */

import Metal
import Foundation
//------------------------------------------------------------------------------
let count = 10_000_000
let elementsPerSum = 10_000
//------------------------------------------------------------------------------
typealias DataType = CInt // Data type, has to be the same as in the shader
//------------------------------------------------------------------------------
let device = MTLCreateSystemDefaultDevice()!
let library = device.makeDefaultLibrary()!
let parsum = library.makeFunction(name: "parsum")!
let pipeline = try! device.makeComputePipelineState(function: parsum)
//------------------------------------------------------------------------------
// Our data, randomly generated:
var data = (0..<count).map _ in DataType(arc4random_uniform(100)) 
var dataCount = CUnsignedInt(count)
var elementsPerSumC = CUnsignedInt(elementsPerSum)
// Number of individual results = count / elementsPerSum (rounded up):
let resultsCount = (count + elementsPerSum - 1) / elementsPerSum
//------------------------------------------------------------------------------
// Our data in a buffer (copied):
let dataBuffer = device.makeBuffer(bytes: &data, length: MemoryLayout<DataType>.stride * count, options: [])!
// A buffer for individual results (zero initialized)
let resultsBuffer = device.makeBuffer(length: MemoryLayout<DataType>.stride * resultsCount, options: [])!
// Our results in convenient form to compute the actual result later:
let pointer = resultsBuffer.contents().bindMemory(to: DataType.self, capacity: resultsCount)
let results = UnsafeBufferPointer<DataType>(start: pointer, count: resultsCount)
//------------------------------------------------------------------------------
let queue = device.makeCommandQueue()!
let cmds = queue.makeCommandBuffer()!
let encoder = cmds.makeComputeCommandEncoder()!
//------------------------------------------------------------------------------
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(dataBuffer, offset: 0, index: 0)
encoder.setBytes(&dataCount, length: MemoryLayout<CUnsignedInt>.size, index: 1)
encoder.setBuffer(resultsBuffer, offset: 0, index: 2)
encoder.setBytes(&elementsPerSumC, length: MemoryLayout<CUnsignedInt>.size, index: 3)
//------------------------------------------------------------------------------
// We have to calculate the sum `resultCount` times => amount of threadgroups is `resultsCount` / `threadExecutionWidth` (rounded up) because each threadgroup will process `threadExecutionWidth` threads
let threadgroupsPerGrid = MTLSize(width: (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth, height: 1, depth: 1)

// Here we set that each threadgroup should process `threadExecutionWidth` threads, the only important thing for performance is that this number is a multiple of `threadExecutionWidth` (here 1 times)
let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1)
//------------------------------------------------------------------------------
encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
encoder.endEncoding()
//------------------------------------------------------------------------------
var start, end : UInt64
var result : DataType = 0
//------------------------------------------------------------------------------
start = mach_absolute_time()
cmds.commit()
cmds.waitUntilCompleted()
for elem in results 
    result += elem


end = mach_absolute_time()
//------------------------------------------------------------------------------
print("Metal result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
//------------------------------------------------------------------------------
result = 0

start = mach_absolute_time()
data.withUnsafeBufferPointer  buffer in
    for elem in buffer 
        result += elem
    

end = mach_absolute_time()

print("CPU result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
//------------------------------------------------------------------------------
#include <metal_stdlib>
using namespace metal;

typedef unsigned int uint;
typedef int DataType;

kernel void parsum(const device DataType* data [[ buffer(0) ]],
                   const device uint& dataLength [[ buffer(1) ]],
                   device DataType* sums [[ buffer(2) ]],
                   const device uint& elementsPerSum [[ buffer(3) ]],

                   const uint tgPos [[ threadgroup_position_in_grid ]],
                   const uint tPerTg [[ threads_per_threadgroup ]],
                   const uint tPos [[ thread_position_in_threadgroup ]]) 

    uint resultIndex = tgPos * tPerTg + tPos;

    uint dataIndex = resultIndex * elementsPerSum; // Where the summation should begin
    uint endIndex = dataIndex + elementsPerSum < dataLength ? dataIndex + elementsPerSum : dataLength; // The index where summation should end

    for (; dataIndex < endIndex; dataIndex++)
        sums[resultIndex] += data[dataIndex];

【讨论】:

【参考方案3】:

我一直在运行该应用程序。在具有多线程向量和实现的 gt 740(384 核)与 i7-4790 上,这是我的数据:

Metal lap time: 19.959092
cpu MT lap time: 4.353881

cpu 的比率是 5/1, 所以除非你有一个强大的 GPU 使用着色器是不值得的。

我一直在 i7-3610qm w/igpu intel hd 4000 中测试相同的代码,令人惊讶的是,金属的结果要好得多:2/1

已编辑:在调整线程参数后,我终于提高了 gpu 性能,现在达到 16xcpu

【讨论】:

您能否发布使用高度和深度矩阵计算的更新解决方案?

以上是关于与金属 swift 并行计算数组值的总和的主要内容,如果未能解决你的问题,请参考以下文章

使用 Cuda 并行实现计算大型数组中的大型连续子序列之和

OpenMP 中的并行累积(前缀)总和:线程之间的通信值

在不使用推力的情况下,每个线程具有多个元素的并行前缀总和

如何快速计算 UInt16 值数组的总和?

大型数组中元素的并行总和

Linux并行计算多线程计算ln2(POSIX线程并行)