Metal 命令缓冲区内部错误：啥是内部错误（IOAF 代码 2067）？

Posted 2023-03-16

技术标签:

【中文标题】Metal 命令缓冲区内部错误：啥是内部错误（IOAF 代码 2067）？【英文标题】：Metal Command Buffer Internal Error: What is Internal Error (IOAF code 2067)?Metal 命令缓冲区内部错误：什么是内部错误（IOAF 代码 2067）？ 【发布时间】：2021-08-24 22:59:37 【问题描述】：

尝试运行计算内核会导致以下消息：

Execution of the command buffer was aborted due to an error during execution. Internal Error (IOAF code 2067)

为了获得更具体的信息，我查询命令编码器的用户信息并设法提取更多详细信息。我按照this video 的说明生成了以下消息：

[Metal Diagnostics] __message__: MTLCommandBuffer execution failed: The commands 
associated with the encoder were affected by an error, which may or may not have been 
caused by the commands themselves, and failed to execute in full __:::__ 
__delegate_identifier__: GPUToolsDiagnostics

API Validation 和 Shader Validation 触发的断点导致记录堆栈帧 - 而不是 GPU 回溯。断点除了上面的消息外没有任何新的信息。

我在文档中找不到提到的 IOAF 代码的任何参考。打印的附加信息没有显示任何帮助。内核非常不同，我推测这可能会导致 GPU 花费太多时间来完成。这可能是罪魁祸首，但除了直觉之外，我没有任何支持。

这是组的线程设置：

let threadExecutionWidth = pipeline.threadExecutionWidth

let threadgroupsPerGrid = MTLSize(width: (Int(pixelCount) + threadExecutionWidth - 1) / threadExecutionWidth, height: 1, depth: 1)

let threadsPerThreadgroup = MTLSize(width: threadExecutionWidth, height: 1, depth: 1)

commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)

GPU 命令正在提交并等待完成：

commandEncoder.endEncoding()

commandBuffer.commit()
commandBuffer.waitUntilCompleted()

这是我的完整应用程序端代码：

import Metal
import Foundation
import simd

typealias Float4 = SIMD4<Float>

struct SimpleFileWriter 
  var fileHandle: FileHandle
  
  init(filePath: String, append: Bool = false) 
    if !FileManager.default.fileExists(atPath: filePath) 
      FileManager.default.createFile(atPath: filePath, contents: nil, attributes: nil)
    
    
    fileHandle = FileHandle(forWritingAtPath: filePath)!
    if !append 
      fileHandle.truncateFile(atOffset: 0)
    
  
  
  func write(content: String) 
    fileHandle.seekToEndOfFile()
    guard let data = content.data(using: String.Encoding.ascii) else 
      fatalError("Could not convert \(content) to ascii data!")
    
    fileHandle.write(data)
  


var imageWidth = 480
var imageHeight = 270
var sampleCount = 16
var bounceCount = 3 

let device = MTLCreateSystemDefaultDevice()!
let library = try! device.makeDefaultLibrary(bundle: Bundle.module)
let primaryRayFunc = library.makeFunction(name: "ray_trace")!
let pipeline = try! device.makeComputePipelineState(function: primaryRayFunc)

var pixelData: [Float4] = (0..<(imageWidth * imageHeight)).map _ in Float4(0, 0, 0, 0)
var pixelCount = UInt(pixelData.count)

let pixelDataBuffer = device.makeBuffer(bytes: &pixelData, length: Int(pixelCount) * MemoryLayout<Float4>.stride, options: [])!
let pixelDataMirrorPointer = pixelDataBuffer.contents().bindMemory(to: Float4.self, capacity: Int(pixelCount))
let pixelDataMirrorBuffer = UnsafeBufferPointer(start: pixelDataMirrorPointer, count: Int(pixelCount))

let commandQueue = device.makeCommandQueue()!
let commandBufferDescriptor = MTLCommandBufferDescriptor()
commandBufferDescriptor.errorOptions = MTLCommandBufferErrorOption.encoderExecutionStatus
let commandBuffer = commandQueue.makeCommandBuffer(descriptor: commandBufferDescriptor)!
let commandEncoder = commandBuffer.makeComputeCommandEncoder()!

commandEncoder.setComputePipelineState(pipeline)
commandEncoder.setBuffer(pixelDataBuffer, offset: 0, index: 0)
commandEncoder.setBytes(&pixelCount, length: MemoryLayout<Int>.stride, index: 1)
commandEncoder.setBytes(&imageWidth, length: MemoryLayout<Int>.stride, index: 2)
commandEncoder.setBytes(&imageHeight, length: MemoryLayout<Int>.stride, index: 3)
commandEncoder.setBytes(&sampleCount, length: MemoryLayout<Int>.stride, index: 4)
commandEncoder.setBytes(&bounceCount, length: MemoryLayout<Int>.stride, index: 5)


// We have to calculate the sum `pixelCount` times
// => amount of threadgroups is `resultsCount` / `threadExecutionWidth` (rounded up)
// because each threadgroup will process `threadExecutionWidth` threads
let threadExecutionWidth = pipeline.threadExecutionWidth;
let threadgroupsPerGrid = MTLSize(width: (Int(pixelCount) + threadExecutionWidth - 1) / threadExecutionWidth, height: 1, depth: 1)
// Here we set that each threadgroup should process `threadExecutionWidth` threads
// the only important thing for performance is that this number is a multiple of
// `threadExecutionWidth` (here 1 times)
let threadsPerThreadgroup = MTLSize(width: threadExecutionWidth, height: 1, depth: 1)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()

commandBuffer.commit()
commandBuffer.waitUntilCompleted()

if let error = commandBuffer.error as NSError? 
  if let encoderInfo = error.userInfo[MTLCommandBufferEncoderInfoErrorKey] as? [MTLCommandBufferEncoderInfo] 
    for info in encoderInfo 
      print(info.label + info.debugSignposts.joined())
    
  


let sfw = SimpleFileWriter(filePath: "/Users/pprovins/Desktop/render.ppm")
sfw.write(content: "P3\n")
sfw.write(content: "\(imageWidth) \(imageHeight)\n")
sfw.write(content: "255\n")

for pixel in pixelDataMirrorBuffer 
  sfw.write(content: "\(UInt8(pixel.x * 255)) \(UInt8(pixel.y * 255)) \(UInt8(pixel.z * 255)) ")


sfw.write(content: "\n")

此外，这里是正在运行的着色器。为简洁起见，我没有包含所有函数定义：

kernel void ray_trace(device float4 *result [[ buffer(0) ]],
                        const device uint& dataLength [[ buffer(1) ]],
                        const device int& imageWidth [[ buffer(2) ]],
                        const device int& imageHeight [[ buffer(3) ]],
                        const device int& samplesPerPixel [[ buffer(4) ]],
                        const device int& rayBounces [[ buffer (5)]],
                        const uint index [[thread_position_in_grid]]) 
  
  if (index >= dataLength) 
    return;
  
  
  const float3 origin = float3(0.0);
  const float aspect = float(imageWidth) / float(imageHeight);
  const float3 vph = float3(0.0, 2.0, 0.0);
  const float3 vpw = float3(2.0 * aspect, 0.0, 0.0);
  const float3 llc = float3(-(vph / 2.0) - (vpw / 2.0) - float3(0.0, 0.0, 1.0));
  
  float3 accumulatedColor = float3(0.0);
  thread float seed = getSeed(index, index % imageWidth, index / imageWidth);
  
  float row = float(index / imageWidth);
  float col = float(index % imageWidth);
  
  for (int aai = 0; aai < samplesPerPixel; ++aai) 
    float ranX = fract(rand(seed));
    float ranY = fract(rand(seed));
    float u = (col + ranX) / float(imageWidth - 1);
    float v = 1.0 - (row + ranY) / float(imageHeight - 1);
    Ray r(origin, llc + u * vpw + v * vph - origin);
    
    float3 color = float3(0.0);
    HitRecord hr = 0.0, 0.0, false;
    
    float attenuation = 1.0;
    for (int bounceIndex = 0; bounceIndex < rayBounces; ++bounceIndex) 
      testForHit(sceneDistance, r, hr);

      if (hr.h) 
        float3 target = hr.p + hr.n + random_f3_in_unit_sphere(seed);
        attenuation *= 0.5;
        r = Ray(hr.p, target - hr.p);
       else 
        color = default_atmosphere_color(r) * attenuation;
        break;
      
    

    accumulatedColor += color / samplesPerPixel;
  

  result[index] = float4(sqrt(accumulatedColor), 1.0);

奇怪的是，它偶尔会运行。将样本数更改为 16 或以上将始终导致提及 IOAF 代码。少于 16 个样本，代码将运行约 25% 的时间。样本越多，产生错误码的可能性就越大。

有没有关于 IOAF 代码 2067 的附加信息？

【问题讨论】：

您是否尝试过在描述符中使用encoderExecutionStatus 选项创建命令缓冲区？或者也许使用着色器验证？另外，我认为如果您的命令缓冲区是另一个命令缓冲区故障的受害者，则可能会发生此错误，因此请检查您提交的其他命令缓冲区上的完成处理程序。 commandBufferDescriptor.errorOptions = MTLCommandBufferErrorOption.encoderExecutionStatus 让 commandBuffer = commandQueue.makeCommandBuffer(descriptor: commandBufferDescriptor)！打印的消息是相同的...当时没有其他命令缓冲区在使用中。所以它是一个单一的调度运行，没有别的，它崩溃了？应该有一种方法可以检索编码器的面包屑，这里有一个关于错误处理的讨论链接：developer.apple.com/videos/play/wwdc2020/10616 此外，您可能需要提供更多上下文，以便更容易理解发生了什么问题单次派送，仅此而已。我已按照发布的视频中的说明得出帖子中的“更具体的错误”。启用断点不会显示任何新内容，Metal Diagnostics Backtraces 不会指向着色器中的特定位置，并且触发的断点不会提供 GPU 回溯，而是提供记录的堆栈帧。我将更新我的帖子，以便未来的访问者知道我已经采取的调试步骤。我将添加代码来完整演示我的过程。 【参考方案1】：

无法使用 Metal API + Shader Validation 确定错误代码。

通过测试内核的各个部分，将特定错误缩小到导致 GPU 挂起的 while 循环。

问题基本上可以归结为如下代码：

while(true) 
 // ad infinitum

或者，在上述代码中调用random_f3_in_unit_sphere(seed)：

while(randNum(seed) < threshold) 
  // the while loop is not "bounded"
  // in any sense. Whoops.
  ++seed;

【讨论】：

以上是关于Metal 命令缓冲区内部错误：啥是内部错误（IOAF 代码 2067）？的主要内容，如果未能解决你的问题，请参考以下文章