在大纹理上渲染小纹理时，Metal 比 OpenGL 慢得多

Posted 2023-03-08

技术标签:

【中文标题】在大纹理上渲染小纹理时，Metal 比 OpenGL 慢得多【英文标题】：Metal much slower compared to OpenGL while rendering small textures on a large texture 【发布时间】：2018-07-14 03:12:49 【问题描述】：

我正在尝试将我的项目从 OpenGL 迁移到 ios 上的 Metal。但我似乎遇到了性能障碍。任务很简单……

我的纹理很大（超过 3000x3000 像素）。我需要在每个 touchesMoved 事件上绘制几个（几百个）小纹理（比如 124x124）。这是在启用特定混合功能的同时。它基本上就像一个油漆刷。然后显示大纹理。任务大致是这样的。

在 OpenGL 上它运行得非常快。我得到大约60fps。当我将相同的代码移植到 Metal 时，我只能设法获得 15fps。

我创建了两个最少的示例项目来演示问题。以下是项目（OpenGL 和 Metal）...

https://drive.google.com/file/d/12MPt1nMzE2UL_s4oXEUoTCXYiTz42r4b/view?usp=sharing

这大致就是我在 OpenGL 中所做的...

    - (void) renderBrush:(GLuint)brush on:(GLuint)fbo ofSize:(CGSize)size at:(CGPoint)point 
    GLfloat brushCoordinates[] = 
        0.0f, 0.0f,
        1.0f, 0.0f,
        0.0f,  1.0f,
        1.0f,  1.0f,
    ;

    GLfloat imageVertices[] = 
        -1.0f, -1.0f,
        1.0f, -1.0f,
        -1.0f,  1.0f,
        1.0f,  1.0f,
    ;

    int brushSize = 124;

    CGRect rect = CGRectMake(point.x - brushSize/2, point.y - brushSize/2, brushSize, brushSize);

    rect.origin.x /= size.width;
    rect.origin.y /= size.height;
    rect.size.width /= size.width;
    rect.size.height /= size.height;

    [self convertImageVertices:imageVertices toProjectionRect:rect onImageOfSize:size];

    int currentFBO;
    glGetIntegerv(GL_FRAMEBUFFER_BINDING, &currentFBO);

    [_Program use];

    glBindFramebuffer(GL_FRAMEBUFFER, fbo);
    glViewport(0, 0, (int)size.width, (int)size.height);

    glActiveTexture(GL_TEXTURE2);
    glBindTexture(GL_TEXTURE_2D, brush);
    glUniform1i(brushTextureLocation, 2);

    glVertexAttribPointer(positionLocation, 2, GL_FLOAT, 0, 0, imageVertices);
    glVertexAttribPointer(brushCoordinateLocation, 2, GL_FLOAT, 0, 0, brushCoordinates);

    glEnable(GL_BLEND);
    glBlendEquation(GL_FUNC_ADD);
    glBlendFuncSeparate(GL_ONE, GL_ZERO, GL_ONE, GL_ONE);

    glDrawArrays(GL_TRIANGLE_STRIP, 0, 4);

    glDisable(GL_BLEND);

    glActiveTexture(GL_TEXTURE2);
    glBindTexture(GL_TEXTURE_2D, 0);

    glBindFramebuffer(GL_FRAMEBUFFER, currentFBO);

我在每个触摸事件中循环运行此代码（大约 200-500）。它运行得非常快。

这就是我将代码移植到 Metal 的方式...

- (void) renderBrush:(id<MTLTexture>)brush onTarget:(id<MTLTexture>)target at:(CGPoint)point withCommandBuffer:(id<MTLCommandBuffer>)commandBuffer 

int brushSize = 124;

CGRect rect = CGRectMake(point.x - brushSize/2, point.y - brushSize/2, brushSize, brushSize);

rect.origin.x /= target.width;
rect.origin.y /= target.height;
rect.size.width /= target.width;
rect.size.height /= target.height;

Float32 imageVertices[8];
// Calculate the vertices (basically the rectangle that we need to draw) on the target texture that we are going to draw
// We are not drawing on the entire target texture, only on a square around the point
[self composeImageVertices:imageVertices toProjectionRect:rect onImageOfSize:CGSizeMake(target.width, target.height)];

// We use different one vertexBuffer per pass. This is because this is run on a loop and the subsequent calls will overwrite
// The values. Other buffers also get overwritten but that is ok for now, we only need to demonstrate the performance.
id<MTLBuffer> vertexBuffer = [_vertexArray lastObject];

memcpy([vertexBuffer contents], imageVertices, 8 * sizeof(Float32));

id<MTLRenderCommandEncoder> commandEncoder = [commandBuffer renderCommandEncoderWithDescriptor:mRenderPassDescriptor];
commandEncoder.label = @"DrawCE";

[commandEncoder setRenderPipelineState:mPipelineState];

[commandEncoder setVertexBuffer:vertexBuffer offset:0 atIndex:0];
[commandEncoder setVertexBuffer:mBrushTextureBuffer offset:0 atIndex:1];

[commandEncoder setFragmentTexture:brush atIndex:0];
[commandEncoder setFragmentSamplerState:mSampleState atIndex:0];

[commandEncoder drawPrimitives:MTLPrimitiveTypeTriangleStrip vertexStart:0 vertexCount:4];
[commandEncoder endEncoding];

然后在循环中运行此代码，每个触摸事件都有一个 MTLCommandBuffer，例如...

    id<MTLCommandBuffer> commandBuffer = [MetalContext.defaultContext.commandQueue commandBuffer];
commandBuffer.label = @"DrawCB";

dispatch_semaphore_wait(_inFlightSemaphore, DISPATCH_TIME_FOREVER);

mRenderPassDescriptor.colorAttachments[0].texture = target;

__block dispatch_semaphore_t block_sema = _inFlightSemaphore;
[commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> buffer) 
    dispatch_semaphore_signal(block_sema);
];

_vertexArray = [[NSMutableArray alloc] init];
for (int i = 0; i < strokes; i++) 
    id<MTLBuffer> vertexBuffer = [MetalContext.defaultContext.device newBufferWithLength:8 * sizeof(Float32) options:0];
    [_vertexArray addObject:vertexBuffer];

    id<MTLTexture> brush = [_brushes objectAtIndex:rand()%_brushes.count];
    [self renderBrush:brush onTarget:target at:CGPointMake(x, y) withCommandBuffer:commandBuffer];
    x += deltaX;
    y += deltaY;


[commandBuffer commit];

在我附加的示例代码中，我已将触摸事件替换为计时器循环以保持简单。

在 iPhone 7 Plus 上，我使用 OpenGL 获得 60fps，使用 Metal 获得 15fps。可能我在这里做错了什么？

【问题讨论】：

【参考方案1】：

删除所有冗余：

不要在渲染时创建缓冲区。在初始化期间分配足够的缓冲区。不要为每个四边形创建一个命令编码器。为每个四边形使用具有不同（正确对齐）偏移的大顶点缓冲区。使用-setVertexBufferOffset:atIndex: 仅根据需要设置偏移量，而不更改缓冲区。 composeImageVertices:... 可以通过适当的转换直接写入顶点缓冲区，避免 memcpy。取决于composeImageVertices:... 的实际作用以及如果deltaX 和deltaY 是常量，您也许可以设置一次顶点缓冲区。顶点着色器可以根据需要变换顶点。您可以将适当的数据作为制服传递（目标点和渲染目标大小，甚至是变换矩阵）。假设它们每次都相同，不要每次都设置mPipelineState、mBrushTextureBuffer和mSampleState。如果任何四边形共享相同的画笔纹理，请将它们组合在一起并执行一个绘制命令将它们全部绘制出来。这可能需要切换到三角形图元而不是三角形条形图元。但是，如果您进行索引绘制，则可以使用原始重启哨兵在一个绘制命令中绘制多个三角形条带。如果计数不超过允许的纹理数量 (31)，您甚至可以在一个绘制命令中执行多个画笔。将所有画笔纹理传递给片段着色器。它可以将它们作为纹理数组接收。顶点数据将包括画笔索引，顶点着色器将向前传递，片段着色器将使用它来查找纹理以从数组中采样。您可以使用实例化绘图在一个命令中绘制所有内容。绘制单个四边形的stroke 实例。在顶点着色器中，根据实例 ID 变换位置。您必须将 deltaX 和 deltaY 作为统一数据传入。画笔索引也可以在传入的单个缓冲区中，着色器可以通过实例 ID 在其中查找画笔索引。您是否考虑过使用点图元而不是四边形？这将减少顶点的数量并为 Metal 提供可用于优化光栅化的信息。

【讨论】：

谢谢肯。我将示例代码更改为使用点图元（一次传递中的所有点），然后使用一组画笔纹理（到目前为止，我的画笔纹理数量远低于 31），其中画笔纹理的索引通过顶点你建议的着色器。我现在得到 60 fps！我唯一需要检查的是混合结果是否与 openGL 代码相同。在 OpenGL 中，我使用某种混合功能一个接一个地绘制四边形。但在这里我使用相同的混合模式一次绘制所有四边形，但没有特定的顺序。我在片段着色器中为笔刷纹理使用硬编码坐标进行测试。但是现在当我尝试在片段着色器中访问纹理坐标时，当我使用点图元时，我找不到与“gl_PointCoord”等效的金属。片段着色器[[stage_in]]参数中，注解为[[point_coord]]的字段获取点坐标。或者，只是一个如此注释的独立参数。

以上是关于在大纹理上渲染小纹理时，Metal 比 OpenGL 慢得多的主要内容，如果未能解决你的问题，请参考以下文章