如何将带有 grid_sample 的模型转换为带有 INT8 量化的 TensorRT？

Posted 2023-02-16

技术标签:

【中文标题】如何将带有 grid_sample 的模型转换为带有 INT8 量化的 TensorRT？【英文标题】：How to convert the model with grid_sample to TensorRT with INT8 quantization? 【发布时间】：2021-09-13 11:52:49 【问题描述】：

我正在尝试将带有torch.nn.functional.grid_sample 的模型从Pytorch (1.9) 转换为TensorRT (7)，并通过ONNX (opset 11) 进行INT8 量化。 Opset 11 不支持将 grid_sample 转换为 ONNX。因此，我将 ONNX graphsurgeon 与外部 GridSamplePlugin 一起使用，因为它是 proposed here。有了它，转换到 TensorRT（有和没有 INT8 量化）是成功的。没有 INT8 量化的 Pytorch 和 TRT 模型提供的结果接近相同（MSE 为 e-10 阶）。但是对于带有 INT8 量化的 TensorRT，MSE 更高（185）。

grid_sample 运算符有两个输入：输入信号和采样网格。它们都应该是同一类型。在 GridSamplePlugin 中，只实现了 kFLOAT 和 kHALF 的处理。在我的情况下，绝对采样网格中的 X 坐标（在转换为 grid_sample 所需的相对坐标之前）在 [-d; W+d]和[-d； H+d] 为 Y 坐标。 W 的最大值为 640，H 的最大值为 360。并且坐标在此范围内可能具有非整数值。出于测试目的，我创建了仅包含 grid_sample 层的测试模型。在这种情况下，使用和不使用 INT8 量化的 TensorRT 结果是相同的。

这是测试模型的代码：

import torch
import numpy as np
import cv2

BATCH_SIZE = 1
WIDTH = 640
HEIGHT = 360

def calculate_grid(B, H, W, dtype, device='cuda'):
    xx = torch.arange(0, W, device=device).view(1, -1).repeat(H, 1).type(dtype)
    yy = torch.arange(0, H, device=device).view(-1, 1).repeat(1, W).type(dtype)
    xx = xx + yy * 0.25
    if B > 1:
        xx = xx.view(1, 1, H, W).repeat(B, 1, 1, 1)
        yy = yy.view(1, 1, H, W).repeat(B, 1, 1, 1)
    else:
        xx = xx.view(1, 1, H, W)
        yy = yy.view(1, 1, H, W)
    vgrid = torch.cat((xx, yy), 1).type(dtype)
    return vgrid.type(dtype)

def modify_grid(vgrid, H, W):
    vgrid = torch.cat([
        torch.sub(2.0 * vgrid[:, :1, :, :].clone() / max(W - 1, 1), 1.0),
        torch.sub(2.0 * vgrid[:, 1:2, :, :].clone() / max(H - 1, 1), 1.0),
        vgrid[:, 2:, :, :]], dim=1)
    vgrid = vgrid.permute(0, 2, 3, 1)
    return vgrid

class GridSamplingBlock(torch.nn.Module):

    def __init__(self):
        super(GridSamplingBlock, self).__init__()

    def forward(self, input, vgrid):
        output = torch.nn.functional.grid_sample(input, vgrid)
        return output

if __name__ == '__main__':
    model = torch.nn.DataParallel(GridSamplingBlock())
    model.cuda()
    print("Reading inputs")
    img = cv2.imread("result/left_frame_rect_0373.png")
    img = cv2.resize(cv2.cvtColor(img, cv2.COLOR_BGR2GRAY), (WIDTH, HEIGHT))
    img_in = torch.from_numpy(img.astype(float)).view(1, 1, HEIGHT, WIDTH).cuda()
    vgrid = calculate_grid(BATCH_SIZE, HEIGHT, WIDTH, img_in.dtype)
    vgrid = modify_grid(vgrid, HEIGHT, WIDTH)
    np.save("result/grid", vgrid.cpu().detach().numpy())
    print("Getting output")
    with torch.no_grad():
        model.module.eval()
        img_out = model.module(img_in, vgrid)
        img = img_out.cpu().detach().numpy().squeeze()
        cv2.imwrite("result/grid_sample_test_output.png", img.astype(np.uint8))

保存的网格用于 TensorRT 模型的校准和推理。

所以问题是：

将 INT8 量化应用于具有至少一个的函数是否有效索引输入（如grid_sample）？这种量化不会导致结果的显着变化（如果我们将 INT8 量化应用于例如，范围为 [0..640) 的输入）？如果在此插件代码中仅实现 FP32 和 FP16，INT8 量化如何与自定义插件一起使用？由于 grid_sample 输入实际上是网络输入这一事实，在 TensorRT 中使用和不使用 INT8 量化获得的测试网络的结果是否相同？

我的环境：

TensorRT 版本：7 GPU 类型：NVidia GeForce GTX 1050 Ti Nvidia 驱动程序版本：470.63.01 CUDA 版本：10.2.89 CUDNN 版本：8.1.1 操作系统+版本：Ubuntu 18.04 Python 版本（如果适用）：3.7 PyTorch 版本（如果适用）：1.9

重现步骤：

运行测试代码保存网格并获取 Torch 结果。使用任何用于测试的输入图像。根据自定义插件构建 TensorRT OSS 这个sample。最新的 TRT OSS 版本需要对 GridSamplePlugin 进行一些适配，所以最好使用推荐的 TensorRT OSS 版本。根据code example创建ONNX模型。创建带有或不带有 INT8 量化的 TensorRT 引擎并运行推理。在我的 C++ 代码中，我使用 https://github.com/llohse/libnpy 来读取 grid.npy 文件。

【问题讨论】：

【参考方案1】：

您可以将模型分成两部分，一个在网格样本之前，另一个在它之后，分别进行 int8 量化。在 INT8 中使用 grid_sample 会极大地影响您的模型性能。这将导致您的网络结构发生变化，因此可能会改变图的优化。

【讨论】：

以上是关于如何将带有 grid_sample 的模型转换为带有 INT8 量化的 TensorRT？的主要内容，如果未能解决你的问题，请参考以下文章