TensorRT FB

Posted WhateverYoung

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了TensorRT FB相关的知识,希望对你有一定的参考价值。

TensorRT

NVidia公司提供的前向框架(API形式提供),主要是C++接口。可以充分发挥GPU的计算能力,提高吞吐和时延。

模型转换:
- API自定义模型,要熟悉训练使用框架的格式以及TRT中layer对于权重格式的要求,把训练的权重转化并赋值给TRT layer
- CaffeParser,从Caffe模型导出
- UffParser,从uff格式模型导出
- OnnxParser,从ONNX格式模型导出

download and install

按照install文档安装即可,可能遇到下述问题:
https://codeyarns.com/2015/07/31/pip-install-error-with-pycuda/
https://blog.csdn.net/xll_bit/article/details/78376320

developer guide

https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html

  • kinds of supported layers
  • key concepts: NewWork Definition, Tensor, Layer, DataType, Builder, RunTime
    PlanFile, Engine, ExcuteCtx
  • int8 低精度高性能计算
  • FP16 低精度高性能计算
  • 优化方法主要是图的冗余优化,fusion,低精度以及根据具体参数下的kernel的穷举最优选择

WorkFlow

Train->fix batch_size and Precision define network-> Optimize using TensorRT->planfile(can be serialized to disk)->validate and inference

build phrase and excute phrase

IBuilder->createNetwork
INetworkDefination->addInput addLayers markOutput
IBuilder->buildCudaEngine(INetworkDefination)
ICudaEngine->createCtx() ->serialize()
ICtx->binding buffers to tensors inputs and outputs
IExecuteContext->enqueue or excute

IRuntime->deserialize()
ICudaEngine->createCtx()
ICtx->binding buffers to tensors inputs and outputs

第一步,准备好网络所需要的所有权重参数;第二步,使用builder定义网络,此时tensor只有维度信息,类似tf的计算图,定义好计算规则,所有tensor由name map映射到最终运行时需要的真实内存buffer,在excute enqueue之前,都只是定义网络,此时可以持久化定义好的网络到磁盘;第三步,申请内存,绑定所有的inputs outpus buffer到tensor,使用name和index进行绑定,然后同步或者异步执行,得到计算结果输出。

samples

  • sampleMNIST,CaffeParser
  • sampleUFFMNIST,UffParser
  • sampleMNISTAPI, API build network
  • sampleGoogleNet, caffeParser and FP16 and profile
  • sampleCharRNN, APi build network, weights load from files
  • sampleINT8, weight convert, how to use int8
  • samplePlugIN, CafferParser + PlugIn
  • sampleFasterRCNN, TODO
  • sampleONNX, ONNXParser
  • sampleMLP, TODO

sampleNMT

  • ReadMe.txt helps a lot.
  • Support uni encoder and uni decoder with luong attention.
  • bi-encoder and other attentions like scaled_luong attention are not supported.

First, train with TensorFlow-nmt, using chpt_to_bin.py convert weights from TensorFlow to TensorRT’s bin file and parse by the sample c++ codes.
Then, run inference by TensorRT.
You can modify sampleNMT.cpp’s global vars for other vocs.

  • params: beam size, batch size…

TODO

  • How to use PlugIn ….
  • CaffeParser/UFFParser/ONNXParer …
  • RaggedSoftmax Scale RNNV2 layer …
  • sampleNMT beam search algs …
  • each samples run and learn …

reference

https://docs.nvidia.com/deeplearning/sdk/index.html#deep-learning-sdk
https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/topics/index.html
https://developer.nvidia.com/tensorrt
https://developer.nvidia.com/nvidia-tensorrt-4x-download

以上是关于TensorRT FB的主要内容,如果未能解决你的问题,请参考以下文章

tensorRT介绍

TensorRT8 meets Python TensorRT快速入门介绍

tensorrt fp16结果一半0

NVIDA-TensorRT部署

模型部署Tensorrt学习记录

TensorRT安装