基于Triton Server部署BERT模型

本文简要介绍如何使用 Triton 部署 BERT模型,主要参考 NVIDIA/DeepLearningExamples




bash ./squad_download.sh


wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/bert_large_pyt_amp_ckpt_squad_qa1_1/versions/1/zip -O bert_large_pyt_amp_ckpt_squad_qa1_1_1.zip



bash ./scripts/docker/build.sh

Processing triggers for libc-bin (2.27-3ubuntu1) ...
Removing intermediate container 89010b0a75b2
 ---> 562bcc14dbfa
Step 15/15 : COPY . .
 ---> 23bac3585a43
Successfully built 23bac3585a43
Successfully tagged bert:latest


将 checkpoint 导出为 torchscript

在宿主机(不需要容器内部)下,进入DeepLearningExamples-master/PyTorch/LanguageModeling/BERT执行下述脚本将 checkpoint 转为 torchscript:

bash ./triton/export_model.sh


== PyTorch ==

NVIDIA Release 20.06 (build 13419386)
PyTorch Version 1.6.0a0+9907a3e

Container image Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.

Copyright (c) 2014-2020 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

NOTE: Legacy NVIDIA Driver detected.  Compatibility mode ENABLED.

NOTE: MOFED driver for multi-node communication was not detected.
      Multi-node communication performance may be reduced.

deploying model bertQA-ts-script in format pytorch_libtorch
/opt/conda/lib/python3.6/site-packages/torch/jit/_recursive.py:160: UserWarning: 'bias' was found in ScriptModule constants,  but it is a non-constant parameter. Consider removing it.
  " but it is a non-constant . Consider removing it.".format(name, hint))

conversion correctness test results
maximal absolute error over dataset (L_inf):  0.0322265625

average L_inf error over output tensors:  0.02264404296875
variance of L_inf error over output tensors:  5.4970383644104004e-05
stddev of L_inf error over output tensors:  0.00741420148391612

time of error check of native model:  0.8040032386779785 seconds
time of error check of ts model:  1.7353665828704834 seconds




启动 Triton server

可以通过执行以下命令来启动Triton server:

docker run --rm --gpus device=0 --ipc=host --network=host -p 8000:8000 -p 8001:8001 -p 8002:8002 -v $PWD/results/triton_models:/models nvcr.io/nvidia/tritonserver:20.06-v1-py3 trtserver --model-store=/models --log-verbose=1




启动自定义的Triton client


Step1: 启动一个 client 容器

docker run -it --rm --ipc=host --network=host -v $PWD/vocab:/workspace/bert/vocab bert:latest



Step2: 启动 client
进入到 client 代码目录:cd /workspace/bert/triton/,再运行如下代码,对 bertQA-ts-script 版模型进行请求:

python client.py --do_lower_case --version_2_with_negative --vocab_file=../vocab/vocab --triton-model-name=bertQA-ts-script

此时,client 端将向已在运行的 Triton server 发送一个请求,Triton server 接收请求并处理后,将请求返回。如果想输入自定义的文本段落和问题,则只需在运行client.py脚本时搭配--question--context参数并传入对应的内容。此外,可以通过--triton-model-name指定特定的模型。这里服务端加载了2个模型,所以client也可以对 onnx 版模型进行请求:

python client.py --do_lower_case --version_2_with_negative --vocab_file=../vocab/vocab --triton-model-name=bertQA-onnx



bash ./triton/evaluate.sh

在部署和评测之前,先将之前启动的 Triton server 关闭,否则端口被冲突。


deploying model bert_large_fp32 in format pytorch_libtorch
/opt/conda/lib/python3.6/site-packages/torch/jit/_recursive.py:160: UserWarning: 'bias' was found in ScriptModule constants,  but it is a non-constant parameter. Consider removing it.
  " but it is a non-constant . Consider removing it.".format(name, hint))

conversion correctness test results
maximal absolute error over dataset (L_inf):  1.4185905456542969e-05

average L_inf error over output tensors:  1.0482966899871826e-05
variance of L_inf error over output tensors:  8.773056355456296e-12
stddev of L_inf error over output tensors:  2.961934562993635e-06

time of error check of native model:  1.596167802810669 seconds
time of error check of ts model:  2.414717435836792 seconds

Starting server...
Waiting for TRITON Server to be ready at http://localhost:8000...
.......TRITON Server is ready!

Sending Requests: 100%|███████████████████████████████████████████████████████████████████████████| 10833/10833 [04:20<00:00, 27.84sentences/s-----------------------------█████████████████████████████████████████████████████████████████████▉| 10832/10833 [14:29<00:00, 12.28sentences/s]
Individual Time Runs
Total Time: 869886.3623142242 ms
Total Inference Time = 432310.23 forSentences processed = 10833
Throughput Average (sentences/sec) = 12.45
Throughput Average (batches/sec) = 1.56
Summary Statistics
Batch size = 8
Sequence Length = 384
Latency Confidence Level 95 (ms) = 594040.61627388
Latency Confidence Level 99 (ms)  = 615392.275094986
Latency Confidence Level 100 (ms)  = 619993.6480522156
Latency Average (ms)  = 319048.1366518239
Sending Requests: 100%|███████████████████████████████████████████████████████████████████████████| 10833/10833 [15:16<00:00, 11.82sentences/s]
Processed Requests: 100%|█████████████████████████████████████████████████████████████████████████| 10833/10833 [15:16<00:00, 11.82sentences/s]

deploying model bert_large_fp32 in format onnxruntime_onnx
/opt/conda/lib/python3.6/site-packages/torch/onnx/utils.py:955: UserWarning: No names were found for specified dynamic axes of provided input.Automatically generated names will be applied to each dynamic axes of input input__0
  'Automatically generated names will be applied to each dynamic axes of input '.format(key))
/opt/conda/lib/python3.6/site-packages/torch/onnx/utils.py:955: UserWarning: No names were found for specified dynamic axes of provided input.Automatically generated names will be applied to each dynamic axes of input input__1
  'Automatically generated names will be applied to each dynamic axes of input '.format(key))
/opt/conda/lib/python3.6/site-packages/torch/onnx/utils.py:955: UserWarning: No names were found for specified dynamic axes of provided input.Automatically generated names will be applied to each dynamic axes of input input__2
  'Automatically generated names will be applied to each dynamic axes of input '.format(key))
/opt/conda/lib/python3.6/site-packages/torch/onnx/utils.py:955: UserWarning: No names were found for specified dynamic axes of provided input.Automatically generated names will be applied to each dynamic axes of input output__0
  'Automatically generated names will be applied to each dynamic axes of input '.format(key))
/opt/conda/lib/python3.6/site-packages/torch/onnx/utils.py:955: UserWarning: No names were found for specified dynamic axes of provided input.Automatically generated names will be applied to each dynamic axes of input output__1
  'Automatically generated names will be applied to each dynamic axes of input '.format(key))
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1336539136

conversion correctness test results
maximal absolute error over dataset (L_inf):  0.00022530555725097656

average L_inf error over output tensors:  0.0001377016305923462
variance of L_inf error over output tensors:  6.448256743378049e-09
stddev of L_inf error over output tensors:  8.030103824595327e-05

time of error check of native model:  1.2507586479187012 seconds
time of error check of onnx model:  76.80649089813232 seconds

Starting server...
Waiting for TRITON Server to be ready at http://localhost:8000...
.......TRITON Server is ready!

Sending Requests: 100%|███████████████████████████████████████████████████████████████████████████| 10833/10833 [04:40<00:00, 15.52sentences/s-----------------------------█████████████████████████████████████████████████████████████████████▉| 10832/10833 [14:23<00:00, 12.42sentences/s]
Individual Time Runs
Total Time: 863938.3265972137 ms
Total Inference Time = 418017.89 forSentences processed = 10833
Throughput Average (sentences/sec) = 12.54
Throughput Average (batches/sec) = 1.57
Summary Statistics
Batch size = 8
Sequence Length = 384
Latency Confidence Level 95 (ms) = 568533.2419872284
Latency Confidence Level 99 (ms)  = 591532.5634479523
Latency Confidence Level 100 (ms)  = 595446.0487365723
Latency Average (ms)  = 308500.2912194087
Sending Requests: 100%|███████████████████████████████████████████████████████████████████████████| 10833/10833 [15:10<00:00, 11.90sentences/s]
Processed Requests: 100%|█████████████████████████████████████████████████████████████████████████| 10833/10833 [15:10<00:00, 11.90sentences/s]

以上是关于基于Triton Server部署BERT模型的主要内容,如果未能解决你的问题,请参考以下文章

