Jetson AGX Xavier实现TensorRT加速YOLOv5进行实时检测

Posted luoganttcc

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Jetson AGX Xavier实现TensorRT加速YOLOv5进行实时检测相关的知识,希望对你有一定的参考价值。

link

上一篇:Jetson AGX Xavier安装torch、torchvision且成功运行yolov5算法

下一篇:Jetson AGX Xavier测试YOLOv4

一、前言

        由于YOLOv5在Xavier上对实时画面的检测速度较慢,需要采用TensorRT对其进行推理加速。接下来记录一下我的实现过程。

二、环境准备

 如果还没有搭建YOLOv5的python环境,按照下文步骤执行。反之,直接跳过第一步执行第二步。

1、参考文章《Jetson AGX Xavier配置yolov5虚拟环境》建立YOLOv5的Python环境,并参照《Jetson AGX Xavier安装Archiconda虚拟环境管理器与在虚拟环境中调用opencv》,将opencv导入环境,本文Opencv采用的是3.4.3版本。

2、在环境中导入TensorRT的库。与opencv的导入相同。将路径 /usr/lib/python3.6/dist-packages/  下关于TensorRT的文件夹,复制到自己所创建环境的site-packages文件夹下。例如:复制到/home/jetson/archiconda3/envs/yolov5env/lib/python3.6/site-packages/之下。

3、在环境中安装pycuda,如果pip安装不成功,网上有许多解决办法。 


    
  1. conda activate yolov 5env
  2. pip install pycuda

三、加速步骤

         以加速YOLOv5s模型为例,以下有v4.0与v5.0两个版本,大家任选其一即可。

1、克隆工程

①v4.0


    
  1. git clone -b v4.0 https://github.com/ultralytics/yolov5.git
  2. git clone -b yolov5-v4.0 https://github.com/wang-xinyu/tensorrtx.git

②v5.0


    
  1. git clone -b v5.0 https://github.com/ultralytics/yolov5.git
  2. git clone -b yolov5-v5.0 https://github.com/wang-xinyu/tensorrtx.git

2、生成引擎文件

①下载yolov5s.pt到yolov5工程的weights文件夹下。

②复制tensorrtx/yolov5文件夹下的gen_wts.py文件到yolov5工程下。

③生成yolov5s.wts文件。


    
  1. conda activate yolov5env
  2. cd /xxx/yolov5
  3. 以下按照自己所下版本选择
  4. #v4.0
  5. python gen_wts.py
  6. #v5.0
  7. python gen_wts.py -w yolov5s.pt -o yolov5s.wts

④生成引擎文件

        进入tensorrtx/yolov5文件夹下。

mkdir build
    

        复制yolov5工程中生成的yolov5s.wts文件到tensorrtx/yolov5/build文件夹中。并在build文件夹中打开终端:


    
  1. cmake ..
  2. make
  3. #v4.0 sudo ./yolov5 -s [.wts] [.engine] [s/m/l/x/]
  4. #v5.0 sudo ./yolov5 -s [.wts] [.engine] [s/m/l/x/s6/m6/l6/x6 or c/c6 gd gw]
  5. sudo ./yolov5 -s yolov5s.wts yolov5s.engine s

生成yolov5s.engine文件。

四、加速实现

1、图片检测加速


    
  1. sudo ./yolov5 -d yolov5s.engine ../samples
  2. 或者
  3. conda activate yolov5env
  4. python yolov5_trt.py

2、摄像头实时检测加速

        由于本人没有学习过C++语言,所以只能硬着头皮修改了下yolov5_trt.py脚本,脚本的代码格式较差,但是能够实现加速,有需要的可以作为一个参考。

        在tensorrt工程下新建一个yolo_trt_test.py文件。复制下面 v4.0或者v5.0的代码到yolo_trt_test.py。注意yolov5s.engine的路径,自行更改。

  ①v4.0代码


    
  1. """
  2. An example that uses TensorRT's Python api to make inferences.
  3. """
  4. import ctypes
  5. import os
  6. import random
  7. import sys
  8. import threading
  9. import time
  10. import cv2
  11. import numpy as np
  12. import pycuda.autoinit
  13. import pycuda.driver as cuda
  14. import tensorrt as trt
  15. import torch
  16. import torchvision
  17. INPUT_W = 608
  18. INPUT_H = 608
  19. CONF_THRESH = 0.15
  20. IOU_THRESHOLD = 0.45
  21. int_box=[ 0, 0, 0, 0]
  22. int_box1=[ 0, 0, 0, 0]
  23. fps1= 0.0
  24. def plot_one_box( x, img, color=None, label=None, line_thickness=None):
  25. """
  26. description: Plots one bounding box on image img,
  27. this function comes from YoLov5 project.
  28. param:
  29. x: a box likes [x1,y1,x2,y2]
  30. img: a opencv image object
  31. color: color to draw rectangle, such as (0,255,0)
  32. label: str
  33. line_thickness: int
  34. return:
  35. no return
  36. """
  37. tl = (
  38. line_thickness or round( 0.002 * (img.shape[ 0] + img.shape[ 1]) / 2) + 1
  39. ) # line/font thickness
  40. color = color or [random.randint( 0, 255) for _ in range( 3)]
  41. c1, c2 = ( int(x[ 0]), int(x[ 1])), ( int(x[ 2]), int(x[ 3]))
  42. C2 = c2
  43. cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)
  44. if label:
  45. tf = max(tl - 1, 1) # font thickness
  46. t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[ 0]
  47. c2 = c1[ 0] + t_size[ 0], c1[ 1] + t_size[ 1] + 8
  48. cv2.rectangle(img, c1, c2, color, - 1, cv2.LINE_AA) # filled
  49. cv2.putText(
  50. img,
  51. label,
  52. (c1[ 0], c1[ 1]+t_size[ 1] + 5),
  53. 0,
  54. tl / 3,
  55. [ 255, 255, 255],
  56. thickness=tf,
  57. lineType=cv2.LINE_AA,
  58. )
  59. class YoLov5TRT( object):
  60. """
  61. description: A YOLOv5 class that warps TensorRT ops, preprocess and postprocess ops.
  62. """
  63. def __init__( self, engine_file_path):
  64. # Create a Context on this device,
  65. self.cfx = cuda.Device( 0).make_context()
  66. stream = cuda.Stream()
  67. TRT_LOGGER = trt.Logger(trt.Logger.INFO)
  68. runtime = trt.Runtime(TRT_LOGGER)
  69. # Deserialize the engine from file
  70. with open(engine_file_path, "rb") as f:
  71. engine = runtime.deserialize_cuda_engine(f.read())
  72. context = engine.create_execution_context()
  73. host_inputs = []
  74. cuda_inputs = []
  75. host_outputs = []
  76. cuda_outputs = []
  77. bindings = []
  78. for binding in engine:
  79. size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
  80. dtype = trt.nptype(engine.get_binding_dtype(binding))
  81. # Allocate host and device buffers
  82. host_mem = cuda.pagelocked_empty(size, dtype)
  83. cuda_mem = cuda.mem_alloc(host_mem.nbytes)
  84. # Append the device buffer to device bindings.
  85. bindings.append( int(cuda_mem))
  86. # Append to the appropriate list.
  87. if engine.binding_is_input(binding):
  88. host_inputs.append(host_mem)
  89. cuda_inputs.append(cuda_mem)
  90. else:
  91. host_outputs.append(host_mem)
  92. cuda_outputs.append(cuda_mem)
  93. # Store
  94. self.stream = stream
  95. self.context = context
  96. self.engine = engine
  97. self.host_inputs = host_inputs
  98. self.cuda_inputs = cuda_inputs
  99. self.host_outputs = host_outputs
  100. self.cuda_outputs = cuda_outputs
  101. self.bindings = bindings
  102. def infer( self, input_image_path):
  103. global int_box,int_box1,fps1
  104. # threading.Thread.__init__(self)
  105. # Make self the active context, pushing it on top of the context stack.
  106. self.cfx.push()
  107. # Restore
  108. stream = self.stream
  109. context = self.context
  110. engine = self.engine
  111. host_inputs = self.host_inputs
  112. cuda_inputs = self.cuda_inputs
  113. host_outputs = self.host_outputs
  114. cuda_outputs = self.cuda_outputs
  115. bindings = self.bindings
  116. # Do image preprocess
  117. input_image, image_raw, origin_h, origin_w = self.preprocess_image(
  118. input_image_path
  119. )
  120. # Copy input image to host buffer
  121. np.copyto(host_inputs[ 0], input_image.ravel())
  122. # Transfer input data to the GPU.
  123. cuda.memcpy_htod_async(cuda_inputs[ 0], host_inputs[ 0], stream)
  124. # Run inference.
  125. context.execute_async(bindings=bindings, stream_handle=stream.handle)
  126. # Transfer predictions back from the GPU.
  127. cuda.memcpy_dtoh_async(host_outputs[ 0], cuda_outputs[ 0], stream)
  128. # Synchronize the stream
  129. stream.synchronize()
  130. # Remove any context from the top of the context stack, deactivating it.
  131. self.cfx.pop()
  132. # Here we use the first row of output in that batch_size = 1
  133. output = host_outputs[ 0]
  134. # Do postprocess
  135. result_boxes, result_scores, result_classid = self.post_process(
  136. output, origin_h, origin_w
  137. )
  138. # Draw rectangles and labels on the original image
  139. for i in range( len(result_boxes)):
  140. box1 = result_boxes[i]
  141. plot_one_box(
  142. box1,
  143. image_raw,
  144. label= "::.2f". format(
  145. categories[ int(result_classid[i])], result_scores[i]
  146. ),
  147. )
  148. return image_raw
  149. # parent, filename = os.path.split(input_image_path)
  150. # save_name = os.path.join(parent, "output_" + filename)
  151. # #  Save image
  152. # cv2.imwrite(save_name, image_raw)
  153. def destroy( self):
  154. # Remove any context from the top of the context stack, deactivating it.
  155. self.cfx.pop()
  156. def preprocess_image( self, input_image_path):
  157. """
  158. description: Read an image from image path, convert it to RGB,
  159. resize and pad it to target size, normalize to [0,1],
  160. transform to NCHW format.
  161. param:
  162. input_image_path: str, image path
  163. return:
  164. image: the processed image
  165. image_raw: the original image
  166. h: original height
  167. w: original width
  168. """
  169. image_raw = input_image_path
  170. h, w, c = image_raw.shape
  171. image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)
  172. # Calculate widht and height and paddings
  173. r_w = INPUT_W / w
  174. r_h = INPUT_H / h
  175. if r_h > r_w:
  176. tw = INPUT_W
  177. th = int(r_w * h)
  178. tx1 = tx2 = 0
  179. ty1 = int((INPUT_H - th) / 2)
  180. ty2 = INPUT_H - th - ty1
  181. else:
  182. tw = int(r_h * w)
  183. th = INPUT_H
  184. tx1 = int((INPUT_W - tw) / 2)
  185. tx2 = INPUT_W - tw - tx1
  186. ty1 = ty2 = 0
  187. # Resize the image with long side while maintaining ratio
  188. image = cv2.resize(image, (tw, th))
  189. # Pad the short side with (128,128,128)
  190. image = cv2.copyMakeBorder(
  191. image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, ( 128, 128, 128)
  192. )
  193. image = image.astype(np.float32)
  194. # Normalize to [0,1]
  195. image /= 255.0
  196. # HWC to CHW format:
  197. image = np.transpose(image, [ 2, 0, 1])
  198. # CHW to NCHW format
  199. image = np.expand_dims(image, axis= 0)
  200. # Convert the image to row-major order, also known as "C order":
  201. image = np.ascontiguousarray(image)
  202. return image, image_raw, h, w
  203. def xywh2xyxy( self, origin_h, origin_w, x):
  204. """
  205. description: Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right
  206. param:
  207. origin_h: height of original image
  208. origin_w: width of original image
  209. x: A boxes tensor, each row is a box [center_x, center_y, w, h]
  210. return:
  211. y: A boxes tensor, each row is a box [x1, y1, x2, y2]
  212. """
  213. y = torch.zeros_like(x) if isinstance(x, torch.Tensor) else np.zeros_like(x)
  214. r_w = INPUT_W / origin_w
  215. r_h = INPUT_H / origin_h
  216. if r_h > r_w:
  217. y[:, 0] = x[:, 0] - x[:, 2] / 2
  218. y[:, 2] = x[:, 0] + x[:, 2] / 2
  219. y[:, 1] = x[:, 1] - x[:, 3] / 2 - (INPUT_H - r_w * origin_h) / 2
  220. y[:, 3] = x[:, 1] + x[:, 3] / 2 - (INPUT_H - r_w * origin_h) / 2
  221. y /= r_w
  222. else:
  223. y[:, 0] = x[:, 0] - x[:, 2] / 2 - (INPUT_W - r_h * origin_w) / 2
  224. y[:, 2] = x[:, 0] + x[:, 2] / 2 - (INPUT_W - r_h * origin_w) / 2
  225. y[:, 1] = x[:, 1] - x[:, 3] / 2
  226. y[:, 3] = x[:, 1] + x[:, 3] / 2
  227. y /= r_h
  228. return y
  229. def post_process( self, output, origin_h, origin_w):
  230. """
  231. description: postprocess the prediction
  232. param:
  233. output: A tensor likes [num_boxes,cx,cy,w,h,conf,cls_id, cx,cy,w,h,conf,cls_id, ...]
  234. origin_h: height of original image
  235. origin_w: width of original image
  236. return:
  237. result_boxes: finally boxes, a boxes tensor, each row is a box [x1, y1, x2, y2]
  238. result_scores: finally scores, a tensor, each element is the score correspoing to box
  239. result_classid: finally classid, a tensor, each element is the classid correspoing to box
  240. """
  241. # Get the num of boxes detected
  242. num = int(output[ 0])
  243. # Reshape to a two dimentional ndarray
  244. pred = np.reshape(output[ 1:], (- 1, 6))[:num, :]
  245. # to a torch Tensor
  246. pred = torch.Tensor(pred).cuda()
  247. # Get the boxes
  248. boxes = pred[:, : 4]
  249. # Get the scores
  250. scores = pred[:, 4]
  251. # Get the classid
  252. classid = pred[:, 5]
  253. # Choose those boxes that score > CONF_THRESH
  254. si = scores > CONF_THRESH
  255. boxes = boxes[si, :]
  256. scores = scores[si]
  257. classid = classid[si]
  258. # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]
  259. boxes = self.xywh2xyxy(origin_h, origin_w, boxes)
  260. # Do nms
  261. indices = torchvision.ops.nms(boxes, scores, iou_threshold=IOU_THRESHOLD).cpu()
  262. result_boxes = boxes[indices, :].cpu()
  263. result_scores = scores[indices].cpu()
  264. result_classid = classid[indices].cpu()
  265. return result_boxes, result_scores, result_classid
  266. class myThread(threading.Thread):
  267. def __init__( self, func, args):
  268. threading.Thread.__init__(self)
  269. self.func = func
  270. self.args = args
  271. def run( self):
  272. self.func(*self.args)
  273. if __name__ == "__main__":
  274. # load custom plugins
  275. PLUGIN_LIBRARY = "build/libmyplugins.so"
  276. ctypes.CDLL(PLUGIN_LIBRARY)
  277. engine_file_path = "yolov5s.engine"
  278. # load coco labels
  279. categories = [ "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat", "traffic light",
  280. "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse", "sheep", "cow",
  281. "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee",
  282. "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove", "skateboard", "surfboard",
  283. "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon", "bowl", "banana", "apple",
  284. "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut", "cake", "chair", "couch",
  285. "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse", "remote", "keyboard", "cell phone",
  286. "microwave", "oven", "toaster", "sink", "refrigerator", "book", "clock", "vase", "scissors", "teddy bear",
  287. "hair drier", "toothbrush"]
  288. # a YoLov5TRT instance
  289. yolov5_wrapper = YoLov5TRT(engine_file_path)
  290. cap = cv2.VideoCapture( 0)
  291. while 1:
  292. _,image =cap.read()
  293. img=yolov5_wrapper.infer(image)
  294. cv2.imshow( "result", img)
  295. if cv2.waitKey( 1) & 0XFF == ord( 'q'): # 1 millisecond
  296. break
  297. cap.release()
  298. cv2.destroyAllWindows()
  299. yolov5_wrapper.destroy()

  ②v5.0代码


    
  1. """
  2. An example that uses TensorRT's Python api to make inferences.
  3. """
  4. import ctypes
  5. import os
  6. import shutil
  7. import random
  8. import sys
  9. import threading
  10. import time
  11. import cv2
  12. import numpy as np
  13. import pycuda.autoinit
  14. import pycuda.driver as cuda
  15. import tensorrt as trt
  16. import torch
  17. import torchvision
  18. import argparse
  19. CONF_THRESH = 0.5
  20. IOU_THRESHOLD = 0.4
  21. def get_img_path_batches( batch_size, img_dir):
  22. ret = []
  23. batch = []
  24. for root, dirs, files in os.walk(img_dir):
  25. for name in files:
  26. if len(batch) == batch_size:
  27. ret.append(batch)
  28. batch = []
  29. batch.append(os.path.join(root, name))
  30. if len(batch) > 0:
  31. ret.append(batch)
  32. return ret
  33. def plot_one_box( x, img, color=None, label=None, line_thickness=None):
  34. """
  35. description: Plots one bounding box on image img,
  36. Jetson AGX Xavier上查看版本

    刷机后在Xavier上安装了python,CUDA,cudnn,OpenCV和TensorRT,查看他们的版本。

    1. python

    Xavier上python2和python3都有。

    #查看python2版本
    python -V
    #查看python3版本
    python3 -V

    技术图片

     

     

     2. CUDA

    下面这两种方法都可以的。

    cat /usr/local/cuda/version.txt
    nvcc -V

    技术图片

     

     

     3. cudnn

    cudnn在Xavier上的位置与一般Ubuntu上的位置是不同的。

    cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2

    技术图片

     

     

    4. OpenCV

    pkg-config --modversion opencv

    技术图片

     

     

    5. TensorRT

    dpkg -l | grep TensorRT

    技术图片 

    以上是关于Jetson AGX Xavier实现TensorRT加速YOLOv5进行实时检测的主要内容,如果未能解决你的问题,请参考以下文章

    Jetson AGX Xavier 刷机指南

    Jetson AGX Xavier 刷机指南

    NVIDIA Jetson AGX Xavier YOLOv5应用与部署

    Jetson AGX Xavier JetPack 4.2环境配置

    Jetson AGX Xavier JetPack 4.2环境配置

    Jetson AGX Xavier/Ubuntu测试SSD的读写速度