极智AI | 讲解 TensorRT Fully Connected 算子

Posted 2022-07-10 极智视界

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了极智AI | 讲解 TensorRT Fully Connected 算子相关的知识，希望对你有一定的参考价值。

欢迎关注我的公众号 [极智视界]，获取我的更多笔记分享

大家好，我是极智视界，本文讲解一下 TensorRT Fully Connected 算子。

Fully Connected 也即 全连接层，一般作为分类头或特征头使用。全连接层是个经典层，并不复杂，若没有偏置的话就是一个矩阵乘，如有偏置的话，就是一个矩阵乘然后接一个矩阵加。这里我们来看看 TensorRT 中 Fully Connected 的几种实现方式。

文章目录

- 1 TensorRT 原生算子实现
- 2 TensorRT 矩阵乘加实现

1 TensorRT 原生算子实现

用 TensorRT Fully Connected 原生算子来实现肯定是最方便的，关键的几步如下：

placeHolder = np.zeros(1, dtype=np.float32)
# 添加全连接层
fullyConnectedLayer = network.add_fully_connected(inputT0, 1, placeHolder, placeHolder)
# 重设输出通道数
fullyConnectedLayer.num_output_channels = cOut  
# 重设全连接权值
fullyConnectedLayer.kernel = weight
# 重设全连接偏置，bias 为可选参数，默认值 None
fullyConnectedLayer.bias = bias

来用一个完整的示例进行展示：

import numpy as np
from cuda import cudart
import tensorrt as trt

# 输入张量 NCHW
nIn, cIn, hIn, wIn = 1, 3, 4, 5  
# 输出张量 C
cOut = 2  
# 输入数据
data = np.arange(cIn * hIn * wIn, dtype=np.float32).reshape(cIn, hIn, wIn) 
# 全连接权值
weight = np.ones(cIn * hIn * wIn, dtype=np.float32)  
weight = np.concatenate([weight, -weight], 0).reshape(cOut, cIn, hIn, wIn)
# 全连接偏置
bias = np.zeros(cOut, dtype=np.float32)  

np.set_printoptions(precision=8, linewidth=200, suppress=True)
cudart.cudaDeviceSynchronize()

logger = trt.Logger(trt.Logger.ERROR)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
config = builder.create_builder_config()
inputT0 = network.add_input('inputT0', trt.DataType.FLOAT, (nIn, cIn, hIn, wIn))
#-----------------------------------------------------------------------# 替换部分
# 添加全连接层
fullyConnectedLayer = network.add_fully_connected(inputT0, cOut, weight, bias)
#-----------------------------------------------------------------------# 替换部分
network.mark_output(fullyConnectedLayer.get_output(0))
engineString = builder.build_serialized_network(network, config)
engine = trt.Runtime(logger).deserialize_cuda_engine(engineString)
context = engine.create_execution_context()
_, stream = cudart.cudaStreamCreate()

inputH0 = np.ascontiguousarray(data.reshape(-1))
outputH0 = np.empty(context.get_binding_shape(1), dtype=trt.nptype(engine.get_binding_dtype(1)))
_, inputD0 = cudart.cudaMallocAsync(inputH0.nbytes, stream)
_, outputD0 = cudart.cudaMallocAsync(outputH0.nbytes, stream)

cudart.cudaMemcpyAsync(inputD0, inputH0.ctypes.data, inputH0.nbytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice, stream)
context.execute_async_v2([int(inputD0), int(outputD0)], stream)
cudart.cudaMemcpyAsync(outputH0.ctypes.data, outputD0, outputH0.nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost, stream)
cudart.cudaStreamSynchronize(stream)

print("inputH0 :", data.shape)
print(data)
print("outputH0:", outputH0.shape)
print(outputH0)

cudart.cudaStreamDestroy(stream)
cudart.cudaFree(inputD0)
cudart.cudaFree(outputD0)

输入张量形状 (1,3,4,5)
$\\left[\\beginmatrix \\left[\\beginmatrix \\left[\\beginmatrix 0. & 1. & 2. & 3. & 4. \\\\ 5. & 6. & 7. & 8. & 9. \\\\ 10. & 11. & 12. & 13. & 14. \\\\ 15. & 16. & 17. & 18. & 19. \\endmatrix\\right] \\left[\\beginmatrix 20. & 21. & 22. & 23. & 24. \\\\ 25. & 26. & 27. & 28. & 29. \\\\ 30. & 31. & 32. & 33. & 34. \\\\ 35. & 36. & 37. & 38. & 39. \\endmatrix\\right] \\left[\\beginmatrix 40. & 41. & 42. & 43. & 44. \\\\ 45. & 46. & 47. & 48. & 49. \\\\ 50. & 51. & 52. & 53. & 54. \\\\ 55. & 56. & 57. & 58. & 59. \\endmatrix\\right] \\endmatrix\\right] \\endmatrix\\right]$
输出张量形状 (1,2,1,1)
$\\left[\\beginmatrix \\left[\\beginmatrix \\left[\\beginmatrix \\left[\\beginmatrix 1770. \\endmatrix\\right] \\endmatrix\\right] \\\\ \\left[\\beginmatrix \\left[\\beginmatrix -1770. \\endmatrix\\right] \\endmatrix\\right] \\endmatrix\\right] \\endmatrix\\right]$
计算过程：
以上是关于极智AI | 讲解 TensorRT Fully Connected 算子的主要内容，如果未能解决你的问题，请参考以下文章