Use TensorRT to faster inference and lower latency for Deeplearning Model

Tram Ho

What is TensorRT?

TensorRT is a library developed by NVIDIA to improve inference speed, reduce lag on NVIDIA graphics devices (GPUs). It can improve inference speed up to 2-4 times compared to real-time services and 30 times faster than CPU performance.

In this article, we focus on the following issues:

  • Why does TensorRT improve inference speed
  • Does the improved speed need a trade-off?
  • How to use TensorRT on deep learning?

How does TensorRT improve optimization?

TensorRT will perform 5 types of optimization to increase inference performance. We will discuss these 5 types below.

 

1. Precision Calibration

During training, the parameters and activations in precision FP32(Float Point 32) will be converted to precision FP16 or INT8. Optimizing it will reduce stagnation and increase inference speed, but at the expense of reducing the accuracy of the model, although not significantly. In real-time recognition, sometimes a trade-off between accuracy and inference speed is necessary.

2. Layer & Tensor Fusion

TensorRT will combine layers and tensor to optimize GPU memory and bandwidth by merging nodes vertically, horizontally or both.

  • Improve GPU utilization – less kernel launch overhead, better memory usage and bandwidth
  • Vertical fusion = Combine sequential kernel calls
  • Horizontal fusion = Combine same kernels that have common input but different weights

3. Kernel auto-tuning

During model optimization, several kernels dedicated to optimization are executed during the process.

  • There are multiple low-level algorithms/implementations for common operations
  • TensorRT selects the optimal kernels based on your parameters e.g: batch_size, filter-size, input data size.
  • TensorRT selects the optimal kernel based on your target platform.

4. Dynamic Tensor Memory

  • Allocates just the memory required for each tensor and only for the duration of its usage
  • Reduces memory footprint and improves memory re-use

5. Multiple Stream Execution

  • Allows processing multiple input streams in parallel

Workflow

To apply TensorRT on deep learning, we need to convert the model to model-TRT according to the star graph flow.

Code

1. Installing the TensorRT . environment

To install TensorRT on your system, the following requirements are required:

  • NVIDIA-GPU
  • Tensorflow-GPU >=2.0

2. Convert model ResNet-50 to TF-TRT

You can convert other models to TensorRT, here I take ResNet-50 as an example.

3. Reload converted model

Result

To compare the results of using TensorRT with native inference, I Inference ResNet-50 on TF-TRT FP32, FP16, INT8 and native.

FP16FP32Native
Average step time: 2.1 msecAverage step time: 2.5 msecAverage step time: 4.1 msec
Average throughput: 244248 samples/secAverage throughput: 240145 samples/secAverage throughput: 126328 samples/sec

Thereby, we see that when converting the model to TensorRT, it will increase the inference speed and reduce the delay quite significantly compared to traditional inference.

References

Entire code : https://colab.research.google.com/drive/15m95GzznIoCRn1XnMQXd9L-onpJiCWM3?usp=sharing
References Links:

Share the news now