ITZone

Use TensorRT to faster inference and lower latency for Deeplearning Model

What is TensorRT?

TensorRT is a library developed by NVIDIA to improve inference speed, reduce lag on NVIDIA graphics devices (GPUs). It can improve inference speed up to 2-4 times compared to real-time services and 30 times faster than CPU performance.

In this article, we focus on the following issues:

How does TensorRT improve optimization?

TensorRT will perform 5 types of optimization to increase inference performance. We will discuss these 5 types below.

 

1. Precision Calibration

During training, the parameters and activations in precision FP32(Float Point 32) will be converted to precision FP16 or INT8. Optimizing it will reduce stagnation and increase inference speed, but at the expense of reducing the accuracy of the model, although not significantly. In real-time recognition, sometimes a trade-off between accuracy and inference speed is necessary.

2. Layer & Tensor Fusion

TensorRT will combine layers and tensor to optimize GPU memory and bandwidth by merging nodes vertically, horizontally or both.

3. Kernel auto-tuning

During model optimization, several kernels dedicated to optimization are executed during the process.

4. Dynamic Tensor Memory

5. Multiple Stream Execution

Workflow

To apply TensorRT on deep learning, we need to convert the model to model-TRT according to the star graph flow.

Code

1. Installing the TensorRT . environment

To install TensorRT on your system, the following requirements are required:

2. Convert model ResNet-50 to TF-TRT

You can convert other models to TensorRT, here I take ResNet-50 as an example.

3. Reload converted model

Result

To compare the results of using TensorRT with native inference, I Inference ResNet-50 on TF-TRT FP32, FP16, INT8 and native.

FP16 FP32 Native
Average step time: 2.1 msec Average step time: 2.5 msec Average step time: 4.1 msec
Average throughput: 244248 samples/sec Average throughput: 240145 samples/sec Average throughput: 126328 samples/sec

Thereby, we see that when converting the model to TensorRT, it will increase the inference speed and reduce the delay quite significantly compared to traditional inference.

References

Entire code : https://colab.research.google.com/drive/15m95GzznIoCRn1XnMQXd9L-onpJiCWM3?usp=sharing
References Links:

Share the news now