- Tram Ho
What is TensorRT?
TensorRT is a library developed by NVIDIA to improve inference speed, reduce lag on NVIDIA graphics devices (GPUs). It can improve inference speed up to 2-4 times compared to real-time services and 30 times faster than CPU performance.
In this article, we focus on the following issues:
- Why does TensorRT improve inference speed
- Does the improved speed need a trade-off?
- How to use TensorRT on deep learning?
How does TensorRT improve optimization?
TensorRT will perform 5 types of optimization to increase inference performance. We will discuss these 5 types below.
1. Precision Calibration
During training, the parameters and activations in precision FP32(Float Point 32) will be converted to precision FP16 or INT8. Optimizing it will reduce stagnation and increase inference speed, but at the expense of reducing the accuracy of the model, although not significantly. In real-time recognition, sometimes a trade-off between accuracy and inference speed is necessary.
2. Layer & Tensor Fusion
TensorRT will combine layers and tensor to optimize GPU memory and bandwidth by merging nodes vertically, horizontally or both.
- Improve GPU utilization – less kernel launch overhead, better memory usage and bandwidth
- Vertical fusion = Combine sequential kernel calls
- Horizontal fusion = Combine same kernels that have common input but different weights
3. Kernel auto-tuning
During model optimization, several kernels dedicated to optimization are executed during the process.
- There are multiple low-level algorithms/implementations for common operations
- TensorRT selects the optimal kernels based on your parameters e.g: batch_size, filter-size, input data size.
- TensorRT selects the optimal kernel based on your target platform.
4. Dynamic Tensor Memory
- Allocates just the memory required for each tensor and only for the duration of its usage
- Reduces memory footprint and improves memory re-use
5. Multiple Stream Execution
- Allows processing multiple input streams in parallel
To apply TensorRT on deep learning, we need to convert the model to model-TRT according to the star graph flow.
1. Installing the TensorRT . environment
To install TensorRT on your system, the following requirements are required:
- Tensorflow-GPU >=2.0
pip install tensorflow-gpu==2.0.0
dpkg -i nvidia-machine-learning-repo-*.deb
sudo apt-get install libnvinfer5
pip install 'h5py==2.10.0' --force-reinstall
2. Convert model ResNet-50 to TF-TRT
You can convert other models to TensorRT, here I take ResNet-50 as an example.
from tensorflow.python.compiler.tensorrt import trt_convert as trt
from tensorflow.keras.applications.resnet50 import ResNet50
import numpy as np
# Load model =>Predict => save model
model = ResNet50(weights='imagenet')
# Convert to TF_TRT => SavedModel
print('Converting to TF-TRT FP32 or FP16 or INT8...')
# Nếu convert TF-TRT FP32 : trt.TrtPrecisionMode.FP32
# Nếu convert TF-TRT FP16 : trt.TrtPrecisionMode.FP16
# Nếu convert TF-TRT INT8 : trt.TrtPrecisionMode.INT8
conversion_params = trt.DEFAULT_TRT_CONVERSION_PARAMS._replace(precision_mode=trt.TrtPrecisionMode.FP16,
converter = trt.TrtGraphConverterV2(input_saved_model_dir='resnet50_saved_model',
# Converter method used to partition and optimize TensorRT compatible segments
# Save the model to the disk
print('Done Converting to TF-TRT FP32')
3. Reload converted model
# Load converted model and infer
import tensorflow as tf
root = tf.saved_model.load('/content/resnet50_saved_model_TFTRT_FP32')
infer = root.signatures['serving_default']
# output = infer(input_tensor)
To compare the results of using TensorRT with native inference, I Inference ResNet-50 on TF-TRT FP32, FP16, INT8 and native.
python tf2_inference.py --use_tftrt_model --precision fp16
python tf2_inference.py --use_tftrt_model --precision fp32
python tf2_inference.py --use_tftrt_model --precision int8
python tf2_inference.py --use_native_tensorflow
|Average step time: 2.1 msec||Average step time: 2.5 msec||Average step time: 4.1 msec|
|Average throughput: 244248 samples/sec||Average throughput: 240145 samples/sec||Average throughput: 126328 samples/sec|
Thereby, we see that when converting the model to TensorRT, it will increase the inference speed and reduce the delay quite significantly compared to traditional inference.
Entire code : https://colab.research.google.com/drive/15m95GzznIoCRn1XnMQXd9L-onpJiCWM3?usp=sharing