What is TensorRT?
TensorRT is a library developed by NVIDIA to improve inference speed, reduce lag on NVIDIA graphics devices (GPUs). It can improve inference speed up to 2-4 times compared to real-time services and 30 times faster than CPU performance.
In this article, we focus on the following issues:
- Why does TensorRT improve inference speed
- Does the improved speed need a trade-off?
- How to use TensorRT on deep learning?
How does TensorRT improve optimization?
TensorRT will perform 5 types of optimization to increase inference performance. We will discuss these 5 types below.
1. Precision Calibration
During training, the parameters and activations in precision FP32(Float Point 32) will be converted to precision FP16 or INT8. Optimizing it will reduce stagnation and increase inference speed, but at the expense of reducing the accuracy of the model, although not significantly. In real-time recognition, sometimes a trade-off between accuracy and inference speed is necessary.
2. Layer & Tensor Fusion
TensorRT will combine layers and tensor to optimize GPU memory and bandwidth by merging nodes vertically, horizontally or both.
- Improve GPU utilization – less kernel launch overhead, better memory usage and bandwidth
- Vertical fusion = Combine sequential kernel calls
- Horizontal fusion = Combine same kernels that have common input but different weights
3. Kernel auto-tuning
During model optimization, several kernels dedicated to optimization are executed during the process.
- There are multiple low-level algorithms/implementations for common operations
- TensorRT selects the optimal kernels based on your parameters e.g: batch_size, filter-size, input data size.
- TensorRT selects the optimal kernel based on your target platform.
4. Dynamic Tensor Memory
- Allocates just the memory required for each tensor and only for the duration of its usage
- Reduces memory footprint and improves memory re-use
5. Multiple Stream Execution
- Allows processing multiple input streams in parallel
Workflow
To apply TensorRT on deep learning, we need to convert the model to model-TRT according to the star graph flow.
Code
1. Installing the TensorRT . environment
To install TensorRT on your system, the following requirements are required:
- NVIDIA-GPU
- Tensorflow-GPU >=2.0
1 2 3 4 5 6 7 8 9 10 |
pip install tensorflow-gpu==2.0.0 wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb dpkg -i nvidia-machine-learning-repo-*.deb apt-get update sudo apt-get install libnvinfer5 pip install 'h5py==2.10.0' --force-reinstall |
2. Convert model ResNet-50 to TF-TRT
You can convert other models to TensorRT, here I take ResNet-50 as an example.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
from tensorflow.python.compiler.tensorrt import trt_convert as trt from tensorflow.keras.applications.resnet50 import ResNet50 import os import numpy as np # Load model =>Predict => save model model = ResNet50(weights='imagenet') model.save('/content/resnet50_saved_model') # Convert to TF_TRT => SavedModel print('Converting to TF-TRT FP32 or FP16 or INT8...') # Nếu convert TF-TRT FP32 : trt.TrtPrecisionMode.FP32 # Nếu convert TF-TRT FP16 : trt.TrtPrecisionMode.FP16 # Nếu convert TF-TRT INT8 : trt.TrtPrecisionMode.INT8 conversion_params = trt.DEFAULT_TRT_CONVERSION_PARAMS._replace(precision_mode=trt.TrtPrecisionMode.FP16, max_workspace_size_bytes=8000000000) converter = trt.TrtGraphConverterV2(input_saved_model_dir='resnet50_saved_model', conversion_params=conversion_params) # Converter method used to partition and optimize TensorRT compatible segments converter.convert() # Save the model to the disk converter.save(output_saved_model_dir='resnet50_saved_model_TFTRT_FP32') print('Done Converting to TF-TRT FP32') |
3. Reload converted model
1 2 3 4 5 6 7 |
# Load converted model and infer import tensorflow as tf root = tf.saved_model.load('/content/resnet50_saved_model_TFTRT_FP32') infer = root.signatures['serving_default'] # output = infer(input_tensor) print(infer) |
Result
To compare the results of using TensorRT with native inference, I Inference ResNet-50 on TF-TRT FP32, FP16, INT8 and native.
1 2 3 4 5 6 7 |
wget https://raw.githubusercontent.com/tensorflow/tensorrt/master/tftrt/blog_posts/Leveraging%20TensorFlow-TensorRT%20integration%20for%20Low%20latency%20Inference/tf2_inference.py python tf2_inference.py --use_tftrt_model --precision fp16 python tf2_inference.py --use_tftrt_model --precision fp32 python tf2_inference.py --use_tftrt_model --precision int8 python tf2_inference.py --use_native_tensorflow |
FP16 | FP32 | Native |
---|---|---|
Average step time: 2.1 msec | Average step time: 2.5 msec | Average step time: 4.1 msec |
Average throughput: 244248 samples/sec | Average throughput: 240145 samples/sec | Average throughput: 126328 samples/sec |
Thereby, we see that when converting the model to TensorRT, it will increase the inference speed and reduce the delay quite significantly compared to traditional inference.
References
Entire code : https://colab.research.google.com/drive/15m95GzznIoCRn1XnMQXd9L-onpJiCWM3?usp=sharing
References Links: