Use TensorRT to faster inference and lower latency for Deeplearning Model

Tram Ho

2 years ago

What is TensorRT?

TensorRT is a library developed by NVIDIA to improve inference speed, reduce lag on NVIDIA graphics devices (GPUs). It can improve inference speed up to 2-4 times compared to real-time services and 30 times faster than CPU performance.

In this article, we focus on the following issues:

Why does TensorRT improve inference speed
Does the improved speed need a trade-off?
How to use TensorRT on deep learning?

How does TensorRT improve optimization?

TensorRT will perform 5 types of optimization to increase inference performance. We will discuss these 5 types below.

1. Precision Calibration

During training, the parameters and activations in precision FP32(Float Point 32) will be converted to precision FP16 or INT8. Optimizing it will reduce stagnation and increase inference speed, but at the expense of reducing the accuracy of the model, although not significantly. In real-time recognition, sometimes a trade-off between accuracy and inference speed is necessary.

2. Layer & Tensor Fusion

TensorRT will combine layers and tensor to optimize GPU memory and bandwidth by merging nodes vertically, horizontally or both.

Improve GPU utilization – less kernel launch overhead, better memory usage and bandwidth
Vertical fusion = Combine sequential kernel calls
Horizontal fusion = Combine same kernels that have common input but different weights

3. Kernel auto-tuning

During model optimization, several kernels dedicated to optimization are executed during the process.

There are multiple low-level algorithms/implementations for common operations
TensorRT selects the optimal kernels based on your parameters e.g: batch_size, filter-size, input data size.
TensorRT selects the optimal kernel based on your target platform.

4. Dynamic Tensor Memory

Allocates just the memory required for each tensor and only for the duration of its usage
Reduces memory footprint and improves memory re-use

5. Multiple Stream Execution

Allows processing multiple input streams in parallel

Workflow

To apply TensorRT on deep learning, we need to convert the model to model-TRT according to the star graph flow.

Code

1. Installing the TensorRT . environment

To install TensorRT on your system, the following requirements are required:

NVIDIA-GPU
Tensorflow-GPU >=2.0

pip install tensorflow-gpu==2.0.0

wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

dpkg -i nvidia-machine-learning-repo-*.deb
apt-get update

sudo apt-get install libnvinfer5
pip install 'h5py==2.10.0' --force-reinstall

pip install tensorflow-gpu==2.0.0

wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

dpkg -i nvidia-machine-learning-repo-*.deb

apt-get update

sudo apt-get install libnvinfer5

pip install 'h5py==2.10.0' --force-reinstall

2. Convert model ResNet-50 to TF-TRT

You can convert other models to TensorRT, here I take ResNet-50 as an example.

from tensorflow.python.compiler.tensorrt import trt_convert as trt
from tensorflow.keras.applications.resnet50 import ResNet50
import os
import numpy as np
# Load model =&gt;Predict =&gt; save model
model = ResNet50(weights='imagenet')
model.save('/content/resnet50_saved_model')
# Convert to TF_TRT =&gt; SavedModel

print('Converting to TF-TRT FP32 or FP16 or INT8...')
# Nếu convert TF-TRT FP32 : trt.TrtPrecisionMode.FP32
# Nếu convert TF-TRT FP16 : trt.TrtPrecisionMode.FP16
# Nếu convert TF-TRT INT8 : trt.TrtPrecisionMode.INT8
conversion_params = trt.DEFAULT_TRT_CONVERSION_PARAMS._replace(precision_mode=trt.TrtPrecisionMode.FP16,
                                                               max_workspace_size_bytes=8000000000)
converter = trt.TrtGraphConverterV2(input_saved_model_dir='resnet50_saved_model',
                                    conversion_params=conversion_params)
# Converter method used to partition and optimize TensorRT compatible segments
converter.convert()

# Save the model to the disk 
converter.save(output_saved_model_dir='resnet50_saved_model_TFTRT_FP32')
print('Done Converting to TF-TRT FP32')

from tensorflow.python.compiler.tensorrt import trt_convert as trt

from tensorflow.keras.applications.resnet50 import ResNet50

import os

import numpy as np

# Load model =>Predict => save model

model = ResNet50(weights='imagenet')

model.save('/content/resnet50_saved_model')

# Convert to TF_TRT => SavedModel

print('Converting to TF-TRT FP32 or FP16 or INT8...')

# Nếu convert TF-TRT FP32 : trt.TrtPrecisionMode.FP32

# Nếu convert TF-TRT FP16 : trt.TrtPrecisionMode.FP16

# Nếu convert TF-TRT INT8 : trt.TrtPrecisionMode.INT8

conversion_params = trt.DEFAULT_TRT_CONVERSION_PARAMS._replace(precision_mode=trt.TrtPrecisionMode.FP16,

max_workspace_size_bytes=8000000000)

converter = trt.TrtGraphConverterV2(input_saved_model_dir='resnet50_saved_model',

conversion_params=conversion_params)

# Converter method used to partition and optimize TensorRT compatible segments

converter.convert()

# Save the model to the disk

converter.save(output_saved_model_dir='resnet50_saved_model_TFTRT_FP32')

print('Done Converting to TF-TRT FP32')

3. Reload converted model

# Load converted model and infer
import tensorflow as tf
root = tf.saved_model.load('/content/resnet50_saved_model_TFTRT_FP32')
infer = root.signatures['serving_default']
# output = infer(input_tensor)
print(infer)

# Load converted model and infer

import tensorflow as tf

root = tf.saved_model.load('/content/resnet50_saved_model_TFTRT_FP32')

infer = root.signatures['serving_default']

# output = infer(input_tensor)

print(infer)

Result

To compare the results of using TensorRT with native inference, I Inference ResNet-50 on TF-TRT FP32, FP16, INT8 and native.

wget https://raw.githubusercontent.com/tensorflow/tensorrt/master/tftrt/blog_posts/Leveraging%20TensorFlow-TensorRT%20integration%20for%20Low%20latency%20Inference/tf2_inference.py
python tf2_inference.py --use_tftrt_model --precision fp16
python tf2_inference.py --use_tftrt_model --precision fp32
python tf2_inference.py --use_tftrt_model --precision int8
python tf2_inference.py --use_native_tensorflow

wget https://raw.githubusercontent.com/tensorflow/tensorrt/master/tftrt/blog_posts/Leveraging%20TensorFlow-TensorRT%20integration%20for%20Low%20latency%20Inference/tf2_inference.py

python tf2_inference.py --use_tftrt_model --precision fp16

python tf2_inference.py --use_tftrt_model --precision fp32

python tf2_inference.py --use_tftrt_model --precision int8

python tf2_inference.py --use_native_tensorflow

FP16	FP32	Native
Average step time: 2.1 msec	Average step time: 2.5 msec	Average step time: 4.1 msec
Average throughput: 244248 samples/sec	Average throughput: 240145 samples/sec	Average throughput: 126328 samples/sec

Thereby, we see that when converting the model to TensorRT, it will increase the inference speed and reduce the delay quite significantly compared to traditional inference.

References

Entire code : https://colab.research.google.com/drive/15m95GzznIoCRn1XnMQXd9L-onpJiCWM3?usp=sharing
References Links:

Share the news now