TorchServe, PyTorch model deployment support tool

Wednesday, 16/12/2020

Tram Ho

Introduction

Today I will briefly introduce you the PyTorch model-specific deployment tool. This tool is called TorchServe, recently developed so the repo is less star than Tensorflow Serving and the bug is more.. Link Git repo: https://github.com/pytorch/serve

TorchServe system diagram

As shown above, TorchServe system is divided into 3 parts: API, Core (Backend & Frontend), Model Storage.

The API will be divided into two parts: Management API and Inference API, the first will manage the state of the query, the state of the model, the number of workers, the second is where the user request is received.

In the core of TorchServe there are 2 parts: Frontend and Backend. The frontend receives the user’s request if multiple requests are batching and return the request status logs. These batch requests pass the inference endpoint to the backend, where the backend splits the batch requests to each worker process, each worker managing an instance of the trained model.

So where does the model get? Of course, in the model storage, there are many models used for different tasks: Classification, Detection, Segmentation, … Each model will have many different versions. TorchServe will automatically load the model based on the user’s config.

Install TorchServe and torch-model-archiver

First you need to clone the repo to already:

git clone https://github.com/pytorch/serve
cd serve

git clone https://github.com/pytorch/serve

cd serve

Based on the environment you need, the following installation options are available:

With CPU for Torch 1.7.1

python ./ts_scripts/install_dependencies.py

1 2	python ./ts_scripts/install_dependencies.py

With GPU and Cuda 10.2

python ./ts_scripts/install_dependencies.py --cuda=cu102

1 2	python ./ts_scripts/install_dependencies.py --cuda=cu102

With GPU and Cuda 10.1

python ./ts_scripts/install_dependencies.py --cuda=cu101

1 2	python ./ts_scripts/install_dependencies.py --cuda=cu101

With GPU and Cuda 9.2

python ./ts_scripts/install_dependencies.py --cuda=cu92

1 2	python ./ts_scripts/install_dependencies.py --cuda=cu92

=> Install necessary dependencies

Next install two important libraries: torchserve and torch-model-archiver can be made by conda or pip

With Conda

conda install torchserve torch-model-archiver -c pytorch

1 2	conda install torchserve torch-model-archiver -c pytorch

With Pip

pip install torchserve torch-model-archiver

1 2	pip install torchserve torch-model-archiver

Save models using TorchServe

Create a folder anywhere, named model_store

mkdir model_store

1 2	mkdir model_store

Download a sample model to deploy and predict. Here I use densene161

wget https://download.pytorch.org/models/densenet161-8d451a50.pth

1 2	wget https://download.pytorch.org/models/densenet161-8d451a50.pth

Use the torch-model-archive library to save the models in a format that TorchServe supports

torch-model-archiver 
--model-name densenet161 
--version 1.0 
--model-file ./serve/examples/image_classifier/densenet_161/model.py 
--serialized-file densenet161-8d451a50.pth 
--export-path model_store 
--extra-files ./serve/examples/image_classifier/index_to_name.json 
--handler image_classifier

torch-model-archiver

--model-name densenet161

--version 1.0

--model-file ./serve/examples/image_classifier/densenet_161/model.py

--serialized-file densenet161-8d451a50.pth

--export-path model_store

--extra-files ./serve/examples/image_classifier/index_to_name.json

--handler image_classifier

Explain the parameters in the above statement:

model-name : model name
version : version how many
model-file : model file, if you save the model with torch save then not needed (Optional)
serialized-file : required, trained model awaits conversion, here is typed the path to the model
export-path : export-path location
extra-files : json file contains label (Optional)
handler : required, file processing (preprocessing, post-processing), can inherit from the classes available in the repo or customize it as you like.

Result:

Run the TorchServe server

After having the above file, you use this command. This command will open the endpoint for the user request as well as execute hidden processes to serve the model.

torchserve 
--start --ncs 
--model-store model_store 
--models densenet161.mar

torchserve

--start --ncs

--model-store model_store

--models densenet161.mar

start : Start session of TorchServe
stop : end of TorchServe session
model-store : the place that contains the model, namely the folder containing the file with the .mar extension earlier
models : model to load, eg densenet161.mar
log-config : config file for log
ts-config : special config file for TorchServe, such as adjusting the port
foreground : show log when running in terminal, if disable ts will run in the background
ncs : disable snapshot

Link Inference API: http://127.0.0.1:8080

Link Management API: http://127.0.0.1:8081

Link Metric API: http://127.0.0.1:8082

The two links below have not been mentioned, let’s find out which first link. This link is used to predict the result using REST API

Download 1 previous image already:

curl -O https://raw.githubusercontent.com/pytorch/serve/master/docs/images/kitten_small.jpg

1 2	curl -O https://raw.githubusercontent.com/pytorch/serve/master/docs/images/kitten_small.jpg

Use the command line to send an image request with a POST method to the TorchServe endpoint

curl http://127.0.0.1:8080/predictions/densenet161 -T kitten_small.jpg

1 2	curl http://127.0.0.1:8080/predictions/densenet161 -T kitten_small.jpg

Output:

<span class="token punctuation">{</span>
  <span class="token property">"tabby"</span> <span class="token operator">:</span> <span class="token number">0.5237820744514465</span> <span class="token punctuation">,</span>
  <span class="token property">"tiger_cat"</span> <span class="token operator">:</span> <span class="token number">0.18530139327049255</span> <span class="token punctuation">,</span>
  <span class="token property">"lynx"</span> <span class="token operator">:</span> <span class="token number">0.15431317687034607</span> <span class="token punctuation">,</span>
  <span class="token property">"tiger"</span> <span class="token operator">:</span> <span class="token number">0.056817926466464996</span> <span class="token punctuation">,</span>
  <span class="token property">"Egyptian_cat"</span> <span class="token operator">:</span> <span class="token number">0.04702862352132797</span>
<span class="token punctuation">}</span>

{

"tabby" : 0.5237820744514465 ,

"tiger_cat" : 0.18530139327049255 ,

"lynx" : 0.15431317687034607 ,

"tiger" : 0.056817926466464996 ,

"Egyptian_cat" : 0.04702862352132797

}

Predicting via gRPC

First of all download the gRPC protocol libraries already

pip install -U grpcio protobuf grpcio-tools

1 2	pip install -U grpcio protobuf grpcio-tools

In the serve folder, use the proto file to gen gRPC client stub

python -m grpc_tools.protoc 
--proto_path=frontend/server/src/main/resources/proto/ 
--python_out=ts_scripts 
--grpc_python_out=ts_scripts frontend/server/src/main/resources/proto/inference.proto frontend/server/src/main/resources/proto/management.proto

python -m grpc_tools.protoc

--proto_path=frontend/server/src/main/resources/proto/

--python_out=ts_scripts

--grpc_python_out=ts_scripts frontend/server/src/main/resources/proto/inference.proto frontend/server/src/main/resources/proto/management.proto

Use model registration

python ts_scripts/torchserve_grpc_client.py register densenet161

1 2	python ts_scripts/torchserve_grpc_client.py register densenet161

Predict a sample using gRPC python client

python ts_scripts/torchserve_grpc_client.py infer densenet161 examples/image_classifier/kitten.jpg

1 2	python ts_scripts/torchserve_grpc_client.py infer densenet161 examples/image_classifier/kitten.jpg

Unsubscribe the model

python ts_scripts/torchserve_grpc_client.py unregister densenet161

1 2	python ts_scripts/torchserve_grpc_client.py unregister densenet161

By default, TorchServe takes 2 port 7070 for the gRPC Inference API and 7071 for the gRPC Management API

As a result, I haven’t tried gRPC so I can’t show it to you (actually tried it but got bug, this prediction method has just been updated on torchserve’s repo so the error is also normal.)

Management API

When you have multiple models, this is when you need an efficient management tool and of course torchserve supports this through the API endpoint. Supported functions

Register 1 model
Increase / decrease the number of workers for a specified model
Describe the model’s state
Unsubscribe the model
Show registered models
Specify a model instance as default

Model registration

Use the POST method: POST /models

List of parameters:

url: path to .mar or link to download model from Internet. Example: https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar
model_name: model name
handler: Make sure the handler is in PYTHONPATH. Format: module_name: method_name
runtime: PYTHON default
batch_size: default 1
max_batch_delay: batch timeout, default 100 ms
initial_workers: initialized number of workers, default 0, TorchServe will not run without workers
synchronous: create synchronous or asynchronous workers, default to false
response_timeout: timeout, default 120 s

curl -X POST  <span class="token string">"http://localhost:8081/models?url=https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar"</span>

<span class="token punctuation">{</span>
  <span class="token property">"status"</span> <span class="token operator">:</span> <span class="token string">"Model "squeezenet_v1.1" Version: 1.0 registered with 0 initial workers. Use scale workers API to add workers for the model."</span>
<span class="token punctuation">}</span>

curl -X POST "http://localhost:8081/models?url=https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar"

{

"status" : "Model "squeezenet_v1.1" Version: 1.0 registered with 0 initial workers. Use scale workers API to add workers for the model."

}

curl -v -X POST "http://localhost:8081/models?initial_workers=1&amp;synchronous=false&amp;url=https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar"

*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8081 (#0)
&gt; POST /models?initial_workers=1&amp;synchronous=false&amp;url=https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar HTTP/1.1
&gt; Host: localhost:8081
&gt; User-Agent: curl/7.58.0
&gt; Accept: */*
&gt; 
&lt; HTTP/1.1 202 Accepted
&lt; content-type: application/json
&lt; x-request-id: 61d2b2b4-2a3a-49d4-84c9-e6f2f92cd36d
&lt; Pragma: no-cache
&lt; Cache-Control: no-cache; no-store, must-revalidate, private
&lt; Expires: Thu, 01 Jan 1970 00:00:00 UTC
&lt; content-length: 47
&lt; connection: keep-alive
&lt; 
{
  "status": "Processing worker updates..."
}
* Connection #0 to host localhost left intact

curl -v -X POST "http://localhost:8081/models?initial_workers=1&synchronous=false&url=https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar"

* Trying 127.0.0.1...

* TCP_NODELAY set

* Connected to localhost (127.0.0.1) port 8081 (#0)

> POST /models?initial_workers=1&synchronous=false&url=https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar HTTP/1.1

> Host: localhost:8081

> User-Agent: curl/7.58.0

> Accept: */*

< HTTP/1.1 202 Accepted

< content-type: application/json

< x-request-id: 61d2b2b4-2a3a-49d4-84c9-e6f2f92cd36d

< Pragma: no-cache

< Cache-Control: no-cache; no-store, must-revalidate, private

< Expires: Thu, 01 Jan 1970 00:00:00 UTC

< content-length: 47

< connection: keep-alive

{

"status": "Processing worker updates..."

}

* Connection #0 to host localhost left intact

Scale workers

Using the PUT method: PUT /models/{model_name}

List of parameters:

min_worker: (Optional) minimum number of workers, default 1
max_worker: (Optional) the maximum number of workers, default 1, TorchServe will not create a worker that exceeds this number.
number_gpu: (Optional) the number of GPU workers created, default is 0, if the number of workers exceeds the number of GPUs on the machine, the remaining workers will run on the CPU.
synchronous: false default
timeout: the time it takes for the worker to complete the pending requests. If this number is exceeded, the worker will stop working. A 0 will stop worker processing immediately. The value -1 will wait indefinitely. Default -1

curl -v -X PUT "http://localhost:8081/models/squeezenet1_1/?min_worker=3"

*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8081 (#0)
&gt; PUT /models/squeezenet1_1/?min_worker=3 HTTP/1.1
&gt; Host: localhost:8081
&gt; User-Agent: curl/7.58.0
&gt; Accept: */*
&gt; 
&lt; HTTP/1.1 202 Accepted
&lt; content-type: application/json
&lt; x-request-id: b508190b-ef7d-4e7a-a361-6dac1036d2bd
&lt; Pragma: no-cache
&lt; Cache-Control: no-cache; no-store, must-revalidate, private
&lt; Expires: Thu, 01 Jan 1970 00:00:00 UTC
&lt; content-length: 47
&lt; connection: keep-alive
&lt; 
{
  "status": "Processing worker updates..."
}
* Connection #0 to host localhost left intact

curl -v -X PUT "http://localhost:8081/models/squeezenet1_1/?min_worker=3"

* Trying 127.0.0.1...

* TCP_NODELAY set

* Connected to localhost (127.0.0.1) port 8081 (#0)

> PUT /models/squeezenet1_1/?min_worker=3 HTTP/1.1

> Host: localhost:8081

> User-Agent: curl/7.58.0

> Accept: */*

< HTTP/1.1 202 Accepted

< content-type: application/json

< x-request-id: b508190b-ef7d-4e7a-a361-6dac1036d2bd

< Pragma: no-cache

< Cache-Control: no-cache; no-store, must-revalidate, private

< Expires: Thu, 01 Jan 1970 00:00:00 UTC

< content-length: 47

< connection: keep-alive

{

"status": "Processing worker updates..."

}

* Connection #0 to host localhost left intact

If the model has multiple versions: PUT /models/{model_name}/{version}

curl -v -X PUT "http://localhost:8081/models/squeezenet1_1/1.0?min_worker=3"

1 2	curl -v -X PUT "http://localhost:8081/models/squeezenet1_1/1.0?min_worker=3"

Model description

Use the GET method: GET /models/{model_name}

curl http <span class="token operator">:</span> <span class="token comment">//localhost:8081/models/squeezenet1_1</span>

<span class="token punctuation">[</span>
  <span class="token punctuation">{</span>
    <span class="token property">"modelName"</span> <span class="token operator">:</span> <span class="token string">"squeezenet1_1"</span> <span class="token punctuation">,</span>
    <span class="token property">"modelVersion"</span> <span class="token operator">:</span> <span class="token string">"1.0"</span> <span class="token punctuation">,</span>
    <span class="token property">"modelUrl"</span> <span class="token operator">:</span> <span class="token string">"https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar"</span> <span class="token punctuation">,</span>
    <span class="token property">"runtime"</span> <span class="token operator">:</span> <span class="token string">"python"</span> <span class="token punctuation">,</span>
    <span class="token property">"minWorkers"</span> <span class="token operator">:</span> <span class="token number">3</span> <span class="token punctuation">,</span>
    <span class="token property">"maxWorkers"</span> <span class="token operator">:</span> <span class="token number">3</span> <span class="token punctuation">,</span>
    <span class="token property">"batchSize"</span> <span class="token operator">:</span> <span class="token number">1</span> <span class="token punctuation">,</span>
    <span class="token property">"maxBatchDelay"</span> <span class="token operator">:</span> <span class="token number">100</span> <span class="token punctuation">,</span>
    <span class="token property">"loadedAtStartup"</span> <span class="token operator">:</span> <span class="token boolean">false</span> <span class="token punctuation">,</span>
    <span class="token property">"workers"</span> <span class="token operator">:</span> <span class="token punctuation">[</span>
      <span class="token punctuation">{</span>
        <span class="token property">"id"</span> <span class="token operator">:</span> <span class="token string">"9001"</span> <span class="token punctuation">,</span>
        <span class="token property">"startTime"</span> <span class="token operator">:</span> <span class="token string">"2020-12-16T15:13:43.722Z"</span> <span class="token punctuation">,</span>
        <span class="token property">"status"</span> <span class="token operator">:</span> <span class="token string">"READY"</span> <span class="token punctuation">,</span>
        <span class="token property">"gpu"</span> <span class="token operator">:</span> <span class="token boolean">true</span> <span class="token punctuation">,</span>
        <span class="token property">"memoryUsage"</span> <span class="token operator">:</span> <span class="token number">2044100608</span>
      <span class="token punctuation">}</span> <span class="token punctuation">,</span>
      <span class="token punctuation">{</span>
        <span class="token property">"id"</span> <span class="token operator">:</span> <span class="token string">"9002"</span> <span class="token punctuation">,</span>
        <span class="token property">"startTime"</span> <span class="token operator">:</span> <span class="token string">"2020-12-16T15:52:52.561Z"</span> <span class="token punctuation">,</span>
        <span class="token property">"status"</span> <span class="token operator">:</span> <span class="token string">"READY"</span> <span class="token punctuation">,</span>
        <span class="token property">"gpu"</span> <span class="token operator">:</span> <span class="token boolean">true</span> <span class="token punctuation">,</span>
        <span class="token property">"memoryUsage"</span> <span class="token operator">:</span> <span class="token number">2045640704</span>
      <span class="token punctuation">}</span> <span class="token punctuation">,</span>
      <span class="token punctuation">{</span>
        <span class="token property">"id"</span> <span class="token operator">:</span> <span class="token string">"9003"</span> <span class="token punctuation">,</span>
        <span class="token property">"startTime"</span> <span class="token operator">:</span> <span class="token string">"2020-12-16T15:52:52.561Z"</span> <span class="token punctuation">,</span>
        <span class="token property">"status"</span> <span class="token operator">:</span> <span class="token string">"READY"</span> <span class="token punctuation">,</span>
        <span class="token property">"gpu"</span> <span class="token operator">:</span> <span class="token boolean">true</span> <span class="token punctuation">,</span>
        <span class="token property">"memoryUsage"</span> <span class="token operator">:</span> <span class="token number">2060914688</span>
      <span class="token punctuation">}</span>
    <span class="token punctuation">]</span>
  <span class="token punctuation">}</span>
<span class="token punctuation">]</span>

curl http : //localhost:8081/models/squeezenet1_1

[

{

"modelName" : "squeezenet1_1" ,

"modelVersion" : "1.0" ,

"modelUrl" : "https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar" ,

"runtime" : "python" ,

"minWorkers" : 3 ,

"maxWorkers" : 3 ,

"batchSize" : 1 ,

"maxBatchDelay" : 100 ,

"loadedAtStartup" : false ,

"workers" : [

{

"id" : "9001" ,

"startTime" : "2020-12-16T15:13:43.722Z" ,

"status" : "READY" ,

"gpu" : true ,

"memoryUsage" : 2044100608

} ,

{

"id" : "9002" ,

"startTime" : "2020-12-16T15:52:52.561Z" ,

"status" : "READY" ,

"gpu" : true ,

"memoryUsage" : 2045640704

} ,

{

"id" : "9003" ,

"startTime" : "2020-12-16T15:52:52.561Z" ,

"status" : "READY" ,

"gpu" : true ,

"memoryUsage" : 2060914688

}

]

}

]

If the model has multiple versions: GET /models/{model_name}/all

Unsubscribe the model

Use the Delete method: DELETE /models/{model_name}/{version}

curl -X DELETE http <span class="token operator">:</span> <span class="token comment">//localhost:8081/models/squeezenet1_1/1.0</span>

<span class="token punctuation">{</span>
  <span class="token property">"status"</span> <span class="token operator">:</span> <span class="token string">"Model "squeezenet1_1" unregistered"</span>
<span class="token punctuation">}</span>

curl -X DELETE http : //localhost:8081/models/squeezenet1_1/1.0

{

"status" : "Model "squeezenet1_1" unregistered"

}

Lists the models of registration

Use the GET: GET /models

Parameters:

limit: (Optional) number of items to return, default 100
next_page_token: (Optional) what page

curl <span class="token string">"http://localhost:8081/models"</span>

<span class="token punctuation">{</span>
  <span class="token property">"models"</span> <span class="token operator">:</span> <span class="token punctuation">[</span>
    <span class="token punctuation">{</span>
      <span class="token property">"modelName"</span> <span class="token operator">:</span> <span class="token string">"densenet161"</span> <span class="token punctuation">,</span>
      <span class="token property">"modelUrl"</span> <span class="token operator">:</span> <span class="token string">"densenet161.mar"</span>
    <span class="token punctuation">}</span> <span class="token punctuation">,</span>
    <span class="token punctuation">{</span>
      <span class="token property">"modelName"</span> <span class="token operator">:</span> <span class="token string">"squeezenet1_1"</span> <span class="token punctuation">,</span>
      <span class="token property">"modelUrl"</span> <span class="token operator">:</span> <span class="token string">"https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar"</span>
    <span class="token punctuation">}</span>
  <span class="token punctuation">]</span>
<span class="token punctuation">}</span>

curl "http://localhost:8081/models"

{

"models" : [

{

"modelName" : "densenet161" ,

"modelUrl" : "densenet161.mar"

} ,

{

"modelName" : "squeezenet1_1" ,

"modelUrl" : "https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar"

}

]

}

Set default model

Use the PUT method: PUT /models/{model_name}/{version}/set-default

curl -v -X PUT http://localhost:8081/models/squeezenet1_1/1.0/set-default

*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8081 (#0)
&gt; PUT /models/squeezenet1_1/1.0/set-default HTTP/1.1
&gt; Host: localhost:8081
&gt; User-Agent: curl/7.58.0
&gt; Accept: */*
&gt; 
&lt; HTTP/1.1 200 OK
&lt; content-type: application/json
&lt; x-request-id: 6db1cff1-7517-4826-b146-0e8605ecfd36
&lt; Pragma: no-cache
&lt; Cache-Control: no-cache; no-store, must-revalidate, private
&lt; Expires: Thu, 01 Jan 1970 00:00:00 UTC
&lt; content-length: 93
&lt; connection: keep-alive
&lt; 
{
  "status": "Default vesion succsesfully updated for model "squeezenet1_1" to "1.0""
}
* Connection #0 to host localhost left intact

curl -v -X PUT http://localhost:8081/models/squeezenet1_1/1.0/set-default

* Trying 127.0.0.1...

* TCP_NODELAY set

* Connected to localhost (127.0.0.1) port 8081 (#0)

> PUT /models/squeezenet1_1/1.0/set-default HTTP/1.1

> Host: localhost:8081

> User-Agent: curl/7.58.0

> Accept: */*

< HTTP/1.1 200 OK

< content-type: application/json

< x-request-id: 6db1cff1-7517-4826-b146-0e8605ecfd36

< Pragma: no-cache

< Cache-Control: no-cache; no-store, must-revalidate, private

< Expires: Thu, 01 Jan 1970 00:00:00 UTC

< content-length: 93

< connection: keep-alive

{

"status": "Default vesion succsesfully updated for model "squeezenet1_1" to "1.0""

}

* Connection #0 to host localhost left intact

Conclude

I have just written here today, anyone interested should go to the TorchServe repo to dabble in it.

References

https://github.com/pytorch/serve

Share the news now

Source : Viblo

TorchServe, PyTorch model deployment support tool

Introduction

TorchServe system diagram

Install TorchServe and torch-model-archiver

Save models using TorchServe

Run the TorchServe server

Predicting via gRPC

Management API

Model registration

Scale workers

Model description

Unsubscribe the model

Lists the models of registration

Set default model

Conclude

References

TikTok becomes the second largest social platform in South Africa

The fastest depreciating after 9 months of launch, iPhone 14 Pro Max continues to break the bottom in Vietnam

Beginner's guide to R: Introduction

10 essential SublimeText plugins for JavaScript developers