Deployment Process

Thursday, 17/11/2022

Tram Ho

Opening

As we all know Deploy is an indispensable step in the process of Bringing AI models to users . In the previous post, we learned about the deployment step with an example of the Speech Recognition system and in this article we will go into it in more detail. OK let’s start

Main Issues in Deployment

Usually will face two big challenges that are ML & statistical issues (Concept & Data drift) and software engine issues

Data Drift / Concept Drift problem

Training data : pretty clean and fixed distributions. The example includes only 1 voice in each audio, only northern accents and the recording environment is noise-free
Real data: many people in 1 audio, including central and southern voices, environments with a lot of noise.

There are two common types of Drift :

Concept Drift : Change everything
$X$ and $Y in the inference process$
Data Drift: Change only
$X$

Software engine problem

Real-time processing is different from batch processing
Cloud processing is different than edge device processing
The problem of computing resources
Problems of latency, throughput (QPS, query per second)
Logging problem
Security and privacy issues

APPREHEND

The software engine challenge will appear during the deployment of the model to production
Drift challenges will appear during system operation and maintenance after deployment

Deployment partners

Common deployment scenarios

For an AI problem we will usually have the following cases:

Our AI system is a new product
An AI system is a product that assists people in decision making
The AI system used to replace the previous ML systems.

We have several ideas for deploying this system:

Gradual ramp up with monitoring: Gradual improvement in accuracy and continuous redeployment. In the third case, instead of replacing the entire old model with the new model, we can send a small number of requests to the new model and measure the accuracy on it. Then update new model gradually
Rollback: Its idea is that in case the current model doesn’t work well then we can roll back to its previous version.

Visual inspection example

To manually test this system, usually one will have inspectors to manually check if the phone is damaged or broken. One of the popular ways to deploy AI models for this case is Shadow mode

Shallow mode

The AI model is hidden from the human and does not use the output of the AI model
Humans and AI models are run in parallel.
Comparison of human decision-making results and AI models

The purpose of shadow mode: is to collect data on how the model is performing and compare how well the AI model performs against a human. For example, the image below will have cases where both humans and AI agree on a decision, in other cases they don’t.

Even in case our system does not have human inspectors but only other good predictors (such as a commercial 3rd party API), the shadown mode deployment method is also very effective for accuracy assessment. accuracy of the new model and ensure that the new model is good enough for practical use.

Canary deployment

Once we have a model that starts to respond to real-world accuracy, one of the most commonly used deployment partners is canary deployment . This implementation is done as follows:

Forward a small amount of requests (about 5%) to an AI image when it begins to respond to decision-making with real data.
Gradually increase the number of requests to the model as the accuracy is increased

Blue green deployment

During the prediction process, the input image can be switched between versions of the model by the router instantaneously. In it we have:

BLUE: old version
GREEN: new version

The advantage of this method is that it is very easy to roll-back when just reconfiguring the router and the request will be made immediately (provided we have to keep the old version and maintain its working state).

Normally, we will not transfer all requests to the new version of the system – Green version but will gradually switch to

Automation levels

From left to right we have an ascending level of automation

Human only: The system has only humans to make decisions
Shadow mode The system runs both human and AI in parallel. Using human decisions to evaluate AI models
AI Assistance: AI gives suggestions to humans to make decisions
Partial Automation: part of the request is done by humans, the other part is done by AI
Full automation: All done by AI system

Monitoring

Using dashboards

Depending on different applications, we use different metrics on the dashboard to monitor the system. For example we have dashboards to monitor server load, or the percentage of requests where the output is not null (e.g. a speech to text system returns no text), or the percentage of missing inputs.

Based on the metrics measured on the dashboard, we can take the following specific actions:

Brainstorm with the development team to discuss what’s going on with the system.
Build some metrics to automatically detect the problem

Some evaluation metrics

Software metrics: memory, compute, latency, throughput, server load..
Input metrics: average length (for example in speech recognition problem), average volume, number of missing values…
Output metrics: frequency of returning null, frequency of users performing a search again, number of clicks on each displayed product….

Iterative procedure

ML Modeling and Deployment are both iterative processes

Model mantainance

Like other software, AI models also need to be maintained regularly during use. When the model needs to be updated it can be done as follows:

Manual training: retraining is performed by humans, error measurement is performed, and a new version of the model is deployed.
Automatic training: usually less common than manual training

Monitoring pipeline

Most AI systems will include many different steps. For example, with the Speech recognition app we discussed earlier, the basic pipeline would look like this

However, there is a problem that if implemented as above, the speech recognition server will always accept the input as always stream audio even when there is a speaker or no speaker. This is a waste of resources and inefficient in practice. We can implement another multi-step pipeline as follows:

Step 1: Data is fed into the VAD (Voice Activity Detection) model to perform tasks such as:
- Check if someone is talking
- Takes input from a long stream of audio and picks out which parts contain the speaker’s audio, then feeds that audio into speech recognition model.
Step 2: Perform speech recognition on the server

However, there will be a problem that changes in step 1 can affect the results of the entire pipeline . Therefore, we need to have solutions to monitor and evaluate the effectiveness of the model in more detail.

summary

In this article, we have gone through some points to keep in mind during deployment. Effectiveness assessment methods are necessary during model monitoring to help us detect problems early to find the right solution.

Share the news now

Source : Viblo