Concept Drift and Data Drift Phenomenon

Tram Ho

Preamble

Come to think of it… no one can stay in love forever – those are the lyrics of a very popular song recently. In love, that’s it, but in Machine Learning, there is also a phenomenon like that, we can’t call it a good model forever, even if the data quality is good, the model will lose quality over time. also happens frequently. What does this mean and affect the implementation of AI models in practice, we will learn in today’s article.

Where does the error come from?

Machine learning models often face cases where data is corrupt, out of date, or incomplete. The quality of data is one of the main causes of errors in production. But there is a problem that even if the data is carefully prepared, it does not guarantee that our model will work well forever. The phenomenon of model quality degradation persisted despite efforts to improve the data.

There are a few terms that we will discuss together in this section as follows:

Model Decay

We can call it drift, decay or instability of the model over time. As the surroundings change and the performance of the model degrades over time. The final measure is model quality. It could be accuracy, average error rate, or some KPI of your app or business performance, like ad click-through rate, for example. We need a small note as follows:

No model is forever correct, however the rate of deterioration varies.

There are some models that need annual updates (for example, some computer vision or natural language processing applications, or some models operating in a stable environment), while others need to be updated daily or even hourly (for example, models in predicting financial results or with time series data).

After the model is updated, the accuracy needs to be improved and adapted to new data patterns

image.png

There are usually two causes of model deterioration: data Drift and concept Drift, or sometimes both.

Data Drift

Also known as features Drift, population Drift or covariate Drift , is the phenomenon that when the input data is changed, the distribution of significant features is changed, leading to the phenomenon that the old model no longer meets the accuracy. exactly with the distributions of the data as well.

The model still works well on similar data as ” old ” data. This is considered as a standard condition for the model to work. But in real terms, the model becomes a lot less useful because we are dealing with a new feature space .

Example of a trend prediction model

E-commerce systems have a sizable revenue stream from advertising. As soon as users enter the system, the advertising recommendation model will have the task of predicting the types of products that users are likely to buy and sending them appropriate offers.

Previously, the recommendation was done mostly on the data that the user had made a purchase on the system before. However, after applying online marketing strategies, there are many users from different sources such as Facebook, Google … .and the AI ​​system has not been studied with these types of data before.

However, the overall quality of the model will not be too affected if the number of these new users is small. The problem will be really serious if this number of users becomes more.

image.png

When doing debugging we will see that there is a difference in the distribution of attributes like source_channel representing where the user comes to the system (For example Facebook, Google or Current System).

Monitoring these problems will help us to warn about Drift phenomenon as soon as possible.

The same thing happens when a model is applied in a new geographic area or the age of the system users is changed.

image.png

To deal with this phenomenon, we need to retrain the model with new data samples or have early detection strategies to come up with appropriate processing logic.

Training-serving skew

The difference in data when training the model and when running in the real environment is very common. Usually this occurs when the training data is quite clean while the actual data appears many different cases. Sometimes we can’t predict this during training because there are data samples that can only appear when used in practice.

We can see an example of this phenomenon in the figure below. It’s the same simple problem of handwriting recognition, but the training data set and the actual inference dataset are very different.

image.png

Usually, when training-serving skew appears, it is necessary to continue updating the model to adapt to new data types.

Concept Drift

Concept Drift appears in the case of a hidden part that the model learns from the changed data. In contrast to data Drift , in concept Drift it is possible that the distribution of input data does not change, but the relationship between input features and the target changes. In other words, the meaning of what we are trying to predict has been changed. That makes the model worse or obsolete over time

Concept Drift can come from a few reasons as follows:

Gradual concept drift

image.png

The Drift process is done slowly, gradually. As the world around us changes and our model gradually becomes obsolete, that’s when we need to add new data and update the model.

Some examples for this type of Drift are:

  • Rival company launches new product: leads to customers having more choices and changing customer behavior leads to revenue prediction model needing to be updated
  • New credit regulations have changed , leading to some of the factors that influence risk scores no longer hold true. We need to update the credit scoring model.

Sudden concept Drift

image.png

Typically in this case, it is problems with seasonal fluctuations in seasonal data or simply an objective factor that suddenly appears to change the entire surrounding conditions (for example, the COVID-19 pandemic). appears to close all stores and make the model unpredictable in this case).

Any solution for Drift

image.png

Retrain the model

We can immediately think of the simplest method to deal with the Drift phenomenon that is to retrain the model. We can have the following training methods based on the input data:

  • Train the model on all collected data (both old and new)
  • Train the model on old and new data but with higher weights on new data
  • Train the model only on new data.

Some other solutions

Somewhere we have also heard some methods such as:

  • Domain Adaptation
  • Out-of-distribution detection
  • Outlier detection

Conclusion

Drift is a very common phenomenon in practice and it is necessary to have solutions to overcome it if we do not want our model to become worse and worse. There are two basic types of Drift, Data Drift and Concept Drift , each of which will require different remedies. In the next article, we will learn about how to monitor model problems in the Production environment. See you in the next post

Share the news now

Source : Viblo