Apply Machine Learning model to customer segmentation problem

Tram Ho

Hello everyone to see you again =))). In this Viblo article, I will share about a problem that most E-commerce websites need – Customer Segmentation. However, I will use the ML model to solve this problem .

Customer segmentation is the finding and selection of groups of customers that businesses and organizations are able to satisfy needs better than competitors. His reference here


  • To choose the right customers and serve the best way
  • Create a competitive advantage with competitors in the market
  • Understanding customers and affirming the brand Ways to segment customers that businesses are currently doing:
  • Geography
  • Sex
  • Age
  • Income.

Customer segmentation applies ML


Here I have used data based on data of an e-commerce site on customer transactions, people can download it here.

Read the data to see what our data has.

Our data includes the fields:

  • PRODUCT_CATE: The type of transaction product
  • PROVINCE: transaction provinces
  • ORDER_COST: Product price
  • ORDER_DATE: Order time
  • ORDER_ID: order code
  • CUST_ID: Customer ID The data format of the fields:

Here the ORDER_ID field is the most important.

Preprocessing Data

Processing and converting data

First we will convert datetime from Object to Datetime64 format.

Next, we will try to draw a distribution chart of variables (bins = 10)

Determine the outlier points of the order value variable “ORDER_COST” based on the 3 sigma principle. According to the 3 sigma principle, 99.75% of the order value will range from:

[ μ 3 σ , μ + 3 σ ] [ mu – 3 sigma, mu + 3 sigma]

Outliers are points that are located outside the upper range.

Statistics of total values ​​according to “PRODUCT_CATE” corresponding to “CUST_ID”.

After the statistics are complete, we will fill in the na values, here we fillna with 0 home.

Training Model

Divide the train training and test practice with my family.

Building Kmeans model, everyone can refer to KMean here

wcss: measure the deviation to centerpoints. When making the number of clussters makes the index of wcss insignificant, we can choose

Visualize clusters: First we use tnse to reduce the data dimension from 9 to 2:

Next is Visualize:

Let’s see how the result looks like everyone:

Above, I use Kmeans to segment customers or people can refer to Anh Khanh’s article on RFM here using RFM (Recency – Frequency – Monetary model) model to segment customers by rank.

  • VIP customers: rank from 8-10.
  • Mass customers: rank from 5-7.
  • Secondary customers: rank <5.

Please refer to the RFM code here


The customer segmentation problem is quite common for TMTT to contribute correctly to customer needs. However, my problem is quite simple, hope that everyone can give me suggestions for my writing.


Share the news now

Source : Viblo