Apply Machine Learning model to customer segmentation problem

Saturday, 21/12/2019

Tram Ho

Hello everyone to see you again =))). In this Viblo article, I will share about a problem that most E-commerce websites need – Customer Segmentation. However, I will use the ML model to solve this problem .

Customer segmentation is the finding and selection of groups of customers that businesses and organizations are able to satisfy needs better than competitors. His reference here

Purpose:

To choose the right customers and serve the best way
Create a competitive advantage with competitors in the market
Understanding customers and affirming the brand Ways to segment customers that businesses are currently doing:
Geography
Sex
Age
Income.

Customer segmentation applies ML

Data

Here I have used data based on data of an e-commerce site on customer transactions, people can download it here.

Read the data to see what our data has.

import pandas as pd

dataset = pd.read_csv('customerSpending.csv', header = 0, index_col = 0)
print(dataset.shape)
dataset.head()

import pandas as pd

dataset = pd.read_csv('customerSpending.csv', header = 0, index_col = 0)

print(dataset.shape)

dataset.head()

Our data includes the fields:

PRODUCT_CATE: The type of transaction product
PROVINCE: transaction provinces
ORDER_COST: Product price
ORDER_DATE: Order time
ORDER_ID: order code
CUST_ID: Customer ID The data format of the fields:

Here the ORDER_ID field is the most important.

Preprocessing Data

Processing and converting data

First we will convert datetime from Object to Datetime64 format.

from datetime import datetime
def strToDatetime(x):
    return datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
# Kiểm tra định dạng các trường của pandas dataframe
print(dataset.dtypes)
# Convert dữ liệu về đúng định dạng
dataset['ORDER_DATE'] = dataset['ORDER_DATE'].apply(strToDatetime)
dataset.dtypes

from datetime import datetime

def strToDatetime(x):

return datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

# Kiểm tra định dạng các trường của pandas dataframe

print(dataset.dtypes)

# Convert dữ liệu về đúng định dạng

dataset['ORDER_DATE'] = dataset['ORDER_DATE'].apply(strToDatetime)

dataset.dtypes

Next, we will try to draw a distribution chart of variables (bins = 10)

# Thống kê mô tả
print(dataset.describe())

# Vẽ biểu đồ phân phối các biến
import seaborn as sns
import matplotlib.pyplot as plt

def plotNumeric(colname, n_bins = 10, hist = True, kde = True):
    sns.distplot(dataset[colname], hist = hist, kde = kde, bins = n_bins)
    plt.title('Distribution of {}'.format(colname))
    plt.show()
    
plotNumeric('ORDER_COST', hist = True, kde = True, n_bins = 10)

# Thống kê mô tả

print(dataset.describe())

# Vẽ biểu đồ phân phối các biến

import seaborn as sns

import matplotlib.pyplot as plt

def plotNumeric(colname, n_bins = 10, hist = True, kde = True):

sns.distplot(dataset[colname], hist = hist, kde = kde, bins = n_bins)

plt.title('Distribution of {}'.format(colname))

plt.show()

plotNumeric('ORDER_COST', hist = True, kde = True, n_bins = 10)

Determine the outlier points of the order value variable “ORDER_COST” based on the 3 sigma principle. According to the 3 sigma principle, 99.75% of the order value will range from:

$[μ - 3 σ, μ + 3 σ]$

Outliers are points that are located outside the upper range.

dataset.loc[[1, 2]]['ORDER_COST']

1 2	dataset.loc[[1, 2]]['ORDER_COST']

def fillOutlier(colname):
    mu = np.mean(dataset[colname])
    sigma = np.std(dataset[colname])
    x_min = max(mu - 3*sigma, 0)
    x_max = mu + 3*sigma
    print('x_min: ', x_min)
    print('x_max: ', x_max)
    out_lower_id = dataset[dataset[colname] &lt; x_min].index
    out_upper_id = dataset[dataset[colname] &gt; x_max].index
    dataset[colname].loc[out_lower_id] = x_min
    dataset[colname].loc[out_upper_id] = x_max

fillOutlier('ORDER_COST')
plotNumeric('ORDER_COST', hist = True, kde = True, n_bins = 10)

def fillOutlier(colname):

mu = np.mean(dataset[colname])

sigma = np.std(dataset[colname])

x_min = max(mu - 3*sigma, 0)

x_max = mu + 3*sigma

print('x_min: ', x_min)

print('x_max: ', x_max)

out_lower_id = dataset[dataset[colname] < x_min].index

out_upper_id = dataset[dataset[colname] > x_max].index

dataset[colname].loc[out_lower_id] = x_min

dataset[colname].loc[out_upper_id] = x_max

fillOutlier('ORDER_COST')

plotNumeric('ORDER_COST', hist = True, kde = True, n_bins = 10)

Statistics of total values according to “PRODUCT_CATE” corresponding to “CUST_ID”.

dfSummary = pd.pivot_table(data = dataset, 
                            values = ['ORDER_COST', 'ORDER_ID'],
                            index = ['CUST_ID'],
                            columns = ['PRODUCT_CATE'],
                            aggfunc= {'ORDER_COST': np.sum}
                          )

print(dfSummary.shape)
dfSummary.head()

dfSummary = pd.pivot_table(data = dataset,

values = ['ORDER_COST', 'ORDER_ID'],

index = ['CUST_ID'],

columns = ['PRODUCT_CATE'],

aggfunc= {'ORDER_COST': np.sum}

)

print(dfSummary.shape)

dfSummary.head()

After the statistics are complete, we will fill in the na values, here we fillna with 0 home.

from sklearn.preprocessing import StandardScaler

dfSummary.fillna(0, inplace = True)

scaler = StandardScaler()
scaler.fit(dfSummary)
X = scaler.transform(dfSummary)

from sklearn.preprocessing import StandardScaler

dfSummary.fillna(0, inplace = True)

scaler = StandardScaler()

scaler.fit(dfSummary)

X = scaler.transform(dfSummary)

Training Model

Divide the train training and test practice with my family.

from sklearn.model_selection import train_test_split

X_train, X_test, id_train, id_test = train_test_split(X, np.arange(X.shape[0]), test_size = 0.2)

from sklearn.model_selection import train_test_split

X_train, X_test, id_train, id_test = train_test_split(X, np.arange(X.shape[0]), test_size = 0.2)

Building Kmeans model, everyone can refer to KMean here

from sklearn.cluster import KMeans

# Khởi tạo mô hình kmean cluster với số cluster từ 2-&gt;16
kmeans = []
wcss = []

for i in np.arange(2, 17, 1):
    km_i = KMeans(n_clusters=i,init='k-means++', max_iter=300, n_init=10, random_state=0)
    km_i.fit(X_train)
    wcss.append(km_i.inertia_)
    kmeans.append(km_i)

from sklearn.cluster import KMeans

# Khởi tạo mô hình kmean cluster với số cluster từ 2->16

kmeans = []

wcss = []

for i in np.arange(2, 17, 1):

km_i = KMeans(n_clusters=i,init='k-means++', max_iter=300, n_init=10, random_state=0)

km_i.fit(X_train)

wcss.append(km_i.inertia_)

kmeans.append(km_i)

wcss: measure the deviation to centerpoints. When making the number of clussters makes the index of wcss insignificant, we can choose

# Vẽ biểu đồ wcss vs n_clusters
plt.plot(np.arange(2, 17),wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('wcss')
# plt.ylim(0,  800000)
plt.show()

# Vẽ biểu đồ wcss vs n_clusters

plt.plot(np.arange(2, 17),wcss)

plt.title('Elbow Method')

plt.xlabel('Number of clusters')

plt.ylabel('wcss')

# plt.ylim(0, 800000)

plt.show()

Visualize clusters: First we use tnse to reduce the data dimension from 9 to 2:

from sklearn.manifold import TSNE
import time
time_start = time.time()
tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
tsne_results = tsne.fit_transform(X_train)

from sklearn.manifold import TSNE

import time

time_start = time.time()

tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)

tsne_results = tsne.fit_transform(X_train)

Next is Visualize:

X_  = tsne_results
y_label = kmeans[7].predict(X_train)
plt.figure(figsize = (12, 8))
plt.scatter(X_[y_label==0,0],X_[y_label==0,1],s=50, c='purple',label='Cluster1')
plt.scatter(X_[y_label==1,0],X_[y_label==1,1],s=50, c='blue',label='Cluster2')
plt.scatter(X_[y_label==2,0],X_[y_label==2,1],s=50, c='green',label='Cluster3')
plt.scatter(X_[y_label==3,0],X_[y_label==3,1],s=50, c='cyan',label='Cluster4')
plt.scatter(X_[y_label==4,0],X_[y_label==4,1],s=50, c='yellow',label='Cluster5')
plt.scatter(X_[y_label==5,0],X_[y_label==5,1],s=50, c='brown',label='Cluster6')
plt.scatter(X_[y_label==6,0],X_[y_label==6,1],s=50, c='purple',label='Cluster7')
plt.scatter(X_[y_label==7,0],X_[y_label==7,1],s=50, c='pink',label='Cluster8')
# plt.scatter(X_[y_label==8,0],X_[y_label==8,1],s=50, c='orange',label='Cluster9')
# plt.scatter(X[y_means==4,0],X[y_means==4,1],s=50, c='yellow',label='Cluster5')
# plt.scatter(kmeans_7.cluster_centers_[:,0], kmeans_7.cluster_centers_[:,1],s=100,marker='s', c='red', alpha=0.7, label='Centroids')
plt.scatter(kmeans[7].cluster_centers_[:,0], kmeans[7].cluster_centers_[:,1],s=100,marker='s', c='red', alpha=0.7, label='Centroids')
plt.title('Customer segments')
plt.xlabel('tsne-2d-one')
plt.ylabel('tsne-2d-two')
plt.legend()
plt.show()

X_ = tsne_results

y_label = kmeans[7].predict(X_train)

plt.figure(figsize = (12, 8))

plt.scatter(X_[y_label==0,0],X_[y_label==0,1],s=50, c='purple',label='Cluster1')

plt.scatter(X_[y_label==1,0],X_[y_label==1,1],s=50, c='blue',label='Cluster2')

plt.scatter(X_[y_label==2,0],X_[y_label==2,1],s=50, c='green',label='Cluster3')

plt.scatter(X_[y_label==3,0],X_[y_label==3,1],s=50, c='cyan',label='Cluster4')

plt.scatter(X_[y_label==4,0],X_[y_label==4,1],s=50, c='yellow',label='Cluster5')

plt.scatter(X_[y_label==5,0],X_[y_label==5,1],s=50, c='brown',label='Cluster6')

plt.scatter(X_[y_label==6,0],X_[y_label==6,1],s=50, c='purple',label='Cluster7')

plt.scatter(X_[y_label==7,0],X_[y_label==7,1],s=50, c='pink',label='Cluster8')

# plt.scatter(X_[y_label==8,0],X_[y_label==8,1],s=50, c='orange',label='Cluster9')

# plt.scatter(X[y_means==4,0],X[y_means==4,1],s=50, c='yellow',label='Cluster5')

# plt.scatter(kmeans_7.cluster_centers_[:,0], kmeans_7.cluster_centers_[:,1],s=100,marker='s', c='red', alpha=0.7, label='Centroids')

plt.scatter(kmeans[7].cluster_centers_[:,0], kmeans[7].cluster_centers_[:,1],s=100,marker='s', c='red', alpha=0.7, label='Centroids')

plt.title('Customer segments')

plt.xlabel('tsne-2d-one')

plt.ylabel('tsne-2d-two')

plt.legend()

plt.show()

Let’s see how the result looks like everyone:

Above, I use Kmeans to segment customers or people can refer to Anh Khanh’s article on RFM here using RFM (Recency – Frequency – Monetary model) model to segment customers by rank.

VIP customers: rank from 8-10.
Mass customers: rank from 5-7.
Secondary customers: rank <5.

Please refer to the RFM code here

Conclude

The customer segmentation problem is quite common for TMTT to contribute correctly to customer needs. However, my problem is quite simple, hope that everyone can give me suggestions for my writing.

Reference

https://machinelearningcoban.com/2017/01/01/kmeans/ https://phamdinhkhanh.github.io/2019/11/08/RFMModel.html

Share the news now

Source : Viblo

Apply Machine Learning model to customer segmentation problem

Customer segmentation applies ML

Data

Preprocessing Data

Processing and converting data

Training Model

Conclude

Reference

TikTok becomes the second largest social platform in South Africa

The fastest depreciating after 9 months of launch, iPhone 14 Pro Max continues to break the bottom in Vietnam

Beginner's guide to R: Introduction

10 essential SublimeText plugins for JavaScript developers