ML From Scratch: PCA Dimensional Reduction Algorithm

Thursday, 13/10/2022

Tram Ho

Hello everyone, in this Machine Learning From Scratch Series, you and I will go to implement basic algorithms in machine learning to better understand the nature of these algorithms.

1. Why is it necessary to reduce the data dimension?

As everyone knows, in machine learning problems, the data is very large in size. Computers can understand and execute algorithms on this data, but for humans to “see” multidimensional data is really difficult. Therefore, the problem of reducing data dimensions was born to help give people a new perspective on multidimensional data. In addition to data visualization, data dimensional reduction methods also help bring data to a new space to help uncover hidden properties that in the original data dimension did not show clearly, or simply reduce the size. data to speed up the execution of the computer.

2. PCA (Principal component analysis) algorithm

Conceptually, the PCA algorithm finds a new spatial system and maximizes the data variance of that new space. Then select the n dimensions with the largest variance (assuming that the more scattered the data, the larger the variance, the more valuable it is).

The figure above shows the value of the variance, when, for the original space (

$O_1$ $O$ xy) then the overlape of the two layers when mapped on each axis is quite large. Then the new space (

$O_{2}$ $O zt) is maximized the variance for the axis$

$O_2$ $O z so when mapped here the classes will be separated from each other quite clearly. To find the new space, PCA searches for the eigenvalues of the covariance matrix of the input data. The eigenvalues represent the variance of the new data dimension, the eigenvectors corresponding to that eigenvalue correspond to a new data space. So after this step we choose the eigenvectors corresponding to the eigenvalues with the largest value to get a new space that maximizes the variance. You can read more about algebra here$

To do this, the PCA algorithm needs to go through the following steps:

Step 1: Prepare the data to be reduced to X with dimensions (n_sample, n_feature), corresponding to each row is a data sample with n_feature attribute
Step 2: Subtract each data point from the expectation vector:
$X_k$ $X$ = $X_k$ $X -$ $X_{mean}$ $I_{‘m_} with k = 1..n_sample and$ $X_{mean}$ $I_{‘m_} is the mean vector of all data points$
Step 3: Calculate the difference matrix: S =
$frac{1}{n-sample}*X^T*X$ $n - s am pl e X^{T}_X$
Step 4: Find the eigenvalues, eigenvectors of the matrix S
Step 5: Take k eigenvalues with the largest value, create a matrix U with rows of eigenvectors corresponding to k eigenvalues selected
Step 6: Map the original space to the k-dimensional space:
$X_{new}$ $X_{n e}$ = X*U
Note: If you don’t understand the multiplication in Step 6, you can take each data sample and multiply it by its own vector, then each original data sample will be multiplied by k vectors, so there will be k dimensions.

3. Python implementation:

Assuming there is a data matrix X, I will perform a summary from Step 2 to Step 6 for everyone to follow well:

Step 2: Calculate the mean vector, then subtract the data points for that vector

mean = np.mean(X, axis=0)
X = X - mean

mean = np.mean(X, axis=0)

X = X - mean

Step 3: Find the Covariance Matrix

cov = X.T.dot(X) / X.shape[0]

1 2	cov = X.T.dot(X) / X.shape[0]

Step 4: Calculate eigenvalues, eigenvectors

eigen_values, eigen_vectors, = np.linalg.eig(cov)

1 2	eigen_values, eigen_vectors, = np.linalg.eig(cov)

Step 5: In this step, I will take the index of the eigenvalues from large to small, then choose k eigenvectors to create a matrix U corresponding to the k indexes found.

select_index = np.argsort(eigen_values)[::-1][:k]
U = eigen_vectors[:, select_index]

select_index = np.argsort(eigen_values)[::-1][:k]

U = eigen_vectors[:, select_index]

Step 6: Map data to new space

X_new = X.dot(U)

1 2	X_new = X.dot(U)

And this is the entire code I ran on the Iris dataset, you can find this dataset on Google.

import numpy as np
import pandas as pd
fỏm matplotlib import pyplot as plt

class PCA:
  def __init__(self, n_dimention: int):
    self.n_dimention = n_dimention

  def fit_transform(self, X):
    mean = np.mean(X, axis=0)
    X = X - mean
    cov = X.T.dot(X) / X.shape[0] 
    eigen_values, eigen_vectors, = np.linalg.eig(cov)
    select_index = np.argsort(eigen_values)[::-1][:self.n_dimention]
    U = eigen_vectors[:, select_index]
    X_new = X.dot(U)
    return X_new
    
if __name__ == "__main__":
  df = pd.read_csv(r"/content/iris.csv")
  X = df[["sepal_length",	"sepal_width",	"petal_length",	"petal_width"]].to_numpy()
  Y = df["species"].to_numpy()

  pca = PCA(n_dimention=2)
  new_X = pca.fit_transform(X)
  
  for label in set(Y):
    X_class = new_X[Y == label]
    plt.scatter(X_class[:, 0], X_class[:, 1], label=label)

  plt.legend()

import numpy as np

import pandas as pd

fỏm matplotlib import pyplot as plt

class PCA:

def __init__(self, n_dimention: int):

self.n_dimention = n_dimention

def fit_transform(self, X):

mean = np.mean(X, axis=0)

X = X - mean

cov = X.T.dot(X) / X.shape[0]

eigen_values, eigen_vectors, = np.linalg.eig(cov)

select_index = np.argsort(eigen_values)[::-1][:self.n_dimention]

U = eigen_vectors[:, select_index]

X_new = X.dot(U)

return X_new

if __name__ == "__main__":

df = pd.read_csv(r"/content/iris.csv")

X = df[["sepal_length", "sepal_width", "petal_length", "petal_width"]].to_numpy()

Y = df["species"].to_numpy()

pca = PCA(n_dimention=2)

new_X = pca.fit_transform(X)

for label in set(Y):

X_class = new_X[Y == label]

plt.scatter(X_class[:, 0], X_class[:, 1], label=label)

plt.legend()

And this is the result when converting the Iris dataset from 4 dimensions to 2 dimensions

4. Conclusion

In this article, you and I have learned about how the PCA algorithm works in the data dimensionality reduction problem as well as how to implement it in Python language. Thank you for reading the article and remember to Upvote for me if you find the article useful.

References:

Lesson 27: Principal Component Analysis (part 1/2)

Share the news now

Source : Viblo

ML From Scratch: PCA Dimensional Reduction Algorithm

1. Why is it necessary to reduce the data dimension?

2. PCA (Principal component analysis) algorithm

3. Python implementation:

4. Conclusion

References:

TikTok becomes the second largest social platform in South Africa

The fastest depreciating after 9 months of launch, iPhone 14 Pro Max continues to break the bottom in Vietnam

Beginner's guide to R: Introduction

10 essential SublimeText plugins for JavaScript developers