ML From Scratch: PCA Dimensional Reduction Algorithm

Tram Ho

Hello everyone, in this Machine Learning From Scratch Series, you and I will go to implement basic algorithms in machine learning to better understand the nature of these algorithms.

image.png

1. Why is it necessary to reduce the data dimension?

As everyone knows, in machine learning problems, the data is very large in size. Computers can understand and execute algorithms on this data, but for humans to “see” multidimensional data is really difficult. Therefore, the problem of reducing data dimensions was born to help give people a new perspective on multidimensional data. In addition to data visualization, data dimensional reduction methods also help bring data to a new space to help uncover hidden properties that in the original data dimension did not show clearly, or simply reduce the size. data to speed up the execution of the computer.

image.png Example of an Iris legend dataset consisting of 4 attributes and 3 labels corresponding to 3 flower species. It is difficult to know whether these 4 attributes are segregated from each other by species because it is necessary to represent this space on 4-dimensional data. Therefore, the data dimensionality reduction algorithm helps to bring the data back to 2D space for easy visualization on the Oxy coordinate system, in return we have to accept the loss of a certain amount of information. And this is the result:

image.png Looking here we can more easily analyze, can see which classes are easy to confuse with each other, the degree of separation between classes, …

2. PCA (Principal component analysis) algorithm

Conceptually, the PCA algorithm finds a new spatial system and maximizes the data variance of that new space. Then select the n dimensions with the largest variance (assuming that the more scattered the data, the larger the variance, the more valuable it is). image.png

The figure above shows the value of the variance, when, for the original space (

O first O_1 xy) then the overlape of the two layers when mapped on each axis is quite large. Then the new space (

To do this, the PCA algorithm needs to go through the following steps:

  • Step 1: Prepare the data to be reduced to X with dimensions (n_sample, n_feature), corresponding to each row is a data sample with n_feature attribute
  • Step 2: Subtract each data point from the expectation vector:
    X k X_k =
  • Step 3: Calculate the difference matrix: S =
    first n S a m p l eX BILLIONX frac{1}{n-sample}*X^T*X
  • Step 4: Find the eigenvalues, eigenvectors of the matrix S
  • Step 5: Take k eigenvalues ​​with the largest value, create a matrix U with rows of eigenvectors corresponding to k eigenvalues ​​selected
  • Step 6: Map the original space to the k-dimensional space:
    X n e w X_{new} = X*U
  • Note: If you don’t understand the multiplication in Step 6, you can take each data sample and multiply it by its own vector, then each original data sample will be multiplied by k vectors, so there will be k dimensions.

3. Python implementation:

Assuming there is a data matrix X, I will perform a summary from Step 2 to Step 6 for everyone to follow well:

  • Step 2: Calculate the mean vector, then subtract the data points for that vector

  • Step 3: Find the Covariance Matrix

  • Step 4: Calculate eigenvalues, eigenvectors

  • Step 5: In this step, I will take the index of the eigenvalues ​​from large to small, then choose k eigenvectors to create a matrix U corresponding to the k indexes found.

  • Step 6: Map data to new space

And this is the entire code I ran on the Iris dataset, you can find this dataset on Google.

And this is the result when converting the Iris dataset from 4 dimensions to 2 dimensions image.png

4. Conclusion

In this article, you and I have learned about how the PCA algorithm works in the data dimensionality reduction problem as well as how to implement it in Python language. Thank you for reading the article and remember to Upvote for me if you find the article useful.

References:

Share the news now

Source : Viblo