Hello everyone, in this Machine Learning From Scratch Series, you and I will go to implement basic algorithms in machine learning to better understand the nature of these algorithms.
1. Why is it necessary to reduce the data dimension?
As everyone knows, in machine learning problems, the data is very large in size. Computers can understand and execute algorithms on this data, but for humans to “see” multidimensional data is really difficult. Therefore, the problem of reducing data dimensions was born to help give people a new perspective on multidimensional data. In addition to data visualization, data dimensional reduction methods also help bring data to a new space to help uncover hidden properties that in the original data dimension did not show clearly, or simply reduce the size. data to speed up the execution of the computer.
2. PCA (Principal component analysis) algorithm
Conceptually, the PCA algorithm finds a new spatial system and maximizes the data variance of that new space. Then select the n dimensions with the largest variance (assuming that the more scattered the data, the larger the variance, the more valuable it is).
The figure above shows the value of the variance, when, for the original space (
O first O_1 xy) then the overlape of the two layers when mapped on each axis is quite large. Then the new space (
To do this, the PCA algorithm needs to go through the following steps:
- Step 1: Prepare the data to be reduced to X with dimensions (n_sample, n_feature), corresponding to each row is a data sample with n_feature attribute
- Step 2: Subtract each data point from the expectation vector:
X k X_k = - Step 3: Calculate the difference matrix: S =
first n − S a m p l eX BILLIONX frac{1}{n-sample}*X^T*X - Step 4: Find the eigenvalues, eigenvectors of the matrix S
- Step 5: Take k eigenvalues with the largest value, create a matrix U with rows of eigenvectors corresponding to k eigenvalues selected
- Step 6: Map the original space to the k-dimensional space:
X n e w X_{new} = X*U - Note: If you don’t understand the multiplication in Step 6, you can take each data sample and multiply it by its own vector, then each original data sample will be multiplied by k vectors, so there will be k dimensions.
3. Python implementation:
Assuming there is a data matrix X, I will perform a summary from Step 2 to Step 6 for everyone to follow well:
- Step 2: Calculate the mean vector, then subtract the data points for that vector
1 2 3 | mean = np.mean(X, axis=0) X = X - mean |
- Step 3: Find the Covariance Matrix
1 2 | cov = X.T.dot(X) / X.shape[0] |
- Step 4: Calculate eigenvalues, eigenvectors
1 2 | eigen_values, eigen_vectors, = np.linalg.eig(cov) |
- Step 5: In this step, I will take the index of the eigenvalues from large to small, then choose k eigenvectors to create a matrix U corresponding to the k indexes found.
1 2 3 | select_index = np.argsort(eigen_values)[::-1][:k] U = eigen_vectors[:, select_index] |
- Step 6: Map data to new space
1 2 | X_new = X.dot(U) |
And this is the entire code I ran on the Iris dataset, you can find this dataset on Google.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | import numpy as np import pandas as pd fỏm matplotlib import pyplot as plt class PCA: def __init__(self, n_dimention: int): self.n_dimention = n_dimention def fit_transform(self, X): mean = np.mean(X, axis=0) X = X - mean cov = X.T.dot(X) / X.shape[0] eigen_values, eigen_vectors, = np.linalg.eig(cov) select_index = np.argsort(eigen_values)[::-1][:self.n_dimention] U = eigen_vectors[:, select_index] X_new = X.dot(U) return X_new if __name__ == "__main__": df = pd.read_csv(r"/content/iris.csv") X = df[["sepal_length", "sepal_width", "petal_length", "petal_width"]].to_numpy() Y = df["species"].to_numpy() pca = PCA(n_dimention=2) new_X = pca.fit_transform(X) for label in set(Y): X_class = new_X[Y == label] plt.scatter(X_class[:, 0], X_class[:, 1], label=label) plt.legend() |
And this is the result when converting the Iris dataset from 4 dimensions to 2 dimensions
4. Conclusion
In this article, you and I have learned about how the PCA algorithm works in the data dimensionality reduction problem as well as how to implement it in Python language. Thank you for reading the article and remember to Upvote for me if you find the article useful.