A Step-by-Step Guide to Principal Component Analysis (PCA)
๐ A Step-by-Step Guide to Principal Component Analysis (PCA)
๐ง What is PCA?
Principal Component Analysis (PCA) is a dimensionality reduction technique.
It helps you simplify your dataset by transforming it into fewer variables (called principal components) that still retain most of the important information.
๐ฏ Why use PCA?
To reduce complexity, speed up training, and remove noise — all without losing much accuracy.
๐ฆ When to Use PCA
✅ Your dataset has many features (columns)
✅ Features are correlated
✅ You want to visualize high-dimensional data
✅ You want to compress data while preserving patterns
๐ช Step-by-Step Guide to PCA
Let’s go through PCA step by step, both conceptually and with Python.
✅ Step 1: Standardize the Data
PCA is affected by the scale of the data. So, we must standardize features to have mean = 0 and variance = 1.
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
✅ Step 2: Calculate the Covariance Matrix
This tells us how variables relate to one another.
import numpy as np
cov_matrix = np.cov(X_scaled.T)
๐ A covariance matrix shows how much variables change together.
✅ Step 3: Compute the Eigenvalues and Eigenvectors
Eigenvectors represent the direction of new axes (principal components).
Eigenvalues show how much variance is captured by each component.
eig_vals, eig_vecs = np.linalg.eig(cov_matrix)
✅ Step 4: Sort and Select Top Principal Components
Sort eigenvalues in descending order and choose the top k eigenvectors that capture the most variance.
# Sort eigenvalues and select top components
sorted_idx = np.argsort(eig_vals)[::-1]
eig_vecs_sorted = eig_vecs[:, sorted_idx]
eig_vals_sorted = eig_vals[sorted_idx]
# Choose the top k components (e.g., k=2)
k = 2
eig_vecs_subset = eig_vecs_sorted[:, :k]
✅ Step 5: Transform the Data
Project the original data onto the new k-dimensional space.
X_reduced = X_scaled.dot(eig_vecs_subset)
Now X_reduced has fewer dimensions (e.g., 2 instead of 10), but still holds the most important information.
๐งช Let’s Do It Using Scikit-learn (Much Easier!)
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Step 1: Standardize
X_scaled = StandardScaler().fit_transform(X)
# Step 2–5: Apply PCA
pca = PCA(n_components=2) # choose 2 components
X_pca = pca.fit_transform(X_scaled)
print("Explained Variance Ratio:", pca.explained_variance_ratio_)
๐ Step 6: Visualize the Results
import matplotlib.pyplot as plt
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y) # y = labels, if available
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.title('PCA: First Two Principal Components')
plt.show()
๐ What is "Explained Variance Ratio"?
This tells you how much information (variance) each principal component holds.
print(pca.explained_variance_ratio_)
Example output:
[0.72, 0.18]
➡️ Means PC1 captures 72% of the variance, and PC2 captures 18%. Together: 90% of the data’s structure.
๐ Summary of Steps
Step Description
1 Standardize the data
2 Compute the covariance matrix
3 Find eigenvalues and eigenvectors
4 Choose top k components
5 Transform data onto new axes
6 (Optional) Visualize or use for modeling
⚠️ Important Notes
PCA assumes linear relationships.
PCA works best when features are correlated.
PCA doesn't care about labels — it's unsupervised.
PCA can be used before supervised learning to reduce dimensions.
๐ฏ Use Cases of PCA
Data visualization (e.g., compress 50D to 2D)
Noise reduction
Speeding up ML models
Removing multicollinearity
✅ Final Thoughts
PCA is like compressing a high-resolution image — you keep the essential details while dropping the noise.
"Lose the clutter, keep the meaning."
Learning PCA is a solid step toward mastering machine learning, data preprocessing, and dimensionality reduction.
Learn Data Science Course in Hyderabad
Read More
Gradient Boosting Algorithms: XGBoost, LightGBM, and CatBoost
Random Forests: The Power of Ensemble Learning
Support Vector Machines (SVM) Demystified
Naive Bayes: How It Works and When to Use It
Visit Our Quality Thought Training Institute in Hyderabad
Comments
Post a Comment