A Step-by-Step Guide to Principal Component Analysis (PCA)

September 12, 2025

📉 A Step-by-Step Guide to Principal Component Analysis (PCA)

🧠 What is PCA?

Principal Component Analysis (PCA) is a dimensionality reduction technique.

It helps you simplify your dataset by transforming it into fewer variables (called principal components) that still retain most of the important information.

🎯 Why use PCA?

To reduce complexity, speed up training, and remove noise — all without losing much accuracy.

📦 When to Use PCA

✅ Your dataset has many features (columns)

✅ Features are correlated

✅ You want to visualize high-dimensional data

✅ You want to compress data while preserving patterns

🪜 Step-by-Step Guide to PCA

Let’s go through PCA step by step, both conceptually and with Python.

✅ Step 1: Standardize the Data

PCA is affected by the scale of the data. So, we must standardize features to have mean = 0 and variance = 1.

from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(X)

✅ Step 2: Calculate the Covariance Matrix

This tells us how variables relate to one another.

import numpy as np

cov_matrix = np.cov(X_scaled.T)

📌 A covariance matrix shows how much variables change together.

✅ Step 3: Compute the Eigenvalues and Eigenvectors

Eigenvectors represent the direction of new axes (principal components).

Eigenvalues show how much variance is captured by each component.

eig_vals, eig_vecs = np.linalg.eig(cov_matrix)

✅ Step 4: Sort and Select Top Principal Components

Sort eigenvalues in descending order and choose the top k eigenvectors that capture the most variance.

# Sort eigenvalues and select top components

sorted_idx = np.argsort(eig_vals)[::-1]

eig_vecs_sorted = eig_vecs[:, sorted_idx]

eig_vals_sorted = eig_vals[sorted_idx]

# Choose the top k components (e.g., k=2)

k = 2

eig_vecs_subset = eig_vecs_sorted[:, :k]

✅ Step 5: Transform the Data

Project the original data onto the new k-dimensional space.

X_reduced = X_scaled.dot(eig_vecs_subset)

Now X_reduced has fewer dimensions (e.g., 2 instead of 10), but still holds the most important information.

🧪 Let’s Do It Using Scikit-learn (Much Easier!)

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

# Step 1: Standardize

X_scaled = StandardScaler().fit_transform(X)

# Step 2–5: Apply PCA

pca = PCA(n_components=2) # choose 2 components

X_pca = pca.fit_transform(X_scaled)

print("Explained Variance Ratio:", pca.explained_variance_ratio_)

📈 Step 6: Visualize the Results

import matplotlib.pyplot as plt

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y) # y = labels, if available

plt.xlabel('PC 1')

plt.ylabel('PC 2')

plt.title('PCA: First Two Principal Components')

plt.show()

📊 What is "Explained Variance Ratio"?

This tells you how much information (variance) each principal component holds.

print(pca.explained_variance_ratio_)

Example output:

[0.72, 0.18]

➡️ Means PC1 captures 72% of the variance, and PC2 captures 18%. Together: 90% of the data’s structure.

🔍 Summary of Steps

Step Description

1 Standardize the data

2 Compute the covariance matrix

3 Find eigenvalues and eigenvectors

4 Choose top k components

5 Transform data onto new axes

6 (Optional) Visualize or use for modeling

⚠️ Important Notes

PCA assumes linear relationships.

PCA works best when features are correlated.

PCA doesn't care about labels — it's unsupervised.

PCA can be used before supervised learning to reduce dimensions.

🎯 Use Cases of PCA

Data visualization (e.g., compress 50D to 2D)

Noise reduction

Speeding up ML models

Removing multicollinearity

✅ Final Thoughts

PCA is like compressing a high-resolution image — you keep the essential details while dropping the noise.

"Lose the clutter, keep the meaning."

Learning PCA is a solid step toward mastering machine learning, data preprocessing, and dimensionality reduction.

Learn Data Science Course in Hyderabad

Random Forests: The Power of Ensemble Learning

Support Vector Machines (SVM) Demystified

Naive Bayes: How It Works and When to Use It

Visit Our Quality Thought Training Institute in Hyderabad

Get Directions

Search This Blog

Best Quality Thought Software Institute Training in Hyderabad