How to Use Principal Component Analysis (PCA) for Dimensionality Reduction

 ๐Ÿง  What is PCA?

Principal Component Analysis (PCA) is a statistical technique used to reduce the number of features (dimensions) in a dataset while preserving as much of the variation (information) as possible.


Instead of working with dozens or hundreds of variables, PCA finds a smaller number of "principal components" — new variables that summarize the original ones.


✅ When to Use PCA

You have high-dimensional data (many features)


You want to speed up training or visualize data


You want to reduce multicollinearity (highly correlated features)


You’re okay with losing some interpretability (components are combinations of original features)


๐Ÿ› ️ How to Use PCA (Step-by-Step)

Step 1: Standardize the Data

PCA is affected by the scale of the variables. Standardize them so they all have mean = 0 and standard deviation = 1.


python

Copy

Edit

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)  # X is your original dataset

Step 2: Apply PCA

You decide how many components you want (e.g., keep 95% of the variance or reduce to 2 for visualization).


python

Copy

Edit

from sklearn.decomposition import PCA


# Option 1: Keep 95% of variance

pca = PCA(n_components=0.95)


# Option 2: Reduce to 2 components

# pca = PCA(n_components=2)


X_pca = pca.fit_transform(X_scaled)

Step 3: Analyze Results

You can look at the explained variance:


python

Copy

Edit

print(pca.explained_variance_ratio_)

print(pca.n_components_)

This tells you how much information (variance) each principal component captures.


Step 4: Use Transformed Data

You can now use X_pca (your reduced-dimension dataset) for:


Visualization


Feeding into a machine learning model


Clustering (e.g., K-Means)


Noise reduction


๐Ÿ“Š Optional: Visualize PCA Results

python

Copy

Edit

import matplotlib.pyplot as plt


plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels)  # 'labels' are class labels if available

plt.xlabel('PC1')

plt.ylabel('PC2')

plt.title('PCA Result')

plt.show()

๐Ÿ“Œ Notes & Tips

PCA is unsupervised: it ignores target labels.


It works best when features are linearly correlated.


It can reduce overfitting and speed up training.


However, the principal components are not always easy to interpret.

Learn Data Science Course in Hyderabad

Read More

One-Hot Encoding vs. Label Encoding: When to Use Them

How to Select the Right Features for Machine Learning Models

Feature Engineering and Model Optimization

How Companies Can Ensure Responsible AI Use

Visit Our Quality Thought Training Institute in Hyderabad

Get Directions



Comments

Popular posts from this blog

Entry-Level Cybersecurity Jobs You Can Apply For Today

Understanding Snowflake Editions: Standard, Enterprise, Business Critical

Installing Tosca: Step-by-Step Guide for Beginners