How to Use Principal Component Analysis (PCA) for Dimensionality Reduction
๐ง What is PCA?
Principal Component Analysis (PCA) is a statistical technique used to reduce the number of features (dimensions) in a dataset while preserving as much of the variation (information) as possible.
Instead of working with dozens or hundreds of variables, PCA finds a smaller number of "principal components" — new variables that summarize the original ones.
✅ When to Use PCA
You have high-dimensional data (many features)
You want to speed up training or visualize data
You want to reduce multicollinearity (highly correlated features)
You’re okay with losing some interpretability (components are combinations of original features)
๐ ️ How to Use PCA (Step-by-Step)
Step 1: Standardize the Data
PCA is affected by the scale of the variables. Standardize them so they all have mean = 0 and standard deviation = 1.
python
Copy
Edit
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # X is your original dataset
Step 2: Apply PCA
You decide how many components you want (e.g., keep 95% of the variance or reduce to 2 for visualization).
python
Copy
Edit
from sklearn.decomposition import PCA
# Option 1: Keep 95% of variance
pca = PCA(n_components=0.95)
# Option 2: Reduce to 2 components
# pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
Step 3: Analyze Results
You can look at the explained variance:
python
Copy
Edit
print(pca.explained_variance_ratio_)
print(pca.n_components_)
This tells you how much information (variance) each principal component captures.
Step 4: Use Transformed Data
You can now use X_pca (your reduced-dimension dataset) for:
Visualization
Feeding into a machine learning model
Clustering (e.g., K-Means)
Noise reduction
๐ Optional: Visualize PCA Results
python
Copy
Edit
import matplotlib.pyplot as plt
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels) # 'labels' are class labels if available
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA Result')
plt.show()
๐ Notes & Tips
PCA is unsupervised: it ignores target labels.
It works best when features are linearly correlated.
It can reduce overfitting and speed up training.
However, the principal components are not always easy to interpret.
Learn Data Science Course in Hyderabad
Read More
One-Hot Encoding vs. Label Encoding: When to Use Them
How to Select the Right Features for Machine Learning Models
Feature Engineering and Model Optimization
How Companies Can Ensure Responsible AI Use
Visit Our Quality Thought Training Institute in Hyderabad
Comments
Post a Comment