How to Handle Categorical Data in Machine Learning Models
๐ง What is Categorical Data?
Categorical data represents variables with a limited number of possible values. These can be:
Nominal: No order (e.g., Color: Red, Green, Blue)
Ordinal: Ordered categories (e.g., Size: Small, Medium, Large)
Machine learning models work best with numerical data, so we need to convert categorical variables into a format they can understand.
๐ ️ Common Techniques to Handle Categorical Data
1. Label Encoding
Assign each category a unique number.
python
Copy
Edit
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Color'] = le.fit_transform(df['Color'])
✅ When to use:
For ordinal data
With tree-based models (e.g., Decision Trees, Random Forests)
❌ Avoid with:
Nominal data + linear models (can mislead the model)
2. One-Hot Encoding
Creates a new binary column for each category.
python
Copy
Edit
import pandas as pd
df = pd.get_dummies(df, columns=['Color'])
✅ When to use:
For nominal (unordered) categories
With linear models, SVM, neural networks
❌ Avoid when:
Feature has many unique categories (can create too many columns)
3. Ordinal Encoding
Manually assign numbers to categories based on their order.
python
Copy
Edit
df['Size'] = df['Size'].map({'Small': 1, 'Medium': 2, 'Large': 3})
✅ When to use:
For ordinal data
When the order matters
4. Binary Encoding / Target Encoding / Frequency Encoding
More advanced techniques to deal with high-cardinality features (many unique categories):
Binary Encoding: Converts categories to binary numbers
Target Encoding: Replace each category with the average target value
Frequency Encoding: Replace each category with how often it appears
These are helpful when One-Hot Encoding creates too many columns.
✅ Best Practices
Situation Recommended Method
Few categories (nominal) One-Hot Encoding
Ordinal categories Ordinal or Label Encoding
Many categories (high cardinality) Target or Frequency Encoding
Tree-based models Label, Ordinal, or Target
Linear models One-Hot or Target Encoding
❗ Watch Out For:
Data leakage with Target Encoding (use cross-validation or encode after splitting)
Too many features from One-Hot Encoding
Incorrect ordering in Label or Ordinal Encoding
๐งช Example: Comparing Techniques
python
Copy
Edit
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
# Assume 'Color' is categorical, 'Size' is ordinal
preprocessor = ColumnTransformer(
transformers=[
('color', OneHotEncoder(), ['Color']),
('size', 'passthrough', ['Size'])
])
X_transformed = preprocessor.fit_transform(df)
✅ Summary
Always analyze your data: Is it nominal or ordinal?
Choose encoding based on your model type and data size
Be careful with high-cardinality and data leakage
Learn Data Science Course in Hyderabad
Read More
Feature Selection Techniques: Filter, Wrapper, and Embedded Methods
How to Use Principal Component Analysis (PCA) for Dimensionality Reduction
One-Hot Encoding vs. Label Encoding: When to Use Them
How to Select the Right Features for Machine Learning Models
Visit Our Quality Thought Training Institute in Hyderabad
Comments
Post a Comment