๐ง What is Categorical Data?
Categorical data represents variables with a limited number of possible values. These can be:
Nominal: No order (e.g., Color: Red, Green, Blue)
Ordinal: Ordered categories (e.g., Size: Small, Medium, Large)
Machine learning models work best with numerical data, so we need to convert categorical variables into a format they can understand.
๐ ️ Common Techniques to Handle Categorical Data
1. Label Encoding
Assign each category a unique number.
python
Copy
Edit
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Color'] = le.fit_transform(df['Color'])
✅ When to use:
For ordinal data
With tree-based models (e.g., Decision Trees, Random Forests)
❌ Avoid with:
Nominal data + linear models (can mislead the model)
2. One-Hot Encoding
Creates a new binary column for each category.
python
Copy
Edit
import pandas as pd
df = pd.get_dummies(df, columns=['Color'])
✅ When to use:
For nominal (unordered) categories
With linear models, SVM, neural networks
❌ Avoid when:
Feature has many unique categories (can create too many columns)
3. Ordinal Encoding
Manually assign numbers to categories based on their order.
python
Copy
Edit
df['Size'] = df['Size'].map({'Small': 1, 'Medium': 2, 'Large': 3})
✅ When to use:
For ordinal data
When the order matters
4. Binary Encoding / Target Encoding / Frequency Encoding
More advanced techniques to deal with high-cardinality features (many unique categories):
Binary Encoding: Converts categories to binary numbers
Target Encoding: Replace each category with the average target value
Frequency Encoding: Replace each category with how often it appears
These are helpful when One-Hot Encoding creates too many columns.
✅ Best Practices
Situation Recommended Method
Few categories (nominal) One-Hot Encoding
Ordinal categories Ordinal or Label Encoding
Many categories (high cardinality) Target or Frequency Encoding
Tree-based models Label, Ordinal, or Target
Linear models One-Hot or Target Encoding
❗ Watch Out For:
Data leakage with Target Encoding (use cross-validation or encode after splitting)
Too many features from One-Hot Encoding
Incorrect ordering in Label or Ordinal Encoding
๐งช Example: Comparing Techniques
python
Copy
Edit
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
# Assume 'Color' is categorical, 'Size' is ordinal
preprocessor = ColumnTransformer(
transformers=[
('color', OneHotEncoder(), ['Color']),
('size', 'passthrough', ['Size'])
])
X_transformed = preprocessor.fit_transform(df)
✅ Summary
Always analyze your data: Is it nominal or ordinal?
Choose encoding based on your model type and data size
Be careful with high-cardinality and data leakage
Learn Data Science Course in Hyderabad
Read More
Feature Selection Techniques: Filter, Wrapper, and Embedded Methods
How to Use Principal Component Analysis (PCA) for Dimensionality Reduction
One-Hot Encoding vs. Label Encoding: When to Use Them
How to Select the Right Features for Machine Learning Models
Visit Our Quality Thought Training Institute in Hyderabad
Subscribe by Email
Follow Updates Articles from This Blog via Email
No Comments