How to Handle Categorical Data in Machine Learning Models

August 03, 2025

🧠 What is Categorical Data?

Categorical data represents variables with a limited number of possible values. These can be:

Nominal: No order (e.g., Color: Red, Green, Blue)

Ordinal: Ordered categories (e.g., Size: Small, Medium, Large)

Machine learning models work best with numerical data, so we need to convert categorical variables into a format they can understand.

🛠️ Common Techniques to Handle Categorical Data

1. Label Encoding

Assign each category a unique number.

python

Copy

Edit

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df['Color'] = le.fit_transform(df['Color'])

✅ When to use:

For ordinal data

With tree-based models (e.g., Decision Trees, Random Forests)

❌ Avoid with:

Nominal data + linear models (can mislead the model)

2. One-Hot Encoding

Creates a new binary column for each category.

python

Copy

Edit

import pandas as pd

df = pd.get_dummies(df, columns=['Color'])

✅ When to use:

For nominal (unordered) categories

With linear models, SVM, neural networks

❌ Avoid when:

Feature has many unique categories (can create too many columns)

3. Ordinal Encoding

Manually assign numbers to categories based on their order.

python

Copy

Edit

df['Size'] = df['Size'].map({'Small': 1, 'Medium': 2, 'Large': 3})

✅ When to use:

For ordinal data

When the order matters

4. Binary Encoding / Target Encoding / Frequency Encoding

More advanced techniques to deal with high-cardinality features (many unique categories):

Binary Encoding: Converts categories to binary numbers

Target Encoding: Replace each category with the average target value

Frequency Encoding: Replace each category with how often it appears

These are helpful when One-Hot Encoding creates too many columns.

✅ Best Practices

Situation Recommended Method

Few categories (nominal) One-Hot Encoding

Ordinal categories Ordinal or Label Encoding

Many categories (high cardinality) Target or Frequency Encoding

Tree-based models Label, Ordinal, or Target

Linear models One-Hot or Target Encoding

❗ Watch Out For:

Data leakage with Target Encoding (use cross-validation or encode after splitting)

Too many features from One-Hot Encoding

Incorrect ordering in Label or Ordinal Encoding

🧪 Example: Comparing Techniques

python

Copy

Edit

from sklearn.preprocessing import OneHotEncoder

from sklearn.compose import ColumnTransformer

# Assume 'Color' is categorical, 'Size' is ordinal

preprocessor = ColumnTransformer(

transformers=[

('color', OneHotEncoder(), ['Color']),

('size', 'passthrough', ['Size'])

])

X_transformed = preprocessor.fit_transform(df)

✅ Summary

Always analyze your data: Is it nominal or ordinal?

Choose encoding based on your model type and data size

Be careful with high-cardinality and data leakage

Learn Data Science Course in Hyderabad

How to Use Principal Component Analysis (PCA) for Dimensionality Reduction

One-Hot Encoding vs. Label Encoding: When to Use Them

How to Select the Right Features for Machine Learning Models

Visit Our Quality Thought Training Institute in Hyderabad

Get Directions

Search This Blog

Best Quality Thought Software Institute Training in Hyderabad

How to Handle Categorical Data in Machine Learning Models

🧠 What is Categorical Data?

🛠️ Common Techniques to Handle Categorical Data

✅ Summary

Comments

Post a Comment

Popular posts from this blog

Entry-Level Cybersecurity Jobs You Can Apply For Today

Understanding Snowflake Editions: Standard, Enterprise, Business Critical

Installing Tosca: Step-by-Step Guide for Beginners