How to Handle Categorical Data in Machine Learning Models

 ๐Ÿง  What is Categorical Data?

Categorical data represents variables with a limited number of possible values. These can be:


Nominal: No order (e.g., Color: Red, Green, Blue)


Ordinal: Ordered categories (e.g., Size: Small, Medium, Large)


Machine learning models work best with numerical data, so we need to convert categorical variables into a format they can understand.


๐Ÿ› ️ Common Techniques to Handle Categorical Data

1. Label Encoding

Assign each category a unique number.


python

Copy

Edit

from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()

df['Color'] = le.fit_transform(df['Color'])

✅ When to use:

For ordinal data


With tree-based models (e.g., Decision Trees, Random Forests)


❌ Avoid with:

Nominal data + linear models (can mislead the model)


2. One-Hot Encoding

Creates a new binary column for each category.


python

Copy

Edit

import pandas as pd


df = pd.get_dummies(df, columns=['Color'])

✅ When to use:

For nominal (unordered) categories


With linear models, SVM, neural networks


❌ Avoid when:

Feature has many unique categories (can create too many columns)


3. Ordinal Encoding

Manually assign numbers to categories based on their order.


python

Copy

Edit

df['Size'] = df['Size'].map({'Small': 1, 'Medium': 2, 'Large': 3})

✅ When to use:

For ordinal data


When the order matters


4. Binary Encoding / Target Encoding / Frequency Encoding

More advanced techniques to deal with high-cardinality features (many unique categories):


Binary Encoding: Converts categories to binary numbers


Target Encoding: Replace each category with the average target value


Frequency Encoding: Replace each category with how often it appears


These are helpful when One-Hot Encoding creates too many columns.


✅ Best Practices

Situation Recommended Method

Few categories (nominal) One-Hot Encoding

Ordinal categories Ordinal or Label Encoding

Many categories (high cardinality) Target or Frequency Encoding

Tree-based models Label, Ordinal, or Target

Linear models One-Hot or Target Encoding


❗ Watch Out For:

Data leakage with Target Encoding (use cross-validation or encode after splitting)


Too many features from One-Hot Encoding


Incorrect ordering in Label or Ordinal Encoding


๐Ÿงช Example: Comparing Techniques

python

Copy

Edit

from sklearn.preprocessing import OneHotEncoder

from sklearn.compose import ColumnTransformer


# Assume 'Color' is categorical, 'Size' is ordinal

preprocessor = ColumnTransformer(

    transformers=[

        ('color', OneHotEncoder(), ['Color']),

        ('size', 'passthrough', ['Size'])

    ])


X_transformed = preprocessor.fit_transform(df)

✅ Summary

Always analyze your data: Is it nominal or ordinal?


Choose encoding based on your model type and data size


Be careful with high-cardinality and data leakage

Learn Data Science Course in Hyderabad

Read More

Feature Selection Techniques: Filter, Wrapper, and Embedded Methods

How to Use Principal Component Analysis (PCA) for Dimensionality Reduction

One-Hot Encoding vs. Label Encoding: When to Use Them

How to Select the Right Features for Machine Learning Models

Visit Our Quality Thought Training Institute in Hyderabad

Get Directions


Comments

Popular posts from this blog

Entry-Level Cybersecurity Jobs You Can Apply For Today

Understanding Snowflake Editions: Standard, Enterprise, Business Critical

Installing Tosca: Step-by-Step Guide for Beginners