Introduction to Decision Trees and Random Forests

July 18, 2025

🌳 Introduction to Decision Trees and Random Forests

🔹 What is a Decision Tree?

A Decision Tree is a supervised machine learning algorithm used for classification and regression tasks. It models decisions and their possible consequences as a tree-like structure, where:

Each internal node represents a decision based on a feature

Each branch represents the outcome of the decision

Each leaf node represents a predicted class (for classification) or value (for regression)

🧠 How It Works:

The algorithm splits the dataset into subsets based on the most significant feature using metrics like:

Gini Impurity

Entropy (Information Gain)

Mean Squared Error (for regression)

✅ Example:

If you want to classify whether someone will buy a product, a decision tree might split by:

Age < 30 → Yes

Age ≥ 30 and Income > $50K → Yes

Else → No

⚖️ Advantages of Decision Trees:

Easy to understand and interpret

Handles both numerical and categorical data

Requires little data preprocessing

Non-linear relationships can be captured

❌ Disadvantages:

Prone to overfitting, especially with deep trees

Unstable to small data changes (a small change in data can create a different tree)

🌲 What is a Random Forest?

A Random Forest is an ensemble of many decision trees. It builds multiple trees and combines their outputs to improve accuracy and control overfitting.

🧠 How It Works:

Trains multiple decision trees on different random subsets of the data (bagging)

At each split in a tree, it considers a random subset of features

Prediction:

Classification: Uses majority vote

Regression: Takes the average of outputs

⚖️ Advantages of Random Forest:

High accuracy and robust to overfitting

Handles large datasets and high-dimensional spaces well

Works for both classification and regression

Less sensitive to outliers and noise

❌ Disadvantages:

Slower than individual decision trees

Less interpretable (black-box model)

Can be memory-intensive with many trees

🆚 Decision Tree vs Random Forest

Feature Decision Tree Random Forest

Simplicity Simple and interpretable More complex

Overfitting Risk High Low

Accuracy Moderate High

Speed Fast Slower (more computations)

Interpretability Easy to visualize Harder to interpret as a whole

🔧 Basic Code Example (Python using scikit-learn)

python

Copy

Edit

from sklearn.datasets import load_iris

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

Learn Data Science Course in Hyderabad

Data Preprocessing Techniques for Machine Learning

Introduction to Neural Networks and Deep Learning

The Bias-Variance Tradeoff in Machine Learning

Visit Our Quality Thought Training Institute in Hyderabad

Get Directions

# Load dataset

X, y = load_iris(return_X_y=True)

# Split into train/test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Decision Tree

tree_model = DecisionTreeClassifier()

tree_model.fit(X_train, y_train)

tree_preds = tree_model.predict(X_test)

# Random Forest

forest_model = RandomForestClassifier(n_estimators=100)

forest_model.fit(X_train, y_train)

forest_preds = forest_model.predict(X_test)

# Accuracy

print("Decision Tree Accuracy:", accuracy_score(y_test, tree_preds))

print("Random Forest Accuracy:", accuracy_score(y_test, forest_preds))

📌 When to Use:

Use Decision Trees when:

You need a simple, interpretable model

Fast training/prediction is needed

Use Random Forests when:

You want higher accuracy

You can afford more computation

You want to reduce overfitting