How to Build a Machine Learning Pipeline with Apache Spark

September 01, 2025

✅ What is a Machine Learning Pipeline?

An ML pipeline is a sequence of stages where each stage is a data transformation or modeling step. In Apache Spark, these stages are implemented using DataFrame-based APIs.

🚀 Steps to Build an ML Pipeline with Apache Spark

Here’s a typical step-by-step process:

1. Set Up Your Spark Environment

You’ll need to start by importing SparkSession and initializing it.

from pyspark.sql import SparkSession

spark = SparkSession.builder \

.appName("ML Pipeline Example") \

.getOrCreate()

2. Load and Inspect the Dataset

You can load data into a Spark DataFrame.

data = spark.read.csv("data.csv", header=True, inferSchema=True)

data.printSchema()

data.show(5)

3. Data Preprocessing

Use Spark MLlib transformers for:

Handling missing values

Encoding categorical variables

Assembling features

Example: Handling Missing Values

data = data.na.drop()

Example: String Indexing (Categorical to Numeric)

from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")

Example: Feature Vector Assembler

from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(

inputCols=["feature1", "feature2", "categoryIndex"],

outputCol="features"

)

4. Split the Data

(trainingData, testData) = data.randomSplit([0.8, 0.2])

5. Choose and Configure a Model

Example: Logistic Regression

from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(featuresCol="features", labelCol="label")

6. Build the Pipeline

You can combine all steps into a single pipeline.

from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[indexer, assembler, lr])

7. Train the Model

model = pipeline.fit(trainingData)

8. Make Predictions

predictions = model.transform(testData)

predictions.select("features", "label", "prediction").show(5)

9. Evaluate the Model

Use built-in evaluators.

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()

accuracy = evaluator.evaluate(predictions)

print("Test Accuracy:", accuracy)

📦 Example: Full Pipeline Code Snippet

from pyspark.sql import SparkSession

from pyspark.ml import Pipeline

from pyspark.ml.feature import StringIndexer, VectorAssembler

from pyspark.ml.classification import LogisticRegression

from pyspark.ml.evaluation import BinaryClassificationEvaluator

spark = SparkSession.builder.appName("Spark ML Pipeline").getOrCreate()

# Load Data

data = spark.read.csv("data.csv", header=True, inferSchema=True)

data = data.na.drop()

# Indexing and Assembling

indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")

assembler = VectorAssembler(

inputCols=["feature1", "feature2", "categoryIndex"],

outputCol="features"

)

# Model

lr = LogisticRegression(featuresCol="features", labelCol="label")

# Pipeline

pipeline = Pipeline(stages=[indexer, assembler, lr])

# Train/Test Split

(trainingData, testData) = data.randomSplit([0.8, 0.2])

# Fit Model

model = pipeline.fit(trainingData)

# Predict and Evaluate

predictions = model.transform(testData)

evaluator = BinaryClassificationEvaluator()

accuracy = evaluator.evaluate(predictions)

print("Test Accuracy:", accuracy)

🔄 Why Use Spark ML Pipelines?

Advantage Description

Scalability Works efficiently with large datasets using distributed computing.

Modularity Easy to chain preprocessing and modeling steps.

Reproducibility Pipelines can be saved, loaded, and reused.

Integration Works well with Spark SQL and DataFrames.

Learn AI ML Course in Hyderabad

Understanding OpenCV for Computer Vision Projects

How to Build AI Models Using Keras

Using Scikit-learn for Machine Learning: A Step-by-Step Guide

Visit Our Quality Thought Training Institute in Hyderabad

Get Directions

Search This Blog

Best Quality Thought Software Institute Training in Hyderabad