Friday, November 14, 2025

thumbnail

Building a Scalable Recommendation Engine Using Dataproc

 Building a Scalable Recommendation Engine Using Dataproc


Dataproc is a fully-managed Hadoop/Spark service on Google Cloud that allows you to run large-scale data processing and machine learning workloads with minimal overhead. It is a strong fit for recommendation systems because it supports distributed computation, scalable storage, and easy integration with downstream services.


1. High-Level Architecture


A scalable recommendation engine running on Dataproc typically follows this design:


User/Event Data (GCS / BigQuery / Pub/Sub)

                ↓

     Data Preprocessing (Dataproc / Spark)

                ↓

 Feature Engineering (Spark ML / Custom code)

                ↓

   Model Training (Spark MLlib / TensorFlow on Dataproc)

                ↓

 Model Storage (GCS / Vertex AI Model Registry)

                ↓

 Batch Or Near-Real-Time Predictions

                ↓

 Recommendation Delivery (API / BigQuery / Redis / Firestore)


2. Data Sources and Storage

Common sources:


User interaction logs (views, clicks, likes)


Catalog metadata (products, videos, items)


User profile data (demographics, preferences)


Contextual signals (time, location, device)


Best storage solutions:


Google Cloud Storage (GCS) → large-scale historical data


BigQuery → analytical queries and aggregations


Pub/Sub → real-time events


Dataproc integrates directly with all of these.


3. Preprocessing and Feature Engineering on Dataproc


Use Spark jobs to perform:


Data cleaning and deduplication


Sessionization of logs


Building user-item interaction matrices


Generating features such as:


User embeddings


Item embeddings


Content features (TF-IDF, word2vec, BERT embeddings)


Temporal features (recency, frequency)


Feature engineering often accounts for 70–80% of recommendation engine performance.


4. Recommendation Algorithms on Dataproc


Dataproc supports a wide range of scalable algorithms.


A. Collaborative Filtering (CF) with Spark MLlib


Most popular for large datasets.


Algorithm:


ALS (Alternating Least Squares)


Example Spark code (simplified):

from pyspark.ml.recommendation import ALS


als = ALS(

    maxIter=15,

    regParam=0.1,

    rank=100,

    userCol="user_id",

    itemCol="item_id",

    ratingCol="rating"

)


model = als.fit(training_data)


Pros:


Scales to billions of interactions


Distributed training


Native to Dataproc


B. Content-Based Recommendations


Use item metadata:


Sparse (TF-IDF)


Embeddings (word2vec, Doc2Vec, BERT)


Image embeddings (CNN models)


Compute features on Dataproc using Spark MLlib or custom libraries.


C. Hybrid Recommendation Models


Combine CF + content features.


Workflow:


Train ALS for user/item latent factors.


Train content embedding models.


Combine embeddings (concat, weighted sum, etc.).


Build a ranking model (e.g., XGBoost, LightGBM).


Dataproc is ideal for computing embeddings in parallel.


D. Deep Learning Models


If you need neural recommenders:


Two-tower (dual-encoder) models


Sequential recommenders (GRU4Rec, SASRec)


Contextual bandit models


Train using:


Spark + TensorFlow on Dataproc


Vertex AI Training for heavier DL workloads


5. Batch vs. Real-Time Recommendations

Batch Recommendations (common)


Run daily/hourly on Dataproc


Compute top-N recommendations for each user


Store in BigQuery, Redis, or Firestore


Served via API or directly in UI


Real-Time Recommendations


Use Pub/Sub for live stream ingestion


Use Dataproc + Spark Structured Streaming OR Dataflow


Serving layer: Vertex AI Matching Engine / Redis / custom API


Most production systems combine both:


Batch for heavy lifting


Real-time for freshness


6. Production Scaling with Dataproc

A. Use Ephemeral Clusters


Dataproc creates a cluster per workflow and shuts it down after


Reduces cost


Ensures environment consistency


B. Autoscaling


Use autoscaling policies:


Scale workers up/down based on YARN metrics


Support Spot VMs for non-critical jobs


C. Workflow Templates


Use Dataproc Workflow Templates for:


Training pipelines


Feature generation pipelines


Batch scoring pipelines


D. CI/CD


Store job scripts in Git.

Deploy via:


Cloud Build


GitHub Actions


Terraform for infrastructure


7. Model Storage and Serving

Storage options:


GCS → best for Spark and training outputs


Vertex AI Model Registry → versioned model management


Serving options:


Vertex AI Endpoint (real-time predictions)


Redis/Memcached for quick retrieval


BigQuery for batch scoring


Custom microservice for top-N retrieval


Example top-N computation:

userRecs = model.recommendForAllUsers(20)

userRecs.write.parquet("gs://bucket/recommendations/")


8. Monitoring and Logging


Use:


Cloud Logging for job-level logs


Cloud Monitoring for cluster metrics


Dataproc job history server for Spark details


Alerting policies:


Job failure rate


Cluster creation errors


Autoscaling anomalies


9. Security & Governance

Best practices:


Use private clusters (private IPs)


Use service accounts with least privilege


Restrict GCS buckets with IAM + VPC-SC


Enable Cloud Audit Logs


10. Example Production Pipeline Architecture

Daily Batch Recommendation Pipeline (Dataproc-based)


Extract interaction data from BigQuery to GCS (Parquet)


Launch ephemeral Dataproc cluster via Workflow Template


Run Spark preprocessing job


Train ALS or hybrid model


Generate top-N recommendations


Store results in BigQuery and Redis


Notify downstream systems (Pub/Sub)


Shut down cluster automatically


This pattern is widely used in e-commerce, media platforms, and finance.


Summary


Using Dataproc for recommendation engines provides:


Key Benefits


Scalability for massive datasets


Low operational overhead


Full Spark ecosystem support


Easy orchestration & integration


Cost efficiency with ephemeral clusters


Core Components


Distributed preprocessing


Feature engineering via Spark


Large-scale ML model training


Batch/real-time serving strategies


Production monitoring and CI/CD

Learn GCP Training in Hyderabad

Read More

Managing Dataproc Workflow Templates in Production

Integrating Apache Hudi with Dataproc for Incremental Data Processing

Using Structured Streaming with Spark on Dataproc

Cloud Dataproc - Real-Time & Batch Processing Solutions

Visit Our Quality Thought Training Institute in Hyderabad

Get Directions 

Subscribe by Email

Follow Updates Articles from This Blog via Email

No Comments

About

Search This Blog

Powered by Blogger.

Blog Archive