Building a Scalable Recommendation Engine Using Dataproc
Dataproc is a fully-managed Hadoop/Spark service on Google Cloud that allows you to run large-scale data processing and machine learning workloads with minimal overhead. It is a strong fit for recommendation systems because it supports distributed computation, scalable storage, and easy integration with downstream services.
1. High-Level Architecture
A scalable recommendation engine running on Dataproc typically follows this design:
User/Event Data (GCS / BigQuery / Pub/Sub)
↓
Data Preprocessing (Dataproc / Spark)
↓
Feature Engineering (Spark ML / Custom code)
↓
Model Training (Spark MLlib / TensorFlow on Dataproc)
↓
Model Storage (GCS / Vertex AI Model Registry)
↓
Batch Or Near-Real-Time Predictions
↓
Recommendation Delivery (API / BigQuery / Redis / Firestore)
2. Data Sources and Storage
Common sources:
User interaction logs (views, clicks, likes)
Catalog metadata (products, videos, items)
User profile data (demographics, preferences)
Contextual signals (time, location, device)
Best storage solutions:
Google Cloud Storage (GCS) → large-scale historical data
BigQuery → analytical queries and aggregations
Pub/Sub → real-time events
Dataproc integrates directly with all of these.
3. Preprocessing and Feature Engineering on Dataproc
Use Spark jobs to perform:
Data cleaning and deduplication
Sessionization of logs
Building user-item interaction matrices
Generating features such as:
User embeddings
Item embeddings
Content features (TF-IDF, word2vec, BERT embeddings)
Temporal features (recency, frequency)
Feature engineering often accounts for 70–80% of recommendation engine performance.
4. Recommendation Algorithms on Dataproc
Dataproc supports a wide range of scalable algorithms.
A. Collaborative Filtering (CF) with Spark MLlib
Most popular for large datasets.
Algorithm:
ALS (Alternating Least Squares)
Example Spark code (simplified):
from pyspark.ml.recommendation import ALS
als = ALS(
maxIter=15,
regParam=0.1,
rank=100,
userCol="user_id",
itemCol="item_id",
ratingCol="rating"
)
model = als.fit(training_data)
Pros:
Scales to billions of interactions
Distributed training
Native to Dataproc
B. Content-Based Recommendations
Use item metadata:
Sparse (TF-IDF)
Embeddings (word2vec, Doc2Vec, BERT)
Image embeddings (CNN models)
Compute features on Dataproc using Spark MLlib or custom libraries.
C. Hybrid Recommendation Models
Combine CF + content features.
Workflow:
Train ALS for user/item latent factors.
Train content embedding models.
Combine embeddings (concat, weighted sum, etc.).
Build a ranking model (e.g., XGBoost, LightGBM).
Dataproc is ideal for computing embeddings in parallel.
D. Deep Learning Models
If you need neural recommenders:
Two-tower (dual-encoder) models
Sequential recommenders (GRU4Rec, SASRec)
Contextual bandit models
Train using:
Spark + TensorFlow on Dataproc
Vertex AI Training for heavier DL workloads
5. Batch vs. Real-Time Recommendations
Batch Recommendations (common)
Run daily/hourly on Dataproc
Compute top-N recommendations for each user
Store in BigQuery, Redis, or Firestore
Served via API or directly in UI
Real-Time Recommendations
Use Pub/Sub for live stream ingestion
Use Dataproc + Spark Structured Streaming OR Dataflow
Serving layer: Vertex AI Matching Engine / Redis / custom API
Most production systems combine both:
Batch for heavy lifting
Real-time for freshness
6. Production Scaling with Dataproc
A. Use Ephemeral Clusters
Dataproc creates a cluster per workflow and shuts it down after
Reduces cost
Ensures environment consistency
B. Autoscaling
Use autoscaling policies:
Scale workers up/down based on YARN metrics
Support Spot VMs for non-critical jobs
C. Workflow Templates
Use Dataproc Workflow Templates for:
Training pipelines
Feature generation pipelines
Batch scoring pipelines
D. CI/CD
Store job scripts in Git.
Deploy via:
Cloud Build
GitHub Actions
Terraform for infrastructure
7. Model Storage and Serving
Storage options:
GCS → best for Spark and training outputs
Vertex AI Model Registry → versioned model management
Serving options:
Vertex AI Endpoint (real-time predictions)
Redis/Memcached for quick retrieval
BigQuery for batch scoring
Custom microservice for top-N retrieval
Example top-N computation:
userRecs = model.recommendForAllUsers(20)
userRecs.write.parquet("gs://bucket/recommendations/")
8. Monitoring and Logging
Use:
Cloud Logging for job-level logs
Cloud Monitoring for cluster metrics
Dataproc job history server for Spark details
Alerting policies:
Job failure rate
Cluster creation errors
Autoscaling anomalies
9. Security & Governance
Best practices:
Use private clusters (private IPs)
Use service accounts with least privilege
Restrict GCS buckets with IAM + VPC-SC
Enable Cloud Audit Logs
10. Example Production Pipeline Architecture
Daily Batch Recommendation Pipeline (Dataproc-based)
Extract interaction data from BigQuery to GCS (Parquet)
Launch ephemeral Dataproc cluster via Workflow Template
Run Spark preprocessing job
Train ALS or hybrid model
Generate top-N recommendations
Store results in BigQuery and Redis
Notify downstream systems (Pub/Sub)
Shut down cluster automatically
This pattern is widely used in e-commerce, media platforms, and finance.
Summary
Using Dataproc for recommendation engines provides:
Key Benefits
Scalability for massive datasets
Low operational overhead
Full Spark ecosystem support
Easy orchestration & integration
Cost efficiency with ephemeral clusters
Core Components
Distributed preprocessing
Feature engineering via Spark
Large-scale ML model training
Batch/real-time serving strategies
Production monitoring and CI/CD
Learn GCP Training in Hyderabad
Read More
Managing Dataproc Workflow Templates in Production
Integrating Apache Hudi with Dataproc for Incremental Data Processing
Using Structured Streaming with Spark on Dataproc
Cloud Dataproc - Real-Time & Batch Processing Solutions
Visit Our Quality Thought Training Institute in Hyderabad
Subscribe by Email
Follow Updates Articles from This Blog via Email
No Comments