☁️ 1. What Is Google Cloud Dataproc?
Google Cloud Dataproc is a fully managed Apache Spark and Hadoop service that lets you:
Run batch and streaming data processing jobs,
Use open-source frameworks like Spark, Hive, Flink, and Presto,
Scale compute clusters dynamically,
Integrate natively with BigQuery, Cloud Storage, Pub/Sub, and Vertex AI.
It’s often used for:
ETL pipelines
Data lake processing
Machine learning model training
Ad-hoc analytics
Streaming (real-time) event processing
⚙️ 2. Batch vs. Real-Time Processing on Dataproc
Type Description Example Frameworks Example Use Case
Batch Processing Periodic or on-demand processing of large data volumes. Spark, Hive, MapReduce Nightly ETL to BigQuery
Real-Time Processing Continuous ingestion and transformation of streaming data. Spark Streaming, Flink, Beam Streaming analytics from IoT or logs
Dataproc supports both — you can run Spark batch jobs and streaming pipelines on the same cluster, or split them for cost and performance.
๐งฎ 3. Architecture Patterns
A. Batch Processing Pattern
Example: Data Lake ETL → BigQuery
Cloud Storage (Raw Data)
↓
Dataproc Cluster
(Spark/Hive job)
↓
Cloud Storage (Processed)
↓
BigQuery (Analytics)
Trigger: Scheduled (e.g., via Cloud Composer/Airflow or Dataproc Workflow Templates)
Common workloads: ETL, machine learning feature prep, data aggregation
Integration points:
Read/write data in Cloud Storage
Output tables to BigQuery
๐น Example Batch Job Command:
gcloud dataproc jobs submit spark \
--cluster=my-batch-cluster \
--region=us-central1 \
--class=org.example.BatchETLJob \
--jars=gs://my-bucket/jars/batch-etl.jar \
-- gs://data/raw/ gs://data/processed/
B. Real-Time (Streaming) Processing Pattern
Example: Stream Analytics Using Spark Streaming or Flink
Pub/Sub (Event Stream)
↓
Dataproc Cluster (Streaming Job)
(Spark Structured Streaming or Flink)
↓
BigQuery / Cloud Storage / Pub/Sub
Trigger: Always-on or auto-scaling streaming cluster
Common workloads: Real-time ETL, log analysis, fraud detection, IoT analytics
Integration points:
Input from Pub/Sub or Kafka
Output to BigQuery, GCS, or Pub/Sub
๐น Example Spark Structured Streaming (Python):
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("StreamProcessing").getOrCreate()
# Read from Pub/Sub
stream_df = spark.readStream \
.format("pubsublite") \
.option("pubsublite.subscription", "projects/myproj/subscriptions/mysub") \
.load()
# Simple transform
processed_df = stream_df.selectExpr("CAST(data AS STRING) AS message")
# Write to BigQuery
processed_df.writeStream \
.format("bigquery") \
.option("table", "myproject.mydataset.streaming_output") \
.outputMode("append") \
.start()
๐ 4. Deployment Options
A. Ephemeral (Job) Clusters
Spin up a Dataproc cluster per job.
Automatically delete it after completion.
Cost-effective for batch jobs.
gcloud dataproc workflow-templates instantiate-from-file \
--file=workflow.yaml \
--region=us-central1
B. Long-Running Clusters
Used for streaming or interactive workloads.
Can use auto-scaling and auto-repair.
C. Serverless Dataproc (Dataproc on Serverless)
Run Spark or PySpark jobs without managing clusters.
Ideal for cost efficiency and simplicity.
gcloud dataproc batches submit pyspark my_job.py --region=us-central1
๐ 5. Integrating Dataproc with Other GCP Services
Service Purpose
Cloud Storage (GCS) Data lake for raw and processed data
BigQuery Analytics and downstream consumption
Pub/Sub Real-time streaming input/output
Composer (Airflow) Orchestration and workflow scheduling
Vertex AI ML model training and batch inference
Cloud Logging & Monitoring Observability for jobs and clusters
๐งฐ 6. Cost & Optimization Tips
✅ Use preemptible VMs for batch jobs — 60–80% cheaper.
✅ Use auto-scaling clusters for variable workloads.
✅ Turn on auto-deletion for ephemeral clusters.
✅ Store large intermediate data in Cloud Storage, not HDFS.
✅ Use Dataproc Serverless for short-lived jobs.
๐ 7. Example End-to-End Architecture
Scenario: Real-Time + Batch Unified Pipeline
[Pub/Sub] <---- Event producers (IoT, logs)
│
┌─────────────┐
│ Dataproc Streaming │
│ (Spark/Flink) │
└─────────────┘
│
[BigQuery - Realtime Table]
│
[Dataproc Batch Job]
(Aggregates daily data)
│
[BigQuery - Curated Dataset]
│
[Looker / Data Studio Dashboards]
Benefits:
Real-time and batch coexist seamlessly.
Unified data lake + warehouse architecture.
Reusable infrastructure and cost-efficient processing.
๐งญ 8. When to Use Dataproc vs. Alternatives
Use Case Recommended Tool
Large-scale batch ETL (Spark, Hive) Dataproc Batch / Serverless
Real-time stream processing Dataproc (Spark Streaming / Flink)
Event-driven processing Dataflow (Apache Beam)
SQL-only analytics BigQuery
ML feature engineering Dataproc + Vertex AI
Learn GCP Training in Hyderabad
Read More
Building Dynamic DAGs Using Metadata and Environment Variables
Airflow Plugins for Custom GCP Operators in Composer
Scheduling Notebooks and Reports with Cloud Composer
How to Handle DAG Failures Gracefully in Cloud Composer
Visit Our Quality Thought Training Institute in Hyderabad
Subscribe by Email
Follow Updates Articles from This Blog via Email
No Comments