Tuesday, November 11, 2025

thumbnail

Cloud Dataproc - Real-Time & Batch Processing Solutions

 ☁️ 1. What Is Google Cloud Dataproc?


Google Cloud Dataproc is a fully managed Apache Spark and Hadoop service that lets you:


Run batch and streaming data processing jobs,


Use open-source frameworks like Spark, Hive, Flink, and Presto,


Scale compute clusters dynamically,


Integrate natively with BigQuery, Cloud Storage, Pub/Sub, and Vertex AI.


It’s often used for:


ETL pipelines


Data lake processing


Machine learning model training


Ad-hoc analytics


Streaming (real-time) event processing


⚙️ 2. Batch vs. Real-Time Processing on Dataproc

Type Description Example Frameworks Example Use Case

Batch Processing Periodic or on-demand processing of large data volumes. Spark, Hive, MapReduce Nightly ETL to BigQuery

Real-Time Processing Continuous ingestion and transformation of streaming data. Spark Streaming, Flink, Beam Streaming analytics from IoT or logs


Dataproc supports both — you can run Spark batch jobs and streaming pipelines on the same cluster, or split them for cost and performance.


๐Ÿงฎ 3. Architecture Patterns

A. Batch Processing Pattern

Example: Data Lake ETL → BigQuery

Cloud Storage (Raw Data)

        ↓

   Dataproc Cluster

  (Spark/Hive job)

        ↓

   Cloud Storage (Processed)

        ↓

        BigQuery (Analytics)



Trigger: Scheduled (e.g., via Cloud Composer/Airflow or Dataproc Workflow Templates)


Common workloads: ETL, machine learning feature prep, data aggregation


Integration points:


Read/write data in Cloud Storage


Output tables to BigQuery


๐Ÿ”น Example Batch Job Command:

gcloud dataproc jobs submit spark \

  --cluster=my-batch-cluster \

  --region=us-central1 \

  --class=org.example.BatchETLJob \

  --jars=gs://my-bucket/jars/batch-etl.jar \

  -- gs://data/raw/ gs://data/processed/


B. Real-Time (Streaming) Processing Pattern

Example: Stream Analytics Using Spark Streaming or Flink

Pub/Sub (Event Stream)

        ↓

   Dataproc Cluster (Streaming Job)

   (Spark Structured Streaming or Flink)

        ↓

   BigQuery / Cloud Storage / Pub/Sub



Trigger: Always-on or auto-scaling streaming cluster


Common workloads: Real-time ETL, log analysis, fraud detection, IoT analytics


Integration points:


Input from Pub/Sub or Kafka


Output to BigQuery, GCS, or Pub/Sub


๐Ÿ”น Example Spark Structured Streaming (Python):

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName("StreamProcessing").getOrCreate()


# Read from Pub/Sub

stream_df = spark.readStream \

    .format("pubsublite") \

    .option("pubsublite.subscription", "projects/myproj/subscriptions/mysub") \

    .load()


# Simple transform

processed_df = stream_df.selectExpr("CAST(data AS STRING) AS message")


# Write to BigQuery

processed_df.writeStream \

    .format("bigquery") \

    .option("table", "myproject.mydataset.streaming_output") \

    .outputMode("append") \

    .start()


๐Ÿš€ 4. Deployment Options

A. Ephemeral (Job) Clusters


Spin up a Dataproc cluster per job.


Automatically delete it after completion.


Cost-effective for batch jobs.


gcloud dataproc workflow-templates instantiate-from-file \

    --file=workflow.yaml \

    --region=us-central1


B. Long-Running Clusters


Used for streaming or interactive workloads.


Can use auto-scaling and auto-repair.


C. Serverless Dataproc (Dataproc on Serverless)


Run Spark or PySpark jobs without managing clusters.


Ideal for cost efficiency and simplicity.


gcloud dataproc batches submit pyspark my_job.py --region=us-central1


๐Ÿ”„ 5. Integrating Dataproc with Other GCP Services

Service Purpose

Cloud Storage (GCS) Data lake for raw and processed data

BigQuery Analytics and downstream consumption

Pub/Sub Real-time streaming input/output

Composer (Airflow) Orchestration and workflow scheduling

Vertex AI ML model training and batch inference

Cloud Logging & Monitoring Observability for jobs and clusters

๐Ÿงฐ 6. Cost & Optimization Tips


✅ Use preemptible VMs for batch jobs — 60–80% cheaper.

✅ Use auto-scaling clusters for variable workloads.

✅ Turn on auto-deletion for ephemeral clusters.

✅ Store large intermediate data in Cloud Storage, not HDFS.

✅ Use Dataproc Serverless for short-lived jobs.


๐Ÿ“Š 7. Example End-to-End Architecture

Scenario: Real-Time + Batch Unified Pipeline

          [Pub/Sub]  <----  Event producers (IoT, logs)

               │

         ┌─────────────┐

         │ Dataproc Streaming │

         │ (Spark/Flink)      │

         └─────────────┘

               │

         [BigQuery - Realtime Table]

               │

         [Dataproc Batch Job]

         (Aggregates daily data)

               │

         [BigQuery - Curated Dataset]

               │

        [Looker / Data Studio Dashboards]



Benefits:


Real-time and batch coexist seamlessly.


Unified data lake + warehouse architecture.


Reusable infrastructure and cost-efficient processing.


๐Ÿงญ 8. When to Use Dataproc vs. Alternatives

Use Case Recommended Tool

Large-scale batch ETL (Spark, Hive) Dataproc Batch / Serverless

Real-time stream processing Dataproc (Spark Streaming / Flink)

Event-driven processing Dataflow (Apache Beam)

SQL-only analytics BigQuery

ML feature engineering Dataproc + Vertex AI

Learn GCP Training in Hyderabad

Read More

Building Dynamic DAGs Using Metadata and Environment Variables

Airflow Plugins for Custom GCP Operators in Composer

Scheduling Notebooks and Reports with Cloud Composer

How to Handle DAG Failures Gracefully in Cloud Composer

Visit Our Quality Thought Training Institute in Hyderabad

Get Directions 

Subscribe by Email

Follow Updates Articles from This Blog via Email

No Comments

About

Search This Blog

Powered by Blogger.

Blog Archive