Running TensorFlow Distributed Training on Dataproc
Google Cloud Dataproc is not only for Spark and Hadoop; it also provides an elastic and low-overhead way to run distributed TensorFlow (TF) training jobs using custom clusters. Dataproc’s managed infrastructure, autoscaling, and integration with GCS make it a strong option for cost-effective large-scale training.
1. Why Train TensorFlow on Dataproc?
Key advantages
Elastic clusters: Spin up GPU/CPU training clusters on demand.
Autoscaling: Add/remove workers as needed.
Custom machine types: High-CPU, high-memory, or GPU nodes.
Lower cost than dedicated GPU services for intermittent workloads.
Native integration with GCS, BigQuery, and Spark pipelines.
Workflow Templates support for repeatable ML pipelines.
2. Distributed Training Strategies Supported
TensorFlow supports several distributed training strategies that work seamlessly on Dataproc:
A. MultiWorkerMirroredStrategy
Synchronous training across multiple workers.
Most common for training on GPUs or multi-VM clusters.
B. ParameterServerStrategy
Asynchronous training using:
Workers
Parameter servers (PS)
C. TFJob via Kubeflow (alternative)
If using Dataproc for hybrid Spark + TF, you can export trained data and run distributed training separately.
Not required—Dataproc can run TF natively.
3. Cluster Setup for TensorFlow Training
Option 1: CPU Training Clusters
gcloud dataproc clusters create tf-cpu-cluster \
--region=us-central1 \
--num-workers=4 \
--worker-machine-type=n1-standard-16 \
--image-version=2.1-debian12 \
--optional-components=ANACONDA
Option 2: GPU Training Clusters
Use initialization actions for GPU drivers:
gcloud dataproc clusters create tf-gpu-cluster \
--region=us-central1 \
--master-machine-type=n1-standard-16 \
--worker-machine-type=n1-standard-16 \
--worker-accelerator type=nvidia-tesla-t4,count=2 \
--metadata gpu-driver-provider=NVIDIA \
--initialization-actions gs://dataproc-initialization-actions/gpu/install_gpu_driver.sh \
--image-version=2.1-debian12
GPU training works best with:
T4 or A100 GPUs
High-performance disk or local SSD
Proper CUDA and cuDNN installation (handled by init actions)
4. Preparing the Training Code
Your Python script must:
Detect cluster configuration
Define TF_CONFIG environment variable
Use a distributed strategy
A. TF_CONFIG Environment Variable
Dataproc provides cluster node names and IPs. Each node acts as:
chief (master)
worker
optional ps
Example TF_CONFIG (JSON):
{
"cluster": {
"chief": ["master:2222"],
"worker": ["worker1:2222", "worker2:2222"]
},
"task": {
"type": "worker",
"index": 0
}
}
In Dataproc we usually generate it dynamically in the training script or pass it via job arguments.
B. Example Python Code Using MultiWorkerMirroredStrategy
import json
import os
import tensorflow as tf
strategy = tf.distribute.MultiWorkerMirroredStrategy()
with strategy.scope():
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation="relu"),
tf.keras.layers.Dense(10, activation="softmax")
])
model.compile(optimizer="adam",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"])
dataset = tf.data.Dataset.from_tensor_slices((train_x, train_y)).batch(64)
model.fit(dataset, epochs=10)
model.save("gs://my-bucket/models/model1")
This code automatically distributes training across multiple Dataproc worker VMs.
5. Submitting TensorFlow Jobs to Dataproc
Upload your training script to GCS:
gs://mybucket/jobs/train.py
Then submit the job:
gcloud dataproc jobs submit pyspark \
--cluster=tf-gpu-cluster \
--region=us-central1 \
--py-files train.py \
-- train.py --epochs=10 --batch_size=64
For pure Python without Spark:
gcloud dataproc jobs submit pyspark \
--cluster=tf-gpu-cluster \
--properties spark.executorEnv.PYTHONHASHSEED=0 \
-- bins/run_python.sh train.py
(Using run_python.sh ensures proper Python environment.)
6. Integrating Spark + TensorFlow (Typical Workflow)
A common Dataproc ML pipeline:
Spark for data preparation
Ingest + preprocess data
Write dataset to GCS (TFRecord/Parquet)
TensorFlow for distributed training
Load preprocessed data from GCS
Train across GPU workers
Post-processing
Export model to GCS / Vertex AI Model Registry
Trigger batch/online inference jobs
This setup takes advantage of Dataproc’s strengths in both big data and distributed ML.
7. Autoscaling for Distributed ML
Dataproc supports:
Cluster scaling based on YARN utilization
Preemptible (Spot) workers (cheap but can interrupt training)
Best practice:
Use synchronous training with stable workers
Use asynchronous training (ParameterServerStrategy) if using Spot VMs
8. Best Practices for Running TF Training on Dataproc
Data
✔ Store datasets in GCS (TFRecords for best performance)
✔ Use caching or sharding for large datasets
Cluster
✔ Use GPU clusters for deep learning
✔ Use local SSDs when reading large datasets locally
✔ Use Instance Groups with consistent hardware
Training
✔ Use distribution strategies properly
✔ Use checkpointing to GCS
✔ Use mixed precision to speed up GPU training
Costs
✔ Run ephemeral clusters via Workflow Templates
✔ Prefer T4/L4 GPUs unless extremely large models
✔ Shut down clusters automatically
9. Using Dataproc Workflow Templates
You can define a repeatable training pipeline:
Step 1: Preprocess data (Spark)
Step 2: Run distributed TensorFlow training job
Step 3: Export model to GCS
Step 4: Delete cluster automatically
Workflow templates ensure:
Reproducibility
Versioned ML pipelines
Separation of environments (dev/stage/prod)
10. Example End-to-End Training Pipeline
A typical production training pipeline:
Spark job prepares large dataset → gs://bucket/data/train/*.tfrecord
Dataproc ephemeral GPU cluster starts
Distributed TF training runs across workers
Checkpoints saved to gs://bucket/checkpoints/
Final model exported to gs://bucket/models/
Cluster auto-shutdown
Vertex AI online/batch serving uses exported model
Summary
Running TensorFlow distributed training on Dataproc provides:
Strengths
Elastic, scalable training clusters
Full GPU support
Low-cost, on-demand training
Smooth integration with Spark and GCS
Reproducible ML workflows using Dataproc templates
Best Use Cases
Large-scale deep learning
Embedding generation
Recommender systems
NLP workloads
Image classification and detection
Learn GCP Training in Hyderabad
Read More
Using GraphFrames on Spark for Network Analysis
Building a Scalable Recommendation Engine Using Dataproc
Managing Dataproc Workflow Templates in Production
Integrating Apache Hudi with Dataproc for Incremental Data Processing
Visit Our Quality Thought Training Institute in Hyderabad
Subscribe by Email
Follow Updates Articles from This Blog via Email
No Comments