Wednesday, November 26, 2025

thumbnail

Handling Data Skew in Large Dataproc Jobs

 ✅ What Is Data Skew?


Data skew happens when some partitions contain a lot more data than others during a Spark job.


This causes:


Slow stages


Long-running tasks (stragglers)


Out-of-memory errors


Inefficient cluster usage


Increased job cost


Example:

A join on user_id where one user appears 10 million times while others appear only 100 times.


๐Ÿ›‘ Symptoms of Data Skew in Dataproc


You’ll see:


Some tasks completing in seconds, but a few take minutes or hours


Executors dying due to memory pressure


UI showing “task 0 of 200 running for long time”


Stages stuck at 99%


⭐ Techniques to Handle Data Skew in Dataproc (Spark)


Below are the best and most common solutions:


1️⃣ Salting the Key (Most Important Method)


If the skew happens during a join, add randomness to distribute values.


Example:

Before (skewed):

df1.join(df2, "user_id")


After (salt the keys):

from pyspark.sql.functions import col, concat_ws, lit, rand


df1_salted = df1.withColumn("user_id_salted", concat_ws("_", col("user_id"), (rand() * 10).cast("int")))

df2_salted = df2.withColumn("user_id_salted", concat_ws("_", col("user_id"), lit(0)))



You create multiple buckets to spread the load.


2️⃣ Using broadcast() for Smaller Dataset


If one side of the join is small:


from pyspark.sql.functions import broadcast


df1.join(broadcast(df_small), "id")



Advantages:


No shuffle needed


Eliminates skew for that join


3️⃣ Use Skew Hint (Spark 3.x)


Spark can automatically detect skewed partitions.


df1.join(df2.hint("skew"), "id")



Dataproc supports Spark 3.x — so this works.


4️⃣ Use repartition() on Skewed Columns


Explicitly redistribute data:


df = df.repartition(200, "user_id")



Useful when:


Key cardinality is high


Natural distribution is uneven


5️⃣ Pre-Aggregate Before Join


Instead of joining raw massive tables, aggregate first, then join.


Before:

df1.join(df2, "user_id")


After:

df2_agg = df2.groupBy("user_id").agg(sum("amount").alias("total_amount"))

df1.join(df2_agg, "user_id")



This reduces data shuffled and eliminates skew.


6️⃣ Optimize Partition Sizes


Use:


Smaller partitions for uneven distributions


Larger partitions when many small tasks slow you down


Example:


df = df.repartition(400)


7️⃣ Avoid Wide Transformations When Possible


Operations that cause shuffles:


join


groupBy


distinct


orderBy


Try alternatives like:


mapPartitions


reduceByKey instead of groupByKey


Caching mid-stage results


8️⃣ Increase Executor Resources (If Needed)


On Dataproc, increase:


executor memory (--properties)


executor cores


number of workers


Example:


--properties=spark.executor.memory=8g,spark.executor.cores=4



This helps if skewed tasks need more memory to finish.


9️⃣ Use Adaptive Query Execution (AQE)


Enable AQE in Spark 3:


--properties=spark.sql.adaptive.enabled=true



AQE automatically:


Detects skew


Merges small partitions


Splits large partitions


This is very effective and recommended.


10️⃣ Push Filtering Earlier (Predicate Pushdown)


Always filter before joins:


df1 = df1.filter(df1.status == "active")

df2 = df2.filter(df2.amount > 0)



Less data = Less skew.


๐ŸŽฏ Which Techniques Should You Use? (Priority Order)


For Dataproc, the best-performing order is:


Broadcast small table


AQE (Adaptive Query Execution)


Salting hot keys


Pre-aggregation


Repartition on join keys


Increase cluster size only if needed


๐Ÿ“Œ Example: Full Anti-Skew Join Solution

from pyspark.sql.functions import col, rand, concat_ws, lit, broadcast


# Step 1: filter early

df1 = df1.filter(col("active") == True)


# Step 2: aggregating small table

df2_agg = df2.groupBy("user_id").agg(sum("amount").alias("total"))


# Step 3: if df2_agg is small, broadcast it

df_join = df1.join(broadcast(df2_agg), "user_id")



Fast, scalable, and no skew.


๐ŸŽ‰ Final Summary


To handle data skew in Dataproc:


Use broadcast joins for small tables


Enable AQE for automatic skew handling


Salt keys for heavy skewed joins


Pre-aggregate before joining


Repartition skewed columns


Filter early


Avoid expensive wide transformations


Use the right method depending on your data size and skew severity.

Learn GCP Training in Hyderabad

Read More

Spark SQL Tuning Techniques on Dataproc

Using Dataproc with JupyterHub for Collaborative Data Science

Running TensorFlow Distributed Training on Dataproc

Using GraphFrames on Spark for Network Analysis

Visit Our Quality Thought Training Institute in Hyderabad

Get Directions 

Subscribe by Email

Follow Updates Articles from This Blog via Email

No Comments

About

Search This Blog

Powered by Blogger.

Blog Archive