✅ What Is Data Skew?
Data skew happens when some partitions contain a lot more data than others during a Spark job.
This causes:
Slow stages
Long-running tasks (stragglers)
Out-of-memory errors
Inefficient cluster usage
Increased job cost
Example:
A join on user_id where one user appears 10 million times while others appear only 100 times.
๐ Symptoms of Data Skew in Dataproc
You’ll see:
Some tasks completing in seconds, but a few take minutes or hours
Executors dying due to memory pressure
UI showing “task 0 of 200 running for long time”
Stages stuck at 99%
⭐ Techniques to Handle Data Skew in Dataproc (Spark)
Below are the best and most common solutions:
1️⃣ Salting the Key (Most Important Method)
If the skew happens during a join, add randomness to distribute values.
Example:
Before (skewed):
df1.join(df2, "user_id")
After (salt the keys):
from pyspark.sql.functions import col, concat_ws, lit, rand
df1_salted = df1.withColumn("user_id_salted", concat_ws("_", col("user_id"), (rand() * 10).cast("int")))
df2_salted = df2.withColumn("user_id_salted", concat_ws("_", col("user_id"), lit(0)))
You create multiple buckets to spread the load.
2️⃣ Using broadcast() for Smaller Dataset
If one side of the join is small:
from pyspark.sql.functions import broadcast
df1.join(broadcast(df_small), "id")
Advantages:
No shuffle needed
Eliminates skew for that join
3️⃣ Use Skew Hint (Spark 3.x)
Spark can automatically detect skewed partitions.
df1.join(df2.hint("skew"), "id")
Dataproc supports Spark 3.x — so this works.
4️⃣ Use repartition() on Skewed Columns
Explicitly redistribute data:
df = df.repartition(200, "user_id")
Useful when:
Key cardinality is high
Natural distribution is uneven
5️⃣ Pre-Aggregate Before Join
Instead of joining raw massive tables, aggregate first, then join.
Before:
df1.join(df2, "user_id")
After:
df2_agg = df2.groupBy("user_id").agg(sum("amount").alias("total_amount"))
df1.join(df2_agg, "user_id")
This reduces data shuffled and eliminates skew.
6️⃣ Optimize Partition Sizes
Use:
Smaller partitions for uneven distributions
Larger partitions when many small tasks slow you down
Example:
df = df.repartition(400)
7️⃣ Avoid Wide Transformations When Possible
Operations that cause shuffles:
join
groupBy
distinct
orderBy
Try alternatives like:
mapPartitions
reduceByKey instead of groupByKey
Caching mid-stage results
8️⃣ Increase Executor Resources (If Needed)
On Dataproc, increase:
executor memory (--properties)
executor cores
number of workers
Example:
--properties=spark.executor.memory=8g,spark.executor.cores=4
This helps if skewed tasks need more memory to finish.
9️⃣ Use Adaptive Query Execution (AQE)
Enable AQE in Spark 3:
--properties=spark.sql.adaptive.enabled=true
AQE automatically:
Detects skew
Merges small partitions
Splits large partitions
This is very effective and recommended.
10️⃣ Push Filtering Earlier (Predicate Pushdown)
Always filter before joins:
df1 = df1.filter(df1.status == "active")
df2 = df2.filter(df2.amount > 0)
Less data = Less skew.
๐ฏ Which Techniques Should You Use? (Priority Order)
For Dataproc, the best-performing order is:
Broadcast small table
AQE (Adaptive Query Execution)
Salting hot keys
Pre-aggregation
Repartition on join keys
Increase cluster size only if needed
๐ Example: Full Anti-Skew Join Solution
from pyspark.sql.functions import col, rand, concat_ws, lit, broadcast
# Step 1: filter early
df1 = df1.filter(col("active") == True)
# Step 2: aggregating small table
df2_agg = df2.groupBy("user_id").agg(sum("amount").alias("total"))
# Step 3: if df2_agg is small, broadcast it
df_join = df1.join(broadcast(df2_agg), "user_id")
Fast, scalable, and no skew.
๐ Final Summary
To handle data skew in Dataproc:
Use broadcast joins for small tables
Enable AQE for automatic skew handling
Salt keys for heavy skewed joins
Pre-aggregate before joining
Repartition skewed columns
Filter early
Avoid expensive wide transformations
Use the right method depending on your data size and skew severity.
Learn GCP Training in Hyderabad
Read More
Spark SQL Tuning Techniques on Dataproc
Using Dataproc with JupyterHub for Collaborative Data Science
Running TensorFlow Distributed Training on Dataproc
Using GraphFrames on Spark for Network Analysis
Visit Our Quality Thought Training Institute in Hyderabad
Subscribe by Email
Follow Updates Articles from This Blog via Email
No Comments