Using GraphFrames on Spark for Network Analysis
GraphFrames is a Spark package that combines graph computation with DataFrame APIs. It enables scalable graph processing using familiar Spark SQL/DataFrame operations, while also providing graph algorithms like PageRank, connected components, motif finding, and BFS.
It is widely used for:
Social network analysis
Recommendation systems
Fraud detection
Knowledge graphs
Network topology and routing
1. What Are GraphFrames?
GraphFrames represent a graph using two Spark DataFrames:
Vertices DataFrame: each row is a node with id and additional attributes.
Edges DataFrame: each row is a directed edge with src, dst, and attributes.
Example:
vertices = spark.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30)
], ["id", "name", "age"])
edges = spark.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "a", "follow")
], ["src", "dst", "relationship"])
2. Installing GraphFrames on Dataproc or Spark
Dataproc cluster setup
You can install GraphFrames with an initialization action:
gcloud dataproc clusters create my-cluster \
--initialization-actions gs://dataproc-initialization-actions/graphframes/graphframes.sh
Local PySpark
pyspark --packages graphframes:graphframes:0.8.2-spark3.3-s_2.12
Use the version that matches your Spark version.
3. Creating and Inspecting a GraphFrame
from graphframes import GraphFrame
g = GraphFrame(vertices, edges)
print("Vertices:")
g.vertices.show()
print("Edges:")
g.edges.show()
4. Core Graph Operations
A. Degrees
g.inDegrees.show()
g.outDegrees.show()
g.degrees.show()
B. Breadth-First Search (BFS)
Find paths from node a to nodes named “Charlie”:
paths = g.bfs("id = 'a'", "name = 'Charlie'")
paths.show()
C. Motif Finding (Pattern Matching)
Find triangles or repeated patterns.
Example: user → item → user:
motifs = g.find("(u1)-[e]->(u2); (u2)-[e2]->(u3)")
motifs.show()
5. Graph Algorithms Included in GraphFrames
GraphFrames provides distributed graph algorithms optimized for Spark.
A. PageRank (Google-style importance ranking)
results = g.pageRank(resetProbability=0.15, maxIter=10)
results.vertices.show()
Useful for:
Ranking influencers
Finding important documents
Prioritizing fraud investigation targets
B. Connected Components
components = g.connectedComponents()
components.show()
Useful for:
Community detection
Finding isolated sub-networks
C. Strongly Connected Components
scc = g.stronglyConnectedComponents(maxIter=10)
scc.show()
Useful in:
Transaction cycle detection
Graph consistency checks
D. Label Propagation Algorithm (LPA)
For community detection:
communities = g.labelPropagation(maxIter=20)
communities.show()
E. Shortest Paths
shortest = g.shortestPaths(landmarks=["a", "c"])
shortest.show()
6. Typical Use Cases in Production
Fraud Detection (Banking & E-commerce)
Build a user-device-email graph
Compute connected components to identify fraud rings
Use PageRank to find suspicious central nodes
Recommendation Engines
User–item bipartite graphs
Motif finding for co-purchases
Use GraphFrames to generate candidate sets
Social Network Analysis
Detect influencers using PageRank
Community detection via LPA
Relationship path queries using BFS
Telecommunications
Network connectivity and outages
Router-path shortest paths
Detect weak points in topology graphs
Knowledge Graphs
Discover patterns using motif finding
Identify clusters and entity relationships
7. Optimization and Performance Tips
✔ Use Parquet for vertex/edge DataFrames
✔ Filter early to reduce graph size before running algorithms
✔ Avoid extremely large BFS queries (materialization can be expensive)
✔ Use caching for repeated graph operations
✔ Use ephemeral Dataproc clusters optimized for CPU/memory demands
✔ Consider GraphX (Scala-only) for extremely large-scale iterative algorithms
8. Working with GraphFrames on Dataproc
You can embed GraphFrames processing into:
Dataproc jobs
Spark SQL pipelines
Dataproc Workflow Templates
Cloud Composer DAGs
Example Dataproc job submission:
gcloud dataproc jobs submit pyspark graph_analysis.py \
--cluster=my-dataproc \
--packages=graphframes:graphframes:0.8.2-spark3.3-s_2.12
9. Example End-to-End Workflow
Use case: Detecting fraud rings
Ingest logs and user/device data into GCS/BigQuery
Run Dataproc job:
Build graph (users, devices, emails)
Run connected components
Score highly connected clusters with statistical rules
Store results in BigQuery for analysts
Alert risk systems via Pub/Sub
GraphFrames makes step 2 straightforward and scalable.
10. Summary
GraphFrames is ideal for:
Scalable graph analytics
Fast prototyping with DataFrames
Integrating graph algorithms into Spark workflows
Running large graph workloads on Dataproc
Key Strengths
Easy integration with Spark SQL
Built-in graph algorithms
Ability to scale to millions/billions of nodes
Flexible combination of graph and tabular features
Learn GCP Training in Hyderabad
Read More
Building a Scalable Recommendation Engine Using Dataproc
Managing Dataproc Workflow Templates in Production
Integrating Apache Hudi with Dataproc for Incremental Data Processing
Using Structured Streaming with Spark on Dataproc
Visit Our Quality Thought Training Institute in Hyderabad
Subscribe by Email
Follow Updates Articles from This Blog via Email
No Comments