Friday, November 14, 2025

thumbnail

Using GraphFrames on Spark for Network Analysis

 Using GraphFrames on Spark for Network Analysis


GraphFrames is a Spark package that combines graph computation with DataFrame APIs. It enables scalable graph processing using familiar Spark SQL/DataFrame operations, while also providing graph algorithms like PageRank, connected components, motif finding, and BFS.


It is widely used for:


Social network analysis


Recommendation systems


Fraud detection


Knowledge graphs


Network topology and routing


1. What Are GraphFrames?


GraphFrames represent a graph using two Spark DataFrames:


Vertices DataFrame: each row is a node with id and additional attributes.


Edges DataFrame: each row is a directed edge with src, dst, and attributes.


Example:


vertices = spark.createDataFrame([

    ("a", "Alice", 34),

    ("b", "Bob", 36),

    ("c", "Charlie", 30)

], ["id", "name", "age"])


edges = spark.createDataFrame([

    ("a", "b", "friend"),

    ("b", "c", "follow"),

    ("c", "a", "follow")

], ["src", "dst", "relationship"])


2. Installing GraphFrames on Dataproc or Spark

Dataproc cluster setup


You can install GraphFrames with an initialization action:


gcloud dataproc clusters create my-cluster \

  --initialization-actions gs://dataproc-initialization-actions/graphframes/graphframes.sh


Local PySpark

pyspark --packages graphframes:graphframes:0.8.2-spark3.3-s_2.12



Use the version that matches your Spark version.


3. Creating and Inspecting a GraphFrame

from graphframes import GraphFrame


g = GraphFrame(vertices, edges)


print("Vertices:")

g.vertices.show()


print("Edges:")

g.edges.show()


4. Core Graph Operations

A. Degrees

g.inDegrees.show()

g.outDegrees.show()

g.degrees.show()


B. Breadth-First Search (BFS)


Find paths from node a to nodes named “Charlie”:


paths = g.bfs("id = 'a'", "name = 'Charlie'")

paths.show()


C. Motif Finding (Pattern Matching)


Find triangles or repeated patterns.


Example: user → item → user:


motifs = g.find("(u1)-[e]->(u2); (u2)-[e2]->(u3)")

motifs.show()


5. Graph Algorithms Included in GraphFrames


GraphFrames provides distributed graph algorithms optimized for Spark.


A. PageRank (Google-style importance ranking)

results = g.pageRank(resetProbability=0.15, maxIter=10)

results.vertices.show()



Useful for:


Ranking influencers


Finding important documents


Prioritizing fraud investigation targets


B. Connected Components

components = g.connectedComponents()

components.show()



Useful for:


Community detection


Finding isolated sub-networks


C. Strongly Connected Components

scc = g.stronglyConnectedComponents(maxIter=10)

scc.show()



Useful in:


Transaction cycle detection


Graph consistency checks


D. Label Propagation Algorithm (LPA)


For community detection:


communities = g.labelPropagation(maxIter=20)

communities.show()


E. Shortest Paths

shortest = g.shortestPaths(landmarks=["a", "c"])

shortest.show()


6. Typical Use Cases in Production

Fraud Detection (Banking & E-commerce)


Build a user-device-email graph


Compute connected components to identify fraud rings


Use PageRank to find suspicious central nodes


Recommendation Engines


User–item bipartite graphs


Motif finding for co-purchases


Use GraphFrames to generate candidate sets


Social Network Analysis


Detect influencers using PageRank


Community detection via LPA


Relationship path queries using BFS


Telecommunications


Network connectivity and outages


Router-path shortest paths


Detect weak points in topology graphs


Knowledge Graphs


Discover patterns using motif finding


Identify clusters and entity relationships


7. Optimization and Performance Tips


✔ Use Parquet for vertex/edge DataFrames

✔ Filter early to reduce graph size before running algorithms

✔ Avoid extremely large BFS queries (materialization can be expensive)

✔ Use caching for repeated graph operations

✔ Use ephemeral Dataproc clusters optimized for CPU/memory demands

✔ Consider GraphX (Scala-only) for extremely large-scale iterative algorithms


8. Working with GraphFrames on Dataproc


You can embed GraphFrames processing into:


Dataproc jobs


Spark SQL pipelines


Dataproc Workflow Templates


Cloud Composer DAGs


Example Dataproc job submission:

gcloud dataproc jobs submit pyspark graph_analysis.py \

    --cluster=my-dataproc \

    --packages=graphframes:graphframes:0.8.2-spark3.3-s_2.12


9. Example End-to-End Workflow

Use case: Detecting fraud rings


Ingest logs and user/device data into GCS/BigQuery


Run Dataproc job:


Build graph (users, devices, emails)


Run connected components


Score highly connected clusters with statistical rules


Store results in BigQuery for analysts


Alert risk systems via Pub/Sub


GraphFrames makes step 2 straightforward and scalable.


10. Summary

GraphFrames is ideal for:


Scalable graph analytics


Fast prototyping with DataFrames


Integrating graph algorithms into Spark workflows


Running large graph workloads on Dataproc


Key Strengths


Easy integration with Spark SQL


Built-in graph algorithms


Ability to scale to millions/billions of nodes


Flexible combination of graph and tabular features

Learn GCP Training in Hyderabad

Read More

Building a Scalable Recommendation Engine Using Dataproc

Managing Dataproc Workflow Templates in Production

Integrating Apache Hudi with Dataproc for Incremental Data Processing

Using Structured Streaming with Spark on Dataproc

Visit Our Quality Thought Training Institute in Hyderabad

Get Directions 

Subscribe by Email

Follow Updates Articles from This Blog via Email

No Comments

About

Search This Blog

Powered by Blogger.

Blog Archive