Using GraphFrames on Spark for Network Analysis

GraphFrames is a Spark package that combines graph computation with DataFrame APIs. It enables scalable graph processing using familiar Spark SQL/DataFrame operations, while also providing graph algorithms like PageRank, connected components, motif finding, and BFS.

It is widely used for:

Social network analysis

Recommendation systems

Fraud detection

Knowledge graphs

Network topology and routing

1. What Are GraphFrames?

GraphFrames represent a graph using two Spark DataFrames:

Vertices DataFrame: each row is a node with id and additional attributes.

Edges DataFrame: each row is a directed edge with src, dst, and attributes.

Example:

vertices = spark.createDataFrame([

("a", "Alice", 34),

("b", "Bob", 36),

("c", "Charlie", 30)

], ["id", "name", "age"])

edges = spark.createDataFrame([

("a", "b", "friend"),

("b", "c", "follow"),

("c", "a", "follow")

], ["src", "dst", "relationship"])

2. Installing GraphFrames on Dataproc or Spark

Dataproc cluster setup

You can install GraphFrames with an initialization action:

gcloud dataproc clusters create my-cluster \

--initialization-actions gs://dataproc-initialization-actions/graphframes/graphframes.sh

Local PySpark

pyspark --packages graphframes:graphframes:0.8.2-spark3.3-s_2.12

Use the version that matches your Spark version.

3. Creating and Inspecting a GraphFrame

from graphframes import GraphFrame

g = GraphFrame(vertices, edges)

print("Vertices:")

g.vertices.show()

print("Edges:")

g.edges.show()

4. Core Graph Operations

A. Degrees

g.inDegrees.show()

g.outDegrees.show()

g.degrees.show()

B. Breadth-First Search (BFS)

Find paths from node a to nodes named “Charlie”:

paths = g.bfs("id = 'a'", "name = 'Charlie'")

paths.show()

C. Motif Finding (Pattern Matching)

Find triangles or repeated patterns.

Example: user → item → user:

motifs = g.find("(u1)-[e]->(u2); (u2)-[e2]->(u3)")

motifs.show()

5. Graph Algorithms Included in GraphFrames

GraphFrames provides distributed graph algorithms optimized for Spark.

A. PageRank (Google-style importance ranking)

results = g.pageRank(resetProbability=0.15, maxIter=10)

results.vertices.show()

Useful for:

Ranking influencers

Finding important documents

Prioritizing fraud investigation targets

B. Connected Components

components = g.connectedComponents()

components.show()

Useful for:

Community detection

Finding isolated sub-networks

C. Strongly Connected Components

scc = g.stronglyConnectedComponents(maxIter=10)

scc.show()

Useful in:

Transaction cycle detection

Graph consistency checks

D. Label Propagation Algorithm (LPA)

For community detection:

communities = g.labelPropagation(maxIter=20)

communities.show()

E. Shortest Paths

shortest = g.shortestPaths(landmarks=["a", "c"])

shortest.show()

6. Typical Use Cases in Production

Fraud Detection (Banking & E-commerce)

Build a user-device-email graph

Compute connected components to identify fraud rings

Use PageRank to find suspicious central nodes

Recommendation Engines

User–item bipartite graphs

Motif finding for co-purchases

Use GraphFrames to generate candidate sets

Social Network Analysis

Detect influencers using PageRank

Community detection via LPA

Relationship path queries using BFS

Telecommunications

Network connectivity and outages

Router-path shortest paths

Detect weak points in topology graphs

Knowledge Graphs

Discover patterns using motif finding

Identify clusters and entity relationships

7. Optimization and Performance Tips

✔ Use Parquet for vertex/edge DataFrames

✔ Filter early to reduce graph size before running algorithms

✔ Avoid extremely large BFS queries (materialization can be expensive)

✔ Use caching for repeated graph operations

✔ Use ephemeral Dataproc clusters optimized for CPU/memory demands

✔ Consider GraphX (Scala-only) for extremely large-scale iterative algorithms

8. Working with GraphFrames on Dataproc

You can embed GraphFrames processing into:

Dataproc jobs

Spark SQL pipelines

Dataproc Workflow Templates

Cloud Composer DAGs

Example Dataproc job submission:

gcloud dataproc jobs submit pyspark graph_analysis.py \

--cluster=my-dataproc \

--packages=graphframes:graphframes:0.8.2-spark3.3-s_2.12

9. Example End-to-End Workflow

Use case: Detecting fraud rings

Ingest logs and user/device data into GCS/BigQuery

Run Dataproc job:

Build graph (users, devices, emails)

Run connected components

Score highly connected clusters with statistical rules

Store results in BigQuery for analysts

Alert risk systems via Pub/Sub

GraphFrames makes step 2 straightforward and scalable.

10. Summary

GraphFrames is ideal for:

Scalable graph analytics

Fast prototyping with DataFrames

Integrating graph algorithms into Spark workflows

Running large graph workloads on Dataproc

Key Strengths

Easy integration with Spark SQL

Built-in graph algorithms

Ability to scale to millions/billions of nodes

Flexible combination of graph and tabular features

Learn GCP Training in Hyderabad

Managing Dataproc Workflow Templates in Production

Integrating Apache Hudi with Dataproc for Incremental Data Processing

Using Structured Streaming with Spark on Dataproc

Visit Our Quality Thought Training Institute in Hyderabad

Get Directions

November 14, 2025

Friday, November 14, 2025

Using GraphFrames on Spark for Network Analysis

Using GraphFrames on Spark for Network Analysis

Key Strengths

No Comments

About

Search This Blog

Blog Archive

Report Abuse

About Me

Friday, November 14, 2025

Using GraphFrames on Spark for Network Analysis

Using GraphFrames on Spark for Network Analysis

Key Strengths

Subscribe by Email

No Comments

About

Search This Blog

Blog Archive

Report Abuse

About Me