GraphX
Graph Analytics in Spark
Ankur Dave
Graduate Student, UC Berkeley AMPLab
Joint work with Joseph Gonzalez, Reynold Xin, Daniel
Crankshaw, Michael Franklin, and Ion Stoica UC
BERKELEY
Machine Learning Landscape
Model &
Dependencies
Small & Dense Sparse Large & Dense
Architecture
MapReduce Graph-Parallel Parameter Server
Machine Learning Landscape
Model &
Dependencies
Small & Dense Sparse Large & Dense
GraphX
Architecture
Spark Dataflow
Framework Parameter Server
Graphs
Social Networks
Web Graphs
User-Item Graphs
Graph Algorithms
PageRank
Triangle Counting
Collaborative Filtering
Products
x
Users
Ratings f(j)
Users
f(i)
Products
Collaborative Filtering
f(3)
r13
f(1)
Product Factors
r14
User Factors
f(4)
r24
f(2)
r25 f(5)
X 2
T
f [i] = arg min rij w f [j] + ||w||22
w2Rd
j2Nbrs(i)
The Graph-Parallel Pattern
The Graph-Parallel Pattern
The Graph-Parallel Pattern
Many Graph-Parallel Algorithms
Collaborative Filtering Community Detection
Alternating Least Squares Triangle-Counting
Stochastic Gradient Descent K-core Decomposition
Tensor Factorization K-Truss
Structured Prediction Graph Analytics
Loopy Belief Propagation PageRank
Max-Product Linear Personalized PageRank
Programs Shortest Path
Gibbs Sampling Graph Coloring
Semi-supervised ML Classification
Graph SSL Neural Networks
CoEM
Modern Analytics
Link Table Hyperlinks PageRank Top 20 Pages
Title Link Title PR
Raw
Wikipedia
Com. PR..
<</ />>
</> Top Communities
XML
Editor Community User
Table Editor Graph Detection Community
Editor Title User Com.
Tables
Link Table Hyperlinks PageRank Top 20 Pages
Title Link Title PR
Raw
Wikipedia
Com. PR..
<</ />>
</> Top Communities
XML
Editor Community User
Table Editor Graph Detection Community
Editor Title User Com.
Graphs
Link Table Hyperlinks PageRank Top 20 Pages
Title Link Title PR
Raw
Wikipedia
Com. PR..
<</ />>
</> Top Communities
XML
Editor Community User
Table Editor Graph Detection Community
Editor Title User Com.
The GraphX API
Property Graphs
Vertex Property:
User Profile
Current PageRank Value
Edge Property:
Weights
Relationships
Timestamps
Creating a Graph (Scala)
type
VertexId
=
Long
Graph
val
vertices:
RDD[(VertexId,
String)]
=
sc.parallelize(List(
(1L,
Alice),
1 Alice
(2L,
Bob),
(3L,
Charlie)))
coworker
class
Edge[ED](
val
srcId:
VertexId,
val
dstId:
VertexId,
2 Bob
val
attr:
ED)
val
edges:
RDD[Edge[String]]
=
friend
sc.parallelize(List(
Edge(1L,
2L,
coworker),
Edge(2L,
3L,
friend)))
3 Charlie
val
graph
=
Graph(vertices,
edges)
Graph Operations (Scala)
class
Graph[VD,
ED]
{
//
Table
Views
-----------------------
def
vertices:
RDD[(VertexId,
VD)]
def
edges:
RDD[Edge[ED]]
def
triplets:
RDD[EdgeTriplet[VD,
ED]]
//
Transformations
-------------------------------------------
def
mapVertices[VD2](f:
(VertexId,
VD)
=>
VD2):
Graph[VD2,
ED]
def
mapEdges[ED2](f:
Edge[ED]
=>
ED2):
Graph[VD2,
ED]
def
reverse:
Graph[VD,
ED]
def
subgraph(epred:
EdgeTriplet[VD,
ED]
=>
Boolean,
vpred:
(VertexId,
VD)
=>
Boolean):
Graph[VD,
ED]
//
Joins
----------------------------------------
def
outerJoinVertices[U,
VD2]
(tbl:
RDD[(VertexId,
U)])
(f:
(VertexId,
VD,
Option[U])
=>
VD2):
Graph[VD2,
ED]
//
Computation
----------------------------------
def
mapReduceTriplets[A](
sendMsg:
EdgeTriplet[VD,
ED]
=>
Iterator[(VertexId,
A)],
mergeMsg:
(A,
A)
=>
A):
RDD[(VertexId,
A)]
Built-in Algorithms (Scala)
//
Continued
from
previous
slide
def
pageRank(tol:
Double):
Graph[Double,
Double]
def
triangleCount():
Graph[Int,
ED]
def
connectedComponents():
Graph[VertexId,
ED]
//
...and
more:
org.apache.spark.graphx.lib
}
PageRank Triangle Count Connected
Components
The triplets view
class
Graph[VD,
ED]
{
def
triplets:
RDD[EdgeTriplet[VD,
ED]]
}
class
EdgeTriplet[VD,
ED](
val
srcId:
VertexId,
val
dstId:
VertexId,
val
attr:
ED,
val
srcAttr:
VD,
val
dstAttr:
VD)
Graph
1 Alice RDD
coworker srcAttr dstAttr attr
triplets
Alice coworker Bob
2 Bob
Bob friend Charlie
friend
3 Charlie
The subgraph transformation
class
Graph[VD,
ED]
{
def
subgraph(epred:
EdgeTriplet[VD,
ED]
=>
Boolean,
vpred:
(VertexId,
VD)
=>
Boolean):
Graph[VD,
ED]
}
graph.subgraph(epred
=
(edge)
=>
edge.attr
!=
relative)
Graph Graph
Alice coworker Bob Alice coworker Bob
relative subgraph
friend friend
Charlie relative David Charlie David
The subgraph transformation
class
Graph[VD,
ED]
{
def
subgraph(epred:
EdgeTriplet[VD,
ED]
=>
Boolean,
vpred:
(VertexId,
VD)
=>
Boolean):
Graph[VD,
ED]
}
graph.subgraph(vpred
=
(id,
name)
=>
name
!=
Bob)
Graph Graph
Alice coworker Bob Alice
relative subgraph
relative
friend
Charlie relative David Charlie relative David
Computation with mapReduceTriplets
class
Graph[VD,
ED]
{
def
mapReduceTriplets[A](
upgrade to aggregateMessages
sendMsg:
EdgeTriplet[VD,
ED]
=>
Iterator[(VertexId,
A)],
in Spark 1.2.0
mergeMsg:
(A,
A)
=>
A):
RDD[(VertexId,
A)]
}
graph.mapReduceTriplets(
edge
=>
Iterator(
(edge.srcId,
1),
(edge.dstId,
1)),
_
+
_)
RDD
Graph vertex id degree
Alice Bob
Alice 2
coworker
mapReduceTriplets
Bob 2
relative
friend
Charlie 3
Charlie relative David
David 1
How GraphX Works
Encoding Property Graphs as RDDs
Vertex Routing Edge Table
Property Graph Table Table (RDD)
(RDD) (RDD)
Part. 1 A B
A A 1
2
B C A C
Machine 1
B B 1
B C
C D
ACut
VertexA
D
D C C 1
A D
A E
D D 1
2
Machine 2
A F
E E 2
F E E D
Part. 2 F F 2
E F
Graph System Optimizations
Specialized Vertex-Cuts Remote
Data-Structures Partitioning Caching / Mirroring
Message Combiners Active Set Tracking
PageRank Benchmark
EC2 Cluster of 16 x m2.4xLarge (8 cores) + 1GigE
Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges)
3500 9000
3000 8000
Runtime (Seconds)
7000
2500
6000
2000 5000
7x 18x
1500 4000
3000
1000
2000
500 1000
0 0
GraphX performs comparably to
state-of-the-art graph processing systems.
Future of GraphX
1. Language support
a) Java API: PR #3234
b) Python API: collaborating with Intel, SPARK-3789
2. More algorithms
a) LDA (topic modeling): PR #2388
b) Correlation clustering
c) Your algorithm here?
3. Speculative
a) Streaming/time-varying graphs
b) Graph databaselike queries
Thanks!
http://spark.apache.org/graphx
[email protected]
[email protected]
[email protected]
[email protected]