Unit-3
Property Graph in Spark GraphX and its components
A Property Graph in Spark GraphX is a directed multigraph where each edge and vertex can
have user-defined properties. It consists of vertices (representing entities like users or objects)
and edges (representing relationships between those entities). Each vertex has an identifier (ID)
and associated properties, while each edge connects two vertices and can also have properties.
Vertex and Edge RDDs are used in Spark’s GraphX framework: In GraphX, VertexRDD is
a specialized RDD of vertices, where each vertex is represented by a unique ID and its associated
properties. Similarly, EdgeRDD is an RDD of edges, where each edge connects two vertices
(with source and destination vertex IDs) and may carry attributes (properties). Together, they
represent the structure and data of a property graph.
PageRank Algorithm: PageRank is a link analysis algorithm originally used by Google to rank
web pages in search engine results. It works by determining the importance of each page based
on the number and quality of links to it. The core idea is that if a page is linked to by many other
important pages, it is considered more important.
Example: Consider a small network of four web pages (A, B, C, and D) with the following link
structure:
Page A links to B and C.
Page B links to A and C.
Page C links to A.
Page D links to C.
Initial PageRank Values:
We start with all pages having an equal initial PageRank (PR) of 1.01.01.0.
Iteration 1:
1. Page A:
o Inbound links from B and C.
o Contribution from B: PR(B)L(B)=1.02=0.5\frac{PR(B)}{L(B)} = \frac{1.0}{2}
= 0.5L(B)PR(B)=21.0=0.5.
o Contribution from C: PR(C)L(C)=1.01=1.0\frac{PR(C)}{L(C)} = \frac{1.0}{1}
= 1.0L(C)PR(C)=11.0=1.0.
o New PR(A):
(1−0.85)/4+0.85×(0.5+1.0)=0.0375+0.85×1.5=0.0375+1.275=1.3125(1 - 0.85)/4
+ 0.85 \times (0.5 + 1.0) = 0.0375 + 0.85 \times 1.5 = 0.0375 + 1.275 =
1.3125(1−0.85)/4+0.85×(0.5+1.0)=0.0375+0.85×1.5=0.0375+1.275=1.3125.
2. Page B:
o Inbound link from A.
o Contribution from A: PR(A)L(A)=1.02=0.5\frac{PR(A)}{L(A)} = \frac{1.0}{2}
= 0.5L(A)PR(A)=21.0=0.5.
o New PR(B): (1−0.85)/4+0.85×0.5=0.0375+0.425=0.4625(1 - 0.85)/4 + 0.85 \
times 0.5 = 0.0375 + 0.425 = 0.4625(1−0.85)/4+0.85×0.5=0.0375+0.425=0.4625.
3. Page C:
o Inbound links from A, B, and D.
o Contribution from A: PR(A)L(A)=1.02=0.5\frac{PR(A)}{L(A)} = \frac{1.0}{2}
= 0.5L(A)PR(A)=21.0=0.5.
o Contribution from B: PR(B)L(B)=1.02=0.5\frac{PR(B)}{L(B)} = \frac{1.0}{2}
= 0.5L(B)PR(B)=21.0=0.5.
o Contribution from D: PR(D)L(D)=1.01=1.0\frac{PR(D)}{L(D)} = \frac{1.0}{1}
= 1.0L(D)PR(D)=11.0=1.0.
o New PR(C):
(1−0.85)/4+0.85×(0.5+0.5+1.0)=0.0375+0.85×2.0=0.0375+1.7=1.7375(1 -
0.85)/4 + 0.85 \times (0.5 + 0.5 + 1.0) = 0.0375 + 0.85 \times 2.0 = 0.0375 + 1.7 =
1.7375(1−0.85)/4+0.85×(0.5+0.5+1.0)=0.0375+0.85×2.0=0.0375+1.7=1.7375.
4. Page D:
o Inbound links from nowhere (no page links to D).
o New PR(D): (1−0.85)/4=0.0375(1 - 0.85)/4 = 0.0375(1−0.85)/4=0.0375.
After Iteration 1, the PageRank values are:
PR(A) = 1.3125
PR(B) = 0.4625
PR(C) = 1.7375
PR(D) = 0.0375
Caching in Spark
Caching in Spark is a performance optimization technique used to store intermediate results or
frequently accessed data in memory (or disk, if memory is limited). This helps avoid
recomputing the same data multiple times during Spark operations like transformations and
actions.
Method to Cache an RDD or DataFrame in Spark:
Spark provides two main methods for caching:
1. cache():
o This method stores data in memory (default). If memory runs out, Spark may
evict cached data and recompute it when needed.
o Syntax for RDD:
python
Copy code
rdd.cache()
o Syntax for DataFrame:
python
Copy code
dataframe.cache()
o When to use: Use cache() when you expect data to fit mostly in memory, and you
want faster access without reloading from disk.
2. persist():
o persist() is more flexible than cache(). While cache() stores data only in memory,
persist() allows you to specify different storage levels like disk, memory, or a
combination of both.
o Some common storage levels:
MEMORY_ONLY: Stores the data only in memory.
MEMORY_AND_DISK: Stores the data in memory but writes to disk if
memory is insufficient.
DISK_ONLY: Stores the data only on disk.
MEMORY_ONLY_SER: Stores the data in serialized format in memory,
reducing memory consumption.
o Syntax for RDD:
python
Copy code
rdd.persist(StorageLevel.MEMORY_AND_DISK)
o Syntax for DataFrame:
python
Copy code
dataframe.persist(StorageLevel.MEMORY_AND_DISK)
o Example of using persist() with specific storage level:
python
Copy code
from pyspark import StorageLevel
rdd.persist(StorageLevel.MEMORY_AND_DISK)