0% found this document useful (0 votes)

65 views5 pages

Spark SQL & GraphX Lab Guide

This document discusses a lab assignment on performing Spark SQL and GraphX operations. It introduces Spark SQL and GraphX and their main abstractions. The lab tasks students to use SQL and GraphX to analyze an RDF dataset, including finding the subject distribution, connected components for triples with a specific predicate, triangle count, and ranking components by PageRank. It provides sample code solutions for these tasks.

Uploaded by

benben08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views5 pages

Spark SQL & GraphX Lab Guide

Uploaded by

benben08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF BONN

Lab Distributed Big Data Analytics

Worksheet-3: Spark GraphX and Spark SQL operations

Dr. Hajira Jabeen, Gezim Sejdiu, Prof. Dr. Jens Lehmann

May 9, 2017

In this lab we are going to perform basic Spark SQL and Spark GraphX operations
(described on “Spark Fundamentals II”). Spark SQL provides the ability to write sql-like
queries which can run on Spark. Their main abstraction is SchemaRDD which allows
creating an RDD in which you can run SQL, HiveQL, and Scala.
GraphX is the new Spark API for graphs and graph-parallel computation. At a high-level,
GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed Property
Graph: a directed multigraph with properties attached to each vertex and edge.
In this lab, you will use SQL and GraphX to find out the subject distribution over nt file. The
purpose is to demonstrate how to use the Spark SQL and GraphX libraries on Spark.

IN CLASS

1. Spark SQL operations

a. After a file (page_links_simple.nt.bz2) have been downloaded, unzipped, and
uploaded on HDFS under /yourname folder you may need to create an RDD
out of this file.
b. First create a Scala class Triple containing information about a triple read
from a file, which will be used as schema. Since the data is going to be type
of .nt file which inside contains rows of triples in format <subject>
<predicate> <object> we may need to transform this data into a different
format of representation. Hint: Use map function.
c. Create an RDD of a Triple object
d. Use the filter transformation to return a new RDD with a subset of the triples
on the file by checking if the first row contains “#”, which on .nt file represent a
comment.
e. Run SQL statements using sql method provided by the SQLContext:
i. Taking all triples which are relxated to ‘Category:Events’
ii. Taking all triples for predicate ‘author’.
iii. Taking all triples authored by ‘Andre_Engels’
iv. Count how many time the specific subject have been used on our
dataset.

1
COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF BONN

f. Since result is considered to be SchemaRDD, every RDD operations work

out-of-the-box. By using map function collect and print out Subjects and their
frequencies.
------------------------------------------------------Solution----------------------------------------------------------
object sqlab {

def main(args: Array[String]) = {

val input = "src/main/resources/rdf.nt" // args(0)

val spark = SparkSession.builder

.master("local[*]")
.config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
.appName("SparkSQL example")
.getOrCreate()

import spark.implicits._

val tripleDF = spark.sparkContext.textFile(input)

.map(TripleUtils.parsTriples)
.toDF()

tripleDF.show()

//tripleDF.collect().foreach(println(_))

tripleDF.createOrReplaceTempView("triple")

val sqlText = "SELECT * from triple where subject =

'http://commons.dbpedia.org/resource/Category:Events'"
val triplerelatedtoEvents =
spark.sql(sqlText)

triplerelatedtoEvents.collect().foreach(println(_))

val subjectdistribution = spark.sql("select subject, count(*) from triple

group by subject")
println("subjectdistribution:")
subjectdistribution.collect().foreach(println(_))
}

2
COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF BONN

}
---------------------------------------------------------------------------------------------------
2. Spark GraphX operations
a. After a file (page_links_simple.nt.bz2) have been downloaded, unzipped, and
uploaded on HDFS under /yourname folder you may need to create an RDD
out of this file.
b. First create a Scala class Triple containing information about a triple read
from a file. Since the data is going to be type of .nt file which inside contains
rows of triples in format <subject> <predicate> <object> we may need to
transform this data into a different format of representation. Hint: Use map
function.
c. Use the filter transformation to return a new RDD with a subset of the triples
on the file by checking if the first row contains “#”, which on .nt file represent a
comment.
d. Perform these operations in order to transform your data into GraphX
i. Generate vertices by combining (Subject, Object) as VertexId and
their value.x
ii. Create Edges by using subject as a key to join within vertices and
generate Edge into format (s_index, obj_index, predicate)
e. Compute connected components for triples containing “author” as a
predicate.
f. Compute triangle count.
g. List top 5 connected component by applying pagerank over them.
------------------------------------------------------Solution----------------------------------------------------------

object graphxlab {
def main(args: Array[String]) = {
val input = "src/main/resources/rdf.nt" // args(0)

val spark = SparkSession.builder

.master("local[*]")
.config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
.appName("GraphX example")
.getOrCreate()

val tripleRDD =
spark.sparkContext.textFile(input)
.map(TripleUtils.parsTriples)

val tutleSubjectObject = tripleRDD.map { x => (x.subject,

x.`object`) }

3
COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF BONN

type VertexId =
Long

val indexVertexID = (tripleRDD.map(_.subject) union

tripleRDD.map(_.`object`)).distinct().zipWithIndex()

val vertices: RDD[(VertexId, String)] = indexVertexID.map(f

=> (f._2, f._1))

val tuples = tripleRDD.keyBy(_.subject).join(indexVertexID).map(

{
case (k, (tech.sda.arcana.spark.worksheet3.Triple(s, p,
o), si)) => (o, (si, p))
})

val edges: RDD[Edge[S

tring]] = tuples.join(indexVertexID).map({
case (k, ((si, p), oi)) => Edge(si, oi, p)
})

val graph = Graph(vertices, edges)

graph.vertices.collect().foreach(println(_))

println("edges")
graph.edges.collect().foreach(println(_))

val subrealsourse = graph.subgraph(t => t.attr ==

"http://commons.dbpedia.org/property/source")
println("subrealsourse")
subrealsourse.vertices.collect().foreach(println(_))

val conncompo =
subrealsourse.connectedComponents()

val pageranl = graph.pageRank(0.0001)

val printoutrankedtriples =
pageranl.vertices.join(graph.vertices)
.map({ case (k, (r, v)) =
> (k, r, v) })
.sortBy(5 - _._2)

println("printoutrankedtriples")

4
COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF BONN

printoutrankedtriples.take(5).foreach(println(_))
}
}

---------------------------------------------------------------------------------------------------

AT HOME

1. Read and explore

a. Spark SQL, DataFrames and Datasets Guide
b. GraphX Programming Guide
2. RDF Class Distribution - using Spark SQL - count the usage of respective classes
of a RDF dataset.
Hint: Class fulfils the rule(?predicate = rdf:type && ?object.isIRI())).
a. Read the nt file into an RDD of triples.
b. Apply map function for separating triples into (Subject, Predicate, Object)
c. Apply filter transformation for defining the respective classes.
d. Count the frequencies of Object by using sql statement
e. Return the top 100 classes used in the dataset.
3. Using GraphX To Analyze a Real Graph
a. Count the number of vertices and edges in the graph
b. How many resources are on your graph?
c. What is the max in-degree of this graph?
d. Which triple are related to ‘Category:Events’
e. Run Pagerank for 50 iterations.
f. Compute similarity between two nodes - using Spark GraphX
i. Apply different similarity measures
1. Jaccard similarity
2. Edit distance
4. Further readings
a. Spark SQL: Relational Data Processing in Spark
b. Shark: SQL and Rich Analytics at Scale
c. GraphX: Graph Processing in a Distributed Dataflow Framework

Hartley PDF
100% (2)
Hartley PDF
226 pages
Spark
No ratings yet
Spark
96 pages
AWWA Manuals-List
100% (1)
AWWA Manuals-List
4 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
Building & Structural Construction n6 Sample Test
25% (4)
Building & Structural Construction n6 Sample Test
5 pages
Method Statement For Installation of Underground UPVC Soil and Waste Piping
No ratings yet
Method Statement For Installation of Underground UPVC Soil and Waste Piping
2 pages
SA Lab Manual
No ratings yet
SA Lab Manual
7 pages
Introduction To RDD
No ratings yet
Introduction To RDD
39 pages
06 Parallel Processing Part2
No ratings yet
06 Parallel Processing Part2
93 pages
GraphX & Graph Analytics
No ratings yet
GraphX & Graph Analytics
61 pages
Spark GraphX for Data Scientists
No ratings yet
Spark GraphX for Data Scientists
43 pages
Unit-3 - 315
No ratings yet
Unit-3 - 315
60 pages
Da 4
No ratings yet
Da 4
14 pages
3 RDD
No ratings yet
3 RDD
10 pages
Practical Apache Spark in GraphX
No ratings yet
Practical Apache Spark in GraphX
8 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
GraphX - Spark 3.5.0 Documentation
No ratings yet
GraphX - Spark 3.5.0 Documentation
34 pages
SPARK
No ratings yet
SPARK
27 pages
1) 1C LSZH
No ratings yet
1) 1C LSZH
44 pages
Week 8 - Lecture Notes
No ratings yet
Week 8 - Lecture Notes
75 pages
Wrong Guy Lesson Plan
No ratings yet
Wrong Guy Lesson Plan
9 pages
Lec28 - RDD
No ratings yet
Lec28 - RDD
56 pages
Pyspark
No ratings yet
Pyspark
44 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
BDT MSE2Scheme 23-24
No ratings yet
BDT MSE2Scheme 23-24
4 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
No ratings yet
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
51 pages
Spark Running Notes
No ratings yet
Spark Running Notes
19 pages
Create The Property Graph From Array of Vertex and Edges
No ratings yet
Create The Property Graph From Array of Vertex and Edges
5 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
Unit 4
No ratings yet
Unit 4
3 pages
MODULE-Analyzing Co-Occurrence-Networks With GraphX
No ratings yet
MODULE-Analyzing Co-Occurrence-Networks With GraphX
43 pages
Spark Structured API Solutions
No ratings yet
Spark Structured API Solutions
10 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Spark & RDD Guide for Developers
No ratings yet
Spark & RDD Guide for Developers
1 page
IS 10262 2019 NewConcreteMix Design
No ratings yet
IS 10262 2019 NewConcreteMix Design
69 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
GF 2010-2011 - Inglês
100% (1)
GF 2010-2011 - Inglês
140 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
SPARK
No ratings yet
SPARK
35 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
Data Engineers' Guide to Delta & Spark
No ratings yet
Data Engineers' Guide to Delta & Spark
5 pages
R Taha
No ratings yet
R Taha
24 pages
Codesec Sa340 Addressable Indoor Siren SA340: Security Technologies
No ratings yet
Codesec Sa340 Addressable Indoor Siren SA340: Security Technologies
3 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
VI Tips - Essential Vi/vim Editor Skills
No ratings yet
VI Tips - Essential Vi/vim Editor Skills
110 pages
Apache Spark RDDs Guide
No ratings yet
Apache Spark RDDs Guide
186 pages
GraphX for Data Scientists
No ratings yet
GraphX for Data Scientists
34 pages
Spark
No ratings yet
Spark
11 pages
Lennox 2018 Catalogue
0% (1)
Lennox 2018 Catalogue
203 pages
Xii Hy SP 2024-25 With Answer Key
No ratings yet
Xii Hy SP 2024-25 With Answer Key
18 pages
Graph Database Query Feature
No ratings yet
Graph Database Query Feature
6 pages
Spark-GraphX and Neo4j
No ratings yet
Spark-GraphX and Neo4j
32 pages
Getting Started With IBM SPSS Statistics For Windows: A Training Manual For Beginners
No ratings yet
Getting Started With IBM SPSS Statistics For Windows: A Training Manual For Beginners
56 pages
Cognos Query Studio
No ratings yet
Cognos Query Studio
48 pages
Which Statistical Test: S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
No ratings yet
Which Statistical Test: S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
5 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
GATE 2011 ECE Answer Key
No ratings yet
GATE 2011 ECE Answer Key
1 page
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
ME Welding Course Brochure
No ratings yet
ME Welding Course Brochure
1 page
Visual Guide to Spark API Transformations
No ratings yet
Visual Guide to Spark API Transformations
122 pages
Data Structures for Students
No ratings yet
Data Structures for Students
11 pages
CD 54 HCT 238
No ratings yet
CD 54 HCT 238
20 pages
BDA Experiment 10
No ratings yet
BDA Experiment 10
9 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Session 3.8
No ratings yet
Session 3.8
17 pages
Mie 1628H - Big Data Final Project Report Apple Stock Price Prediction
No ratings yet
Mie 1628H - Big Data Final Project Report Apple Stock Price Prediction
10 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Demo MDM
No ratings yet
Demo MDM
1 page
Sai Chirravuri
No ratings yet
Sai Chirravuri
54 pages
Master Minimum Equipment List (MMEL) : U.S. Department of Transportation Federal Aviation Administration
No ratings yet
Master Minimum Equipment List (MMEL) : U.S. Department of Transportation Federal Aviation Administration
91 pages
Boeing AH-64 Apache
No ratings yet
Boeing AH-64 Apache
9 pages
Practice Test 3 New
No ratings yet
Practice Test 3 New
22 pages
Manual For The Sound Card Oscilloscope V1.32: 1 Requirements
No ratings yet
Manual For The Sound Card Oscilloscope V1.32: 1 Requirements
13 pages
Association Rules
No ratings yet
Association Rules
2 pages
Design and Implementation of An Automatic Power Supply From Four Different Source Using Microcontroller
No ratings yet
Design and Implementation of An Automatic Power Supply From Four Different Source Using Microcontroller
9 pages
ATX Power Supply Conversion Guide
No ratings yet
ATX Power Supply Conversion Guide
6 pages
Spark: Big Data Processing & Libraries
No ratings yet
Spark: Big Data Processing & Libraries
47 pages
HANDS Hadoop Cloud
No ratings yet
HANDS Hadoop Cloud
10 pages
Configuring and Deploying Mongodb Sharded Cluster in 30 Minutes
No ratings yet
Configuring and Deploying Mongodb Sharded Cluster in 30 Minutes
11 pages
1' Ere Feuille D'exercices: Travaux Dirig Es "Entrep Ots de Don Ees Et OLAP " Hiver 2012/13
No ratings yet
1' Ere Feuille D'exercices: Travaux Dirig Es "Entrep Ots de Don Ees Et OLAP " Hiver 2012/13
4 pages
Cloudlab Exercise 11 Lesson 11
No ratings yet
Cloudlab Exercise 11 Lesson 11
2 pages
Graded Review Questions Guide
No ratings yet
Graded Review Questions Guide
2 pages
Using Bayes Network For Prediction of Type-2 Diabetes: Yan Hu
No ratings yet
Using Bayes Network For Prediction of Type-2 Diabetes: Yan Hu
5 pages
Exercise 2
No ratings yet
Exercise 2
8 pages
2020 10 28SupplementaryCE201CE201 I Ktu Qbank
No ratings yet
2020 10 28SupplementaryCE201CE201 I Ktu Qbank
3 pages
Day 8
No ratings yet
Day 8
8 pages
Oscillatory Shear Test Method
No ratings yet
Oscillatory Shear Test Method
43 pages
Source Code Automated Voting System Dbconnect - PHP ?PHP
No ratings yet
Source Code Automated Voting System Dbconnect - PHP ?PHP
15 pages
NoSQL Overview: Concepts & Examples
No ratings yet
NoSQL Overview: Concepts & Examples
15 pages
Apache Spark Overview & Features
No ratings yet
Apache Spark Overview & Features
65 pages
Draft Program ICCAD18
No ratings yet
Draft Program ICCAD18
22 pages
Structural Engineering Data
No ratings yet
Structural Engineering Data
2 pages
N2 Engineering Science August 2019 Memorandum
No ratings yet
N2 Engineering Science August 2019 Memorandum
7 pages
ZN (II) Formate Bpy Coordination Polymer 2016
No ratings yet
ZN (II) Formate Bpy Coordination Polymer 2016
5 pages
Test Certif
No ratings yet
Test Certif
41 pages
Ultrasonic Gage for Inspectors
No ratings yet
Ultrasonic Gage for Inspectors
4 pages
ASSIGNMENT NO.5 (Estimates For Civil Engineering Works)
No ratings yet
ASSIGNMENT NO.5 (Estimates For Civil Engineering Works)
2 pages
Mma 400G
No ratings yet
Mma 400G
1 page

Spark SQL & GraphX Lab Guide

Uploaded by

Spark SQL & GraphX Lab Guide

Uploaded by

COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF BONN

Lab Distributed Big Data Analytics

Dr. Hajira Jabeen​, ​Gezim Sejdiu​, ​Prof. Dr. Jens Lehmann

1. Spark SQL operations

f. Since result is considered to be SchemaRDD, every RDD operations work

​def​ ​main​(​args​: ​Array​[​String​]) ​=​ {

​val​ ​spark​ ​=​ ​SparkSession​.builder

​val​ ​tripleDF​ ​=​ spark.sparkContext.textFile(input)

​val ​sqlText ​= ​"SELECT * from triple where subject =

​val ​subjectdistribution ​= spark.sql(​"select subject, count(*) from triple

​val​ ​spark​ ​=​ ​SparkSession​.builder

​val ​tutleSubjectObject ​= tripleRDD.map { x ​=> (x.subject,

val ​indexVertexID ​= (tripleRDD.map(_.subject) union

​val ​vertices​: ​RDD​[(​VertexId​, ​String​)] ​= indexVertexID.map(f

​val​ ​tuples​ ​=​ tripleRDD.keyBy(_.subject).join(indexVertexID).map(

​val​ ​edges​:​ ​RDD​[​Edge​[S

​val​ ​graph​ ​=​ ​Graph​(vertices, edges)

​val ​subrealsourse ​= graph.subgraph(t ​=> t.attr ​==

​val​ ​pageranl​ ​=​ graph.pageRank(​0.0001​)

1. Read and explore

You might also like

Dr. Hajira Jabeen, Gezim Sejdiu, Prof. Dr. Jens Lehmann

def main(args: Array[String]) = {

val spark = SparkSession.builder

val tripleDF = spark.sparkContext.textFile(input)

val sqlText = "SELECT * from triple where subject =

val subjectdistribution = spark.sql("select subject, count(*) from triple

val spark = SparkSession.builder

val tutleSubjectObject = tripleRDD.map { x => (x.subject,

val indexVertexID = (tripleRDD.map(_.subject) union

val vertices: RDD[(VertexId, String)] = indexVertexID.map(f

val tuples = tripleRDD.keyBy(_.subject).join(indexVertexID).map(

val edges: RDD[Edge[S

val graph = Graph(vertices, edges)

val subrealsourse = graph.subgraph(t => t.attr ==

val pageranl = graph.pageRank(0.0001)