0% found this document useful (0 votes)

20 views7 pages

SA Lab Manual

The document outlines various exercises related to file handling and RDD manipulation using Spark in Scala. It includes procedures for creating RDDs, performing transformations, and analyzing data with Spark GraphX. Each exercise concludes with a successful execution result, demonstrating the capabilities of Spark for data processing and analysis.

Uploaded by

pugalarasan141

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views7 pages

SA Lab Manual

Uploaded by

pugalarasan141

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

B.

TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 21PD21 STREAMING ANALYTICS

Ex no : 6
File Handling in Spark
Date :

AIM:
To demonstrate file handling in Spark.

PROCEDURE:
1. Analyse the graph.
2.Open the spark terminal
3. Configure the Spark .
4. Import the necessary spark libraries.
5. Develop the code.
6. Executing the program.

PROGRAM:
To Create List RDD with the partitions
> val intListRDD = sc.parallelize(intList)
> intListRDD.partition.size //Partition created by
> listRDD.partitions.size //Default partition created by listRDD

//Create a RDD with a file or .tsv

> val fileRDD = sc.textFile(“/ user/location/file.tsv”)
> fileRDD.first()
> fileRDD.take(10) //Prints the output but not user friendly to read the
data > fileRDD.take(10).foreach(println)
> fileRDD.partitions.size //Check for the list of partitions created >
val data=sc.textFile(“/user/location/file.tsv”, 10) //create with 10
partition RDD Transformation – Flatmap
> val data = Array(“Hello There”, “Welcome to Spark 2.0”, “This is Prakash,
your instruction for this course”, “Enjoy learning spark”, “Happy Coding”) > val
dataRDD = sc.parallelize(data)

717822I131 MOHAMED MARAKFHAN S

B.TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 21PD21 STREAMING ANALYTICS

> val filterRDD = dataRDD.filter(line => line.length > 15)

> filterRDD.collect()
> filterRDD.collect.foreach(println)
> val mapRDD = dataRDD.map(line => line.split( “ “))
> mapRDD.collect()
> val flatMapRDD = dataRDD.flatMap(line => line.split(“ “))

RDD Transformation – mapPartitions

map() or foreach(), the number of times we would need to initialise will be equal to the
number of elements in RDD. Whereas if we use mapPartitions(), the no of times we would
need to initialize would be equal to number of Partitions
//Initialize with 3 partitions
scala> val rdd1 = sc.parallelize( List( "yellow", "red", "blue", "cyan", "black" ), 3)

scala> val mapped = rdd1.mapPartitionsWithIndex{

| // 'index' represents the Partition No
| // 'iterator' to iterate through all elements
| // in the partition
| (index, iterator) => {
| println("Called in Partition -> " + index)
| val myList = iterator.toList
| // In a normal user case, we will do the
| // the initialization(ex : initialising database)
| // before iterating through each element
| myList.map(x => x + " -> " + index).iterator
|}
|}

SET Operations (Union – Narrow Transformation)

> val setRDD1 = sc.parallelize(Array(2,4,6,8,10))
> val setRDD2 = sc.parallelize(Array(1,2,3,4,5))
> setRDD1.union(setRDD2).collect //Union – Narrow

717822I131 MOHAMED MARAKFHAN S

B.TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 21PD21 STREAMING
ANALYTICS

> setRDD1.intersection(setRDD2).collect //Union – Wide

> setRDD1.subtract(setRDD2).collect //Union – Wide >
setRDD1.cartesian(setRDD2).collect //Union – Wide – Too costly for
large data
//SET Operation Distinct element extraction
> val numArray = Array(1,2,3,5,2,3,5,6,7,4,7,4)
> val numRDD = sc.parallelize(numArray)
> val distinctElementRDD = numRDD.distinct()
> distinctElementRDD.collect()

RESULT:
Thus, the file handling using SPARK has been executed successfully.

717822I131 MOHAMED MARAKFHAN S

B.TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 21PD21 STREAMING
ANALYTICS

Ex no : 7
MANIPULATING RDD FUNCTION IN SCALA
Date :

AIM :
To perform the word count using Spark.

PROCEDURE :

1. Open the spark terminal.

2. Restart all the daemons.
3. Configure the Spark.
4. Import the necessary spark libraries.
5. Develop the code.
6. Executing the program.

PROGRAM:
cd $SPARK_HOME
./sbin/stop-all.sh
./sbin/
start-a
ll.sh
spark-
shell

import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext_
val data = Array(“Hello there”, “Welcome to Spark”, “Let’s learn together”,
“Do well”) val dataRDD = sc.parallelize(data)
dataRDD.filter(line => line.length > 5).collect
dataRDD.filter(line => line.length >
5).collect.foreach(println) dataRDD.map(line =>
line.split(“ “)).collect dataRDD.flatMap(line =>
line.split(“ “)).collect

717822I131 MOHAMED MARAKFHAN S

B.TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 21PD21 STREAMING ANALYTICS

OUTPUT :

RESULT:
Thus, the manipulations in RDD functions using Scala are executed successfully

717822I131 MOHAMED MARAKFHAN S

B.TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 21PD21 STREAMING ANALYTICS

Ex no : 8 SPARK STREAMING DATA

Date :

AIM:
To analyse the Spark Streaming Data Spark GraphX.

PROCEDURE:
2. Analyse the graph.
3.Open the spark terminal.
4. Configure the Spark
5. Import the necessary spark libraries.
6. Develop the code.
7. Executing the program.

PROGRAM:
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val vertexArray = Array(
(1L, ("Alice", 28)),
(2L, ("Bob", 27)),
(3L, ("Charlie", 65)),
(4L, ("David", 42)),
(5L, ("Ed", 55)),
(6L, ("Fran", 50))
)
val edgeArray = Array(
Edge(2L, 1L, 7),
Edge(2L, 4L, 2),
Edge(3L, 2L, 4),
Edge(3L, 6L, 3),
Edge(4L, 1L, 1),
Edge(5L, 2L, 2),
Edge(5L, 3L, 8),
Edge(5L, 6L, 3)
)
val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)
graph.vertices.filter { case (id, (name, age)) => age > 30 }.collect.foreach {
case (id, (name, age)) => println(s"$name is $age")
}
717822I131 MOHAMED MARAKFHAN S
B.TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 21PD21 STREAMING ANALYTICS

for (triplet <- graph.triplets.filter(t => t.attr > 5).collect) {

println(s"${triplet.srcAttr._1} loves ${triplet.dstAttr._1}")
}

OUTPUT:

RESULT:
Thus, the Spark Streaming Data using Spark Graph-X has been analysed and executed
successfully.

717822I131 MOHAMED MARAKFHAN S

DP 600
100% (2)
DP 600
124 pages
Apache Spark With Scala - Cheatsheet
No ratings yet
Apache Spark With Scala - Cheatsheet
7 pages
Databricks Certified Data Engineer Associate - 6
No ratings yet
Databricks Certified Data Engineer Associate - 6
10 pages
Spark
No ratings yet
Spark
96 pages
06 Parallel Processing Part2
No ratings yet
06 Parallel Processing Part2
93 pages
Introduction To RDD
No ratings yet
Introduction To RDD
39 pages
Microsoft Fabric - James Serra - Public
No ratings yet
Microsoft Fabric - James Serra - Public
54 pages
Apache Spark 101 For Data Engineering
No ratings yet
Apache Spark 101 For Data Engineering
15 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
02 Sparkml
No ratings yet
02 Sparkml
104 pages
Unit - 4
No ratings yet
Unit - 4
18 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Azure Data Engineer Content
No ratings yet
Azure Data Engineer Content
6 pages
4.1. Spark Basics
No ratings yet
4.1. Spark Basics
28 pages
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
No ratings yet
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
48 pages
Spark 1
No ratings yet
Spark 1
97 pages
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
No ratings yet
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
51 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
L7A - Spark RDD With Scala
No ratings yet
L7A - Spark RDD With Scala
21 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Module 5 Data Science
No ratings yet
Module 5 Data Science
7 pages
Create The Property Graph From Array of Vertex and Edges
No ratings yet
Create The Property Graph From Array of Vertex and Edges
5 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Indrani Cheat Sheet
No ratings yet
Indrani Cheat Sheet
2 pages
7 Apache Spark
No ratings yet
7 Apache Spark
48 pages
Lec 9
No ratings yet
Lec 9
33 pages
Dice Resume CV BHAGWATI PRAJAPATI
100% (1)
Dice Resume CV BHAGWATI PRAJAPATI
4 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
Spark and Scala Week 1
No ratings yet
Spark and Scala Week 1
16 pages
BDT MSE2Scheme 23-24
No ratings yet
BDT MSE2Scheme 23-24
4 pages
Spark RDD
No ratings yet
Spark RDD
4 pages
Function Spark
No ratings yet
Function Spark
10 pages
Spark RDD
No ratings yet
Spark RDD
60 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
RDD Actions
No ratings yet
RDD Actions
18 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Lec 9
No ratings yet
Lec 9
38 pages
SPARK
No ratings yet
SPARK
35 pages
Data Engineering
No ratings yet
Data Engineering
91 pages
Spark RDD Basics and Operations
No ratings yet
Spark RDD Basics and Operations
84 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Spark
No ratings yet
Spark
51 pages
Spark
No ratings yet
Spark
11 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Spark
No ratings yet
Spark
37 pages
Spark RDD Guide for Developers
No ratings yet
Spark RDD Guide for Developers
7 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Visual Guide to Spark API Transformations
No ratings yet
Visual Guide to Spark API Transformations
122 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Preguntas Tests Udemy
No ratings yet
Preguntas Tests Udemy
338 pages
Module 5 Data Science
No ratings yet
Module 5 Data Science
8 pages
External Video-En
No ratings yet
External Video-En
2 pages
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
4 Building Blocks of A Streaming Data Architecture
No ratings yet
4 Building Blocks of A Streaming Data Architecture
11 pages
Overview
No ratings yet
Overview
25 pages
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
No ratings yet
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
6 pages
ANSAR HAYAT BigData Architect
No ratings yet
ANSAR HAYAT BigData Architect
3 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
152 pages
Aimlsyll
No ratings yet
Aimlsyll
113 pages
The Big Data Ecosystem
No ratings yet
The Big Data Ecosystem
42 pages
H1. Big Data With Hadoop & Spark - Introduction
No ratings yet
H1. Big Data With Hadoop & Spark - Introduction
47 pages
Spark Streaming for Data Engineers
No ratings yet
Spark Streaming for Data Engineers
34 pages
Job Description For DS4
No ratings yet
Job Description For DS4
5 pages
Aws Spark
No ratings yet
Aws Spark
3 pages
Lec 10
No ratings yet
Lec 10
28 pages
Spark Training in Bangalore
No ratings yet
Spark Training in Bangalore
36 pages
Data Engineer - Freshers JD
No ratings yet
Data Engineer - Freshers JD
4 pages
Scala and Spark Practice Questions - Free Practice Test - Spark Quiz and Test
No ratings yet
Scala and Spark Practice Questions - Free Practice Test - Spark Quiz and Test
9 pages
PySpark Course
No ratings yet
PySpark Course
2 pages
Pfe Book 2022: Internship 2022
No ratings yet
Pfe Book 2022: Internship 2022
30 pages
CS 498: Cloud Computing Applications Syllabus: Course Description
No ratings yet
CS 498: Cloud Computing Applications Syllabus: Course Description
8 pages
Python Developer Profile
No ratings yet
Python Developer Profile
7 pages
DB Admin Technical Test
No ratings yet
DB Admin Technical Test
4 pages
Vishal Mittal CV
No ratings yet
Vishal Mittal CV
3 pages
Dawn Issac Sam
No ratings yet
Dawn Issac Sam
5 pages
Arnav Verma Data Analyst Resume
No ratings yet
Arnav Verma Data Analyst Resume
3 pages
Shubham Resume 23 - 09 - 2024-12-1-2
No ratings yet
Shubham Resume 23 - 09 - 2024-12-1-2
2 pages
Ankur Manna Resume
No ratings yet
Ankur Manna Resume
2 pages

SA Lab Manual

Uploaded by

SA Lab Manual

Uploaded by

B.

TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 21PD21 STREAMING ANALYTICS

//Create a RDD with a file or .tsv

717822I131 MOHAMED MARAKFHAN S

> val filterRDD = dataRDD.filter(line => line.length > 15)

RDD Transformation – mapPartitions

scala> val mapped = rdd1.mapPartitionsWithIndex{

SET Operations (Union – Narrow Transformation)

717822I131 MOHAMED MARAKFHAN S

> setRDD1.intersection(setRDD2).collect //Union – Wide

717822I131 MOHAMED MARAKFHAN S

1. Open the spark terminal.

717822I131 MOHAMED MARAKFHAN S

717822I131 MOHAMED MARAKFHAN S

Ex no : 8 SPARK STREAMING DATA

for (triplet <- graph.triplets.filter(t => t.attr > 5).collect) {

717822I131 MOHAMED MARAKFHAN S

You might also like