Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
20 views7 pages

SA Lab Manual

The document outlines various exercises related to file handling and RDD manipulation using Spark in Scala. It includes procedures for creating RDDs, performing transformations, and analyzing data with Spark GraphX. Each exercise concludes with a successful execution result, demonstrating the capabilities of Spark for data processing and analysis.

Uploaded by

pugalarasan141
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views7 pages

SA Lab Manual

The document outlines various exercises related to file handling and RDD manipulation using Spark in Scala. It includes procedures for creating RDDs, performing transformations, and analyzing data with Spark GraphX. Each exercise concludes with a successful execution result, demonstrating the capabilities of Spark for data processing and analysis.

Uploaded by

pugalarasan141
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

B.

TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 21PD21 STREAMING ANALYTICS

Ex no : 6
File Handling in Spark
Date :

AIM:
To demonstrate file handling in Spark.

PROCEDURE:
1. Analyse the graph.
2.Open the spark terminal
3. Configure the Spark .
4. Import the necessary spark libraries.
5. Develop the code.
6. Executing the program.

PROGRAM:
To Create List RDD with the partitions
> val intListRDD = sc.parallelize(intList)
> intListRDD.partition.size //Partition created by
> listRDD.partitions.size //Default partition created by listRDD

//Create a RDD with a file or .tsv


> val fileRDD = sc.textFile(“/ user/location/file.tsv”)
> fileRDD.first()
> fileRDD.take(10) //Prints the output but not user friendly to read the
data > fileRDD.take(10).foreach(println)
> fileRDD.partitions.size //Check for the list of partitions created >
val data=sc.textFile(“/user/location/file.tsv”, 10) //create with 10
partition RDD Transformation – Flatmap
> val data = Array(“Hello There”, “Welcome to Spark 2.0”, “This is Prakash,
your instruction for this course”, “Enjoy learning spark”, “Happy Coding”) > val
dataRDD = sc.parallelize(data)

717822I131 MOHAMED MARAKFHAN S


B.TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 21PD21 STREAMING ANALYTICS

> val filterRDD = dataRDD.filter(line => line.length > 15)


> filterRDD.collect()
> filterRDD.collect.foreach(println)
> val mapRDD = dataRDD.map(line => line.split( “ “))
> mapRDD.collect()
> val flatMapRDD = dataRDD.flatMap(line => line.split(“ “))

RDD Transformation – mapPartitions


map() or foreach(), the number of times we would need to initialise will be equal to the
number of elements in RDD. Whereas if we use mapPartitions(), the no of times we would
need to initialize would be equal to number of Partitions
//Initialize with 3 partitions
scala> val rdd1 = sc.parallelize( List( "yellow", "red", "blue", "cyan", "black" ), 3)

scala> val mapped = rdd1.mapPartitionsWithIndex{


| // 'index' represents the Partition No
| // 'iterator' to iterate through all elements
| // in the partition
| (index, iterator) => {
| println("Called in Partition -> " + index)
| val myList = iterator.toList
| // In a normal user case, we will do the
| // the initialization(ex : initialising database)
| // before iterating through each element
| myList.map(x => x + " -> " + index).iterator
|}
|}

SET Operations (Union – Narrow Transformation)


> val setRDD1 = sc.parallelize(Array(2,4,6,8,10))
> val setRDD2 = sc.parallelize(Array(1,2,3,4,5))
> setRDD1.union(setRDD2).collect //Union – Narrow

717822I131 MOHAMED MARAKFHAN S


B.TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 21PD21 STREAMING
ANALYTICS

> setRDD1.intersection(setRDD2).collect //Union – Wide


> setRDD1.subtract(setRDD2).collect //Union – Wide >
setRDD1.cartesian(setRDD2).collect //Union – Wide – Too costly for
large data
//SET Operation Distinct element extraction
> val numArray = Array(1,2,3,5,2,3,5,6,7,4,7,4)
> val numRDD = sc.parallelize(numArray)
> val distinctElementRDD = numRDD.distinct()
> distinctElementRDD.collect()

RESULT:
Thus, the file handling using SPARK has been executed successfully.

717822I131 MOHAMED MARAKFHAN S


B.TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 21PD21 STREAMING
ANALYTICS

Ex no : 7
MANIPULATING RDD FUNCTION IN SCALA
Date :

AIM :
To perform the word count using Spark.

PROCEDURE :

1. Open the spark terminal.


2. Restart all the daemons.
3. Configure the Spark.
4. Import the necessary spark libraries.
5. Develop the code.
6. Executing the program.

PROGRAM:
cd $SPARK_HOME
./sbin/stop-all.sh
./sbin/
start-a
ll.sh
spark-
shell

import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext_
val data = Array(“Hello there”, “Welcome to Spark”, “Let’s learn together”,
“Do well”) val dataRDD = sc.parallelize(data)
dataRDD.filter(line => line.length > 5).collect
dataRDD.filter(line => line.length >
5).collect.foreach(println) dataRDD.map(line =>
line.split(“ “)).collect dataRDD.flatMap(line =>
line.split(“ “)).collect

717822I131 MOHAMED MARAKFHAN S


B.TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 21PD21 STREAMING ANALYTICS

OUTPUT :

RESULT:
Thus, the manipulations in RDD functions using Scala are executed successfully

717822I131 MOHAMED MARAKFHAN S


B.TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 21PD21 STREAMING ANALYTICS

Ex no : 8 SPARK STREAMING DATA


Date :

AIM:
To analyse the Spark Streaming Data Spark GraphX.

PROCEDURE:
2. Analyse the graph.
3.Open the spark terminal.
4. Configure the Spark
5. Import the necessary spark libraries.
6. Develop the code.
7. Executing the program.

PROGRAM:
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val vertexArray = Array(
(1L, ("Alice", 28)),
(2L, ("Bob", 27)),
(3L, ("Charlie", 65)),
(4L, ("David", 42)),
(5L, ("Ed", 55)),
(6L, ("Fran", 50))
)
val edgeArray = Array(
Edge(2L, 1L, 7),
Edge(2L, 4L, 2),
Edge(3L, 2L, 4),
Edge(3L, 6L, 3),
Edge(4L, 1L, 1),
Edge(5L, 2L, 2),
Edge(5L, 3L, 8),
Edge(5L, 6L, 3)
)
val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)
graph.vertices.filter { case (id, (name, age)) => age > 30 }.collect.foreach {
case (id, (name, age)) => println(s"$name is $age")
}
717822I131 MOHAMED MARAKFHAN S
B.TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 21PD21 STREAMING ANALYTICS

for (triplet <- graph.triplets.filter(t => t.attr > 5).collect) {


println(s"${triplet.srcAttr._1} loves ${triplet.dstAttr._1}")
}

OUTPUT:

RESULT:
Thus, the Spark Streaming Data using Spark Graph-X has been analysed and executed
successfully.

717822I131 MOHAMED MARAKFHAN S

You might also like