B.
TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 21PD21 STREAMING ANALYTICS
Ex no : 6
File Handling in Spark
Date :
AIM:
To demonstrate file handling in Spark.
PROCEDURE:
1. Analyse the graph.
2.Open the spark terminal
3. Configure the Spark .
4. Import the necessary spark libraries.
5. Develop the code.
6. Executing the program.
PROGRAM:
To Create List RDD with the partitions
> val intListRDD = sc.parallelize(intList)
> intListRDD.partition.size //Partition created by
> listRDD.partitions.size //Default partition created by listRDD
//Create a RDD with a file or .tsv
> val fileRDD = sc.textFile(“/ user/location/file.tsv”)
> fileRDD.first()
> fileRDD.take(10) //Prints the output but not user friendly to read the
data > fileRDD.take(10).foreach(println)
> fileRDD.partitions.size //Check for the list of partitions created >
val data=sc.textFile(“/user/location/file.tsv”, 10) //create with 10
partition RDD Transformation – Flatmap
> val data = Array(“Hello There”, “Welcome to Spark 2.0”, “This is Prakash,
your instruction for this course”, “Enjoy learning spark”, “Happy Coding”) > val
dataRDD = sc.parallelize(data)
717822I131 MOHAMED MARAKFHAN S
B.TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 21PD21 STREAMING ANALYTICS
> val filterRDD = dataRDD.filter(line => line.length > 15)
> filterRDD.collect()
> filterRDD.collect.foreach(println)
> val mapRDD = dataRDD.map(line => line.split( “ “))
> mapRDD.collect()
> val flatMapRDD = dataRDD.flatMap(line => line.split(“ “))
RDD Transformation – mapPartitions
map() or foreach(), the number of times we would need to initialise will be equal to the
number of elements in RDD. Whereas if we use mapPartitions(), the no of times we would
need to initialize would be equal to number of Partitions
//Initialize with 3 partitions
scala> val rdd1 = sc.parallelize( List( "yellow", "red", "blue", "cyan", "black" ), 3)
scala> val mapped = rdd1.mapPartitionsWithIndex{
| // 'index' represents the Partition No
| // 'iterator' to iterate through all elements
| // in the partition
| (index, iterator) => {
| println("Called in Partition -> " + index)
| val myList = iterator.toList
| // In a normal user case, we will do the
| // the initialization(ex : initialising database)
| // before iterating through each element
| myList.map(x => x + " -> " + index).iterator
|}
|}
SET Operations (Union – Narrow Transformation)
> val setRDD1 = sc.parallelize(Array(2,4,6,8,10))
> val setRDD2 = sc.parallelize(Array(1,2,3,4,5))
> setRDD1.union(setRDD2).collect //Union – Narrow
717822I131 MOHAMED MARAKFHAN S
B.TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 21PD21 STREAMING
ANALYTICS
> setRDD1.intersection(setRDD2).collect //Union – Wide
> setRDD1.subtract(setRDD2).collect //Union – Wide >
setRDD1.cartesian(setRDD2).collect //Union – Wide – Too costly for
large data
//SET Operation Distinct element extraction
> val numArray = Array(1,2,3,5,2,3,5,6,7,4,7,4)
> val numRDD = sc.parallelize(numArray)
> val distinctElementRDD = numRDD.distinct()
> distinctElementRDD.collect()
RESULT:
Thus, the file handling using SPARK has been executed successfully.
717822I131 MOHAMED MARAKFHAN S
B.TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 21PD21 STREAMING
ANALYTICS
Ex no : 7
MANIPULATING RDD FUNCTION IN SCALA
Date :
AIM :
To perform the word count using Spark.
PROCEDURE :
1. Open the spark terminal.
2. Restart all the daemons.
3. Configure the Spark.
4. Import the necessary spark libraries.
5. Develop the code.
6. Executing the program.
PROGRAM:
cd $SPARK_HOME
./sbin/stop-all.sh
./sbin/
start-a
ll.sh
spark-
shell
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext_
val data = Array(“Hello there”, “Welcome to Spark”, “Let’s learn together”,
“Do well”) val dataRDD = sc.parallelize(data)
dataRDD.filter(line => line.length > 5).collect
dataRDD.filter(line => line.length >
5).collect.foreach(println) dataRDD.map(line =>
line.split(“ “)).collect dataRDD.flatMap(line =>
line.split(“ “)).collect
717822I131 MOHAMED MARAKFHAN S
B.TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 21PD21 STREAMING ANALYTICS
OUTPUT :
RESULT:
Thus, the manipulations in RDD functions using Scala are executed successfully
717822I131 MOHAMED MARAKFHAN S
B.TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 21PD21 STREAMING ANALYTICS
Ex no : 8 SPARK STREAMING DATA
Date :
AIM:
To analyse the Spark Streaming Data Spark GraphX.
PROCEDURE:
2. Analyse the graph.
3.Open the spark terminal.
4. Configure the Spark
5. Import the necessary spark libraries.
6. Develop the code.
7. Executing the program.
PROGRAM:
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val vertexArray = Array(
(1L, ("Alice", 28)),
(2L, ("Bob", 27)),
(3L, ("Charlie", 65)),
(4L, ("David", 42)),
(5L, ("Ed", 55)),
(6L, ("Fran", 50))
)
val edgeArray = Array(
Edge(2L, 1L, 7),
Edge(2L, 4L, 2),
Edge(3L, 2L, 4),
Edge(3L, 6L, 3),
Edge(4L, 1L, 1),
Edge(5L, 2L, 2),
Edge(5L, 3L, 8),
Edge(5L, 6L, 3)
)
val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)
graph.vertices.filter { case (id, (name, age)) => age > 30 }.collect.foreach {
case (id, (name, age)) => println(s"$name is $age")
}
717822I131 MOHAMED MARAKFHAN S
B.TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 21PD21 STREAMING ANALYTICS
for (triplet <- graph.triplets.filter(t => t.attr > 5).collect) {
println(s"${triplet.srcAttr._1} loves ${triplet.dstAttr._1}")
}
OUTPUT:
RESULT:
Thus, the Spark Streaming Data using Spark Graph-X has been analysed and executed
successfully.
717822I131 MOHAMED MARAKFHAN S