Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
16 views12 pages

35-Unit5 DataAnalytics IoT Adoop Spark Part4

Apache Spark is an open-source cluster computing framework designed for data analysis, offering high-level tools such as Spark Streaming, SparkSQL, and MLlib. It supports real-time, batch, and interactive queries with APIs for Scala, Java, and Python, utilizing a resilient distributed dataset (RDD) for parallel processing. Spark can be easily set up on Amazon EC2, and includes capabilities for machine learning, graph processing, and data transformations.

Uploaded by

ashayamal2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views12 pages

35-Unit5 DataAnalytics IoT Adoop Spark Part4

Apache Spark is an open-source cluster computing framework designed for data analysis, offering high-level tools such as Spark Streaming, SparkSQL, and MLlib. It supports real-time, batch, and interactive queries with APIs for Scala, Java, and Python, utilizing a resilient distributed dataset (RDD) for parallel processing. Spark can be easily set up on Amazon EC2, and includes capabilities for machine learning, graph processing, and data transformations.

Uploaded by

ashayamal2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Apache Spark

❑ Apache Spark is an open source cluster computing framework for data analysis

❑ Spark supports in-memory cluster computing and promises to be faster than Hadoop
❑ Supports High level tools for data analysis such as
▪ Spark Streaming for for streaming jobs
▪ SparkSQL for analysis of structured data
▪ MLlib machine learning library for Spark
▪ GraphX for graph processing
▪ Shark (Hive on Spark)
❑ Spark allows real-time, batch and interactive queries and provides APIs for Scala, Java and Python Languages

Mllib GraphX Bagel


Spark (Machine (Graph (Pragel on
Spark SQL
Streaming Learning) Computation) Spark)

Spark Core

Figure: SPARK TOOLS


Worker Node
Executor
Driver Program
Task Task
Driver Program
Standalone Cache

SparkContext
Apache Mesos
Worker Node
Hadoop YARN Executor
▪ Each Spark application consists of
Driver program and is co ordinated by Task Task
“SparkContext”
▪The Cluster Manager allocates resources
Cache
▪ Spark provides various cluster managers: for each application on the worker node.
Standalone , Apache Mesos, Hadoop YARN
▪The Executor are allocated on the worker nodes
▪ Executors run the applcation code as multiple tasks
▪ Applications are isolated from each other and run
within their own executor processes on the worker
nodes
▪Spark provides data abstraction called resilient distributed dataset(RDD)

▪RDD is a collection of elements partitioned across the nodes of the cluster

▪The RDD elements can be operated on in parallel in the cluster

▪RDD supports two types of operations:


➢Transformations :used to create a new dataset from an existing one
➢Actions : return a value to the driver program after running a computation on the dataset

▪Spark API allows chaining together Transformations and Actions.


▪Spark comes with a spark-ec2 script (available at spark/ec2 directory). Thus, easy to setup on Amazon EC2
With spark-ec2 script, one can easily launch, manage and shutdown Spark Cluster on Amazon EC2.

▪Commands to start a Spark Cluster:

▪ Spark cluster setup on EC2 is configured to use HDFS as its default file system. To analyze the contents of
file, the file should be copied to HDFS using the following command:
▪Spark supports a shell mode with which one can interactively run commands for analyzing the data.
To launch the Spark Python shell, run the following command:

▪ When a PySpark is launches, a SparkContext is created in the variable called sc. The following command
show how to load a text file and count the number of lines from the PySpark shell
Let’s look at the standalone Spark cluster application that computes the word counts in a file

▪The program uses Map


and Reduce functions.

▪The flatmap and map


transformation take as
input a function which is
applied to each element of
the dataset.
▪ flatmap function can
map each input item to
zero or more output items
▪ map function maps each
input item to another
▪The input functions can be in the form of Python lambda expressions or local functions.
item.
▪ Here, flatmap takes as input a lambda expression that splits each line of the file into words.
▪The map function outputs key-value pais where key is a word and value is 1
▪ The reduceByKey transformation aggregates values of each key using the function specified.(e.g. add)
▪ Finally, collect action is used to return all the elements of the result as an array
Let’s take an example of forest fire detection sensor data and let’s look at a Spark application that aggregates the time-stamped Sensor data
and finds hourly maximum values for temperature, humidity, light and CO2.

1.The sensor data is loaded as a text file. Each line


of the text file contains time stamped sensor data.

2. The lines are split by applying map


transformation to access the individual sensor
readings.

3. A map transformation is applied which outputs


Key-value pairs where key is a timestamp and
value is a sensor reading.

4. Finally, the reduceByKey transformation is


applied to find the maximum sensor reading.
Let’s take an example of forest fire detection sensor data and let’s look at a Spark application that filters the sensor data
Let’s take an example of forest fire detection sensor data and let’s look at a Spark application that clusters the data.
Note: Spark includes a machine learning library , “Mllib” , which includes implementation of machine learning algorithms for
“Classification, Regression, Clustering, Collaborative Filtering and Dimensionary Reduction”
The following example shows clustering data with K-means clustering algorithm:

▪Here, the data is loaded from text file


▪Data is parsed using parseVector
▪Next, Kmeans object is used to cluster the
data into two clusters
Let’s take an example of forest fire detection sensor data and let’s look at a Spark application that Classifies the data.
The following example shows classifying data with Naïve Bayes Classification algorithm:

▪ In this example, the training data consists of


Labeled points where value in the first Column
is the label.
▪ The parsePoint function parses the data and
creates Spark LabeledPoint objects
▪ The LabeledPoint is passed to the NaiveBayes
object for training a model
▪ Finally, the classification is donw by parsing
the test data to the trained model
https://spark.apache.org/

You might also like