0% found this document useful (0 votes)

16 views12 pages

35-Unit5 DataAnalytics IoT Adoop Spark Part4

Apache Spark is an open-source cluster computing framework designed for data analysis, offering high-level tools such as Spark Streaming, SparkSQL, and MLlib. It supports real-time, batch, and interactive queries with APIs for Scala, Java, and Python, utilizing a resilient distributed dataset (RDD) for parallel processing. Spark can be easily set up on Amazon EC2, and includes capabilities for machine learning, graph processing, and data transformations.

Uploaded by

ashayamal2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views12 pages

35-Unit5 DataAnalytics IoT Adoop Spark Part4

Uploaded by

ashayamal2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Apache Spark

❑ Apache Spark is an open source cluster computing framework for data analysis

❑ Spark supports in-memory cluster computing and promises to be faster than Hadoop
❑ Supports High level tools for data analysis such as
▪ Spark Streaming for for streaming jobs
▪ SparkSQL for analysis of structured data
▪ MLlib machine learning library for Spark
▪ GraphX for graph processing
▪ Shark (Hive on Spark)
❑ Spark allows real-time, batch and interactive queries and provides APIs for Scala, Java and Python Languages

Mllib GraphX Bagel

Spark (Machine (Graph (Pragel on
Spark SQL
Streaming Learning) Computation) Spark)

Spark Core

Figure: SPARK TOOLS

Worker Node
Executor
Driver Program
Task Task
Driver Program
Standalone Cache

SparkContext
Apache Mesos
Worker Node
Hadoop YARN Executor
▪ Each Spark application consists of
Driver program and is co ordinated by Task Task
“SparkContext”
▪The Cluster Manager allocates resources
Cache
▪ Spark provides various cluster managers: for each application on the worker node.
Standalone , Apache Mesos, Hadoop YARN
▪The Executor are allocated on the worker nodes
▪ Executors run the applcation code as multiple tasks
▪ Applications are isolated from each other and run
within their own executor processes on the worker
nodes
▪Spark provides data abstraction called resilient distributed dataset(RDD)

▪RDD is a collection of elements partitioned across the nodes of the cluster

▪The RDD elements can be operated on in parallel in the cluster

▪RDD supports two types of operations:

➢Transformations :used to create a new dataset from an existing one
➢Actions : return a value to the driver program after running a computation on the dataset

▪Spark API allows chaining together Transformations and Actions.

▪Spark comes with a spark-ec2 script (available at spark/ec2 directory). Thus, easy to setup on Amazon EC2
With spark-ec2 script, one can easily launch, manage and shutdown Spark Cluster on Amazon EC2.

▪Commands to start a Spark Cluster:

▪ Spark cluster setup on EC2 is configured to use HDFS as its default file system. To analyze the contents of
file, the file should be copied to HDFS using the following command:
▪Spark supports a shell mode with which one can interactively run commands for analyzing the data.
To launch the Spark Python shell, run the following command:

▪ When a PySpark is launches, a SparkContext is created in the variable called sc. The following command
show how to load a text file and count the number of lines from the PySpark shell
Let’s look at the standalone Spark cluster application that computes the word counts in a file

▪The program uses Map

and Reduce functions.

▪The flatmap and map

transformation take as
input a function which is
applied to each element of
the dataset.
▪ flatmap function can
map each input item to
zero or more output items
▪ map function maps each
input item to another
▪The input functions can be in the form of Python lambda expressions or local functions.
item.
▪ Here, flatmap takes as input a lambda expression that splits each line of the file into words.
▪The map function outputs key-value pais where key is a word and value is 1
▪ The reduceByKey transformation aggregates values of each key using the function specified.(e.g. add)
▪ Finally, collect action is used to return all the elements of the result as an array
Let’s take an example of forest fire detection sensor data and let’s look at a Spark application that aggregates the time-stamped Sensor data
and finds hourly maximum values for temperature, humidity, light and CO2.

1.The sensor data is loaded as a text file. Each line

of the text file contains time stamped sensor data.

2. The lines are split by applying map

transformation to access the individual sensor
readings.

3. A map transformation is applied which outputs

Key-value pairs where key is a timestamp and
value is a sensor reading.

4. Finally, the reduceByKey transformation is

applied to find the maximum sensor reading.
Let’s take an example of forest fire detection sensor data and let’s look at a Spark application that filters the sensor data
Let’s take an example of forest fire detection sensor data and let’s look at a Spark application that clusters the data.
Note: Spark includes a machine learning library , “Mllib” , which includes implementation of machine learning algorithms for
“Classification, Regression, Clustering, Collaborative Filtering and Dimensionary Reduction”
The following example shows clustering data with K-means clustering algorithm:

▪Here, the data is loaded from text file

▪Data is parsed using parseVector
▪Next, Kmeans object is used to cluster the
data into two clusters
Let’s take an example of forest fire detection sensor data and let’s look at a Spark application that Classifies the data.
The following example shows classifying data with Naïve Bayes Classification algorithm:

▪ In this example, the training data consists of

Labeled points where value in the first Column
is the label.
▪ The parsePoint function parses the data and
creates Spark LabeledPoint objects
▪ The LabeledPoint is passed to the NaiveBayes
object for training a model
▪ Finally, the classification is donw by parsing
the test data to the trained model
https://spark.apache.org/

RTL TECHNOLOGIES Oracle R12 Apps Technical
No ratings yet
RTL TECHNOLOGIES Oracle R12 Apps Technical
85 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Data Analyst Syllabus
No ratings yet
Data Analyst Syllabus
25 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
Fastdataanalyticswithsparkandpython 150207060921 Conversion Gate02
No ratings yet
Fastdataanalyticswithsparkandpython 150207060921 Conversion Gate02
75 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Pyspark Tutorial
100% (2)
Pyspark Tutorial
27 pages
Pyspark Basics
No ratings yet
Pyspark Basics
74 pages
Unit 5
100% (1)
Unit 5
109 pages
Big Data & Apache Spark Explained
No ratings yet
Big Data & Apache Spark Explained
31 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
Spark
No ratings yet
Spark
96 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
Apache Spark
No ratings yet
Apache Spark
162 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Unit - 4
No ratings yet
Unit - 4
18 pages
SPARK
No ratings yet
SPARK
27 pages
Pyspark Modules&packages RDD
No ratings yet
Pyspark Modules&packages RDD
9 pages
SPARK
No ratings yet
SPARK
47 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
SPA Session 9 11 Spark
No ratings yet
SPA Session 9 11 Spark
67 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Unit V
No ratings yet
Unit V
35 pages
Lesson 1 - Introduction To Database Management System
No ratings yet
Lesson 1 - Introduction To Database Management System
12 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Unit IV Spark
No ratings yet
Unit IV Spark
23 pages
Lecture 10 - Spark
No ratings yet
Lecture 10 - Spark
87 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
BDA Lec9
No ratings yet
BDA Lec9
25 pages
Spark Guide for 4th Year Engineering Students
No ratings yet
Spark Guide for 4th Year Engineering Students
241 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Module 4
No ratings yet
Module 4
29 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Bda 5
No ratings yet
Bda 5
21 pages
Py Spark
No ratings yet
Py Spark
9 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Oracle Database Architecture Guide
No ratings yet
Oracle Database Architecture Guide
10 pages
Introduction to Data Analysis with Spark
No ratings yet
Introduction to Data Analysis with Spark
51 pages
Fi Documnt
100% (1)
Fi Documnt
3 pages
Apache Spark Overview & Features
No ratings yet
Apache Spark Overview & Features
65 pages
SAP HANA/BI/BPC Consultant Resume
No ratings yet
SAP HANA/BI/BPC Consultant Resume
4 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Oracle APEX Learner Guide
No ratings yet
Oracle APEX Learner Guide
32 pages
Spark: Big Data Processing & Libraries
No ratings yet
Spark: Big Data Processing & Libraries
47 pages
Big Data Tools 2 - Apache Spark With PySpark
No ratings yet
Big Data Tools 2 - Apache Spark With PySpark
33 pages
Cie CH 7
No ratings yet
Cie CH 7
45 pages
Spark 101
No ratings yet
Spark 101
25 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
JIT2022 Artificial Intelligenceand Conductof Lit Reviews
No ratings yet
JIT2022 Artificial Intelligenceand Conductof Lit Reviews
19 pages
SQL Interview Questions
No ratings yet
SQL Interview Questions
2 pages
Customizing Oracle APIs with User Hooks
No ratings yet
Customizing Oracle APIs with User Hooks
5 pages
MC Is
No ratings yet
MC Is
8 pages
Workshop 06 - PHP and MySQL
No ratings yet
Workshop 06 - PHP and MySQL
5 pages
Introduction To Data Structures: CS 202 Minor
No ratings yet
Introduction To Data Structures: CS 202 Minor
16 pages
Toc 1
No ratings yet
Toc 1
25 pages
Data Warehousing for Students
No ratings yet
Data Warehousing for Students
67 pages
STK - Q Quiz
No ratings yet
STK - Q Quiz
21 pages
26-Unit5 DataAnalytics ApacheHadoop IoTPart1
No ratings yet
26-Unit5 DataAnalytics ApacheHadoop IoTPart1
21 pages
SQL Script To Generate Script For Existing Database Permissions
No ratings yet
SQL Script To Generate Script For Existing Database Permissions
6 pages
Birmingham Opportunity Zones
100% (1)
Birmingham Opportunity Zones
1 page
34-Unit5 DataAnalytics IoT Adoop Oozie Part3
No ratings yet
34-Unit5 DataAnalytics IoT Adoop Oozie Part3
13 pages
Errors That Can Occur When You Run A Report From Tigerpaw
No ratings yet
Errors That Can Occur When You Run A Report From Tigerpaw
22 pages
gc ٢٠٢٤ ١١ ٢٥
No ratings yet
gc ٢٠٢٤ ١١ ٢٥
32 pages
SQL Data Manipulation Language (DML) : Eng. Ibraheem Lubbad
No ratings yet
SQL Data Manipulation Language (DML) : Eng. Ibraheem Lubbad
16 pages
Azure Data
No ratings yet
Azure Data
6 pages
Wa0000.
No ratings yet
Wa0000.
28 pages
10gen-MongoDB Operations Best Practices
No ratings yet
10gen-MongoDB Operations Best Practices
29 pages
Ps File
No ratings yet
Ps File
6 pages
Student Assignment Guide
No ratings yet
Student Assignment Guide
10 pages
SAS PROC CONTENTS Guide
No ratings yet
SAS PROC CONTENTS Guide
6 pages
Is It A String Literal or An Alias
No ratings yet
Is It A String Literal or An Alias
4 pages
Sales Analytics Documentation
No ratings yet
Sales Analytics Documentation
4 pages
Azure Infrastructure Course
No ratings yet
Azure Infrastructure Course
2 pages
Logger
No ratings yet
Logger
1 page
CICS Desk Reference Programs: Program Files and Mapsets
No ratings yet
CICS Desk Reference Programs: Program Files and Mapsets
2 pages

35-Unit5 DataAnalytics IoT Adoop Spark Part4

Uploaded by

35-Unit5 DataAnalytics IoT Adoop Spark Part4

Uploaded by

Apache Spark

Mllib GraphX Bagel

Figure: SPARK TOOLS

▪RDD is a collection of elements partitioned across the nodes of the cluster

▪The RDD elements can be operated on in parallel in the cluster

▪RDD supports two types of operations:

▪Spark API allows chaining together Transformations and Actions.

▪Commands to start a Spark Cluster:

▪The program uses Map

▪The flatmap and map

1.The sensor data is loaded as a text file. Each line

2. The lines are split by applying map

3. A map transformation is applied which outputs

4. Finally, the reduceByKey transformation is

▪Here, the data is loaded from text file

▪ In this example, the training data consists of

You might also like