0% found this document useful (0 votes)

97 views49 pages

APACHE SPARK and Scala

Uploaded by

Veershetty

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

97 views49 pages

APACHE SPARK and Scala

Uploaded by

Veershetty

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 49

Copyright© Scalebyte 1

ScaleByte
APACHE SPARK AND SCALA

www.scalebyte.com
Introduction
2

 Spark is a framework that allows distributed

processing and also provides in-memory
computation power.
 It’s a open source project, used for fast data
analytics
 It is one of the apaches top level project
 It provides high level API’s in JAVA, python and
scala with rich built-in library
Copyright© Scalebyte
Introduction Contd..
3

 Spark runs on cluster and can access any data

source on HDFS as well as Cassandra
 There is a need for fast processing as nowadays
waiting for a long time online / offline for getting
the results is inacceptable, hence spark is
introduced
 It provides you with high level tools like spark
SQL, and MLib for machine learning purpose
Copyright© Scalebyte
Batch Vs Real-time (Stream) Scenario
4

 Chop up the live stream into

batches of X seconds
 Spark treats each batch of
data as RDDs and processes
them using RDD operations
 Finally, the processed results
of the RDD operations are
returned in batches
Copyright© Scalebyte
Why spark?
9

 It’s a powerful and scalable open source processing engine

 Features
 Provides in-memory computation

 Uses RDD’s

 Works on HDFS, S3, Cassandra etc

 Spark scheduling of jobs is faster than mapreduce

 It has a new world record

11 Copyright© Scalebyte
What is RDD
12

Resilient Distributed dataset [primary core abstraction]

 A collection of huge data with the following properties
 Immutable read only
 Fault tolerant due to RDD lineage DAG
 Distributed and partitioned across the cluster
 Lazily evaluated
 Type inferred
 Cacheable
Copyright© Scalebyte
RDD Lineage — Logical Execution
Plan
13

 It is built as a result of applying

transformations to the RDD and creates
a Logical Execution Plan.

 In Spark all work is expressed as either creating new

RDDs, transforming existing RDDs, or calling
operations (Actions) on RDDs to compute a result
 There are two ways of creating RDD’s
 By loading from external dataset
 val Lines = sc.textFile(“Readme.md”)
 By distributing a collection of elements/objects
 val data = 1 to 100
 val dataRDD = sc.parallelize(data)
Copyright© Scalebyte
RDD basics
15

 Once created there are two kind of operations that can be

performed on RDD’s
 Transformations
 Apply functions to create a new RDD from the existing one
 Creates DAG
 Lazily evaluated
 Does not return value
 Actions
 Compute results by applying functions on RDD
 It returns value

 In some cases you wish to use contents of RDD

repeatedly then in such cases you can persist the
data by using persist function
 lines = sc.textFile(“Readme.txt”)
 pythonLines=lines.filter(lambda line: "Python" in line)
 pythonLines.count( )
 pythonLines.persist( )
 pythonLines.first( )
Copyright© Scalebyte
RDD Operations (Transformations)
17

 Transformations
 Transformations are operations that return new RDD’s
eg map(), filter() etc
 filter() transformation in Scala
 val inputRDD = sc.textFile("log.txt")
 val errorsRDD = inputRDD.filter(line =>line.contains("error"))

 filter() transformation in Python

 inputRDD = sc.textFile("log.txt")
 errorsRDD = inputRDD.filter(lambda x: "error" in x)
Copyright© Scalebyte
RDD Operations(Transformations)
18

 let’s use inputRDD again to search for lines with the

word “warning”
 we’ll use another transformation, union(), to print out
the number of lines that contained either “error” or
“warning”
errorsRDD = inputRDD.filter(lambda x: "error" in x)
warningsRDD = inputRDD.filter(lambda x: "warning" in x)
badLinesRDD = errorsRDD.union(warningsRDD)
Copyright© Scalebyte
RDD Operations(Transformations)
19

 union() is a bit different than filter(), in that

it operates on two RDDs instead of one.

 Actions
 Actions are operations performed on RDD
that return results to the driver program, or
they can be stored into some storage Eg
count(), first() etc
 Suppose we might want to print out some
information about the badLinesRDD, see
the examples inCopyright©
the next slide
Scalebyte
RDD Operations (Actions)
21

Scala error count using actions

println("Input had " + badLinesRDD.count() + " concerning lines")
println("Here are 10 examples:")
badLinesRDD.take(10).foreach(println)

Python error count using actions

print "Input had " + badLinesRDD.count() + " concerning lines"
print "Here are 10 examples:"
for line in badLinesRDD.take(10):
print line
Copyright© Scalebyte
22
RDD Operations (Actions)
Note: Its important to know that each
time we call a new action the RDD is
computed from scratch, this is indeed time
consuming
That is the reason we have persist() /
cache() action available that will help us to
store the intermediate results / RDD’s into
memory for further use Scalebyte
Copyright©
Lazy Evaluation
23

Transformations on RDDs are lazily evaluated,

meaning that spark will not begin to execute
until it sees an action.
For the below statements spark will not come in picture because
none of the
errorsRDD three statements have
= inputRDD.filter(lambda x: action operation
"error" in x)
warningsRDD = inputRDD.filter(lambda x: "warning"
in x)
badLinesRDD = errorsRDD.union(warningsRDD)

word = rdd.filter(lambda s: "error" in s)

def containsError(s):
return "error" in s
word = rdd.filter(containsError)

 Your program acts as a driver

 So is your spark shell
 The driver program is just one part of the
spark application

 Spark application starts on two

node cluster 3
mast
worker
 The driver program contacts er
4

the master for resources 2 3 execu

ter T
 Next the master contacts the T

worker nodes 1
driver 5
 Worker nodes create executers worker
4
 Executers now directly come in
execu
contact with the driver nodes terT
T
and the further communication
happens between driver and
executer node

Our area of Interest is Real-Time Analytics

Lets Explore Tools which can give low latency and

high throughput for Real Time Analytics

 Apache Flink is an open source platform which is a

streaming data flow engine that provides communication,
fault-tolerance, and data-distribution for distributed
computations over data streams
 Flink is a top level project of Apache.
 Flink is a scalable data Analyticsframework that is fully
compatible to Hadoop.
 Flink can execute both stream processing and batch
processing easily.
Copyright© Scalebyte
Apache Flink
41

 Its at its early state ie its still not been

explored by the many of analytics
community
 Most of the companies are migrating
 Version available for download Apache
Flink1.3.0

i. Low latency and High Performance

Apache Flink provides high performance and Low latency without any heavy configuration. Its pipelined
architecture provides the high throughput rate. It processes the data at lightening fast speed, it is also
called as 4G of Big Data.
ii. Fault Tolerance
The fault tolerance feature provided by Apache Flink is based on Chandy-Lamport distributed snapshots,
this mechanism provides the strong consistency guarantees.
iii. Memory Management
The memory management in apache flink provides control on how much memory is used by certain
runtime operations.
iv. Iterations
Apache Flink provides the dedicated support for iterative algorithms (machine learning, graph processing)
v. Integration
Apache Flink can be easily integrated with other open source data processing ecosystem. It can be
integrated with Hadoop, streams data from Kafka, It can be run on YARN.

43
44
Lambda Architecture
45
Lambda Architecture
46
47
48
Other Users of Apache Flink
49
Conclusion: Now is the Time for Apache Flink
50

Big Data
Processing Tool

Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
SPARK
No ratings yet
SPARK
125 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Oracle Cloud Learning Subscriptions
No ratings yet
Oracle Cloud Learning Subscriptions
615 pages
Data Bricks
No ratings yet
Data Bricks
42 pages
Pyspark
No ratings yet
Pyspark
31 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Spark
No ratings yet
Spark
160 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
Apache Spark vs Dask: Big Data Tools
No ratings yet
Apache Spark vs Dask: Big Data Tools
55 pages
Unit - 4
No ratings yet
Unit - 4
18 pages
SPARK
No ratings yet
SPARK
47 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
BDA Unit III IV
No ratings yet
BDA Unit III IV
33 pages
Day 9
No ratings yet
Day 9
30 pages
THOMAS M CONNOLLY & CAROLYN E BEGG: DB Questions
100% (3)
THOMAS M CONNOLLY & CAROLYN E BEGG: DB Questions
82 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
SPARK
No ratings yet
SPARK
27 pages
8 Apache Spark
No ratings yet
8 Apache Spark
25 pages
Embedded System Development Coding Reference Guide
100% (2)
Embedded System Development Coding Reference Guide
190 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Spark
No ratings yet
Spark
51 pages
Student Result System Guide
No ratings yet
Student Result System Guide
11 pages
Sap Tables List
100% (1)
Sap Tables List
2 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Spark Guide for 4th Year Engineering Students
No ratings yet
Spark Guide for 4th Year Engineering Students
241 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
Architecting On Aws1
No ratings yet
Architecting On Aws1
4 pages
SPARK
No ratings yet
SPARK
66 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Apache Spark Basics & Comparison
No ratings yet
Apache Spark Basics & Comparison
66 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Apache Spark: In-Memory Data Processing
No ratings yet
Apache Spark: In-Memory Data Processing
187 pages
Big Data Assignment Notes
No ratings yet
Big Data Assignment Notes
13 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
SPARK
No ratings yet
SPARK
35 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Sivasai Dudekula: Machine Learning Data Associate
No ratings yet
Sivasai Dudekula: Machine Learning Data Associate
2 pages
Apache Spark: The Next Gen Toolset For Big Data Processing
No ratings yet
Apache Spark: The Next Gen Toolset For Big Data Processing
9 pages
Introduction to Data Analysis with Spark
No ratings yet
Introduction to Data Analysis with Spark
51 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Unit 5 Note
No ratings yet
Unit 5 Note
18 pages
Backoffice Infrastructure Guide
100% (1)
Backoffice Infrastructure Guide
28 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
HDP Developer Apache Pig and Hive
No ratings yet
HDP Developer Apache Pig and Hive
42 pages
F 12 CH 04 TEXT FILE HANDLING 1
No ratings yet
F 12 CH 04 TEXT FILE HANDLING 1
111 pages
Big Data with Apache Spark Basics
No ratings yet
Big Data with Apache Spark Basics
43 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Kerberos Authentication Protocol
100% (2)
Kerberos Authentication Protocol
29 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
CNIT 125: Information Security Professional (Cissp Preparation)
No ratings yet
CNIT 125: Information Security Professional (Cissp Preparation)
62 pages
Currency Converter App Guide
No ratings yet
Currency Converter App Guide
3 pages
Vmware Vsphere Essentials Kits Datasheet
No ratings yet
Vmware Vsphere Essentials Kits Datasheet
3 pages
SAP HCM: Get Manager's Subordinates
No ratings yet
SAP HCM: Get Manager's Subordinates
4 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Kavita Bhatt Resume
No ratings yet
Kavita Bhatt Resume
1 page
Net Developer Sample Resume
No ratings yet
Net Developer Sample Resume
3 pages
Java Preview
No ratings yet
Java Preview
19 pages
File Input and Output
No ratings yet
File Input and Output
2 pages
Create Your ADempiere Customization Environment - AdempiereWiki
No ratings yet
Create Your ADempiere Customization Environment - AdempiereWiki
7 pages
Assignment 1 Solution
No ratings yet
Assignment 1 Solution
3 pages
Elo For Sharepoint: More Functionality For Greater Effectiveness
No ratings yet
Elo For Sharepoint: More Functionality For Greater Effectiveness
8 pages
A Survey On Essential Components of A Self-Sovereign Identity
No ratings yet
A Survey On Essential Components of A Self-Sovereign Identity
7 pages
Abhishek Jaiswal: Operating Systems: Sun Solaris 9/10, RHEL 5,6,7 Cent OS 6/7, OEL 6/7
No ratings yet
Abhishek Jaiswal: Operating Systems: Sun Solaris 9/10, RHEL 5,6,7 Cent OS 6/7, OEL 6/7
3 pages
Understanding Arrays in Java
No ratings yet
Understanding Arrays in Java
20 pages
Data Warehousing Quiz for BE Students
No ratings yet
Data Warehousing Quiz for BE Students
5 pages
Payroll Management System PDF
No ratings yet
Payroll Management System PDF
34 pages
Sabeo - Security Engineer - Mid Level CV Shorter Version
No ratings yet
Sabeo - Security Engineer - Mid Level CV Shorter Version
4 pages
ChatGPT Prompt 3
No ratings yet
ChatGPT Prompt 3
2 pages
MBA Student's EDMS Project Proposal
No ratings yet
MBA Student's EDMS Project Proposal
7 pages
Rajalakshmi Institute of Technology: Department of Computer Science and Engineering
No ratings yet
Rajalakshmi Institute of Technology: Department of Computer Science and Engineering
112 pages

APACHE SPARK and Scala

Uploaded by

APACHE SPARK and Scala

Uploaded by

Copyright© Scalebyte 1

 Spark is a framework that allows distributed

 Spark runs on cluster and can access any data

 Chop up the live stream into

 It’s a powerful and scalable open source processing engine

 Works on HDFS, S3, Cassandra etc

 Spark scheduling of jobs is faster than mapreduce

 It has a new world record

Resilient Distributed dataset [primary core abstraction]

 It is built as a result of applying

 In Spark all work is expressed as either creating new

 Once created there are two kind of operations that can be

 In some cases you wish to use contents of RDD

 filter() transformation in Python

 let’s use inputRDD again to search for lines with the

 union() is a bit different than filter(), in that

Scala error count using actions

Python error count using actions

Transformations on RDDs are lazily evaluated,

word = rdd.filter(lambda s: "error" in s)

 Your program acts as a driver

 Spark application starts on two

the master for resources 2 3 execu

Our area of Interest is Real-Time Analytics

Lets Explore Tools which can give low latency and

 Apache Flink is an open source platform which is a

 Its at its early state ie its still not been

i. Low latency and High Performance

You might also like