Welcome to
Big Data
with
Spark & Hadoop
Please introduce yourself while others are
joining
Session 1 - Big Data with Spark & Hadoop
Duration: 3 hours
Agenda:
• Introduction to Big Data: What, Why, Use cases, Various solutions
• 10 mins. break
• Spark & Hadoop Architecture, Hands-On
• Hands-On
Notes:
• Please introduce yourself using chat window while others are joining
• Session is being recorded & Recording & presentation will be shared
• This is first of 20 sessions on Big Data with Spark & Hadoop specialization.
• It suffices as an introduction to Big Data Technology Stack.
Asking Questions?
• Everyone except Instructor is muted
• Please ask questions by typing in Q&A Window
• Instructor will read out the questions before answering
• To get better answers, keep your messages short and avoid chat language
About the Course
Specialization in Big Data with Hadoop & Spark
Learn HDFS, ZooKeeper, Hive, HBase, NoSQL, Oozie, Flume, Sqoop, Spark, Spark RDD,
Spark Streaming, Kafka, SparkR, SparkSQL, MLlib, and GraphX From Industry Experts
• 60+ hours training
• Projects & Lab
• Compatible with Hortonworks and Cloudera Certifications
• Certificate
Next Session:
Date: 18 Nov
Time: 8 pm - 11 pm (IST) or 6:30 am - 9:30 am (PST)
About CloudxLab
Making learning fun and for life
Videos Quizzes Hands-On Projects Case Studies
Real Life Use Cases
Automated Hands-on Assessments
Learn by doing
Automated Hands-on Assessments
Problem Statement Hands On Assessment
Cloud based Lab
Automated Hands-on Assessments
Problem
Statement
Evaluation
Automated Hands-on Assessments
Python Assessment Jupyter Notebook
Automated Hands-on Assessments
Python Assessment Jupyter Notebook
Course Instructor
Founder
Loves Explaining Technologies
Software Engineer
Sandeep Giri
Worked On Large Scale Computing
Graduated from IIT Roorkee
Course Objective
Learn To Process
Big Data
With
Hadoop, Spark
&
Related Technologies
Data Variety
Data Variety
ETL
Extract Transform Load
Distributed Systems
1.Groups of networked computers
2.Interact with each other
3.To achieve a common goal.
Question
How Many Bytes in One Petabyte?
^15
1.1259x10
Question
How Much Data Facebook Stores in
One Day?
600 TB
What is Big Data?
• Simply: Data of Very Big Size
• Can’t process with usual tools
• Distributed Architecture
Needed
• Structured / Unstructured
Characteristics of Big Data
VOLUME VELOCITY VARIETY
Data At Rest Data In Motion Data in Many Forms
Problems Involving the
Problems related to storage handling of data coming at Problems involving
of huge data reliably. fast rate. complex data structures
e.g. Storage of Logs of a e.g. Number of requests e.g. Maps, Social Graphs,
website, Storage of data by being received by Facebook, Recommendations
gmail. Youtube streaming, Google
FB: 300 PB. 600TB/ day Analytics
Characteristics of Big Data - Variety
Problems involving complex data structures
e.g. Maps, Social Graphs, Recommendations
Question
Time taken to read 1 TB from HDD?
Around 6 hours
Is One PetaByte Big Data?
If you have to count just vowels in 1 Petabyte
data everyday, do you need distributed
system?
Is One PetaByte Big Data?
Yes.
Most of the existing systems can’t handle it.
Why Big Data?
Why is It Important Now?
X =>
Devices: Application
Connectivity
Smart Phones Social Networks
Wifi, 4G, NFC, GPS
4.6 billion mobile-phones. Internet of Things
1 - 2 billion people accessing the internet.
The devices became cheaper, faster and smaller.
The connectivity improved. Result: Many Applications
Computing Components
To process & store data
we need
1. CPU Speed
4. Network
3. HDD or SSD
2. RAM - Speed & Size
Disk Size + Speed
[email protected] Introduction to Hadoop & Spark
Which Components Impact the Speed of
Computing?
A. CPU
B. Memory Size
C. Memory Read Speed
D. Disk Speed
E. Disk Size
F. Network Speed
G. All of Above
Which Components Impact the Speed of
Computing?
A. CPU
B. Memory Size
C. Memory Read Speed
D. Disk Speed
E. Disk Size
F. Network Speed
G. All of Above
Example Big Data Customers
1. Ecommerce - Recommendations
Example Big Data Customers
1. Ecommerce - Recommendations
Example Big Data Problems
Recommendations - How?
USER ID MOVIE ID RATING
KUMAR matrix 4.0
KUMAR Ice age 3.5
USER ID MOVIE ID RATING
apocalypse
GIRI 3.6
now
KUMAR apocalypse now 3.6
GIRI Ice age 3.5
GIRI matrix 4.0
Example Big Data Customers
2. Ecommerce - A/B Testing
Big Data Customers
Government
1.Fraud Detection
2.Cyber Security Welfare
3.Justice
Telecommunications
1.Customer Churn Prevention
2.Network Performance Optimization
3.Calling Data Record (CDR) Analysis
4.Analyzing Network to Predict Failure
Example Big Data Customers
Healthcare & Life Sciences
1.Health information exchange
2.Gene sequencing
3.Healthcare improvements
4.Drug Safety
Big Data Solutions
1.Apache Hadoop
○ Apache Spark
2.Cassandra
3.MongoDB
4.Google Compute Engine
5.AWS
What is Hadoop?
A. Created by Doug Cutting (of Yahoo)
B. Built for Nutch search engine project
C. Joined by Mike Cafarella
D. Based on GFS, GMR & Google Big Table
E. Named after Toy Elephant
F. Open Source - Apache
G. Powerful, Popular & Supported
H. Framework to handle Big Data
I. For distributed, scalable and reliable computing
J. Written in Java
[email protected] Introduction to Hadoop & Spark
Hadoop Distributions
A. Hortonworks - HDP (Hortonworks Data Platform)
○ Apache Hadoop + Spark
○ Ambari - Provisioning + Workbench
B. Cloudera
○ Apache Hadoop + Spark
○ Cloudera Manager - Provisioning
○ Hue - Workbench
C. MapR
Hadoop hosting in the cloud
A. On Microsoft Azure
○ HDInsight (uses Hortonworks HDP)
○ Extn. with .NET (plus Java).
B. On Amazon EC2/S3 services
○ See CloudxLab Video using HDP
■ https://www.youtube.com/watch?v=3uYY2sMBCz4
C. On Amazon Elastic MapReduce
○ Running and terminating jobs
○ Handling data transfer between EC2 (VM) and S3 (Object
Storage)
○ Provides Apache Hive
○ Support for using Spot Instances[109]
D. On Google Cloud Platform
WorkFlow
Components Spark
SQL like interface Machine
learning
/ STATS
SQL Interface
Compute Engine
NoSQL Datastore
Resource Manager
File Storage
Coordination!
[email protected] Introduction to Hadoop & Spark
Oozie - Use Case
Building a recommendation Engine
Flume Spark, Hive, Pig Spark MLlib Sqoop Web Server
1 2 4 6 8
3 5 7
HDFS HDFS HDFS MySQL
Run daily using Oozie workflow
Apache
• Really fast MapReduce
• 100x faster than Hadoop MapReduce in memory,
• 10x faster on disk.
• Builds on similar paradigms as MapReduce
• Integrated with Hadoop
Spark Core - A fast and general engine for large-scale
data processing.
Spark Architecture
Data Sources
HDFS
HBase
SQL SparkR Java Python Scala Languages
Hive
Dataframes Streaming MLLib GraphX Libraries
Tachyon
(alluxio)
Spark Core
Cassandra
Hadoop
Amazon EC2 Standalone Apache Mesos
YARN
Resource/cluster managers
Thank you. For the full course please enroll at
https://cloudxlab.com/
Next Instructor-Led Batch
Date: 25 Aug Time: 8 pm - 11 pm (IST) or 6:30 am - 9:30 am (PST)
A Typical WorkFlow
TensorFlow,
NoSQL
Linear
Databases,
HBase, HDFS, SparkSQL, Regression,
Kafka Hive Algorithms
Using Optimizi
Cleaning
Gatheri Basic ML ng ML Moving
Data
ng Data Analysis Algorith Algorith to Prod
(ETL)
ms ms
Spark RDD, Spark MLlib, Exposing
MapReduce, GraphX, REST APIs,
Informatica Scikit-learn, R Building
Android app
Machine Learning
Only Good
Careers Basics Knowledge
Software Engineer Software Engineer Applied ML Core ML Data Analyst
- ML - Big Data Engineer Engineer
CS Basics
Probability and
Statistics
Data Modeling
and Evaluation
Applying ML
Algos & Libs
SW Engg &
System Design
Big Data
Machine Learning
For the full course please enroll at https://cloudxlab.com/
For the full course please enroll at https://cloudxlab.com/
Questions?
https://discuss.cloudxlab.com
[email protected]