0% found this document useful (0 votes)

66 views47 pages

H1. Big Data With Hadoop & Spark - Introduction

This document provides an introduction to a session on Big Data with Spark and Hadoop. It outlines the agenda, notes, asking questions process and information about the course and instructor. The next session details are also provided.

Uploaded by

wordpressbugs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views47 pages

H1. Big Data With Hadoop & Spark - Introduction

Uploaded by

wordpressbugs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Welcome to

Big Data
with

Spark & Hadoop

Please introduce yourself while others are

joining

[email protected] Introduction to Hadoop & Spark

Session 1 - Big Data with Spark & Hadoop
Duration: 3 hours
Agenda:
• Introduction to Big Data: What, Why, Use cases, Various solutions
• 10 mins. break
• Spark & Hadoop Architecture, Hands-On
• Hands-On
Notes:
• Please introduce yourself using chat window while others are joining
• Session is being recorded & Recording & presentation will be shared
• This is first of 20 sessions on Big Data with Spark & Hadoop specialization.
• It suffices as an introduction to Big Data Technology Stack.
Asking Questions?
• Everyone except Instructor is muted
• Please ask questions by typing in Q&A Window
• Instructor will read out the questions before answering
• To get better answers, keep your messages short and avoid chat language

[email protected] Introduction to Hadoop & Spark

About the Course

Specialization in Big Data with Hadoop & Spark

Learn HDFS, ZooKeeper, Hive, HBase, NoSQL, Oozie, Flume, Sqoop, Spark, Spark RDD,
Spark Streaming, Kafka, SparkR, SparkSQL, MLlib, and GraphX From Industry Experts

• 60+ hours training

• Projects & Lab
• Compatible with Hortonworks and Cloudera Certifications
• Certificate

Next Session:

Date: 18 Nov
Time: 8 pm - 11 pm (IST) or 6:30 am - 9:30 am (PST)

[email protected] Introduction to Hadoop & Spark

About CloudxLab
Making learning fun and for life

Videos Quizzes Hands-On Projects Case Studies

Real Life Use Cases

Automated Hands-on Assessments

Learn by doing
Automated Hands-on Assessments

Problem Statement Hands On Assessment

Cloud based Lab
Automated Hands-on Assessments

Problem
Statement

Evaluation
Automated Hands-on Assessments

Python Assessment Jupyter Notebook

Automated Hands-on Assessments

Python Assessment Jupyter Notebook

Course Instructor

Founder

Loves Explaining Technologies

Software Engineer

Sandeep Giri

Worked On Large Scale Computing

Graduated from IIT Roorkee

[email protected] Introduction to Hadoop & Spark

Course Objective

Learn To Process
Big Data
With
Hadoop, Spark
&
Related Technologies

[email protected] Introduction to Hadoop & Spark

Data Variety

[email protected] Introduction to Hadoop & Spark

Data Variety

ETL
Extract Transform Load

[email protected] Introduction to Hadoop & Spark

Distributed Systems

1.Groups of networked computers

2.Interact with each other
3.To achieve a common goal.

[email protected] Introduction to Hadoop & Spark

Question

How Many Bytes in One Petabyte?

^15
1.1259x10

[email protected] Introduction to Hadoop & Spark

Question

How Much Data Facebook Stores in

One Day?

600 TB

[email protected] Introduction to Hadoop & Spark

What is Big Data?

• Simply: Data of Very Big Size

• Can’t process with usual tools

• Distributed Architecture
Needed

• Structured / Unstructured

[email protected] Introduction to Hadoop & Spark

Characteristics of Big Data
VOLUME VELOCITY VARIETY
Data At Rest Data In Motion Data in Many Forms

Problems Involving the

Problems related to storage handling of data coming at Problems involving
of huge data reliably. fast rate. complex data structures
e.g. Storage of Logs of a e.g. Number of requests e.g. Maps, Social Graphs,
website, Storage of data by being received by Facebook, Recommendations
gmail. Youtube streaming, Google
FB: 300 PB. 600TB/ day Analytics

[email protected] Introduction to Hadoop & Spark

Characteristics of Big Data - Variety

Problems involving complex data structures

e.g. Maps, Social Graphs, Recommendations

[email protected] Introduction to Hadoop & Spark

Question

Time taken to read 1 TB from HDD?

Around 6 hours

[email protected] Introduction to Hadoop & Spark

Is One PetaByte Big Data?

If you have to count just vowels in 1 Petabyte

data everyday, do you need distributed
system?

[email protected] Introduction to Hadoop & Spark

Is One PetaByte Big Data?

Yes.
Most of the existing systems can’t handle it.

[email protected] Introduction to Hadoop & Spark

Why Big Data?

[email protected] Introduction to Hadoop & Spark

Why is It Important Now?

X =>
Devices: Application
Connectivity
Smart Phones Social Networks
Wifi, 4G, NFC, GPS
4.6 billion mobile-phones. Internet of Things
1 - 2 billion people accessing the internet.

The devices became cheaper, faster and smaller.

The connectivity improved. Result: Many Applications

[email protected] Introduction to Hadoop & Spark

Computing Components
To process & store data
we need

1. CPU Speed

4. Network

3. HDD or SSD
2. RAM - Speed & Size
Disk Size + Speed
[email protected] Introduction to Hadoop & Spark
Which Components Impact the Speed of
Computing?
A. CPU
B. Memory Size
C. Memory Read Speed
D. Disk Speed
E. Disk Size
F. Network Speed
G. All of Above

[email protected] Introduction to Hadoop & Spark

Which Components Impact the Speed of
Computing?
A. CPU
B. Memory Size
C. Memory Read Speed
D. Disk Speed
E. Disk Size
F. Network Speed
G. All of Above

[email protected] Introduction to Hadoop & Spark

Example Big Data Customers
1. Ecommerce - Recommendations

[email protected] Introduction to Hadoop & Spark

Example Big Data Customers
1. Ecommerce - Recommendations

[email protected] Introduction to Hadoop & Spark

Example Big Data Problems
Recommendations - How?

USER ID MOVIE ID RATING

KUMAR matrix 4.0

KUMAR Ice age 3.5

USER ID MOVIE ID RATING
apocalypse
GIRI 3.6
now
KUMAR apocalypse now 3.6
GIRI Ice age 3.5
GIRI matrix 4.0

[email protected] Introduction to Hadoop & Spark

Example Big Data Customers
2. Ecommerce - A/B Testing

[email protected] Introduction to Hadoop & Spark

Big Data Customers
Government
1.Fraud Detection
2.Cyber Security Welfare
3.Justice

Telecommunications
1.Customer Churn Prevention
2.Network Performance Optimization
3.Calling Data Record (CDR) Analysis
4.Analyzing Network to Predict Failure

[email protected] Introduction to Hadoop & Spark

Example Big Data Customers

Healthcare & Life Sciences

1.Health information exchange
2.Gene sequencing
3.Healthcare improvements
4.Drug Safety

[email protected] Introduction to Hadoop & Spark

Big Data Solutions
1.Apache Hadoop
○ Apache Spark
2.Cassandra
3.MongoDB
4.Google Compute Engine
5.AWS

[email protected] Introduction to Hadoop & Spark

What is Hadoop?

A. Created by Doug Cutting (of Yahoo)

B. Built for Nutch search engine project
C. Joined by Mike Cafarella
D. Based on GFS, GMR & Google Big Table
E. Named after Toy Elephant
F. Open Source - Apache
G. Powerful, Popular & Supported
H. Framework to handle Big Data
I. For distributed, scalable and reliable computing
J. Written in Java
[email protected] Introduction to Hadoop & Spark
Hadoop Distributions

A. Hortonworks - HDP (Hortonworks Data Platform)

○ Apache Hadoop + Spark
○ Ambari - Provisioning + Workbench
B. Cloudera
○ Apache Hadoop + Spark
○ Cloudera Manager - Provisioning
○ Hue - Workbench
C. MapR

[email protected] Introduction to Hadoop & Spark

Hadoop hosting in the cloud
A. On Microsoft Azure
○ HDInsight (uses Hortonworks HDP)
○ Extn. with .NET (plus Java).
B. On Amazon EC2/S3 services
○ See CloudxLab Video using HDP
■ https://www.youtube.com/watch?v=3uYY2sMBCz4
C. On Amazon Elastic MapReduce
○ Running and terminating jobs
○ Handling data transfer between EC2 (VM) and S3 (Object
Storage)
○ Provides Apache Hive
○ Support for using Spot Instances[109]
D. On Google Cloud Platform

[email protected] Introduction to Hadoop & Spark

WorkFlow
Components Spark
SQL like interface Machine
learning
/ STATS
SQL Interface

Compute Engine

NoSQL Datastore

Resource Manager

File Storage
Coordination!
[email protected] Introduction to Hadoop & Spark
Oozie - Use Case
Building a recommendation Engine

Flume Spark, Hive, Pig Spark MLlib Sqoop Web Server

1 2 4 6 8
3 5 7

HDFS HDFS HDFS MySQL

Run daily using Oozie workflow

[email protected] Introduction to Hadoop & Spark

Apache
• Really fast MapReduce
• 100x faster than Hadoop MapReduce in memory,
• 10x faster on disk.
• Builds on similar paradigms as MapReduce
• Integrated with Hadoop

Spark Core - A fast and general engine for large-scale

data processing.

[email protected] Introduction to Hadoop & Spark

Spark Architecture
Data Sources

HDFS

HBase

SQL SparkR Java Python Scala Languages

Hive
Dataframes Streaming MLLib GraphX Libraries

Tachyon
(alluxio)
Spark Core
Cassandra

Hadoop
Amazon EC2 Standalone Apache Mesos
YARN

Resource/cluster managers

[email protected] Introduction to Hadoop & Spark

Thank you. For the full course please enroll at
https://cloudxlab.com/

Next Instructor-Led Batch

Date: 25 Aug Time: 8 pm - 11 pm (IST) or 6:30 am - 9:30 am (PST)

[email protected] Introduction to Hadoop & Spark

A Typical WorkFlow

TensorFlow,
NoSQL
Linear
Databases,
HBase, HDFS, SparkSQL, Regression,
Kafka Hive Algorithms
Using Optimizi
Cleaning
Gatheri Basic ML ng ML Moving
Data
ng Data Analysis Algorith Algorith to Prod
(ETL)
ms ms
Spark RDD, Spark MLlib, Exposing
MapReduce, GraphX, REST APIs,
Informatica Scikit-learn, R Building
Android app

Machine Learning
Only Good
Careers Basics Knowledge

Software Engineer Software Engineer Applied ML Core ML Data Analyst

- ML - Big Data Engineer Engineer

CS Basics

Probability and
Statistics

Data Modeling
and Evaluation

Applying ML
Algos & Libs

SW Engg &
System Design

Big Data

Machine Learning
For the full course please enroll at https://cloudxlab.com/

[email protected] Introduction to Hadoop & Spark

For the full course please enroll at https://cloudxlab.com/

[email protected] Introduction to Hadoop & Spark

Questions?
https://discuss.cloudxlab.com
[email protected]

Introduction To Big Data With Spark and Hadoop
No ratings yet
Introduction To Big Data With Spark and Hadoop
61 pages
Students Language Learning Strategies and Academic Performance Torres Sumicad Tu
No ratings yet
Students Language Learning Strategies and Academic Performance Torres Sumicad Tu
103 pages
Literary Theory Bakhtin Theoryof Dialogism Ideasand Applications
No ratings yet
Literary Theory Bakhtin Theoryof Dialogism Ideasand Applications
14 pages
Grade 6 Science Term Test Papers
33% (3)
Grade 6 Science Term Test Papers
8 pages
Bda M2
No ratings yet
Bda M2
60 pages
Hadoop Essentials Delve Into The Key Concepts of Hadoop and Get A Thorough Understanding of The Hadoop Ecosystem 1st Edition Shiva Achari Instant Download
No ratings yet
Hadoop Essentials Delve Into The Key Concepts of Hadoop and Get A Thorough Understanding of The Hadoop Ecosystem 1st Edition Shiva Achari Instant Download
132 pages
Data Bots Training Courses
100% (1)
Data Bots Training Courses
36 pages
Big Data With Hadoop & Spark - Introduction
No ratings yet
Big Data With Hadoop & Spark - Introduction
28 pages
BIG DATA Class 1 1741496163
No ratings yet
BIG DATA Class 1 1741496163
108 pages
Hadoop Essentials Delve Into The Key Concepts of Hadoop and Get A Thorough Understanding of The Hadoop Ecosystem 1st Edition Shiva Achari Instant Access 2025
No ratings yet
Hadoop Essentials Delve Into The Key Concepts of Hadoop and Get A Thorough Understanding of The Hadoop Ecosystem 1st Edition Shiva Achari Instant Access 2025
156 pages
Focus2 2E Unit Test Vocabulary Grammar UoE Unit5 GroupB
100% (1)
Focus2 2E Unit Test Vocabulary Grammar UoE Unit5 GroupB
2 pages
Prescriptive, Descriptive, Formal, Functional, & Pedagogical Grammar
0% (1)
Prescriptive, Descriptive, Formal, Functional, & Pedagogical Grammar
4 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
BCA - Arithmetic Operations of Binary Numbers
No ratings yet
BCA - Arithmetic Operations of Binary Numbers
8 pages
The Nature of Jesus Christ
No ratings yet
The Nature of Jesus Christ
2 pages
Data-Intensive Computing Overview
No ratings yet
Data-Intensive Computing Overview
46 pages
Lecture 14
No ratings yet
Lecture 14
21 pages
BDA - Unit-1
No ratings yet
BDA - Unit-1
24 pages
BDH (1 5) ChatGPT
No ratings yet
BDH (1 5) ChatGPT
26 pages
Big Data Technologies Presentation
No ratings yet
Big Data Technologies Presentation
10 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
229 pages
BigData Session1
No ratings yet
BigData Session1
14 pages
Spark
No ratings yet
Spark
49 pages
Module Handbook Bigdata
No ratings yet
Module Handbook Bigdata
3 pages
Big Data
No ratings yet
Big Data
27 pages
Bigdata Intro
No ratings yet
Bigdata Intro
76 pages
1 Introduction
No ratings yet
1 Introduction
31 pages
Bba13 Notes BDF Unit 1
No ratings yet
Bba13 Notes BDF Unit 1
3 pages
B2. Introduction To Big Data With Spark and Hadoop - Coursera
No ratings yet
B2. Introduction To Big Data With Spark and Hadoop - Coursera
12 pages
Introduction To Hadoop & Spark
No ratings yet
Introduction To Hadoop & Spark
28 pages
Big Data With Hadoop & Spark - Introduction
No ratings yet
Big Data With Hadoop & Spark - Introduction
42 pages
Big Data Analytics
No ratings yet
Big Data Analytics
20 pages
Big Data Analytics 0th Lecture
No ratings yet
Big Data Analytics 0th Lecture
19 pages
Data Science
No ratings yet
Data Science
87 pages
Seminar Report On Bigdata and Hadoop
No ratings yet
Seminar Report On Bigdata and Hadoop
4 pages
Big Data and Hadoop
No ratings yet
Big Data and Hadoop
5 pages
Fullstack Big Data & Cloud Course
No ratings yet
Fullstack Big Data & Cloud Course
36 pages
Big Data
No ratings yet
Big Data
3 pages
Drew English - 2024 Poetry Revision Booklet
No ratings yet
Drew English - 2024 Poetry Revision Booklet
82 pages
Big Data Challenges and Solutions
No ratings yet
Big Data Challenges and Solutions
36 pages
BD Imp Ques 1
No ratings yet
BD Imp Ques 1
22 pages
Module 2
No ratings yet
Module 2
20 pages
Big Data Analytics for B.Tech Students
No ratings yet
Big Data Analytics for B.Tech Students
119 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
17 pages
Extra Grammar Exercises (Unit 3, Page 29) LESSON 1 The Simple Present Tense: Review
No ratings yet
Extra Grammar Exercises (Unit 3, Page 29) LESSON 1 The Simple Present Tense: Review
4 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Big Data Frameworks for Students
No ratings yet
Big Data Frameworks for Students
32 pages
Hadoop Tutorial
No ratings yet
Hadoop Tutorial
58 pages
IIT Kharagpur Data Science PDF
No ratings yet
IIT Kharagpur Data Science PDF
22 pages
Big Data Analytics With Hadoop and Apache Spark
No ratings yet
Big Data Analytics With Hadoop and Apache Spark
17 pages
Topic 1 Big Data Technologies
No ratings yet
Topic 1 Big Data Technologies
5 pages
Big Data Workshop: Hadoop & AWS
No ratings yet
Big Data Workshop: Hadoop & AWS
52 pages
Hand Note of Computers
No ratings yet
Hand Note of Computers
127 pages
JOHN KEATS AND THE CULTURE OF DISSENT 2nd Edition Nicholas Roe - The Full Ebook Set Is Available With All Chapters For Download
100% (1)
JOHN KEATS AND THE CULTURE OF DISSENT 2nd Edition Nicholas Roe - The Full Ebook Set Is Available With All Chapters For Download
86 pages
The Antediluvian Patriarchs and The Sumerian King List 4n4774ei70
No ratings yet
The Antediluvian Patriarchs and The Sumerian King List 4n4774ei70
11 pages
Day 1 Big Data Concepts For Executives and Senior Management Objective
No ratings yet
Day 1 Big Data Concepts For Executives and Senior Management Objective
2 pages
2 ND Selection Bed
No ratings yet
2 ND Selection Bed
12 pages
Infographic PDF About Teaching Strategies To English Skills
No ratings yet
Infographic PDF About Teaching Strategies To English Skills
2 pages
Big Data Hadoop & Spark Curriculum
No ratings yet
Big Data Hadoop & Spark Curriculum
10 pages
Big Data Analytics for B.Tech Students
No ratings yet
Big Data Analytics for B.Tech Students
134 pages
Big Data Certification for IT Pros
No ratings yet
Big Data Certification for IT Pros
22 pages
Bertrand Russell On Critical Thinking
No ratings yet
Bertrand Russell On Critical Thinking
7 pages
Functions and Relations Solutions
No ratings yet
Functions and Relations Solutions
46 pages
Taming Big Data
No ratings yet
Taming Big Data
268 pages
Bigdata Engineer Complete Syllabus: Presented by
No ratings yet
Bigdata Engineer Complete Syllabus: Presented by
21 pages
Mad Summer 2022 Mad Model Answer Paper
No ratings yet
Mad Summer 2022 Mad Model Answer Paper
40 pages
Fyimca Business Mathematics 123 Theory Termwork 2
No ratings yet
Fyimca Business Mathematics 123 Theory Termwork 2
3 pages
Big Data Hadoop Certification Training: About Intellipaat
No ratings yet
Big Data Hadoop Certification Training: About Intellipaat
13 pages
Spark Development for Developers
No ratings yet
Spark Development for Developers
172 pages
Hadoop Architect Brochure
No ratings yet
Hadoop Architect Brochure
13 pages
B.A. Comparative Literature Hons
No ratings yet
B.A. Comparative Literature Hons
5 pages
Center 2025
No ratings yet
Center 2025
34 pages
AT+CGEQREQ - 3G Quality of Service Profile
No ratings yet
AT+CGEQREQ - 3G Quality of Service Profile
1 page
42 Plag Report
No ratings yet
42 Plag Report
56 pages
What Is The Twink-Handler Relationship I Asked A Bunch of Twinks and Their Handlers
No ratings yet
What Is The Twink-Handler Relationship I Asked A Bunch of Twinks and Their Handlers
1 page
Test Construction and Evaluation and Preparation of Table of Specification
No ratings yet
Test Construction and Evaluation and Preparation of Table of Specification
28 pages
National Anthem Player System IOT (1) (Perfect)
No ratings yet
National Anthem Player System IOT (1) (Perfect)
11 pages
Creating An Object Save Location For The Object Management Workbench - Document 626181.1
No ratings yet
Creating An Object Save Location For The Object Management Workbench - Document 626181.1
6 pages
GIT Full Content
No ratings yet
GIT Full Content
5 pages
CHAITYAVANDAN
No ratings yet
CHAITYAVANDAN
4 pages
Daily Accomplishment Report: Wendy C. Manatad
No ratings yet
Daily Accomplishment Report: Wendy C. Manatad
2 pages
18.reading Mysterious Creatures
No ratings yet
18.reading Mysterious Creatures
1 page

H1. Big Data With Hadoop & Spark - Introduction

Uploaded by

H1. Big Data With Hadoop & Spark - Introduction

Uploaded by

Welcome to

Spark & Hadoop

Please introduce yourself while others are

[email protected] Introduction to Hadoop & Spark

[email protected] Introduction to Hadoop & Spark

Specialization in Big Data with Hadoop & Spark

• 60+ hours training

[email protected] Introduction to Hadoop & Spark

Videos Quizzes Hands-On Projects Case Studies

Real Life Use Cases

Problem Statement Hands On Assessment

Python Assessment Jupyter Notebook

Python Assessment Jupyter Notebook

Loves Explaining Technologies

Worked On Large Scale Computing

Graduated from IIT Roorkee

[email protected] Introduction to Hadoop & Spark

[email protected] Introduction to Hadoop & Spark

[email protected] Introduction to Hadoop & Spark

[email protected] Introduction to Hadoop & Spark

1.Groups of networked computers

[email protected] Introduction to Hadoop & Spark

How Many Bytes in One Petabyte?

[email protected] Introduction to Hadoop & Spark

How Much Data Facebook Stores in

[email protected] Introduction to Hadoop & Spark

• Simply: Data of Very Big Size

• Can’t process with usual tools

[email protected] Introduction to Hadoop & Spark

Problems Involving the

[email protected] Introduction to Hadoop & Spark

Problems involving complex data structures

[email protected] Introduction to Hadoop & Spark

Time taken to read 1 TB from HDD?

[email protected] Introduction to Hadoop & Spark

If you have to count just vowels in 1 Petabyte

[email protected] Introduction to Hadoop & Spark

[email protected] Introduction to Hadoop & Spark

[email protected] Introduction to Hadoop & Spark

The devices became cheaper, faster and smaller.

[email protected] Introduction to Hadoop & Spark

[email protected] Introduction to Hadoop & Spark

[email protected] Introduction to Hadoop & Spark

[email protected] Introduction to Hadoop & Spark

[email protected] Introduction to Hadoop & Spark

USER ID MOVIE ID RATING

KUMAR matrix 4.0

KUMAR Ice age 3.5

[email protected] Introduction to Hadoop & Spark

[email protected] Introduction to Hadoop & Spark

[email protected] Introduction to Hadoop & Spark

Healthcare & Life Sciences

[email protected] Introduction to Hadoop & Spark

[email protected] Introduction to Hadoop & Spark

A. Created by Doug Cutting (of Yahoo)

A. Hortonworks - HDP (Hortonworks Data Platform)

[email protected] Introduction to Hadoop & Spark

[email protected] Introduction to Hadoop & Spark

Flume Spark, Hive, Pig Spark MLlib Sqoop Web Server

HDFS HDFS HDFS MySQL

Run daily using Oozie workflow

[email protected] Introduction to Hadoop & Spark

Spark Core - A fast and general engine for large-scale

[email protected] Introduction to Hadoop & Spark

SQL SparkR Java Python Scala Languages

[email protected] Introduction to Hadoop & Spark

Next Instructor-Led Batch

Date: 25 Aug Time: 8 pm - 11 pm (IST) or 6:30 am - 9:30 am (PST)

[email protected] Introduction to Hadoop & Spark

Software Engineer Software Engineer Applied ML Core ML Data Analyst

[email protected] Introduction to Hadoop & Spark

[email protected] Introduction to Hadoop & Spark

You might also like