Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
66 views47 pages

H1. Big Data With Hadoop & Spark - Introduction

This document provides an introduction to a session on Big Data with Spark and Hadoop. It outlines the agenda, notes, asking questions process and information about the course and instructor. The next session details are also provided.

Uploaded by

wordpressbugs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views47 pages

H1. Big Data With Hadoop & Spark - Introduction

This document provides an introduction to a session on Big Data with Spark and Hadoop. It outlines the agenda, notes, asking questions process and information about the course and instructor. The next session details are also provided.

Uploaded by

wordpressbugs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Welcome to

Big Data
with

Spark & Hadoop

Please introduce yourself while others are


joining

[email protected] Introduction to Hadoop & Spark


Session 1 - Big Data with Spark & Hadoop
Duration: 3 hours
Agenda:
• Introduction to Big Data: What, Why, Use cases, Various solutions
• 10 mins. break
• Spark & Hadoop Architecture, Hands-On
• Hands-On
Notes:
• Please introduce yourself using chat window while others are joining
• Session is being recorded & Recording & presentation will be shared
• This is first of 20 sessions on Big Data with Spark & Hadoop specialization.
• It suffices as an introduction to Big Data Technology Stack.
Asking Questions?
• Everyone except Instructor is muted
• Please ask questions by typing in Q&A Window
• Instructor will read out the questions before answering
• To get better answers, keep your messages short and avoid chat language

[email protected] Introduction to Hadoop & Spark


About the Course

Specialization in Big Data with Hadoop & Spark

Learn HDFS, ZooKeeper, Hive, HBase, NoSQL, Oozie, Flume, Sqoop, Spark, Spark RDD,
Spark Streaming, Kafka, SparkR, SparkSQL, MLlib, and GraphX From Industry Experts

• 60+ hours training


• Projects & Lab
• Compatible with Hortonworks and Cloudera Certifications
• Certificate

Next Session:

Date: 18 Nov
Time: 8 pm - 11 pm (IST) or 6:30 am - 9:30 am (PST)

[email protected] Introduction to Hadoop & Spark


About CloudxLab
Making learning fun and for life

Videos Quizzes Hands-On Projects Case Studies

Real Life Use Cases


Automated Hands-on Assessments

Learn by doing
Automated Hands-on Assessments

Problem Statement Hands On Assessment


Cloud based Lab
Automated Hands-on Assessments

Problem
Statement

Evaluation
Automated Hands-on Assessments

Python Assessment Jupyter Notebook


Automated Hands-on Assessments

Python Assessment Jupyter Notebook


Course Instructor

Founder

Loves Explaining Technologies

Software Engineer

Sandeep Giri

Worked On Large Scale Computing

Graduated from IIT Roorkee

[email protected] Introduction to Hadoop & Spark


Course Objective

Learn To Process
Big Data
With
Hadoop, Spark
&
Related Technologies

[email protected] Introduction to Hadoop & Spark


Data Variety

[email protected] Introduction to Hadoop & Spark


Data Variety

ETL
Extract Transform Load

[email protected] Introduction to Hadoop & Spark


Distributed Systems

1.Groups of networked computers


2.Interact with each other
3.To achieve a common goal.

[email protected] Introduction to Hadoop & Spark


Question

How Many Bytes in One Petabyte?

^15
1.1259x10

[email protected] Introduction to Hadoop & Spark


Question

How Much Data Facebook Stores in


One Day?

600 TB

[email protected] Introduction to Hadoop & Spark


What is Big Data?

• Simply: Data of Very Big Size

• Can’t process with usual tools

• Distributed Architecture
Needed

• Structured / Unstructured

[email protected] Introduction to Hadoop & Spark


Characteristics of Big Data
VOLUME VELOCITY VARIETY
Data At Rest Data In Motion Data in Many Forms

Problems Involving the


Problems related to storage handling of data coming at Problems involving
of huge data reliably. fast rate. complex data structures
e.g. Storage of Logs of a e.g. Number of requests e.g. Maps, Social Graphs,
website, Storage of data by being received by Facebook, Recommendations
gmail. Youtube streaming, Google
FB: 300 PB. 600TB/ day Analytics

[email protected] Introduction to Hadoop & Spark


Characteristics of Big Data - Variety

Problems involving complex data structures


e.g. Maps, Social Graphs, Recommendations

[email protected] Introduction to Hadoop & Spark


Question

Time taken to read 1 TB from HDD?

Around 6 hours

[email protected] Introduction to Hadoop & Spark


Is One PetaByte Big Data?

If you have to count just vowels in 1 Petabyte


data everyday, do you need distributed
system?

[email protected] Introduction to Hadoop & Spark


Is One PetaByte Big Data?

Yes.
Most of the existing systems can’t handle it.

[email protected] Introduction to Hadoop & Spark


Why Big Data?

[email protected] Introduction to Hadoop & Spark


Why is It Important Now?

X =>
Devices: Application
Connectivity
Smart Phones Social Networks
Wifi, 4G, NFC, GPS
4.6 billion mobile-phones. Internet of Things
1 - 2 billion people accessing the internet.

The devices became cheaper, faster and smaller.


The connectivity improved. Result: Many Applications

[email protected] Introduction to Hadoop & Spark


Computing Components
To process & store data
we need

1. CPU Speed

4. Network

3. HDD or SSD
2. RAM - Speed & Size
Disk Size + Speed
[email protected] Introduction to Hadoop & Spark
Which Components Impact the Speed of
Computing?
A. CPU
B. Memory Size
C. Memory Read Speed
D. Disk Speed
E. Disk Size
F. Network Speed
G. All of Above

[email protected] Introduction to Hadoop & Spark


Which Components Impact the Speed of
Computing?
A. CPU
B. Memory Size
C. Memory Read Speed
D. Disk Speed
E. Disk Size
F. Network Speed
G. All of Above

[email protected] Introduction to Hadoop & Spark


Example Big Data Customers
1. Ecommerce - Recommendations

[email protected] Introduction to Hadoop & Spark


Example Big Data Customers
1. Ecommerce - Recommendations

[email protected] Introduction to Hadoop & Spark


Example Big Data Problems
Recommendations - How?

USER ID MOVIE ID RATING

KUMAR matrix 4.0

KUMAR Ice age 3.5


USER ID MOVIE ID RATING
apocalypse
GIRI 3.6
now
KUMAR apocalypse now 3.6
GIRI Ice age 3.5
GIRI matrix 4.0

[email protected] Introduction to Hadoop & Spark


Example Big Data Customers
2. Ecommerce - A/B Testing

[email protected] Introduction to Hadoop & Spark


Big Data Customers
Government
1.Fraud Detection
2.Cyber Security Welfare
3.Justice

Telecommunications
1.Customer Churn Prevention
2.Network Performance Optimization
3.Calling Data Record (CDR) Analysis
4.Analyzing Network to Predict Failure

[email protected] Introduction to Hadoop & Spark


Example Big Data Customers

Healthcare & Life Sciences


1.Health information exchange
2.Gene sequencing
3.Healthcare improvements
4.Drug Safety

[email protected] Introduction to Hadoop & Spark


Big Data Solutions
1.Apache Hadoop
○ Apache Spark
2.Cassandra
3.MongoDB
4.Google Compute Engine
5.AWS

[email protected] Introduction to Hadoop & Spark


What is Hadoop?

A. Created by Doug Cutting (of Yahoo)


B. Built for Nutch search engine project
C. Joined by Mike Cafarella
D. Based on GFS, GMR & Google Big Table
E. Named after Toy Elephant
F. Open Source - Apache
G. Powerful, Popular & Supported
H. Framework to handle Big Data
I. For distributed, scalable and reliable computing
J. Written in Java
[email protected] Introduction to Hadoop & Spark
Hadoop Distributions

A. Hortonworks - HDP (Hortonworks Data Platform)


○ Apache Hadoop + Spark
○ Ambari - Provisioning + Workbench
B. Cloudera
○ Apache Hadoop + Spark
○ Cloudera Manager - Provisioning
○ Hue - Workbench
C. MapR

[email protected] Introduction to Hadoop & Spark


Hadoop hosting in the cloud
A. On Microsoft Azure
○ HDInsight (uses Hortonworks HDP)
○ Extn. with .NET (plus Java).
B. On Amazon EC2/S3 services
○ See CloudxLab Video using HDP
■ https://www.youtube.com/watch?v=3uYY2sMBCz4
C. On Amazon Elastic MapReduce
○ Running and terminating jobs
○ Handling data transfer between EC2 (VM) and S3 (Object
Storage)
○ Provides Apache Hive
○ Support for using Spot Instances[109]
D. On Google Cloud Platform

[email protected] Introduction to Hadoop & Spark


WorkFlow
Components Spark
SQL like interface Machine
learning
/ STATS
SQL Interface

Compute Engine

NoSQL Datastore

Resource Manager

File Storage
Coordination!
[email protected] Introduction to Hadoop & Spark
Oozie - Use Case
Building a recommendation Engine

Flume Spark, Hive, Pig Spark MLlib Sqoop Web Server

1 2 4 6 8
3 5 7

HDFS HDFS HDFS MySQL

Run daily using Oozie workflow

[email protected] Introduction to Hadoop & Spark


Apache
• Really fast MapReduce
• 100x faster than Hadoop MapReduce in memory,
• 10x faster on disk.
• Builds on similar paradigms as MapReduce
• Integrated with Hadoop

Spark Core - A fast and general engine for large-scale


data processing.

[email protected] Introduction to Hadoop & Spark


Spark Architecture
Data Sources

HDFS

HBase

SQL SparkR Java Python Scala Languages

Hive
Dataframes Streaming MLLib GraphX Libraries

Tachyon
(alluxio)
Spark Core
Cassandra

Hadoop
Amazon EC2 Standalone Apache Mesos
YARN

Resource/cluster managers

[email protected] Introduction to Hadoop & Spark


Thank you. For the full course please enroll at
https://cloudxlab.com/

Next Instructor-Led Batch

Date: 25 Aug Time: 8 pm - 11 pm (IST) or 6:30 am - 9:30 am (PST)

[email protected] Introduction to Hadoop & Spark


A Typical WorkFlow

TensorFlow,
NoSQL
Linear
Databases,
HBase, HDFS, SparkSQL, Regression,
Kafka Hive Algorithms
Using Optimizi
Cleaning
Gatheri Basic ML ng ML Moving
Data
ng Data Analysis Algorith Algorith to Prod
(ETL)
ms ms
Spark RDD, Spark MLlib, Exposing
MapReduce, GraphX, REST APIs,
Informatica Scikit-learn, R Building
Android app

Machine Learning
Only Good
Careers Basics Knowledge

Software Engineer Software Engineer Applied ML Core ML Data Analyst


- ML - Big Data Engineer Engineer

CS Basics

Probability and
Statistics

Data Modeling
and Evaluation

Applying ML
Algos & Libs

SW Engg &
System Design

Big Data

Machine Learning
For the full course please enroll at https://cloudxlab.com/

[email protected] Introduction to Hadoop & Spark


For the full course please enroll at https://cloudxlab.com/

[email protected] Introduction to Hadoop & Spark


Questions?
https://discuss.cloudxlab.com
[email protected]

You might also like