0% found this document useful (0 votes)

10 views37 pages

Big Data Course Student

The document discusses the concept of Big Data, highlighting its significance as a valuable resource surpassing oil. It covers various aspects of Big Data including its history, the differences between small and big data, cloud computing architectures (SaaS, PaaS, IaaS), and the five Vs of Big Data: Volume, Variety, Velocity, Veracity, and Value. Additionally, it outlines the Big Data pipeline, detailing the stages from data collection to visualization and security.

Uploaded by

yosri jemai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views37 pages

Big Data Course Student

Uploaded by

yosri jemai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Big Data

Mohamed Ali HADJ TAIEB

Faculty of Sciences, University of Sfax, Tunisia

[email protected]
Big Data: The New Oil

Back in 2017, The

Economist published
a story titled, "The
world's most valuable
resource is no longer
oil, but data."

M. A. HADJ TAIEB 2
Introduction

Big Data history

M. A. HADJ TAIEB 3
Introduction

Big Data history – Small Data vs Big Data

M. A. HADJ TAIEB 4
Virtualization

Cloud computing system architecture

SaaS: Software as a service

Application Software as a Service is a
model for distributing
Platform software across the cloud.
Applications are hosted by the
Infrastructure service provider

Provided by the cloud computing

providers

M. A. HADJ TAIEB 5
Virtualization

Cloud computing system architecture

A cloud user’s own PaaS: Platform as a service

application
Application PaaS is a Cloud Computing model
offers hardware and software tools
Platform as a service over the internet,
allowing the user to develop
Infrastructure applications. The hardware and
software are hosted on the
supplier’s infrastructure. Thus,
users do not need to install their
own hardware and software
Provided by the cloud computing internally to develop or launch new
providers applications.

M. A. HADJ TAIEB 6
Virtualization

Cloud computing system architecture

A cloud user’s own

Application
application and OS IaaS: Infrastructure as a service
Platform
IaaS is a form of Cloud Computing
Infrastructure that provides computing resources
in a virtualized environment (the
Cloud) over the Internet or other
connection.

Provided by the cloud computing

providers

M. A. HADJ TAIEB 7
Virtualization

Cloud computing system architecture

Application
Three cloud computing stacks
▪SaaS: Software as a service Platform

▪PaaS: Platform as a service Infrastructure

▪IaaS: Infrastructure as a service

M. A. HADJ TAIEB 8
Virtualization

Popular CC Services

• Amazon Web Services (AWS): IaaS

• Google App Engine (GAE): PaaS

• Microsoft Azure Services Platform: PaaS

M. A. HADJ TAIEB 9
Virtualization

Virtualized vs. Traditional

What is Virtualization?
Virtual Virtual
Container Container

App. A App. B App. C App. D

Virtualization Layer

Hardware

Virtualization allows one computer to do the job of multiple

computers, by sharing the resources of a single hardware across
multiple environments
M. A. HADJ TAIEB 10
Virtualization

Virtualized vs. Traditional

App App App

App App App OS OS OS

Operating System Hypervisor

Hardware Hardware

Traditional Stack Virtualized Stack

M. A. HADJ TAIEB 11
Virtual Machines

Hypervisor, or a virtual machine

monitor, is software, firmware, or
hardware that creates and runs VMs.
It’s what sits between the hardware
and the virtual machine and is
necessary to virtualize the server.

M. A. HADJ TAIEB 12
Containers

Containers sit on top of a physical server

and its host OS—for example, Linux or
Windows. Each container shares the
host OS kernel and, usually, the binaries
and libraries, too. Shared components
are read-only. Containers are thus
exceptionally “light”—they are only
megabytes in size and take just seconds
to start, versus gigabytes and minutes
for a VM.

M. A. HADJ TAIEB 13
Contexte
Problématiques

Enchaînement du cours

M. A. HADJ TAIEB 14
Big Data

Open source Big Data landscape

M. A. HADJ TAIEB 15
Big Data

Big Data Analytics

M. A. HADJ TAIEB 16
Big Data

Data Engineer vs. Data Scientist vs Data Analyst

M. A. HADJ TAIEB 17
Data Engineer vs. Data Scientist vs Data Analyst

M. A. HADJ TAIEB 18
Big Data

Vs
M. A. HADJ TAIEB 19
Big Data Vs

5 Vs of Big Data

M. A. HADJ TAIEB 20
Big Data Vs

5 Vs- Volume

Volume refers to the ‘amount of data’, which is growing day by day at a very fast
pace. The size of data generated by humans.

M. A. HADJ TAIEB 21
Big Data Vs

5 Vs - Variety

Many sources are contributing to Big Data, the type of data they are generating is
different. It can be structured, semi-structured or unstructured. Data are coming in
the form of images, audios, videos, sensor data, relational databases, etc.

M. A. HADJ TAIEB 22
Big Data Vs

5 Vs- Velocity
Velocity is defined as the pace at
which different sources generate the 2020
data every day. This flow of data is
massive and continuous. There are
1.03 billion Daily Active Users
(Facebook DAU) on Mobile as of
now, which is an increase of 22%
year-over-year.

If you are able to handle the

velocity, you will be able to generate
insights and take decisions based on
real-time data.

M. A. HADJ TAIEB 23
Big Data Vs

5 Vs- Veracity
Veracity refers to the quality of trustworthiness of the data. Data users have
to be able to transform the data into trustworthy insight and discard noise.

Even the best

analytics
systems are
only as good
as the data
they crunch

M. A. HADJ TAIEB 24
Big Data Vs

5 Vs- Value

Value refers to the ability to transform a tsunami of data into business.

Create != Extract

Big Data value chain

M. A. HADJ TAIEB 25
Big Data Pipeline

Big Data Pipeline

M. A. HADJ TAIEB 26
Big Data Pipeline

Big Data Pipeline – Data collector

Big Data Collection involves connecting to

various data sources, extracting the data, and
detecting the changed data. It’s about moving
data – and especially the unstructured data –
from where it is originated, into a system
where it can be stored and analyzed.
Data Serialization in Big Data
Different types of users have various types of
data consumer needs. Here we want to share Apache Avro
variable data, so we must plan how the user can
access data in a meaningful way. That’s why a
single image of variable data optimize the data
for human readability.

M. A. HADJ TAIEB 27
Big Data Pipeline

Big Data Pipeline – Data ingestion layer

It concerns data transportation from the ingestion layer to the rest of the
Data Pipeline. Messaging system will act as a mediator between all the
programs that can send and receive messages.

Data can be streamed in real-time or ingested in batches, When data

is ingested in real time then, as soon as data arrives it is ingested
immediately. When data is ingested in batches, data items are
ingested in some chunks at a periodic interval of time. Ingestion is the
process of bringing data into the Data Processing system.

M. A. HADJ TAIEB 28
Big Data Pipeline

Big Data Pipeline – Data storage layer

Storage becomes a challenge when the size of the data you are
dealing with, becomes large. Several possible solutions can rescue
from such problems. Finding a storage solution is very much
important when the size of your data becomes large. This layer
focuses on “where to store such large data efficiently.”

•HDFS : Hadoop Distributed File System

•GlusterFS: Dependable Distributed File System

•Amazon S3 (Amazon Simple Storage Service)

M. A. HADJ TAIEB 29
Big Data Pipeline

Big Data Pipeline – Data processing layer

Data collected in the previous layer will be processed in this layer. Here we do some magic with
the data to route them to a different destination, classify the data flow and it’s the first point
where the analytic may take place.

Big Data Batch Processing System

A simple batch processing system for offline analytics.

Near Real-Time Processing System

Near real-time processing is when speed is important, but processing
time in minutes is acceptable in lieu of seconds. An example of near
real-time processing is the identification of threats (detecting an
intruder in the network).
Real-Time Processing System
Real time processing requires a continual input, constant processing, and
steady output of data. A great example of real-time processing is data
streaming, radar systems, customer service systems, and bank ATMs, where
immediate processing is crucial to make the system work properly.
M. A. HADJ TAIEB 30
Big Data Pipeline

Big Data Pipeline – Data query layer

This is the layer where active analytic processing takes

place. Here, the primary focus is to gather the data value
so that they are made to be more helpful for the next
layer.

It is the layer where active analytic processing takes place.

This is a field where interactive queries are necessaries,
and it’s a zone traditionally dominated by SQL expert
developers. Before Hadoop, we had insufficient storage
due to which it takes a long analytics process.

M. A. HADJ TAIEB 31
Big Data Pipeline

Big Data Pipeline – Data visualisation layer

The visualization, or presentation tier, probably

the most prestigious tier, where the data pipeline
users may feel the VALUE of DATA. We need
something that will grab people’s attention, pull
them into, make your findings well-understood.

The data visualization layer often is the

thermometer that measures the success of the
project. This is the where the data value is
perceived by the user. While it’s designed for
handling and storing large volumes of data,
Hadoop and other tools have no built-in
provisions for data visualization and information
distribution, leaving no way to make that data
easily consumable by end business users.

M. A. HADJ TAIEB 32
Big Data Pipeline

Big Data Pipeline – Data security

Kerberos
Kerberos is a computer-network authentication protocol that works
on the basis of tickets
M. A. HADJ TAIEB 33
Big Data Pipeline

Big Data Pipeline – Data security

Lightweight Directory Access

Protocol (LDAP) is originally a
protocol for querying and
modifying directory services

M. A. HADJ TAIEB 34
Big Data platforms

Big Data platforms

Big data platform is a type of IT solution that combines the features

and capabilities of several big data applications and utilities within a
single solution.
It is an enterprise class IT platform
that enables organization in
developing, deploying, operating
and managing a big data
infrastructure /environment.

M. A. HADJ TAIEB 35
Big Data platforms

Big Data platforms – Lambda architecture

Lambda architecture is currently one of the most commonly used for real-time
data processing.

Nathan Marz
M. A. HADJ TAIEB 36
Big Data platforms

Big Data platforms – Kappa architecture

Kappa architecture focuses only on processing the data in a stream. It is not

intended to replace the Lambda architecture, except for certain specific use
cases.

Jay Kreps

M. A. HADJ TAIEB 37

Woman-Centered Coaching Blueprint - Workshop 3 - Handout
No ratings yet
Woman-Centered Coaching Blueprint - Workshop 3 - Handout
14 pages
Big Data: Characteristics and Impact
No ratings yet
Big Data: Characteristics and Impact
31 pages
WBI04 01 MSC 20200123
No ratings yet
WBI04 01 MSC 20200123
29 pages
Dsc652 - Chapter 1 Introduction To Big Data Systems
No ratings yet
Dsc652 - Chapter 1 Introduction To Big Data Systems
27 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
20 pages
Computer Architecture Presentation: Topic: Big Data
No ratings yet
Computer Architecture Presentation: Topic: Big Data
11 pages
Big Data
No ratings yet
Big Data
31 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
43 pages
Introduction To Big Data Computing
No ratings yet
Introduction To Big Data Computing
25 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
229 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
Data Science
No ratings yet
Data Science
87 pages
Session 1
No ratings yet
Session 1
37 pages
Chapter 1
No ratings yet
Chapter 1
40 pages
Unit 1 B Tech 3 Year BD
No ratings yet
Unit 1 B Tech 3 Year BD
10 pages
0 Principles of Big Data
No ratings yet
0 Principles of Big Data
70 pages
Big Data Presentation
No ratings yet
Big Data Presentation
24 pages
Anand J. Kulkarn
No ratings yet
Anand J. Kulkarn
4 pages
I Jcs It 2015060405
No ratings yet
I Jcs It 2015060405
6 pages
Lecture 2
No ratings yet
Lecture 2
11 pages
Big Data-One
No ratings yet
Big Data-One
9 pages
Future Revolution On Big Data
No ratings yet
Future Revolution On Big Data
24 pages
Big Data 2.0 Processing Systems 2ed
No ratings yet
Big Data 2.0 Processing Systems 2ed
155 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
Big Data
No ratings yet
Big Data
30 pages
Big Data Seminar Overview
No ratings yet
Big Data Seminar Overview
31 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
55 pages
BD Imp Ques 1
No ratings yet
BD Imp Ques 1
22 pages
Big Data 2
No ratings yet
Big Data 2
49 pages
Introduction to Big Data Concepts
No ratings yet
Introduction to Big Data Concepts
24 pages
BIG Data - Unit - 1
No ratings yet
BIG Data - Unit - 1
24 pages
Big Data Analytics Course Guide
No ratings yet
Big Data Analytics Course Guide
17 pages
Mca Big Data PDF Sem 3
No ratings yet
Mca Big Data PDF Sem 3
193 pages
Unit 1
No ratings yet
Unit 1
20 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
16 pages
Bda U1
No ratings yet
Bda U1
80 pages
Big Data A Comprehensive Overview
No ratings yet
Big Data A Comprehensive Overview
25 pages
Seminar On: Big Data
No ratings yet
Seminar On: Big Data
23 pages
Unit 1
No ratings yet
Unit 1
11 pages
BDA2023 Outline
No ratings yet
BDA2023 Outline
7 pages
Session 8 - George Strawn - Big Data
No ratings yet
Session 8 - George Strawn - Big Data
34 pages
Big Data Analytics Course Guide
No ratings yet
Big Data Analytics Course Guide
31 pages
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
Big Data Unit 1 AKTU Notes
100% (1)
Big Data Unit 1 AKTU Notes
87 pages
Big Data Presentation Slide
100% (1)
Big Data Presentation Slide
30 pages
Bda A23v12bigdata Analytics Unit1
No ratings yet
Bda A23v12bigdata Analytics Unit1
36 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
Bigdata Overview PDF
No ratings yet
Bigdata Overview PDF
98 pages
Stream Processing Chapter 2
No ratings yet
Stream Processing Chapter 2
21 pages
Big Data Tools and App
No ratings yet
Big Data Tools and App
8 pages
Big Data Analytics - Overview
No ratings yet
Big Data Analytics - Overview
66 pages
Big Data Presentation
No ratings yet
Big Data Presentation
22 pages
Big Data Analytics Course Overview
No ratings yet
Big Data Analytics Course Overview
8 pages
Big Data - Unit-I
No ratings yet
Big Data - Unit-I
17 pages
Types of Digital Data: Unit 1 Big Data KCS-061
No ratings yet
Types of Digital Data: Unit 1 Big Data KCS-061
12 pages
Session 1
No ratings yet
Session 1
32 pages
Detailednotes - Unit1 - Big Data
No ratings yet
Detailednotes - Unit1 - Big Data
22 pages
Final Program - LSB Pinning Ceremony 2024
No ratings yet
Final Program - LSB Pinning Ceremony 2024
4 pages
1.5.2 Strategy As Position: Why Strategy Execution Fails
No ratings yet
1.5.2 Strategy As Position: Why Strategy Execution Fails
12 pages
Chapter 1 SAD
No ratings yet
Chapter 1 SAD
8 pages
English 5 Co Combined
100% (2)
English 5 Co Combined
85 pages
Goat Housing Design Guide
No ratings yet
Goat Housing Design Guide
2 pages
A Comprehensive Look at The Acid Number Test PDF
No ratings yet
A Comprehensive Look at The Acid Number Test PDF
6 pages
Preboard Exam in Ee 2
No ratings yet
Preboard Exam in Ee 2
14 pages
Physical Properties of Metals
No ratings yet
Physical Properties of Metals
4 pages
TB3 - 117 Engine Maintenance Manual: (EMM Book1 TOC) (Chapter 72 TOC)
No ratings yet
TB3 - 117 Engine Maintenance Manual: (EMM Book1 TOC) (Chapter 72 TOC)
14 pages
WiFi, Working, Elements of WiFi
100% (2)
WiFi, Working, Elements of WiFi
67 pages
Anthony 8
No ratings yet
Anthony 8
2 pages
Aditya Internship Training
No ratings yet
Aditya Internship Training
14 pages
EMCP4.1 4.2 M05 CANExtMods EN INS
No ratings yet
EMCP4.1 4.2 M05 CANExtMods EN INS
14 pages
The Ultimate Guide To Reading The Water
No ratings yet
The Ultimate Guide To Reading The Water
39 pages
Haldi Ram
No ratings yet
Haldi Ram
9 pages
Unit 1
No ratings yet
Unit 1
10 pages
Review of Invisalign System
No ratings yet
Review of Invisalign System
13 pages
Namma Kalvi 12th Zoology Question Bank em 217045
No ratings yet
Namma Kalvi 12th Zoology Question Bank em 217045
45 pages
Experiment 16: Heat Conduction
No ratings yet
Experiment 16: Heat Conduction
6 pages
STCMB 1
No ratings yet
STCMB 1
59 pages
Navigating Landscapes of Mediated Memory 1st Edition Paul Wilson Instant Download
100% (5)
Navigating Landscapes of Mediated Memory 1st Edition Paul Wilson Instant Download
85 pages
Industrial Two Roll Mill Quotation
No ratings yet
Industrial Two Roll Mill Quotation
3 pages
Assignment MHDD 160
No ratings yet
Assignment MHDD 160
2 pages
Bulging As A Pile Imperfection
No ratings yet
Bulging As A Pile Imperfection
5 pages
Ann Cum Syllabus AP English 10-04-2025 1
No ratings yet
Ann Cum Syllabus AP English 10-04-2025 1
5 pages
AR-M208 Service Manual
No ratings yet
AR-M208 Service Manual
32 pages
Mathematics 9 - Q3 - Mod11 - Conditions Proving For Triangles Similar - v3
100% (2)
Mathematics 9 - Q3 - Mod11 - Conditions Proving For Triangles Similar - v3
28 pages
ANZ J. Surg. 2008 78 (Suppl. 1) A68-A80
No ratings yet
ANZ J. Surg. 2008 78 (Suppl. 1) A68-A80
13 pages

Big Data Course Student

Uploaded by

Big Data Course Student

Uploaded by

Big Data

Mohamed Ali HADJ TAIEB

Back in 2017, The

Big Data history

Big Data history – Small Data vs Big Data

Cloud computing system architecture

SaaS: Software as a service

Provided by the cloud computing

Cloud computing system architecture

A cloud user’s own PaaS: Platform as a service

Cloud computing system architecture

A cloud user’s own

Provided by the cloud computing

Cloud computing system architecture

▪PaaS: Platform as a service Infrastructure

▪IaaS: Infrastructure as a service

• Amazon Web Services (AWS): IaaS

• Google App Engine (GAE): PaaS

• Microsoft Azure Services Platform: PaaS

Virtualized vs. Traditional

App. A App. B App. C App. D

Virtualization allows one computer to do the job of multiple

Virtualized vs. Traditional

App App App

App App App OS OS OS

Operating System Hypervisor

Traditional Stack Virtualized Stack

Hypervisor, or a virtual machine

Containers sit on top of a physical server

Open source Big Data landscape

Big Data Analytics

Data Engineer vs. Data Scientist vs Data Analyst

If you are able to handle the

Even the best

Value refers to the ability to transform a tsunami of data into business.

Big Data value chain

Big Data Pipeline

Big Data Pipeline – Data collector

Big Data Collection involves connecting to

Big Data Pipeline – Data ingestion layer

Data can be streamed in real-time or ingested in batches, When data

Big Data Pipeline – Data storage layer

•HDFS : Hadoop Distributed File System

•GlusterFS: Dependable Distributed File System

•Amazon S3 (Amazon Simple Storage Service)

Big Data Pipeline – Data processing layer

Big Data Batch Processing System

Near Real-Time Processing System

Big Data Pipeline – Data query layer

This is the layer where active analytic processing takes

It is the layer where active analytic processing takes place.

Big Data Pipeline – Data visualisation layer

The visualization, or presentation tier, probably

The data visualization layer often is the

Big Data Pipeline – Data security

Big Data Pipeline – Data security

Lightweight Directory Access

Big Data platforms

Big data platform is a type of IT solution that combines the features

Big Data platforms – Lambda architecture

Big Data platforms – Kappa architecture

Kappa architecture focuses only on processing the data in a stream. It is not

You might also like