Big Data
Mohamed Ali HADJ TAIEB
Faculty of Sciences, University of Sfax, Tunisia
[email protected]
Big Data: The New Oil
Back in 2017, The
Economist published
a story titled, "The
world's most valuable
resource is no longer
oil, but data."
M. A. HADJ TAIEB 2
Introduction
Big Data history
M. A. HADJ TAIEB 3
Introduction
Big Data history – Small Data vs Big Data
M. A. HADJ TAIEB 4
Virtualization
Cloud computing system architecture
SaaS: Software as a service
Application Software as a Service is a
model for distributing
Platform software across the cloud.
Applications are hosted by the
Infrastructure service provider
Provided by the cloud computing
providers
M. A. HADJ TAIEB 5
Virtualization
Cloud computing system architecture
A cloud user’s own PaaS: Platform as a service
application
Application PaaS is a Cloud Computing model
offers hardware and software tools
Platform as a service over the internet,
allowing the user to develop
Infrastructure applications. The hardware and
software are hosted on the
supplier’s infrastructure. Thus,
users do not need to install their
own hardware and software
Provided by the cloud computing internally to develop or launch new
providers applications.
M. A. HADJ TAIEB 6
Virtualization
Cloud computing system architecture
A cloud user’s own
Application
application and OS IaaS: Infrastructure as a service
Platform
IaaS is a form of Cloud Computing
Infrastructure that provides computing resources
in a virtualized environment (the
Cloud) over the Internet or other
connection.
Provided by the cloud computing
providers
M. A. HADJ TAIEB 7
Virtualization
Cloud computing system architecture
Application
Three cloud computing stacks
▪SaaS: Software as a service Platform
▪PaaS: Platform as a service Infrastructure
▪IaaS: Infrastructure as a service
M. A. HADJ TAIEB 8
Virtualization
Popular CC Services
• Amazon Web Services (AWS): IaaS
• Google App Engine (GAE): PaaS
• Microsoft Azure Services Platform: PaaS
M. A. HADJ TAIEB 9
Virtualization
Virtualized vs. Traditional
What is Virtualization?
Virtual Virtual
Container Container
App. A App. B App. C App. D
Virtualization Layer
Hardware
Virtualization allows one computer to do the job of multiple
computers, by sharing the resources of a single hardware across
multiple environments
M. A. HADJ TAIEB 10
Virtualization
Virtualized vs. Traditional
App App App
App App App OS OS OS
Operating System Hypervisor
Hardware Hardware
Traditional Stack Virtualized Stack
M. A. HADJ TAIEB 11
Virtual Machines
Hypervisor, or a virtual machine
monitor, is software, firmware, or
hardware that creates and runs VMs.
It’s what sits between the hardware
and the virtual machine and is
necessary to virtualize the server.
M. A. HADJ TAIEB 12
Containers
Containers sit on top of a physical server
and its host OS—for example, Linux or
Windows. Each container shares the
host OS kernel and, usually, the binaries
and libraries, too. Shared components
are read-only. Containers are thus
exceptionally “light”—they are only
megabytes in size and take just seconds
to start, versus gigabytes and minutes
for a VM.
M. A. HADJ TAIEB 13
Contexte
Problématiques
Enchaînement du cours
M. A. HADJ TAIEB 14
Big Data
Open source Big Data landscape
M. A. HADJ TAIEB 15
Big Data
Big Data Analytics
M. A. HADJ TAIEB 16
Big Data
Data Engineer vs. Data Scientist vs Data Analyst
M. A. HADJ TAIEB 17
Data Engineer vs. Data Scientist vs Data Analyst
M. A. HADJ TAIEB 18
Big Data
Vs
M. A. HADJ TAIEB 19
Big Data Vs
5 Vs of Big Data
M. A. HADJ TAIEB 20
Big Data Vs
5 Vs- Volume
Volume refers to the ‘amount of data’, which is growing day by day at a very fast
pace. The size of data generated by humans.
M. A. HADJ TAIEB 21
Big Data Vs
5 Vs - Variety
Many sources are contributing to Big Data, the type of data they are generating is
different. It can be structured, semi-structured or unstructured. Data are coming in
the form of images, audios, videos, sensor data, relational databases, etc.
M. A. HADJ TAIEB 22
Big Data Vs
5 Vs- Velocity
Velocity is defined as the pace at
which different sources generate the 2020
data every day. This flow of data is
massive and continuous. There are
1.03 billion Daily Active Users
(Facebook DAU) on Mobile as of
now, which is an increase of 22%
year-over-year.
If you are able to handle the
velocity, you will be able to generate
insights and take decisions based on
real-time data.
M. A. HADJ TAIEB 23
Big Data Vs
5 Vs- Veracity
Veracity refers to the quality of trustworthiness of the data. Data users have
to be able to transform the data into trustworthy insight and discard noise.
Even the best
analytics
systems are
only as good
as the data
they crunch
M. A. HADJ TAIEB 24
Big Data Vs
5 Vs- Value
Value refers to the ability to transform a tsunami of data into business.
Create != Extract
Big Data value chain
M. A. HADJ TAIEB 25
Big Data Pipeline
Big Data Pipeline
M. A. HADJ TAIEB 26
Big Data Pipeline
Big Data Pipeline – Data collector
Big Data Collection involves connecting to
various data sources, extracting the data, and
detecting the changed data. It’s about moving
data – and especially the unstructured data –
from where it is originated, into a system
where it can be stored and analyzed.
Data Serialization in Big Data
Different types of users have various types of
data consumer needs. Here we want to share Apache Avro
variable data, so we must plan how the user can
access data in a meaningful way. That’s why a
single image of variable data optimize the data
for human readability.
M. A. HADJ TAIEB 27
Big Data Pipeline
Big Data Pipeline – Data ingestion layer
It concerns data transportation from the ingestion layer to the rest of the
Data Pipeline. Messaging system will act as a mediator between all the
programs that can send and receive messages.
Data can be streamed in real-time or ingested in batches, When data
is ingested in real time then, as soon as data arrives it is ingested
immediately. When data is ingested in batches, data items are
ingested in some chunks at a periodic interval of time. Ingestion is the
process of bringing data into the Data Processing system.
M. A. HADJ TAIEB 28
Big Data Pipeline
Big Data Pipeline – Data storage layer
Storage becomes a challenge when the size of the data you are
dealing with, becomes large. Several possible solutions can rescue
from such problems. Finding a storage solution is very much
important when the size of your data becomes large. This layer
focuses on “where to store such large data efficiently.”
•HDFS : Hadoop Distributed File System
•GlusterFS: Dependable Distributed File System
•Amazon S3 (Amazon Simple Storage Service)
M. A. HADJ TAIEB 29
Big Data Pipeline
Big Data Pipeline – Data processing layer
Data collected in the previous layer will be processed in this layer. Here we do some magic with
the data to route them to a different destination, classify the data flow and it’s the first point
where the analytic may take place.
Big Data Batch Processing System
A simple batch processing system for offline analytics.
Near Real-Time Processing System
Near real-time processing is when speed is important, but processing
time in minutes is acceptable in lieu of seconds. An example of near
real-time processing is the identification of threats (detecting an
intruder in the network).
Real-Time Processing System
Real time processing requires a continual input, constant processing, and
steady output of data. A great example of real-time processing is data
streaming, radar systems, customer service systems, and bank ATMs, where
immediate processing is crucial to make the system work properly.
M. A. HADJ TAIEB 30
Big Data Pipeline
Big Data Pipeline – Data query layer
This is the layer where active analytic processing takes
place. Here, the primary focus is to gather the data value
so that they are made to be more helpful for the next
layer.
It is the layer where active analytic processing takes place.
This is a field where interactive queries are necessaries,
and it’s a zone traditionally dominated by SQL expert
developers. Before Hadoop, we had insufficient storage
due to which it takes a long analytics process.
M. A. HADJ TAIEB 31
Big Data Pipeline
Big Data Pipeline – Data visualisation layer
The visualization, or presentation tier, probably
the most prestigious tier, where the data pipeline
users may feel the VALUE of DATA. We need
something that will grab people’s attention, pull
them into, make your findings well-understood.
The data visualization layer often is the
thermometer that measures the success of the
project. This is the where the data value is
perceived by the user. While it’s designed for
handling and storing large volumes of data,
Hadoop and other tools have no built-in
provisions for data visualization and information
distribution, leaving no way to make that data
easily consumable by end business users.
M. A. HADJ TAIEB 32
Big Data Pipeline
Big Data Pipeline – Data security
Kerberos
Kerberos is a computer-network authentication protocol that works
on the basis of tickets
M. A. HADJ TAIEB 33
Big Data Pipeline
Big Data Pipeline – Data security
Lightweight Directory Access
Protocol (LDAP) is originally a
protocol for querying and
modifying directory services
M. A. HADJ TAIEB 34
Big Data platforms
Big Data platforms
Big data platform is a type of IT solution that combines the features
and capabilities of several big data applications and utilities within a
single solution.
It is an enterprise class IT platform
that enables organization in
developing, deploying, operating
and managing a big data
infrastructure /environment.
M. A. HADJ TAIEB 35
Big Data platforms
Big Data platforms – Lambda architecture
Lambda architecture is currently one of the most commonly used for real-time
data processing.
Nathan Marz
M. A. HADJ TAIEB 36
Big Data platforms
Big Data platforms – Kappa architecture
Kappa architecture focuses only on processing the data in a stream. It is not
intended to replace the Lambda architecture, except for certain specific use
cases.
Jay Kreps
M. A. HADJ TAIEB 37