Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views37 pages

Big Data Course Student

The document discusses the concept of Big Data, highlighting its significance as a valuable resource surpassing oil. It covers various aspects of Big Data including its history, the differences between small and big data, cloud computing architectures (SaaS, PaaS, IaaS), and the five Vs of Big Data: Volume, Variety, Velocity, Veracity, and Value. Additionally, it outlines the Big Data pipeline, detailing the stages from data collection to visualization and security.

Uploaded by

yosri jemai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views37 pages

Big Data Course Student

The document discusses the concept of Big Data, highlighting its significance as a valuable resource surpassing oil. It covers various aspects of Big Data including its history, the differences between small and big data, cloud computing architectures (SaaS, PaaS, IaaS), and the five Vs of Big Data: Volume, Variety, Velocity, Veracity, and Value. Additionally, it outlines the Big Data pipeline, detailing the stages from data collection to visualization and security.

Uploaded by

yosri jemai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Big Data

Mohamed Ali HADJ TAIEB


Faculty of Sciences, University of Sfax, Tunisia

[email protected]
Big Data: The New Oil

Back in 2017, The


Economist published
a story titled, "The
world's most valuable
resource is no longer
oil, but data."

M. A. HADJ TAIEB 2
Introduction

Big Data history

M. A. HADJ TAIEB 3
Introduction

Big Data history – Small Data vs Big Data

M. A. HADJ TAIEB 4
Virtualization

Cloud computing system architecture

SaaS: Software as a service


Application Software as a Service is a
model for distributing
Platform software across the cloud.
Applications are hosted by the
Infrastructure service provider

Provided by the cloud computing


providers

M. A. HADJ TAIEB 5
Virtualization

Cloud computing system architecture

A cloud user’s own PaaS: Platform as a service


application
Application PaaS is a Cloud Computing model
offers hardware and software tools
Platform as a service over the internet,
allowing the user to develop
Infrastructure applications. The hardware and
software are hosted on the
supplier’s infrastructure. Thus,
users do not need to install their
own hardware and software
Provided by the cloud computing internally to develop or launch new
providers applications.

M. A. HADJ TAIEB 6
Virtualization

Cloud computing system architecture

A cloud user’s own


Application
application and OS IaaS: Infrastructure as a service
Platform
IaaS is a form of Cloud Computing
Infrastructure that provides computing resources
in a virtualized environment (the
Cloud) over the Internet or other
connection.

Provided by the cloud computing


providers

M. A. HADJ TAIEB 7
Virtualization

Cloud computing system architecture

Application
Three cloud computing stacks
▪SaaS: Software as a service Platform

▪PaaS: Platform as a service Infrastructure

▪IaaS: Infrastructure as a service

M. A. HADJ TAIEB 8
Virtualization

Popular CC Services

• Amazon Web Services (AWS): IaaS

• Google App Engine (GAE): PaaS

• Microsoft Azure Services Platform: PaaS

M. A. HADJ TAIEB 9
Virtualization

Virtualized vs. Traditional

What is Virtualization?
Virtual Virtual
Container Container

App. A App. B App. C App. D

Virtualization Layer

Hardware

Virtualization allows one computer to do the job of multiple


computers, by sharing the resources of a single hardware across
multiple environments
M. A. HADJ TAIEB 10
Virtualization

Virtualized vs. Traditional

App App App

App App App OS OS OS

Operating System Hypervisor

Hardware Hardware

Traditional Stack Virtualized Stack

M. A. HADJ TAIEB 11
Virtual Machines

Hypervisor, or a virtual machine


monitor, is software, firmware, or
hardware that creates and runs VMs.
It’s what sits between the hardware
and the virtual machine and is
necessary to virtualize the server.

M. A. HADJ TAIEB 12
Containers

Containers sit on top of a physical server


and its host OS—for example, Linux or
Windows. Each container shares the
host OS kernel and, usually, the binaries
and libraries, too. Shared components
are read-only. Containers are thus
exceptionally “light”—they are only
megabytes in size and take just seconds
to start, versus gigabytes and minutes
for a VM.

M. A. HADJ TAIEB 13
Contexte
Problématiques

Enchaînement du cours

M. A. HADJ TAIEB 14
Big Data

Open source Big Data landscape

M. A. HADJ TAIEB 15
Big Data

Big Data Analytics

M. A. HADJ TAIEB 16
Big Data

Data Engineer vs. Data Scientist vs Data Analyst

M. A. HADJ TAIEB 17
Data Engineer vs. Data Scientist vs Data Analyst

M. A. HADJ TAIEB 18
Big Data

Vs
M. A. HADJ TAIEB 19
Big Data Vs

5 Vs of Big Data

M. A. HADJ TAIEB 20
Big Data Vs

5 Vs- Volume

Volume refers to the ‘amount of data’, which is growing day by day at a very fast
pace. The size of data generated by humans.

M. A. HADJ TAIEB 21
Big Data Vs

5 Vs - Variety

Many sources are contributing to Big Data, the type of data they are generating is
different. It can be structured, semi-structured or unstructured. Data are coming in
the form of images, audios, videos, sensor data, relational databases, etc.

M. A. HADJ TAIEB 22
Big Data Vs

5 Vs- Velocity
Velocity is defined as the pace at
which different sources generate the 2020
data every day. This flow of data is
massive and continuous. There are
1.03 billion Daily Active Users
(Facebook DAU) on Mobile as of
now, which is an increase of 22%
year-over-year.

If you are able to handle the


velocity, you will be able to generate
insights and take decisions based on
real-time data.

M. A. HADJ TAIEB 23
Big Data Vs

5 Vs- Veracity
Veracity refers to the quality of trustworthiness of the data. Data users have
to be able to transform the data into trustworthy insight and discard noise.

Even the best


analytics
systems are
only as good
as the data
they crunch

M. A. HADJ TAIEB 24
Big Data Vs

5 Vs- Value

Value refers to the ability to transform a tsunami of data into business.

Create != Extract

Big Data value chain


M. A. HADJ TAIEB 25
Big Data Pipeline

Big Data Pipeline

M. A. HADJ TAIEB 26
Big Data Pipeline

Big Data Pipeline – Data collector

Big Data Collection involves connecting to


various data sources, extracting the data, and
detecting the changed data. It’s about moving
data – and especially the unstructured data –
from where it is originated, into a system
where it can be stored and analyzed.
Data Serialization in Big Data
Different types of users have various types of
data consumer needs. Here we want to share Apache Avro
variable data, so we must plan how the user can
access data in a meaningful way. That’s why a
single image of variable data optimize the data
for human readability.

M. A. HADJ TAIEB 27
Big Data Pipeline

Big Data Pipeline – Data ingestion layer


It concerns data transportation from the ingestion layer to the rest of the
Data Pipeline. Messaging system will act as a mediator between all the
programs that can send and receive messages.

Data can be streamed in real-time or ingested in batches, When data


is ingested in real time then, as soon as data arrives it is ingested
immediately. When data is ingested in batches, data items are
ingested in some chunks at a periodic interval of time. Ingestion is the
process of bringing data into the Data Processing system.

M. A. HADJ TAIEB 28
Big Data Pipeline

Big Data Pipeline – Data storage layer

Storage becomes a challenge when the size of the data you are
dealing with, becomes large. Several possible solutions can rescue
from such problems. Finding a storage solution is very much
important when the size of your data becomes large. This layer
focuses on “where to store such large data efficiently.”

•HDFS : Hadoop Distributed File System

•GlusterFS: Dependable Distributed File System

•Amazon S3 (Amazon Simple Storage Service)

M. A. HADJ TAIEB 29
Big Data Pipeline

Big Data Pipeline – Data processing layer


Data collected in the previous layer will be processed in this layer. Here we do some magic with
the data to route them to a different destination, classify the data flow and it’s the first point
where the analytic may take place.

Big Data Batch Processing System


A simple batch processing system for offline analytics.

Near Real-Time Processing System


Near real-time processing is when speed is important, but processing
time in minutes is acceptable in lieu of seconds. An example of near
real-time processing is the identification of threats (detecting an
intruder in the network).
Real-Time Processing System
Real time processing requires a continual input, constant processing, and
steady output of data. A great example of real-time processing is data
streaming, radar systems, customer service systems, and bank ATMs, where
immediate processing is crucial to make the system work properly.
M. A. HADJ TAIEB 30
Big Data Pipeline

Big Data Pipeline – Data query layer

This is the layer where active analytic processing takes


place. Here, the primary focus is to gather the data value
so that they are made to be more helpful for the next
layer.

It is the layer where active analytic processing takes place.


This is a field where interactive queries are necessaries,
and it’s a zone traditionally dominated by SQL expert
developers. Before Hadoop, we had insufficient storage
due to which it takes a long analytics process.

M. A. HADJ TAIEB 31
Big Data Pipeline

Big Data Pipeline – Data visualisation layer

The visualization, or presentation tier, probably


the most prestigious tier, where the data pipeline
users may feel the VALUE of DATA. We need
something that will grab people’s attention, pull
them into, make your findings well-understood.

The data visualization layer often is the


thermometer that measures the success of the
project. This is the where the data value is
perceived by the user. While it’s designed for
handling and storing large volumes of data,
Hadoop and other tools have no built-in
provisions for data visualization and information
distribution, leaving no way to make that data
easily consumable by end business users.

M. A. HADJ TAIEB 32
Big Data Pipeline

Big Data Pipeline – Data security

Kerberos
Kerberos is a computer-network authentication protocol that works
on the basis of tickets
M. A. HADJ TAIEB 33
Big Data Pipeline

Big Data Pipeline – Data security

Lightweight Directory Access


Protocol (LDAP) is originally a
protocol for querying and
modifying directory services

M. A. HADJ TAIEB 34
Big Data platforms

Big Data platforms

Big data platform is a type of IT solution that combines the features


and capabilities of several big data applications and utilities within a
single solution.
It is an enterprise class IT platform
that enables organization in
developing, deploying, operating
and managing a big data
infrastructure /environment.

M. A. HADJ TAIEB 35
Big Data platforms

Big Data platforms – Lambda architecture


Lambda architecture is currently one of the most commonly used for real-time
data processing.

Nathan Marz
M. A. HADJ TAIEB 36
Big Data platforms

Big Data platforms – Kappa architecture

Kappa architecture focuses only on processing the data in a stream. It is not


intended to replace the Lambda architecture, except for certain specific use
cases.

Jay Kreps

M. A. HADJ TAIEB 37

You might also like