0% found this document useful (0 votes)

40 views36 pages

Big Data Challenges and Solutions

Uploaded by

mahammed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views36 pages

Big Data Challenges and Solutions

Uploaded by

mahammed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Large-Scale and Multi-Structured

Databases
The Big Data Era
Prof Pietro Ducange
Big Data: What, Who, Where, How
• Big Data is a fashion-trend buzz word

• Everybody (and everywhere) talk about Big Data

• Enterprises, Universities, Research Centers, Researchers,

Software Developers, Consultants, etc., state that they work with
Big Data

• The question is: what can we do with Big Data? How can we use
Big Data?
The Big Data Era (I)
The Big Data Era (II)

Genomics
The Big Data Era (II)

Image extracted from: http://www.talismaniandigital.com/importance-of-big-data-analytics-in-digital-

marketing/
Big Data in a Snapshot

Image extracted from: https://www.edureka.co/blog/what-is-big-data/

10 Vs of Big Data

Image extracted from: https://towardsdatascience.com/big-data-analysis-spark-and-hadoop-a11ba591c057

42 Vs of Big Data (in 2017)

Image extracted from: https://www.kdnuggets.com/2017/04/42-vs-big-data-data-science.html

Big data has many faces

Source: Yvan Saeys. A gentle Introduction to Big Data

10
The Main Big Data Issue
Big data refers to any problem characteristic that
represents a challenge to process it with traditional
applications

Big Data involves data whose volume, diversity and

complexity requires new techniques, algorithms and
analyses to fetch, to store and to elaborate them.
The Main Big Data Computing Issue
• Classical data elaboration algorithms, such as the
ones for analytics and data mining, cannot directly
applied to Big Data

• New technologies are needed for both storage

and computational tasks.

• New paradigms are needed for designing,

implementing and experimenting algorithms for
handling big data.
How to deal with data intensive applications?
Scale-up vs. Scale-out

Image extracted from: https://hadoop4usa.wordpress.com/2012/04/13/scale-out-up/

13
Distributed systems in Big Data
• Objective: To apply an operation to all data
– One machine cannot process or store all data
• Data is distributed in a cluster of computing nodes
• It does not matter which machine executes the operation
• It does not matter if it is run twice in different nodes (due to
failures or straggler nodes)
• We look for an abstraction of the complexity behind
distributed systems
– DATA LOCALITY is crucial
• Avoid data transfers between machines as much as possible

Slide courtesy of F. Herrera, A. Fernandez, Isaac Triguero, Fuzzy Models for Data Science and Big Data, Tutorial @
FuzzIEEE 2017, Naples, Italy

14
Data-intensive jobs with data locality
Communication network

Limited

…
communication

Low compute
intensity

b c a c b e d f a d input data
(lots of it)
e j g j h i g i f h

Solution: store data on local disks of the nodes that perform

computations on that data (“data locality”)
Slide courtesy of F. Herrera, A. Fernandez, Isaac Triguero, Fuzzy Models for Data Science and Big Data, Tutorial @
FuzzIEEE 2017, Naples, Italy

15
MapReduce (I)
New programming model: MapReduce
– “Moving computation is cheaper than moving computation
and data at the same time”
– Idea
• Data is distributed among nodes (distributed file system)
• Functions/operations to process data are distributed to
all the computing nodes
• Each computing node works with the data stored in it
• Only the necessary data is moved across the network
Slide courtesy of F. Herrera, A. Fernandez, Isaac Triguero, Fuzzy Models for Data Science and Big Data, Tutorial @
FuzzIEEE 2017, Naples, Italy

16
MapReduce (II)
• Parallel Programming model

• Introduce by Google in 2004

• Divide & conquer strategy

§ divide: partition dataset into smaller, independent chunks to be
processed in parallel (map)
§ conquer: combine, merge or otherwise aggregate the results from
the previous step (reduce)

17
MapReduce: Basic Working (I)

Image extracted from: Elkano, Mikel, et al. "CHI-BD: A fuzzy rule-based classification system for Big Data classification problems." Fuzzy Sets and Systems
(2017).
MapReduce: Basic Working (II)

Image extracted from: Elkano, Mikel, et al. "CHI-BD: A fuzzy rule-based classification system for Big Data classification problems." Fuzzy Sets and Systems
(2017).
MapReduce: WordCount Example
Google Software Architecture for Big
Data

Image extracted from: “Guy Harrison, Next Generation Databases, Apress, 2015”
Hadoop
§ Hadoop is:
– An open-source (the first) framework written in Java
– Distributed storage of very large data sets (Big Data)
– Distributed processing of very large data sets

• This framework consists of a number of modules

– Hadoop Common: the common utilities that support the other Hadoop
modules.
– Hadoop Distributed File System (HDFS)
– Hadoop YARN – A framework for job scheduling and cluster resource
management
– Hadoop MapReduce – programming model

http://hadoop.apache.org/

22
Distributed File System: HDFS
HDFS – Hadoop Distributed File System
– Distributed File System written in Java
– Scales to clusters with thousands of computing nodes
• Each node stores part of the data in the system
– Fault tolerant due to data replication
– Designed for big files and low-cost hardware
• GBs, TBs, PBs
– Efficient for read and append operations (random
updates are rare)

23
Hadoop Limits
• Hadoop is optimized for one-pass batch processing of on-disk
data

• It suffers for interactive data exploration and more complex

multi-pass analytics algorithms

• Due to a poor inter-communication capability and inadequacy

for in-memory computation , Hadoop is not suitable for those
applications that require iterative and/or online computation
SPARK
• Apache Spark is an open-source which has emerged as the next
generation big data processing tool due to its enhanced flexibility and
efficiency.

• Spark allows employing different distributed programming models,

such as MapReduce and Pregel, and has proved to perform faster
than Hadoop , especially in case of iterative and online applications.

• Unlike the disk-based MapReduce paradigm supported by Hadoop,

Spark employs the concept of in-memory cluster computing, where
datasets are cached in memory to reduce their access latency.
Driver and Executors
At high level, a Spark application runs as a set of independent
processes on the top of the dataset distributed across the machines
of the cluster and consists of one driver program and several
executors

The driver program, hosted in the master machine, runs the user’s
main function and distributes operations on the cluster by sending
several units of work, called tasks, to the executors.

Each executor, hosted in a slave machine, runs tasks in parallel and

keeps data in memory or disk storage across them.
Driver and Executors

Image extracted from: https://spark.apache.org/docs/latest/cluster-overview.html

SPARK RDD
• The main abstraction provided by Spark is the resilient distributed dataset (RDD)

• RDD is a fault-tolerant collection of elements partitioned across the machines of

the cluster that can be processed in parallel

• These collections are resilient, because they can be rebuilt if a portion of the
dataset is lost

• The applications developed using the Spark framework are totally independent of
the file system or the database management system used for storing data

• Indeed, there exist connectors for reading data, creating the RDD and writing back
results on files or on databases

• In the last years, Data Frames and Datasets have been recently released as an
abstraction on top of the RDD.
Spark Ecosystem

Image extracted from: https://www.zdnet.com/article/the-future-of-the-future-spark-big-data-insights-streaming-

and-deep-learning-in-the-cloud/
Hadoop vs SPARK
300 Hadoop

274
10× faster on
U p to
250
HadoopBinMem

Iteration time (s)

100× in me disk, Spark

197
200
mory

157
143
150

121

106
87
100

61
2-5× les 50

33
s code 0
25 50 100
Number of machines

More details can be found in: Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction
for in-memory cluster computing." Proceedings of the 9th USENIX conference on Networked Systems Design
and Implementation. USENIX Association, 2012.
Open source solutions for Big Data
A distributed algorithm for large-scale graph
partitioning

Shoddy Partitioning Optimal Partitioning

https://journalofbigdata.springeropen.com/articles/10.1186/s40537-020-00357-y
A distributed algorithm for large-scale graph
partitioning
Big Data architecture for
intelligent maintenance

https://link.springer.com/article/10.1186/s40537-020-00340-7
Evaluation of distributed stream processing
frameworks for IoT applications in Smart Cities

https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0215-2
Suggested Readings
Chapter 2 of the book “Guy Harrison, Next
Generation Databases, Apress, 2015”

https://hadoop.apache.org/

https://spark.apache.org/

Introduction To Big Data With Spark and Hadoop
No ratings yet
Introduction To Big Data With Spark and Hadoop
61 pages
Big Data & Hadoop Architecture Guide
50% (2)
Big Data & Hadoop Architecture Guide
168 pages
5 Strategy - Business Strategy Differentiation Cost Leadership Blue Ocean
100% (1)
5 Strategy - Business Strategy Differentiation Cost Leadership Blue Ocean
30 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
Spark: Big Data Processing & Libraries
No ratings yet
Spark: Big Data Processing & Libraries
47 pages
Big Data Engines: Binary Batch Processing
No ratings yet
Big Data Engines: Binary Batch Processing
12 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Bigdata Overview PDF
No ratings yet
Bigdata Overview PDF
98 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Chapter - 2 Hadoop
100% (1)
Chapter - 2 Hadoop
32 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
41 pages
Hadoop PPT
100% (1)
Hadoop PPT
25 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
BigData Materials
No ratings yet
BigData Materials
68 pages
Big Data: Introduction To Terms, Concepts and Tools
No ratings yet
Big Data: Introduction To Terms, Concepts and Tools
23 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Vision and Scope Document
No ratings yet
Vision and Scope Document
8 pages
AI: A Comprehensive Overview
100% (2)
AI: A Comprehensive Overview
39 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
229 pages
Bigdata PPT Slides (E)
No ratings yet
Bigdata PPT Slides (E)
10 pages
Chapter One and Two
No ratings yet
Chapter One and Two
7 pages
Data Science
No ratings yet
Data Science
87 pages
Chapter Six: Ethics and Professionalism of Emerging Technologies
0% (1)
Chapter Six: Ethics and Professionalism of Emerging Technologies
19 pages
Jifs223295 2
No ratings yet
Jifs223295 2
25 pages
Abubakar's Assignment-2
No ratings yet
Abubakar's Assignment-2
11 pages
AR: Transforming Reality and Beyond
No ratings yet
AR: Transforming Reality and Beyond
13 pages
HADOOP
No ratings yet
HADOOP
55 pages
BIT4440 BSE4040 CloudComputing 3.big Data Technologies
No ratings yet
BIT4440 BSE4040 CloudComputing 3.big Data Technologies
43 pages
DSS Database Suite Vision
No ratings yet
DSS Database Suite Vision
7 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Bigdata Intro
No ratings yet
Bigdata Intro
76 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Part2 HDFS
No ratings yet
Part2 HDFS
33 pages
Module 2
No ratings yet
Module 2
20 pages
Hadoop & Big Data Overview
No ratings yet
Hadoop & Big Data Overview
23 pages
Big Data Platforms: Yogesh Simmhan
No ratings yet
Big Data Platforms: Yogesh Simmhan
51 pages
0 Principles of Big Data
No ratings yet
0 Principles of Big Data
70 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Biggdata
No ratings yet
Biggdata
24 pages
Chap3 OverviewOfBigDataEcosystem
No ratings yet
Chap3 OverviewOfBigDataEcosystem
91 pages
Big Data A Comprehensive Overview
No ratings yet
Big Data A Comprehensive Overview
25 pages
2 - ACID Vs BASE
No ratings yet
2 - ACID Vs BASE
30 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
Database Evolution for Tech Professionals
No ratings yet
Database Evolution for Tech Professionals
46 pages
1 Introduction
No ratings yet
1 Introduction
31 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
5 Documentdatabases
No ratings yet
5 Documentdatabases
25 pages
7 Network Effects Part 2 - Cost Structures
No ratings yet
7 Network Effects Part 2 - Cost Structures
29 pages
ECS765P - W4 - Introduction To Spark
No ratings yet
ECS765P - W4 - Introduction To Spark
39 pages
Lecture 3 MR Model and Systems
No ratings yet
Lecture 3 MR Model and Systems
67 pages
DATA228 Lecture Notes Week 1
No ratings yet
DATA228 Lecture Notes Week 1
20 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Big Data-2
No ratings yet
Big Data-2
40 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
Topic 1 Big Data Technologies
No ratings yet
Topic 1 Big Data Technologies
5 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
BigData Session1
No ratings yet
BigData Session1
14 pages
Data-Intensive Computing Overview
No ratings yet
Data-Intensive Computing Overview
46 pages
Unit 5
No ratings yet
Unit 5
32 pages
SPARK
No ratings yet
SPARK
47 pages
BIG DATA Class 1 1741496163
No ratings yet
BIG DATA Class 1 1741496163
108 pages