0% found this document useful (0 votes)

23 views23 pages

Chapter Three Data Science

Data science course of chapter three data engineering.

Uploaded by

Yoni Yoni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views23 pages

Chapter Three Data Science

Data science course of chapter three data engineering.

Uploaded by

Yoni Yoni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Introduction to Parallel

01
Chapter 3 Computing

Hadoop Ecosystem and

02 Map Reduce
Data Science
Engineering 03
Data Driven Application Design
and Deployment
Introduction to Parallel Computing
• Parallel computing is the simultaneous use of multiple
compute resources to solve a computational problem:
• To be run using multiple CPUs.
• A problem is broken into discrete parts that can be solved
concurrently.
• Each part is further broken down to a series of instructions.
• Instructions from each part execute simultaneously on different
CPUs.
• Languages using parallel computing such as C/C++, Fortran,
MATLAB, Python, R , Perl, Julia and others
Parallel Computing
Why Use Parallel Computing
• Save time and/or money: A task will shorten its time to completion,
with potential cost savings. Parallel clusters can be built from cheap,
commodity components.
• Solve larger problems: Many problems are so large and/or complex
that it is impractical or impossible to solve them on a single computer,
especially given limited computer memory.
• Provide concurrency: A single compute resource can only do one
thing at a time. Multiple computing resources can be doing many
things simultaneously.
• Use of non-local resources: Using compute resources on a wide area
network, or even the Internet when local compute resources are scarce.
Parallel Computer Memory Architectures:

1. Shared Memory:
• Multiple processors can operate independently, but share the same
memory resources
• Changes in a memory location caused by one CPU are visible to all
processors
Continued….

• Advantages:
• Global address space provides a user-friendly programming
perspective to memory
• Fast and uniform data sharing due to proximity of memory to CPUs
• Disadvantages:
• Lack of scalability between memory and CPUs. Adding more CPUs
increases traffic on the shared memory-CPU path
• Programmer responsibility for “correct” access to global memory
Continued ….
2. Distributed Memory:
• Requires a communication network to connect inter-processor memory
• Processors have their own local memory. Changes made by one CPU
have no effect on others
• Requires communication to exchange data among processors
Continued ….
• Advantages:
• Memory is scalable with the number of CPUs
• Each CPU can rapidly access its own memory without overhead
incurred with trying to maintain global cache coherency
• Disadvantages:
• Programmer is responsible for many of the details associated with
data communication between processors
• It is usually difficult to map existing data structures to this memory
organization, based on global memory
Continued …..
3. Hybrid Distributed-Shared Memory:
• The largest and fastest computers in the world today employ both
shared and distributed memory architectures.
• Shared memory component can be a shared memory machine and/or
GPU
• Processors on a compute node share same memory space
• Requires communication to exchange data between compute nodes
Continued.........
• Advantages and Disadvantages:
• Whatever is common to both shared and distributed memory
architectures
• Increased scalability is an important advantage
• Increased programming complexity is a major disadvantage
Hadoop Ecosystem
• Hadoop is an open source framework that is meant for storage and
processing of big data in a distributed manner.
• It is the best solution for handling big data challenges.
Some important features of Hadoop are
• Open Source – Hadoop is an open source framework which means it is
available free of cost. Also, the users are allowed to change the source
code as per their requirements.
• Distributed Processing – Hadoop supports distributed processing of data
i.e. faster processing. The data in Hadoop HDFS is stored in a
distributed manner and MapReduce is responsible for the parallel
processing of data.
Continued …..

• Fault Tolerance – Hadoop is highly fault-tolerant. It creates three

replicas for each block (default) at different nodes.
• Reliability – Hadoop stores data on the cluster in a reliable manner that
is independent of machine. So, the data stored in Hadoop environment
is not affected by the failure of the machine.
• Scalability – It is compatible with the other hardware and we can easily
add/remove the new hardware to the nodes.
• High Availability – The data stored in Hadoop is available to access
even after the hardware failure. In case of hardware failure, the data can
be accessed from another node.
The core components of Hadoop are
1. HDFS: (Hadoop Distributed File System) – HDFS is the basic
storage system of Hadoop.
The large data files running on a cluster of commodity hardware are
stored in HDFS.
It can store data in a reliable manner even when hardware fails.
The key aspects of HDFS are:
a. Storage component
b. Distributes data across several nodes
c. Natively redundant.
Continued …..
2. Map Reduce is the Hadoop layer that is responsible for data
processing.

It writes an application to process unstructured and structured

data stored in HDFS.

It is responsible for the parallel processing of high volume of data

by dividing data into independent tasks.

The processing is done in two phases Map and Reduce.

Continued …..

The Map : is the first phase of processing that specifies

complex logic code and
The Reduce : is the second phase of processing that
specifies lightweight operations.
The key aspects of Map Reduce are:
a. Computational frame work
b. Splits a task across multiple nodes
c. Processes data in parallel
Hadoop Architecture
Data Driven Application Design and Deployment

• Data-driven apps have become a major growth engine for the

worldwide software market. Analysts predict that smart computing
software will become a $48 billion market and have proclaimed that we
are in an era of data-driven marketing and sales. From personalized
portals to wearable devices, data-driven apps are all around us.
Five Best Practices for Designing Data-Driven Applications

1. Recognize How Data Impacts the Customer Journey

Understanding the customer journey and how to use it to deliver

relevant data is a key business differentiator.

2. Focus on the “Last Mile” of Big Data

The last mile of Big Data is where opinions are formed and actions are
taken. For application designers, meeting the last mile challenge
requires understanding self-service use cases and leveraging tools that
turn Big Data into small data that helps people perform specific tasks.
Continued …..

3. Build to Scale (Sources, Formats and Devices)

Even though we are surrounded by data-driven devices, it is a

challenge to design compelling data-driven experiences.

4. Listen to the Crowd (Open is Better)

The community gives our customers access to many resources to ensure that
projects are delivered on time and with minimal risk.

5. Start Small, Then Think Big

Focus and agility are keys to designing data-driven apps quickly.

End of Chapter

Chapter - 2 Hadoop
100% (1)
Chapter - 2 Hadoop
32 pages
Unit - 3
No ratings yet
Unit - 3
34 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
BDA Unit 2
No ratings yet
BDA Unit 2
39 pages
Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi
No ratings yet
Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi
49 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
03 Intro HadoopAndMapReduce BigData
No ratings yet
03 Intro HadoopAndMapReduce BigData
91 pages
Big Data
No ratings yet
Big Data
29 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
BDA Module 3
No ratings yet
BDA Module 3
69 pages
Part2 HDFS
No ratings yet
Part2 HDFS
33 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Unit Iii
No ratings yet
Unit Iii
22 pages
Optimize Small Files in Hadoop
No ratings yet
Optimize Small Files in Hadoop
62 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Week 02
No ratings yet
Week 02
115 pages
Big Data & Hadoop Architecture Guide
50% (2)
Big Data & Hadoop Architecture Guide
168 pages
CC-Unit 3
No ratings yet
CC-Unit 3
22 pages
CC Unit4
No ratings yet
CC Unit4
14 pages
Unit 5
No ratings yet
Unit 5
32 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Engineering Students' Hadoop Guide
No ratings yet
Engineering Students' Hadoop Guide
31 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Large-Scale Data Analytics: Traditional Database Systems
No ratings yet
Large-Scale Data Analytics: Traditional Database Systems
11 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Big Data and Mapreduce Challenges, Opportunities and Trends
No ratings yet
Big Data and Mapreduce Challenges, Opportunities and Trends
9 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Chapter 3 Hadoop
No ratings yet
Chapter 3 Hadoop
10 pages
Bda Unit2
No ratings yet
Bda Unit2
24 pages
Hadoop for Big Data Professionals
No ratings yet
Hadoop for Big Data Professionals
13 pages
Unit-III Big Data
No ratings yet
Unit-III Big Data
10 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
Unit 1,2,3,4
No ratings yet
Unit 1,2,3,4
116 pages
Data Science and Big Data UNIT 3
No ratings yet
Data Science and Big Data UNIT 3
11 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Bda Unit 4 Material
No ratings yet
Bda Unit 4 Material
37 pages
HADOOP
No ratings yet
HADOOP
10 pages
Big Data Spark Lab Manual 2025-2026
No ratings yet
Big Data Spark Lab Manual 2025-2026
62 pages
Attachment
No ratings yet
Attachment
11 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
Unit Ii BDT F
No ratings yet
Unit Ii BDT F
13 pages
Apache Hadoop Developer Training PDF
100% (1)
Apache Hadoop Developer Training PDF
397 pages
IBM Hadoop
No ratings yet
IBM Hadoop
11 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
51 pages
Hadoop for Big Data Enthusiasts
No ratings yet
Hadoop for Big Data Enthusiasts
42 pages
BDA Module2
No ratings yet
BDA Module2
43 pages
Chapter - 4 MAP REDUCE
No ratings yet
Chapter - 4 MAP REDUCE
22 pages
Chapter - 3 HADOOP Distrbuted File System
No ratings yet
Chapter - 3 HADOOP Distrbuted File System
17 pages
A Bune Theo Philos
No ratings yet
A Bune Theo Philos
53 pages
Chapter - 2 Introduction To HADOOP
No ratings yet
Chapter - 2 Introduction To HADOOP
34 pages
Chapter - 1 Introduction To Big Data
No ratings yet
Chapter - 1 Introduction To Big Data
51 pages
Chapter 11 - Dynamic-Object-Modeling
No ratings yet
Chapter 11 - Dynamic-Object-Modeling
32 pages
? @eotc - Books - by - PDF
No ratings yet
? @eotc - Books - by - PDF
78 pages
Book of Prayer For ST M
No ratings yet
Book of Prayer For ST M
8 pages
Conceptual Domain Modeling: Hapter
No ratings yet
Conceptual Domain Modeling: Hapter
42 pages
@ethio - PDF - Books
No ratings yet
@ethio - PDF - Books
58 pages
Supplementary Requirements: Hapter
No ratings yet
Supplementary Requirements: Hapter
11 pages
Chapter 3 - Full-Lifecycle-Objectoriented-Testing-Floot
No ratings yet
Chapter 3 - Full-Lifecycle-Objectoriented-Testing-Floot
33 pages
Chapter 4 - Agile-Modeldriven-Development-Amdd
No ratings yet
Chapter 4 - Agile-Modeldriven-Development-Amdd
32 pages
Chapter 5 - Usage-Modeling
No ratings yet
Chapter 5 - Usage-Modeling
50 pages
OOSAD Chapter 1 Introduction To OOSAD
No ratings yet
OOSAD Chapter 1 Introduction To OOSAD
103 pages
Sorting & Searching Algorithms Guide
No ratings yet
Sorting & Searching Algorithms Guide
42 pages
DSA-Chapter 2 - Complexity - Analysis
0% (1)
DSA-Chapter 2 - Complexity - Analysis
80 pages
DSA Chapter 2
No ratings yet
DSA Chapter 2
89 pages
DSA Chapter 0
No ratings yet
DSA Chapter 0
58 pages
CCC 2025
No ratings yet
CCC 2025
90 pages
3.8.8 Lab - Explore DNS Traffic - ILM
No ratings yet
3.8.8 Lab - Explore DNS Traffic - ILM
10 pages
Dell Precision 7920 Rack Technical Guidebook
No ratings yet
Dell Precision 7920 Rack Technical Guidebook
68 pages
Enabling Vault-Specific Login Accounts
No ratings yet
Enabling Vault-Specific Login Accounts
2 pages
Network Programming Paradigm
No ratings yet
Network Programming Paradigm
5 pages
April: A Processor Architecture For Multiprocessing
No ratings yet
April: A Processor Architecture For Multiprocessing
11 pages
CIS Module 3 VDC Compute
No ratings yet
CIS Module 3 VDC Compute
45 pages
9324 qr001 - en P PDF
No ratings yet
9324 qr001 - en P PDF
3 pages
System 6000 Rfid Encoder pk3695 PDF
No ratings yet
System 6000 Rfid Encoder pk3695 PDF
7 pages
DNVR Digital Network Video Recorder : Before Using This System, Please Read The User Manual Carefully
No ratings yet
DNVR Digital Network Video Recorder : Before Using This System, Please Read The User Manual Carefully
37 pages
Switch 5700
No ratings yet
Switch 5700
4 pages
WM55R Specification 12058896
No ratings yet
WM55R Specification 12058896
4 pages
Intuit Quickbooks Enterprise Solutions: Linux Database Server Manager Installation and Configuration Guide
No ratings yet
Intuit Quickbooks Enterprise Solutions: Linux Database Server Manager Installation and Configuration Guide
42 pages
Com - Upgadata.up7723 Logcat
No ratings yet
Com - Upgadata.up7723 Logcat
775 pages
Apache Spark Interview Prep Guide
No ratings yet
Apache Spark Interview Prep Guide
18 pages
Smart Card Cloner User's Manual-V3.0.151117
No ratings yet
Smart Card Cloner User's Manual-V3.0.151117
2 pages
PH Scalance-S615-Wbm 76
No ratings yet
PH Scalance-S615-Wbm 76
222 pages
StoneOS CLI User Guide Complete Book 5.5R10
No ratings yet
StoneOS CLI User Guide Complete Book 5.5R10
2,509 pages
Eee Vi Embedded Systems (10ee665) Question Paper
No ratings yet
Eee Vi Embedded Systems (10ee665) Question Paper
4 pages
ePAQ 9400 9405
No ratings yet
ePAQ 9400 9405
2 pages
Binary Complements for MBA Students
No ratings yet
Binary Complements for MBA Students
9 pages
Subho Soren Hardware Component Chatper 1
No ratings yet
Subho Soren Hardware Component Chatper 1
10 pages
ATmega2560 IO Ports & C Programming
No ratings yet
ATmega2560 IO Ports & C Programming
142 pages
How To Connect Inverter
No ratings yet
How To Connect Inverter
6 pages
01 MPLS TP
100% (1)
01 MPLS TP
27 pages
Fanuc Spindle Alarm Codes and Fanuc Spindle Drive Faults
No ratings yet
Fanuc Spindle Alarm Codes and Fanuc Spindle Drive Faults
6 pages
Lecture 5 C++
No ratings yet
Lecture 5 C++
60 pages
Best Practices For OPERA Client Side Setup Windows 7 FRM11G PDF
No ratings yet
Best Practices For OPERA Client Side Setup Windows 7 FRM11G PDF
20 pages
8051 Microcontroller Interrupts Guide
No ratings yet
8051 Microcontroller Interrupts Guide
26 pages
M291C21GB - Z32 Alarms
No ratings yet
M291C21GB - Z32 Alarms
33 pages

Chapter Three Data Science

Uploaded by

Chapter Three Data Science

Uploaded by

Introduction to Parallel

Hadoop Ecosystem and

• Fault Tolerance – Hadoop is highly fault-tolerant. It creates three

It writes an application to process unstructured and structured

It is responsible for the parallel processing of high volume of data

The processing is done in two phases Map and Reduce.

The Map : is the first phase of processing that specifies

• Data-driven apps have become a major growth engine for the

1. Recognize How Data Impacts the Customer Journey

Understanding the customer journey and how to use it to deliver

2. Focus on the “Last Mile” of Big Data

3. Build to Scale (Sources, Formats and Devices)

Even though we are surrounded by data-driven devices, it is a

4. Listen to the Crowd (Open is Better)

5. Start Small, Then Think Big

Focus and agility are keys to designing data-driven apps quickly.

You might also like