Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
23 views23 pages

Chapter Three Data Science

Data science course of chapter three data engineering.

Uploaded by

Yoni Yoni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views23 pages

Chapter Three Data Science

Data science course of chapter three data engineering.

Uploaded by

Yoni Yoni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Introduction to Parallel

01
Chapter 3 Computing

Hadoop Ecosystem and


02 Map Reduce
Data Science
Engineering 03
Data Driven Application Design
and Deployment
Introduction to Parallel Computing
• Parallel computing is the simultaneous use of multiple
compute resources to solve a computational problem:
• To be run using multiple CPUs.
• A problem is broken into discrete parts that can be solved
concurrently.
• Each part is further broken down to a series of instructions.
• Instructions from each part execute simultaneously on different
CPUs.
• Languages using parallel computing such as C/C++, Fortran,
MATLAB, Python, R , Perl, Julia and others
Parallel Computing
Why Use Parallel Computing
• Save time and/or money: A task will shorten its time to completion,
with potential cost savings. Parallel clusters can be built from cheap,
commodity components.
• Solve larger problems: Many problems are so large and/or complex
that it is impractical or impossible to solve them on a single computer,
especially given limited computer memory.
• Provide concurrency: A single compute resource can only do one
thing at a time. Multiple computing resources can be doing many
things simultaneously.
• Use of non-local resources: Using compute resources on a wide area
network, or even the Internet when local compute resources are scarce.
Parallel Computer Memory Architectures:

1. Shared Memory:
• Multiple processors can operate independently, but share the same
memory resources
• Changes in a memory location caused by one CPU are visible to all
processors
Continued….

• Advantages:
• Global address space provides a user-friendly programming
perspective to memory
• Fast and uniform data sharing due to proximity of memory to CPUs
• Disadvantages:
• Lack of scalability between memory and CPUs. Adding more CPUs
increases traffic on the shared memory-CPU path
• Programmer responsibility for “correct” access to global memory
Continued ….
2. Distributed Memory:
• Requires a communication network to connect inter-processor memory
• Processors have their own local memory. Changes made by one CPU
have no effect on others
• Requires communication to exchange data among processors
Continued ….
• Advantages:
• Memory is scalable with the number of CPUs
• Each CPU can rapidly access its own memory without overhead
incurred with trying to maintain global cache coherency
• Disadvantages:
• Programmer is responsible for many of the details associated with
data communication between processors
• It is usually difficult to map existing data structures to this memory
organization, based on global memory
Continued …..
3. Hybrid Distributed-Shared Memory:
• The largest and fastest computers in the world today employ both
shared and distributed memory architectures.
• Shared memory component can be a shared memory machine and/or
GPU
• Processors on a compute node share same memory space
• Requires communication to exchange data between compute nodes
Continued.........
• Advantages and Disadvantages:
• Whatever is common to both shared and distributed memory
architectures
• Increased scalability is an important advantage
• Increased programming complexity is a major disadvantage
Hadoop Ecosystem
• Hadoop is an open source framework that is meant for storage and
processing of big data in a distributed manner.
• It is the best solution for handling big data challenges.
Some important features of Hadoop are
• Open Source – Hadoop is an open source framework which means it is
available free of cost. Also, the users are allowed to change the source
code as per their requirements.
• Distributed Processing – Hadoop supports distributed processing of data
i.e. faster processing. The data in Hadoop HDFS is stored in a
distributed manner and MapReduce is responsible for the parallel
processing of data.
Continued …..

• Fault Tolerance – Hadoop is highly fault-tolerant. It creates three


replicas for each block (default) at different nodes.
• Reliability – Hadoop stores data on the cluster in a reliable manner that
is independent of machine. So, the data stored in Hadoop environment
is not affected by the failure of the machine.
• Scalability – It is compatible with the other hardware and we can easily
add/remove the new hardware to the nodes.
• High Availability – The data stored in Hadoop is available to access
even after the hardware failure. In case of hardware failure, the data can
be accessed from another node.
The core components of Hadoop are
1. HDFS: (Hadoop Distributed File System) – HDFS is the basic
storage system of Hadoop.
The large data files running on a cluster of commodity hardware are
stored in HDFS.
It can store data in a reliable manner even when hardware fails.
The key aspects of HDFS are:
a. Storage component
b. Distributes data across several nodes
c. Natively redundant.
Continued …..
2. Map Reduce is the Hadoop layer that is responsible for data
processing.

It writes an application to process unstructured and structured


data stored in HDFS.

It is responsible for the parallel processing of high volume of data


by dividing data into independent tasks.

The processing is done in two phases Map and Reduce.


Continued …..

The Map : is the first phase of processing that specifies


complex logic code and
The Reduce : is the second phase of processing that
specifies lightweight operations.
The key aspects of Map Reduce are:
a. Computational frame work
b. Splits a task across multiple nodes
c. Processes data in parallel
Hadoop Architecture
Data Driven Application Design and Deployment

• Data-driven apps have become a major growth engine for the


worldwide software market. Analysts predict that smart computing
software will become a $48 billion market and have proclaimed that we
are in an era of data-driven marketing and sales. From personalized
portals to wearable devices, data-driven apps are all around us.
Five Best Practices for Designing Data-Driven Applications

1. Recognize How Data Impacts the Customer Journey

Understanding the customer journey and how to use it to deliver


relevant data is a key business differentiator.

2. Focus on the “Last Mile” of Big Data

The last mile of Big Data is where opinions are formed and actions are
taken. For application designers, meeting the last mile challenge
requires understanding self-service use cases and leveraging tools that
turn Big Data into small data that helps people perform specific tasks.
Continued …..

3. Build to Scale (Sources, Formats and Devices)

Even though we are surrounded by data-driven devices, it is a


challenge to design compelling data-driven experiences.

4. Listen to the Crowd (Open is Better)

The community gives our customers access to many resources to ensure that
projects are delivered on time and with minimal risk.

5. Start Small, Then Think Big

Focus and agility are keys to designing data-driven apps quickly.


End of Chapter

You might also like