Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
27 views46 pages

Data-Intensive Computing Overview

Data-intensive computing focuses on managing and analyzing large datasets, ranging from megabytes to petabytes, utilizing repositories for storage and metadata for classification. Traditional computing methods struggle with big data, leading to the adoption of cloud services for scalable solutions, while technologies like MapReduce enable efficient data processing. Hadoop, developed to address storage and processing challenges, employs a distributed architecture with components like HDFS and YARN to facilitate large-scale data operations.

Uploaded by

Ankush Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views46 pages

Data-Intensive Computing Overview

Data-intensive computing focuses on managing and analyzing large datasets, ranging from megabytes to petabytes, utilizing repositories for storage and metadata for classification. Traditional computing methods struggle with big data, leading to the adoption of cloud services for scalable solutions, while technologies like MapReduce enable efficient data processing. Hadoop, developed to address storage and processing challenges, employs a distributed architecture with components like HDFS and YARN to facilitate large-scale data operations.

Uploaded by

Ankush Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Data-Intensive Computing Overview

Data-intensive computing
Data-intensive
computing
is
concerned
with
production, manipulation, and analysis of large-scale
data in the range of hundreds of megabytes (MB) to
petabytes (PB)
Byte
Kilo Byte
Mega Byte
Giga Byte
Tera Byte
Peta Byte
Exa Byte
Zetta Byte
Yotta Byte
Datasets are maintained in repositories, which are
infrastructures supporting the storage, retrieval, and
indexing of large amounts of information.
To facilitate the classification and search, relevant bits of
information, called metadata, are attached to datasets.
Traditional Approach
• A computer to store and process big data.
• For storage purpose, the programmers will take the help of their choice of
database vendors such as Oracle, IBM, etc.
• Limitation: hectic task to process huge amount of scalable data through a single
database bottleneck.
Big Data and Cloud
• Big Data is a collection of large datasets that cannot be processed using
traditional computing techniques.
• It involves many areas of business and technology example Geospatial
data, Social network data, Search engine (Google) etc.
• Clouds are used for Big Data storage on distributed networks.
• Common Cloud Service provider for Big Data are Amazon, Microsoft, IBM
and many more.
How Big is BIGDATA?
Map Reduce and the art of “Thinking Parallel” by Shailesh Kumar, Third Leap, Inc.
Five V’s of Big
Data
Characterizing

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 1/46
Here's a detailed and structured explanation of your notes on Data-Intensive Computing, Big Data,
Cloud, and the Five Vs of Big Data:

📌 Data-Intensive Computing:
Data-intensive computing is an approach that focuses on the production, manipulation, and analysis
of large-scale datasets, which range from:

Hundreds of Megabytes (MB) to

Petabytes (PB) and beyond.

🔢 Data Size Units (From Small to Large):


Unit Size

Byte 1 character (approx.)

Kilobyte (KB) 1,024 Bytes


Megabyte (MB) 1,024 KB = ~1 million bytes

Gigabyte (GB) 1,024 MB = ~1 billion bytes


Terabyte (TB) 1,024 GB = ~1 trillion bytes
Petabyte (PB) 1,024 TB = ~1 quadrillion bytes

Exabyte (EB) 1,024 PB


Zettabyte (ZB) 1,024 EB

Yottabyte (YB) 1,024 ZB

📚 Repositories and Metadata:


Repositories are infrastructures (often databases or storage systems) used to:

Store

Retrieve
Index large-scale data efficiently.

Metadata:

Data about data.


Helps in classification, searching, and retrieval.

Example: A photograph may have metadata such as date taken, camera model, location, etc.

🖥️ Traditional Approach to Big Data:


One central computer system used for:

Storing large datasets.

Processing data using traditional databases (e.g., Oracle, IBM DB2).


Challenges:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 2/46
Scalability issues: Single system can't handle exponential data growth.

Bottlenecks in processing speed.

High cost of upgrading infrastructure.

☁️ Big Data and Cloud Computing:


🔍 What is Big Data?
Refers to huge volumes of data that are too complex to be handled by traditional systems.

Includes:

Structured data (tables, databases)

Semi-structured data (XML, JSON)


Unstructured data (images, videos, social media)

🧠 Areas Involved:
Geospatial data (e.g., maps, GPS)

Social network data (e.g., Facebook, Twitter)


Search engine data (e.g., Google logs)

☁️ Role of Cloud:
Cloud providers offer distributed storage and processing power.

Key Providers:
Amazon Web Services (AWS)

Microsoft Azure

IBM Cloud

Google Cloud Platform (GCP)

🧮 MapReduce and Parallel Thinking:


MapReduce is a programming model for processing large datasets with parallel, distributed
algorithms across a cluster.

Concept: Divide and conquer


Map: Breaks tasks into smaller chunks.
Reduce: Aggregates results from the chunks.

Book Reference: "Map Reduce and the art of Thinking Parallel" by Shailesh Kumar.

🔠 Five V's of Big Data (Key Characteristics):


Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 3/46
V Meaning Description

Data is generated in huge amounts. Example: Facebook generates ~4


Volume Scale of data
petabytes of data per day.

Velocity Speed of data Data flows in real time from sensors, devices, social media.
Variety Different forms Data comes in many formats – text, images, videos, logs, etc.
Veracity Quality/trustworthiness Data can be messy or uncertain, needs cleaning and validation.
Value Meaningful insights Extracting useful business or scientific insights from raw data.

✅ Summary:
Data-intensive computing is vital in the modern data-driven world.

Traditional systems can't handle Big Data efficiently.


Cloud platforms provide scalable solutions.
Understanding the Five Vs helps organizations manage and extract value from Big Data.

Technologies like MapReduce and distributed computing make processing large datasets
practical.

Let me know if you'd like visual diagrams or further breakdowns of any section.

Hadoop History
• Two main Problem with Big Data
• Storage
• Processing Power
• 2002 (Nutch Project)
• Doug Cutting and Mike Cafarella.
• Building a Search Engine to Index 1
billion Pages
• Cost: a million dollars in Hardware +
monthly running cost (around $30000)
• Started Looking for feasible solutions
• 2003: White Paper (Half Solution)
• a paper on Google Distributed File System
(Storage)
• 2004: Another Solution
• Map Reduce: Processing Large Datasets
• 2005
• Apache Nutch is limited to 20-40 nodes per
cluster.
• 2006
• Doug Cutting joined Yahoo with Nutch
Project
• 2007
• Successfully tested Hadoop on 1000 Nodes
How Does Hadoop Work?

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 4/46
• It is quite expensive to build bigger servers with heavy
configurations.
• As an alternative, we can tie together many commodity
computers with single-CPU, as a single functional distributed
system.
• The clustered machines can read the dataset in parallel and
provide a much higher throughput.
• Moreover, it is cheaper than one high-end server.
Hadoop Components
Distributed File System
Storage layer
File management
Replication
Map Reduce Programming Framework
Processing/Computation layer
Run jobs submitted by users.
Manages work distribution and Fault
Tolerance
Hadoop Architecture
Map Reduce –Distributed Processing
HDFS- Distributed storage
YARN –yet another resource negotiator
Common Utilities – Java lib & utilitie

Here is a detailed and easy-to-understand explanation of the History of Hadoop, how it works, and
its architecture and components:

📜 Hadoop History
🌐 The Big Problem in Early 2000s
Big Data faced two major issues:

1. Storage: How to store huge datasets efficiently?


2. Processing Power: How to process such large datasets quickly?

📆 Timeline of Hadoop's Development


2002 – Nutch Project Begins

Creators: Doug Cutting and Mike Cafarella.

Goal: Build an open-source web search engine that could index 1 billion web pages.
Problem: Estimated cost was $1 million in hardware + $30,000/month for running.

2003 – Google GFS White Paper Released

A research paper by Google introduced the Google File System (GFS).

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 5/46
Solved the storage problem by distributing files across many machines.

2004 – MapReduce Introduced by Google

Another Google paper introduced MapReduce, a model for processing large datasets efficiently.

Solved the processing power problem using parallelism.

2005 – Apache Nutch Limitations

Apache Nutch (built by Doug & Mike) could only scale to 20–40 nodes (limited scalability).

2006 – Yahoo Joins the Game

Doug Cutting joined Yahoo, bringing the Nutch project.

Yahoo invested in expanding and scaling the framework.

2007 – Hadoop is Born

Yahoo successfully tested Hadoop on 1000 nodes, proving it could scale efficiently.
Hadoop became a top-level Apache project soon after.

⚙️ How Does Hadoop Work?


💡 Core Idea:
Instead of building one expensive server with high processing power, use many cheap computers
(nodes) connected over a network to form a cluster.

These commodity machines:

Work in parallel.

Share storage and computing tasks.


Can handle failures gracefully.

🔁 Benefits:
Cost-effective: Cheaper than a supercomputer.

Scalable: More machines = more power.


Fault-tolerant: If one node fails, others take over.

High throughput: Many machines read/process data at once.

🧩 Hadoop Components
Hadoop consists of two core components:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 6/46
1. 🔐 HDFS (Hadoop Distributed File System)
Storage layer.
Breaks files into blocks (default 128 MB).

Distributes these blocks across multiple machines.

Replication: Each block is copied to 2–3 machines to ensure data safety.

Handles:
File Storage

Data Access

Fault Tolerance

2. ⚙️ MapReduce
Processing layer.

A programming model for parallel data processing.

Consists of:
Map function: Filters and sorts data.

Reduce function: Aggregates results.

Automatically handles:

Task distribution
Job scheduling

Recovery from failure

🏛️ Hadoop Architecture Overview


🧱 1. HDFS (Storage Layer)
Distributed storage across multiple machines.

Stores data reliably, even in case of hardware failure.

⚙️ 2. MapReduce (Processing Layer)


Parallel computation model.
Processes large-scale data using Map and Reduce tasks.

🧠 3. YARN (Yet Another Resource Negotiator)


Introduced in Hadoop 2.x.

Manages:

Resource allocation
Job scheduling across the cluster.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 7/46
Separates resource management from data processing, improving scalability.

🧰 4. Common Utilities
A set of Java libraries and utilities that support the other Hadoop modules.
Provides tools for:

Data serialization

Configuration
I/O operations

🔄 Summary Table
Aspect Details

Founders Doug Cutting & Mike Cafarella


Goal Open-source web search engine (Nutch)

Challenge Huge cost of storage & processing


Solution Google GFS (2003) + MapReduce (2004)

Apache Hadoop Born in 2006–07 with Yahoo support

Key Components HDFS, MapReduce, YARN, Common Utilities


Key Benefit Cost-effective distributed computing

Would you like a diagram of Hadoop architecture or a comparison between Hadoop and Spark next?

Map Reduce Framework


• A software framework for distributed processing of large data sets.
• The framework takes care of scheduling tasks, monitoring them and
re-executing any failed tasks.
• It splits the input dataset into independent chunks that are
processed in a completely parallel manner.
• MapReduce framework sorts the outputs of the maps, which are
then input to the reduce tasks.
Map Reduce Framework
• MapReduce is a programming model for processing and
generating large data sets with a parallel, distributed algorithm on
a cluster
• Programmer defines two functions, map & reduce– Map(k1,v1) → list(k2,v2). Takes a series of
key/value pairs, processes
each, generates zero or more output key/value pairs– Reduce(k2, list (v2)) → list(v3). Executed once
for each unique key k2 in
the sorted order; iterate through the values associated with that key and
produce zero or more outputs
• System “shuffles” data between map and reduce (so “reduce”
function has whole set of data for its given keys) & automatically

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 8/46
handles system failures, etc.
Map Reduce Architecture
MAP Reduce
JobTracker
• JobTracker is the daemon service for submitting and
tracking MapReduce jobs in Hadoop.
• JobTracker performs following actions in Hadoop:
ü It accepts the MapReduce Jobs from client applications
ü Talks to NameNode to determine data location
ü Locates available TaskTracker Node
ü Submits the work to the chosen TaskTracker Node
TaskTracker
• A TaskTracker node accepts map, reduce or shuffle operations from
a JobTracker.
• Its configured with a set of slots, these indicate the number of tasks
that it can accept
• JobTracker seeks for the free slot to assign a job.
• TaskTracker notifies the JobTracker about job success status.
• TaskTracker also sends the heartbeat signals to the job tracker to
ensure its availability, it also reports the no. of available free slots
with it.
Hadoop Distributed File System
• Files split into 128MB blocks
• Blocks replicated across several
datanodes (often 3)
• Namenode stores metadata
(file names, locations, etc)
Namenode
File1
1
2
3
4
1
2
4
2
1
3
1
4
3
Datanodes
3
2
4
HDFS Architecture
• HDFS follows the master
slave architecture and it
has
the
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 9/46
elements.
• Namenode
• Datanode
• Block

Here is a comprehensive and clear explanation of the MapReduce framework, its architecture, and
how it integrates with the Hadoop Distributed File System (HDFS):

🧮 MapReduce Framework (Overview)


🔹 What is MapReduce?
A software framework for distributed processing of large datasets.

Executes on a Hadoop cluster using parallel and distributed algorithms.

🔧 What it does:
Splits large input data into independent chunks.

Each chunk is processed in parallel by different Map tasks.


The output of the Map tasks is sorted and passed to Reduce tasks.

Framework handles:

Task scheduling
Monitoring

Re-execution of failed tasks

🧠 MapReduce Programming Model


🔹 Two core functions defined by the programmer:
Function Role Input Output

`map(k1, v1)` Processes input data (key1, value1) list of (key2, value2)

`reduce(k2, list(v2))` Aggregates intermediate results (key2, list of values) list of values (v3)

🔄 Example:
Suppose we are counting word occurrences in documents.

Map function: emits each word as `(word, 1)`

Reduce function: sums up the `1`s for each word to get total count

python

# Simplified example
map("line1", "hadoop map reduce hadoop") → [("hadoop", 1), ("map", 1), ("reduce", 1), ("hadoop", 1)

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 10/46
reduce("hadoop", [1, 1]) → [("hadoop", 2)]

📦 Shuffle and Sort Phase:


The framework automatically shuffles and sorts intermediate data:

Groups all values by the same key (`k2`) before sending to reducer.

Example: all values for "hadoop" go to one reducer.

🏗️ MapReduce Architecture
🔸 Components:
✅ JobTracker (Master)
Accepts and manages MapReduce jobs from clients.

Responsibilities:

Communicates with the NameNode (for data location)


Identifies free TaskTrackers (worker nodes)

Assigns Map and Reduce tasks

Tracks job progress, reassigns failed tasks


Sends jobs to appropriate TaskTracker nodes

✅ TaskTracker (Slave)
Located on each worker node.

Executes Map or Reduce tasks as instructed by JobTracker.

Configured with a number of task slots.


Reports:

Heartbeat to JobTracker (ensures it’s alive)

Available slots
Task success/failure

📂 HDFS (Hadoop Distributed File System)


🔹 Purpose:
Acts as the storage system for the input/output of MapReduce jobs.

Highly fault-tolerant, scalable, and optimized for large files.

📦 File Handling in HDFS:


Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 11/46
Files are split into 128 MB blocks (default).
Each block is replicated (typically 3 copies) across different nodes.
Provides data locality (process data where it is stored).

🧱 HDFS Architecture
🔸 Follows a Master-Slave Architecture:
✅ NameNode (Master)
Stores metadata (no actual data):

File names, block IDs, block locations


Permissions
Manages the filesystem namespace and directory structure

Single point of control for the HDFS cluster

✅ DataNode (Slave)
Stores actual blocks of the files.
Sends heartbeat to NameNode.
Handles read/write requests from clients.

📊 Example Block Allocation Diagram (simplified)


vbnet

File1 split into blocks: B1, B2, B3, B4

DataNode1: B1, B2
DataNode2: B2, B3
DataNode3: B1, B4
DataNode4: B3, B4

Each block is replicated on multiple DataNodes to prevent data loss.

When a node fails, the data can still be accessed from the replicas.

🔄 How It All Works Together


1. Client submits a job to JobTracker.
2. JobTracker:

Gets data locations from NameNode


Assigns map/reduce tasks to TaskTrackers
3. TaskTrackers:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 12/46
Execute the assigned tasks.
Read data from DataNodes (local if possible).
4. Map tasks emit intermediate key/value pairs.

5. Framework shuffles and sorts data.


6. Reduce tasks process grouped data and generate output.
7. Final result is written back to HDFS.

✅ Summary Table
Component Description
Map Function Processes raw data to generate intermediate key-value pairs
Reduce Function Aggregates results by key
JobTracker Master daemon, manages jobs

TaskTracker Worker daemon, executes tasks


NameNode Master of HDFS, manages metadata
DataNode Stores actual data blocks

Shuffle Phase Sorts and groups intermediate output


HDFS Block Size Default 128 MB
Replication Default 3 copies of each block

Would you like a diagram or real-world analogy to visualize how MapReduce or HDFS works?

NameNode
• The namenode is the commodity hardware that contains
the namenode software.
• The system having the namenode acts as the master server
and it does the following tasks −
• Manages the file-system namespace.
• Regulates client’s access to files.
• It also executes file system operations such as
renaming, closing, and opening files and directories.
NameNode
https://www.netjstech.com/2018/02/namenode-datanode-and-secondary-namenode-hdfs-
hadoop.html
DataNode
• Datanode is a commodity hardware having the datanode software.
• For every node (Commodity hardware/System) in a cluster, there will be
a datanode.
• These nodes manage the data storage of their system.
• Datanodes perform read-write operations on the file systems, as per
client request.
• They also perform operations such as block creation, deletion, and
replication according to the instructions of the namenode.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 13/46
Block
• Generally the user data is stored in the files of HDFS.
• The file in a file system will be divided into one or more
segments and/or stored in individual data nodes.
• These file segments are called as blocks.
• In other words, the minimum amount of data that HDFS can
read or write is called a Block.
• The default block size is 64MB, but it can be increased as per
the need to change in HDFS configuration.
Roles of Components
Data Replication
• Replication placement
• High initialization time to create replication to all
machines
• An approximate solution: Only 3 replications
• One replication resides in current node
• One replication resides in current rack
• One replication resides in another rack
Goals of HDFS
§ Fault detection and recovery −
§ Since HDFS includes a large number of commodity hardware, failure of
components is frequent.
§ Therefore HDFS should have mechanisms for quick and automatic fault
detection and recovery.
§ Huge datasets −
§ HDFS should have hundreds of nodes per cluster to manage the applications
having huge datasets.
§ Hardware at data −
§ A requested task can be done efficiently, when the computation takes place
near the data. Especially where huge datasets are involved, it reduces the
network traffic and increases the throughput.
Map Reduce and the art of “Thinking Parallel” by Shailesh Kumar, Third Leap, Inc.
Example: Word Count
def mapper(line):
foreach word in line.split():
output(word, 1)
def reducer(key, values):
output(key, sum(values))
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
Word Count Execution
Input
the quick
brown
fox
the fox ate
the
mouse
how now
brown
cow
Map
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 14/46
Map
Shuffle
the, 1
brown, 1
fox, 1
Map
the, 1
fox, 1
the, 1
how, 1
now, 1
brown, 1
Map
ate, 1
mouse, 1
cow, 1
Reduce Output
Reduce
quick, 1
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
ate, 1
cow, 1
mouse, 1
quick, 1
An Optimization: The Combiner
• Local reduce function for repeated keys produced by
same map
• For associative ops. like sum, count, max
• Decreases amount of intermediate data
• Example: local counting for Word Count:
def combiner(key, values):
output(key, sum(values))
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
Word Count with Combiner
Input
the quick
Brown
fox
the fox ate
the
mouse
Map
Map
the, 2
fox, 1
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 15/46
Shuffle
the, 1
brown, 1
fox, 1
Reduce Output
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
Map
how now
brown
cow
how, 1
now, 1
brown, 1
Map
quick, 1
ate, 1
mouse, 1
cow, 1
Reduce
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
ate, 1
cow, 1
mouse, 1
quick, 1
Word Count in Python with Hadoop
Streaming
Mapper.py:
Reducer.py:
import sys
for line in sys.stdin:
for word in line.split():
print(word.lower() + "\t" + 1)
import sys
counts = {}
for line in sys.stdin:
word, count = line.split("\t")
dict[word] = dict.get(word, 0) + int(count)
for word, count in counts:
print(word.lower() + "\t" + 1)
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
MapReduce Execution Details
• Mappers preferentially scheduled on same node or
same rack as their input block– Minimize network use to improve performance
• Mappers save outputs to local disk before serving to
reducers– Allows recovery if a reducer crashes– Allows running more reducers than # of nodes
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 16/46
Fault Tolerance in MapReduce
1. If a task crashes:– Retry on another node
• OK for a map because it had no dependencies
• OK for reduce because map outputs are on disk– If the same task repeatedly fails, fail the job or
ignore that
input block
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
Fault Tolerance in MapReduce
2. If a node crashes:– Relaunch its current tasks on other nodes– Relaunch any maps the node
previously ran
• Necessary because their output files were lost along with the
crashed node
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
Fault Tolerance in MapReduce
3. If a task is going slowly (straggler):– Launch second copy of task on another node– Take the
output of whichever copy finishes first, and kill
the other one
• Critical for performance in large clusters (many
possible causes of stragglers)
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
Takeaways
• By providing a restricted data-parallel programming model,
MapReduce can control job execution in useful ways:– Automatic division of job into tasks–
Placement of computation near data– Load balancing– Recovery from failures & stragglers
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
Example: Word Count
def mapper(line):
foreach word in
line.split():
output(word, 1)
def reducer(key, values):
output(key, sum(values))
Mapper
Input: value: lines of text of input
Output: key: word, value: 1
Reducer
Input: key: word, value: set of
counts
Output: key: word, value: sum
Launching program
Defines this job
Submits job to cluster
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
Word Count Dataflow
Map Reduce Pipeline
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
Example Task - 1
Map Reduce and the art of “Thinking Parallel” by Shailesh Kumar, Third Leap, Inc.
Map Reduce and the art of “Thinking Parallel” by Shailesh Kumar, Third Leap, Inc.
Map Reduce and the art of “Thinking Parallel” by Shailesh Kumar, Third Leap, Inc.
2 . Search
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 17/46
• Input: (lineNumber, line) records
• Output: lines matching a given pattern
• Map:
if(line matches pattern):
output(line)
• Reduce: no reducer (map-only job)
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
3. Sort
• Input: (key, value) records
• Output: same records, sorted by key
Map
Map
aardvark,
elephant
Map
ant, bee
zebra
cow
pig
sheep, yak
Reduce
[A-M]
aardvark
ant
bee
cow
elephant
Reduce
[N-Z]
pig
sheep
yak
zebra
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
4. Inverted Index
• Input: (filename, text) records
• Output: list of files containing each word
• Map:
foreach word in text.split():
output(word, filename)
• Combine: uniquify filenames for each word
• Reduce:
def reduce(word, filenames):
output(word, sort(filenames))
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
Inverted Index Example
hamlet.txt
to be or
not to be
12th.txt
be not
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 18/46
afraid of
greatness
to, hamlet.txt
be, hamlet.txt
or, hamlet.txt
not, hamlet.txt
be, 12th.txt
not, 12th.txt
afraid, 12th.txt
of, 12th.txt
greatnes

Thanks for sharing all the detailed content! Here's a comprehensive explanation of each concept in
your notes, organized for clarity and deeper understanding:

🔷 What is Data-Intensive Computing?


Data-intensive computing deals with producing, manipulating, and analyzing massive datasets
—ranging from hundreds of MBs to petabytes (PB).

Requires high-capacity storage and efficient processing mechanisms.

📏 Data Units (for scale):


1 Byte < 1 KB < 1 MB < 1 GB < 1 TB < 1 PB < 1 EB < 1 ZB < 1 YB

🔷 Big Data & Traditional Limitations


Traditional systems store data using databases like Oracle or IBM DB2.
Problems: Not scalable, single-point bottlenecks in databases, expensive hardware requirements.

🔷 Big Data and Cloud


Big Data refers to large, complex data sets that traditional computing can't handle.

Cloud computing offers scalable storage and processing, using providers like AWS, Azure, IBM
Cloud.
Examples of Big Data sources:
Social Media (e.g., Facebook, Twitter)

Geospatial data (Google Maps)


Search engines (Google Search)

🔷 Hadoop: History & Evolution


2002: Doug Cutting & Mike Cafarella started Nutch (search engine project)

Problem: Cost to index 1 billion pages = ~$1M in hardware


Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 19/46
2003: Google published a paper on GFS (Google File System)
2004: Google introduced MapReduce

2005: Nutch supported only 20–40 nodes


2006: Cutting joined Yahoo; began building Hadoop
2007: Hadoop ran successfully on 1,000 nodes

🔷 What is Hadoop?
A framework for distributed storage and processing of big data using clusters of commodity hardware.

✅ Why use Hadoop?


Inexpensive (uses commodity machines)

Scalable
Fault-tolerant
Suitable for massive data processing

🔷 Hadoop Core Components


1. HDFS – Hadoop Distributed File System (storage)

2. MapReduce – Processing engine


3. YARN – Resource management
4. Common Utilities – Java libraries, scripts, etc.

🔷 HDFS Architecture
Master-slave model:
NameNode (Master): Stores metadata (file structure, locations, etc.)
DataNodes (Slaves): Store actual data in blocks

Default block size: 128MB (older default: 64MB)


Data replication: Default = 3 copies for fault tolerance
One local

One in same rack


One in different rack

🔷 MapReduce Framework
🧩 Basic Model:
Map Function:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 20/46
Processes input key-value pairs
Outputs intermediate key-value pairs
Reduce Function:
Merges values with the same key
Outputs final key-value pairs

📦 Behind the scenes:


Splitting: Input split into chunks

Shuffling: Grouping of map outputs for each key


Sorting: Keys are sorted before reduction

🔷 MapReduce Components
JobTracker:
Manages job scheduling

Assigns tasks to TaskTrackers


Monitors status and failures
TaskTracker:

Executes map/reduce tasks


Sends heartbeat to JobTracker
Reports progress and status

🔷 MapReduce Fault Tolerance


1. Task Crash: Retry task on another node

2. Node Failure: Re-run all tasks from failed node


3. Straggler (slow task): Run a backup; keep the faster one

🔷 Word Count Example


✍️ Pseudocode:
python

def mapper(line):
for word in line.split():
output(word, 1)

def reducer(key, values):


output(key, sum(values))

🧠 Combiner:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 21/46
Optimizes MapReduce by doing local aggregation
Reduces network traffic

python

def combiner(key, values):


output(key, sum(values))

🔷 MapReduce in Python using Hadoop Streaming


Mapper.py:

python

import sys
for line in sys.stdin:
for word in line.split():
print(word.lower() + "\t" + "1")

Reducer.py:

python

import sys
counts = {}
for line in sys.stdin:
word, count = line.split("\t")
counts[word] = counts.get(word, 0) + int(count)

for word, count in counts.items():


print(word + "\t" + str(count))

🔷 Other MapReduce Tasks


1. Search (Map-only job):

Map: If line matches pattern, output it


Reduce: Not needed

2. Sort:

Input records sorted by key


Map: Emits records

Reduce: Groups and outputs sorted keys

3. Inverted Index:

Find which files contain which words

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 22/46
python

# Map:
for word in text.split():
output(word, filename)

# Combine/Reduce:
def reduce(word, filenames):
output(word, sorted(set(filenames)))

🔷 Goals of HDFS
Fault tolerance
Handle massive datasets
Data locality for efficient processing

🧠 Takeaways:
Hadoop enables parallel processing of massive data
MapReduce simplifies data processing using map and reduce logic
HDFS efficiently handles storage with replication and fault-tolerance
Ideal for distributed, scalable, and fault-tolerant big data systems

If you want any of these concepts visualized with diagrams or explained using real-life analogies or
coding examples, just let me know!

Hive
• Open source project developed by Facebook.
• An SQL like interface to Hadoop.
• Provide data summarization, query and analysis.
• Query execution via MapReduce.
• Also used by Netflix, Cnet, Digg, eHarmony etc.
• HiveQL example:
SELECT customerId, max(total_cost) from hive_purchases GROUP BY
customerId HAVING count(*) > 3;
Hive Architecture
Contd…
There are 4 main components as part of Hive Architecture.
• Hadoop core components (Hdfs, MapReduce): Hive tables internally
stores data in HDFS path and queries are executed as MapReduce
programs.
• Metastore: is a namespace for tables and store details related to table,
column, partition and location.
• Driver: it parses the query and perform semantic analysis on different
query blocks and expressions.
• Hive Clients: are the interfaces through which we can submit query to

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 23/46
the Hive system.
Pig
• Open source project developed by Yahoo
• A scripting platform for processing and analyzing large data sets.
• Apache Pig allows to write complex MapReduce programs using a
simple scripting language.
• Pig Latin language is used which is a data flow language.
• Pig translate Pig Latin script into MapReduce to execute within
Hadoop.
• Pig example:
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
X = FOREACH A GENERATE name,$2;
DUMP X;
Pig Architecture
Components:
•Pig Latin language is used to analyze data.
•Grunt shell is used to execute the program.
•Parser checks the syntax and type of the script.
•Optimizer performs the job of local optimization
such as projection and pushdown.
•Compiler compiles the optimized code into series
of MapReduce jobs.
•Execution Engine produces the final results.
Zookeeper
• It coordinate distributed systems as a Zoo.
• ZooKeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and
providing group services

Here is a detailed and clear explanation of the content you provided on Hive, Pig, and ZooKeeper—
along with their architecture, purpose, and usage in Hadoop ecosystems:

🐝 Apache Hive
🔹 What is Hive?
Developed by: Facebook
Purpose: Hive is a data warehouse tool built on top of Hadoop that provides an SQL-like
interface (HiveQL) for querying and analyzing large datasets.
How it works: Converts HiveQL queries into MapReduce jobs which run on the Hadoop cluster.

Use Cases: Summarization, data analysis, ad-hoc queries

🔹 Who uses Hive?


Big companies like Netflix, CNET, Digg, eHarmony, and Facebook.

🔹 HiveQL Example:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 24/46
sql

SELECT customerId, MAX(total_cost)


FROM hive_purchases
GROUP BY customerId
HAVING COUNT(*) > 3;

This query finds the maximum total_cost per customer but only for those customers who made more
than 3 purchases.

🔹 Hive Architecture Overview


1. Hadoop Core Components (HDFS & MapReduce)

Hive stores data in HDFS paths.


When you run a Hive query, it is internally translated into MapReduce jobs.

2. Metastore

Stores metadata about:


Tables
Columns
Partitions
HDFS locations

Example: Keeps info like “Table sales has columns: date, revenue, customerID”.

3. Driver

Acts like the controller.


Tasks:
Parses the query

Performs semantic analysis


Manages query execution and monitoring

4. Hive Clients

Interface for users to interact with Hive.


Examples:

Hive CLI
JDBC/ODBC connections
Web UIs

🐷 Apache Pig
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 25/46
🔹 What is Pig?
Developed by: Yahoo
Purpose: Platform for analyzing large datasets using a high-level language called Pig Latin.
How it works: Converts Pig Latin scripts into MapReduce jobs.
Designed for:
ETL (Extract, Transform, Load)

Data cleansing
Simple transformations

🔹 Pig Latin Example:


pig

A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);


X = FOREACH A GENERATE name, $2;
DUMP X;

`LOAD`: Reads data from the `student` file


`FOREACH`: Extracts `name` and the third field (`$2` = gpa)
`DUMP`: Outputs the result

🔹 Pig Architecture Overview


1. Pig Latin Language

Describes data flow in a pipeline format.


Easy to use for data analysts who are not Java programmers.

2. Grunt Shell

Interactive shell to write and run Pig Latin scripts.

3. Parser

Parses Pig script


Checks for syntax and type errors

4. Optimizer

Improves performance by:


Pushing filters early
Combining operations
Removing unnecessary steps

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 26/46
5. Compiler

Translates the optimized script into MapReduce jobs

6. Execution Engine

Executes the MapReduce jobs


Sends results back to the Grunt shell or HDFS

🦓 Apache ZooKeeper
🔹 What is ZooKeeper?
A coordination service for distributed systems
Developed to manage:

Configuration
Naming services
Distributed synchronization
Group services (leader election, locking)

🔹 Why is ZooKeeper important?


Distributed applications (like Hadoop, HBase, Kafka) need:

To keep their state consistent


Manage leadership roles
Maintain availability and reliability

🔹 Analogy:
Think of a Zoo with many animals (services). ZooKeeper ensures:

Every animal knows the rules (config)


Animals don't bump into each other (synchronization)
There is always a zookeeper in charge (leader election)

✅ Summary Table
Feature Hive Pig ZooKeeper
Developed
Facebook Yahoo Apache Foundation
By
Language HiveQL (SQL-like) Pig Latin (data flow scripting) N/A (API-based service)
Goal Query and analyze Scripting and data Coordination of distributed

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 27/46
Feature Hive Pig ZooKeeper
structured data transformations systems
Independent (used by Hadoop,
Execution MapReduce MapReduce
HBase)
Common
BI, analytics ETL, data processing Leader election, config mgmt
Use

If you want diagrams or flowcharts to visualize these architectures, or real Hadoop logs and examples
from Hive or Pig execution, let me know—I can generate those too.

Introduction to Spark
• Apache Spark is an open-source framework designed for
large-scale distributed data processing, on premises in data
centers or in the cloud.
• It works in distributed environment across clusters of
computers.
• It is designed to scale up from single server to thousands of
machines.
History of Spark
• Spark was developed by researchers at UCBerkley in the year
2009.
• Previously, the research team was working on MapReduce.
• Limitation of Hadoop Map Reduce– Hard to manage and administer, with cumbersome
operational complexity – Map Reduce was inefficient (or intractable) for interactive or iterative
computing jobs.
• Each pair’s intermediate computed result is written to the local disk for the subsequent
stage of its operation.

History of Spark
• Map Reduce was inefficient (or intractable) for interactive or iterative
computing jobs.
Reuse intermediate results across multiple
computations in multistage applications
User runs ad-hoc queries on the same subset of data.
Each query will do the disk I/O on the stable storage,
which can dominate application execution time
History of Spark
• Limitation of Hadoop Map Reduce– was efficient for large scale batch processing applications, but
fell
short for combining other workloads such as machine learning,
streaming, or interactive SQL-like queries.
• Spark extends the MapReduce model to efficiently support more types
of computations, including interactive queries and stream processing.

History of Spark
• Main feature of Apache Spark is its in-memory cluster computing that
increases the processing speed of an application.
• Initially found to be 10-20x faster than Hadoop. (based on paper
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 28/46
published in year 2009)
• Spark was first open sourced in the year 2009.
• Spark can create distributed datasets from any file stored in the
Hadoop distributed filesystem (HDFS) or other storage systems
supported by the Hadoop APIs (including your local filesystem,
Amazon S3, Cassandra, Hive, HBase, etc.)
Iterative Operations on Spark
• Spark stores the intermediate results in a distributed memory
(RAM) instead of Stable storage (Disk) and make the system
faster.
Interactive Operations on Spark
• If different queries are run on the same set of data
repeatedly, this particular data can be kept in memory for
better execution times.
Spark Features
Spark Features
Speed Spark runs up to 100 times faster
than Hadoop MapReduce for large-scale
data processing.
• In-memory
retention
intermediate results.
of
• Query Optimization and DAG
Building
Powerful Caching Simple programming
layer provides powerful caching and
persistence capabilities.
Polyglot Spark provides high-level APIs in
Java, Scala, Python, and R.
Spark Features
Extensibility
Unlike Hadoop, Spark
focuses on its fast, parallel computation
engine rather than on storage.
So, we can use Spark to read data stored in
multiple sources—Apache Hadoop, Apache
Cassandra, Apache HBase, MongoDB,
Apache Hive, RDBMSs, and more—and
process it all in memory.
Ease of Use Spark allows you to write
scalable applications in Java, Scala, Python,
and R. It also provides a shell in Scala and
Python.
Spark Features
Real-time Stream Processing
• Spark is designed to handle real-time
data streaming.
• While MapReduce is built to handle and
process the data that is already stored
in Hadoop clusters, Spark can do both
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 29/46
and also manipulate data in real-time
via Spark Streaming.
Deployment
• Spark can run independently in cluster
mode, and it can also run on Hadoop
YARN, Apache Mesos, Kubernetes, and
even in the cloud.
Spark Features
Deployment
• Client Mode “driver” component of spark
job will run on the machine from which job is
submitted.
• Job submitting machine can be near or
very remote to “spark infrastructure”.
• Cluster Mode “driver” component of spark
job will not run on the local machine from
which job is submitted.
• Here spark job will launch “driver”
component inside the cluster.
https://techvidvan.com/tutorials/spark-modes-of-deployment/
https://docs.cloudera.com/runtime/7.2.0/running-spark-applications/topics/spark-yarn
deployment-modes.html
Apache Spark Components: Unified Stack
• Spark ecosystem is composed of various components like Spark SQL, Spark
Streaming, MLlib, GraphX, and the Core API component.
Apache Spark Components: Unified Stack
• Spark Core– base engine for large-scale distributed
data processing.– Provides API to create RDDs.– RDDs are a collection of items
distributed across many compute
nodes that can be manipulated in
parallel. – responsible for
• memory management,
• fault recovery,
• scheduling, and monitoring jobs on
a cluster
• and interacting with storage
systems.
Apache Spark Components: Unified Stack
• Spark SQL– package for working with structured
data .– supports querying data via SQL.– Example: You can read data stored in an
RDBMS table or from file formats with
structured data and then construct
permanent or temporary tables in
Spark.
Apache Spark Components: Unified Stack
• MLlib(Machine Learning)– MLlib stands for Machine Learning
Library. – Used to perform machine learning in
Apache Spark.– Spark comes with a library containing
common machine learning (ML)
algorithms called MLlib.
Apache Spark Components: Unified Stack
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 30/46
• MLlib(Machine Learning)– MLlib provides fast and distributed
implementations
of
common ML
algorithms, statistical analysis, feature
extraction, convex optimizations, and
distributed linear algebra.– Spark MLlib library has extensive
documentation to describe all the
supported utilities.
Apache Spark Components: Unified Stack
• Spark Streaming– component of Spark which is used to
process real-time streaming data.– It enables high-throughput and fault
tolerant stream processing of live
data streams.– Examples of data streams:
• Log files generated by
production web servers,
• queues of messages containing
status updates posted by users
of a web service.
Apache Spark Components: Unified Stack
• GraphX– A library for manipulating graphs (e.g.,
social network graphs, routes and
connection points, or network topology
graphs)– Performing graph-parallel computations.
Partitioning allows for efficient parallelism.
Distributed scheme of breaking up data into chunks or
partitions allows Spark executors to process only data that
is close to them, minimizing network bandwidth.
Spark Jobs
Driver converts your Spark application into one or more Spark jobs.
It then transforms each job into a DAG.
Stages
Stages are created based on what operations can be performed serially or
in parallel.
A Spark job may be divided into a number of stages.
Spark Tasks
Each stage is comprised of Spark tasks (a unit of execution).
Each task maps to a single core and works on a single partition of
data
Example “an executor with 16 cores can have 16 or more tasks working on
16 or
more partitions in parallel, making the execution of Spark’s tasks exceedingly
parallel”
Spark is a distributed data processing engine with its components
working collaboratively on a cluster of machines.
Spark Driver
• Every spark application consists of a
driver program that is responsible for
launching and managing parallel
operations on the Spark cluster.
• For example, if you are using the
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 31/46
interactive shell, the shell acts as the
driver program.
• responsible
for
instantiating
a
SparkSession i.e. a gateway to all the
Spark functionalities.
Spark is a distributed data processing engine with its components
working collaboratively on a cluster of machines.
Spark Driver (Other roles)
§ communicates with the cluster
manager.
§ requests resources (CPU, memory,
etc.) from the cluster manager for
Spark’s executors.
§ transforms all the Spark operations
into DAG computations.
§ Distributes tasks across the Spark
executors.
§ Driver stores the metadata about all
the RDDs and their partitions.
Spark Driver (Summary)
§ runs on the master node of the spark cluster.
§ schedules the job execution and negotiates with the cluster
manager.
§ translates the RDD’s into the execution graph and splits the graph
into multiple stages.
§ stores the metadata about all the RDDs.
§ converts a user application into smaller execution units known as
tasks.
§ Tasks are then executed by the executors.
§
Spark is a distributed data processing engine with its
components working collaboratively on a cluster of machines.
Spark Session
§ provides a single unified entry
point
to
functionality.
all
of
Spark’s
§ Allows you to create JVM runtime
parameters, define Data- Frames
and Datasets, read from data
sources,
access
catalog
metadata, and issue Spark SQL
queries.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 32/46
Spark is a distributed data processing engine with its components
working collaboratively on a cluster of machines.
Cluster manager
§ Spark context works with
the cluster manager to manage
various jobs.
§ takes care of the job execution
within the cluster.
Spark Executors
§ job is to basically execute the
tasks.
§ returns back the result to the Spark
Context.
Spark is a distributed data processing engine with its components
working collaboratively on a cluster of machines.
Overall Context
• Spark Context takes the job,
breaks the job in tasks and
distribute them to the worker
nodes.
• These tasks work on the
partitioned
RDD,
perform
operations, collect the results and
return to the main Spark Context.
Step 1
• ClientsubmitsSparkuserapplicationcode.
• Driver implicitly converts user code that contains transformations and
actions into a logically directed acyclic graph called DAG.
Step 2
• Driver performs certain optimizations and then it converts the logical DAG
into
physical execution plan with set of stages.
• For physical execution plan, this stage includes creating physical execution units
called tasks under each stage.
Step 3
• Driver interacts with the cluster manager and negotiates the resources.
• Cluster manager launches executors in worker nodes on behalf of the
driver.
• Driver sends tasks to the cluster manager based on data placement.
• When executors start, they register themselves with drivers.
Step 4
• Now, Executors start executing the various tasks assigned by the driver
program
• During the course of execution of tasks, driver program will monitor the
set of
executors that runs.
• Driver node also schedules future tasks based on data placement.
Other Important Points:
• Driver program in the spark architecture also schedules future tasks based on
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 33/46
data
placement by tracking the location of cached data.
• When driver programs main () method exits and it will terminate all the
executors
and release the resources.

Here's a detailed, point-wise explanation of the provided content on Hive, Pig, Zookeeper, and Apache
Spark, structured for clarity:

🐝 Hive (Developed by Facebook)


🔹 Introduction
Hive is an open-source data warehouse system built on top of Hadoop.

It provides an SQL-like interface (HiveQL) for querying and analyzing large datasets stored in
HDFS.
Originally developed by Facebook and used by companies like Netflix, eHarmony, Digg, CNET, etc.
Hive converts queries into MapReduce jobs to be executed on a Hadoop cluster.

🔹 Sample HiveQL
sql

SELECT customerId, max(total_cost)


FROM hive_purchases
GROUP BY customerId
HAVING count(*) > 3;

🔹 Hive Architecture Components


1. Hadoop Core (HDFS + MapReduce)

Data in Hive tables is stored on HDFS.


Queries are executed via MapReduce.
2. Metastore
Stores metadata about tables, columns, partitions, and their locations.
3. Driver

Parses queries, performs semantic analysis, and generates execution plans.


4. Hive Clients
Interfaces like JDBC, ODBC, CLI, Web UI to interact with Hive.

🐷 Pig (Developed by Yahoo)


🔹 Introduction
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 34/46
Apache Pig is a high-level platform for creating MapReduce programs using a scripting
language.
It uses Pig Latin, a procedural language for analyzing large datasets.
Designed to simplify complex data transformations.

🔹 Pig Latin Example


pig

A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);


X = FOREACH A GENERATE name, $2;
DUMP X;

🔹 Pig Architecture Components


1. Pig Latin Language – Used to write scripts.
2. Grunt Shell – Interactive shell to execute Pig commands.
3. Parser – Validates syntax and data types.
4. Optimizer – Applies logical optimizations (e.g., projection pushdowns).
5. Compiler – Converts scripts to MapReduce jobs.
6. Execution Engine – Runs the generated jobs and returns results.

🦓 Zookeeper
🔹 Introduction
Apache Zookeeper coordinates and manages distributed systems.
It acts like a centralized service to maintain:

Configuration info
Naming services
Distributed synchronization
Group services

⚡ Apache Spark (Developed by UC Berkeley)


🔹 Introduction
Spark is an open-source cluster computing framework.
Designed for in-memory data processing across thousands of nodes.
Works on-premises or in the cloud.
Created to overcome the limitations of Hadoop MapReduce in iterative and interactive
processing.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 35/46
🔹 History of Spark
Developed at UC Berkeley in 2009.
Limitations of MapReduce it addressed:
Slow due to repeated disk I/O.

Not ideal for interactive queries and machine learning.


Spark retains intermediate results in memory, making it significantly faster.
Found to be 10–20x faster than Hadoop MapReduce in early benchmarks.

🔹 Iterative & Interactive Operations


Iterative: Stores intermediate results in memory across iterations.
Interactive: Keeps frequently accessed data in memory for faster querying.

🔹 Key Spark Features


Feature Description
🚀 Speed Up to 100x faster than Hadoop MapReduce.

🧠 In-memory Intermediate results stored in memory.


🗃️ Caching Powerful data caching for repeated operations.
🔤 Polyglot APIs in Java, Scala, Python, R.
🔄 Extensibility Works with HDFS, Cassandra, MongoDB, HBase, etc.
📊 Real-time Supports streaming data via Spark Streaming.
☁️ Deployment Works with YARN, Mesos, Kubernetes, or standalone.

🔹 Deployment Modes
Client Mode: Driver runs on the machine submitting the job.
Cluster Mode: Driver runs inside the cluster for better fault tolerance.

🔹 Spark Ecosystem (Unified Stack)


Component Description
Spark Core Base engine for distributed data processing. Manages RDDs, memory, and scheduling.
Spark SQL Module to run SQL queries and work with structured data.
MLlib Machine learning library with classification, regression, clustering, etc.
Spark Streaming Processes real-time data streams (e.g., server logs, Kafka).
GraphX Library for graph processing and analytics (e.g., social networks).

🔹 Spark Processing Workflow


Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 36/46
1. Spark Driver

Main coordinator of the application.


Converts code into DAG (Directed Acyclic Graph).
Communicates with the Cluster Manager.

Requests resources (CPU, memory).


Schedules tasks on Executors.

2. Cluster Manager

Manages job execution.


Allocates resources across the cluster.

Supported managers: YARN, Mesos, Kubernetes.

3. Spark Executors

Run the tasks assigned by the driver.


Report results back to the driver.
Cache data for performance.

4. Spark Session

Entry point for Spark functionality.


Enables reading data, defining DataFrames, running SQL queries.

🔹 Spark Job Execution Stages


Step-by-Step Execution

1. Submit Application
Client submits the job to the Spark cluster.
2. DAG Formation
Driver parses transformations and actions into a logical DAG.
3. DAG to Physical Plan

DAG converted into stages and tasks (units of execution).


4. Resource Allocation
Cluster manager launches executors.
Executors register with the driver.
5. Task Execution
Executors process partitions of data.

Results sent back to the driver.


Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 37/46
Driver tracks task completion and data locality.

If you’d like, I can also give you diagrams, comparative charts, or real-world examples (e.g., Spark vs
Hadoop use cases) to reinforce the concepts.
Let me know if you'd like this content formatted as PDF notes, PowerPoint slides, or quiz questions for
revision.

Resilient Distributed Datasets (RDDs) is a fundamental


data structure of Spark. (Since inception)
It is an immutable distributed collection of objects.
Each dataset in RDD is divided into logical partitions,
which may be computed on different nodes of the cluster.
Resilient
fault-tolerant with the help of RDD lineage graph(DAG
)
able to recompute missing or damaged partitions due to node failures.
Distributed since Data resides on multiple nodes.
Each dataset in RDD is divided into logical partitions, which may be
computed on
different nodes of the cluster.
Two ways to create RDDs
loading an external dataset
and parallelizing a collection in your driver program
There are three vital characteristics associated with an
RDD:
Dependencies
list of dependencies that instructs Spark how an RDD is
constructed.
Beneficial for reproducing results.
Partitions ability to split the work to parallelize
computation on partitions across executors.
Compute function
Spark operations on distributed data can be classified into
two types
Transformations
transform a Spark RDD or DataFrame into a new RDD or
DataFrame.
Giving it the property of immutability.
For example select() or filter() will not change the original
RDD/DataFrame; instead, it will return the transformed results of
the operation as a new RDD/DataFrame.
Lazily Evaluated
Transformations
Can be Narrow or Wide.
Narrow where a single output partition can be computed from a
single input partition.
Wide where data from other partitions is read in and combined.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 38/46
Lineage Graph
track of the set of dependencies between different RDDs
Provide important information to
compute each RDD on demand
and to recover lost data if part of a
RDD is lost.
Spark uses lazy evaluation to
reduce the number of passes it has
to take over our data by grouping
operations together.
Map
takes in a function and applies it to each element in the RDD
Return new each of each element in the RDD as result
Filter
takes in a function and returns an RDD that only has elements
that pass the filter() function.
Other mathematical set of Operations

Spark operations on distributed data can be classified into


two types
Actions
compute a result based on an RDD, and either return it to
the driver program or save it to an external storage
system.
An action triggers the lazy evaluation of all the recorded
transformations.
Example: returns the first element in an RDD
Reduce
takes a function that operates on two elements of the type in your
RDD and returns a new element of the same type.
Example:
‘ + ’
Aggregate to compute the average of an RDD

Sometimes based on the application it is required to use the same RDD


multiple
times.
Spark will recompute the RDD and all of its dependencies each time we
call an
action on the RDD.
This can be very expensive for the iterative algorithms, which access to data
multiple times.
To avoid computing an RDD multiple times, Spark allows to persist the data.
In case of persistence, nodes that compute the RDD store their partitions.
A graphical user interface (GUI) that can be used to inspect or monitor
Spark
applications in their various stages of decomposition.
Is accessible on default port 4040.
In local mode, http://<localhost>:4040
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 39/46
Provide details on
A list of scheduler stages and tasks
A summary of RDD sizes and memory usage
Information about the environment.
Information about the running executors.
All the Spark queries

Spark provides special operations on RDDs containing key/value


pairs.
Key/Value Pair allows to act on each key in parallel or regroup data
across the network.
Creating Pair RDDs: Can be done by running a map function which
returns Key/Value Pair

Aggregations
reduceByKey runs several parallel reduce operations, one for
each key in the dataset, where each operation combines values that
have the same key.
Tuning
Degree
Parallelism
the
of

import sys
from pyspark import SparkContext, SparkConf
# create Spark context with necessary configuration
sc = SparkContext("local","PySpark Word Count Exmaple")
# read data from text file and split each line into words
words = sc.textFile("D:/workspace/spark/input.txt").flatMap(lambda line: line.split(" "))
# count the occurrence of each word
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a +b)
# save the counts to output
wordCounts.saveAsTextFile("D:/workspace/spark/output/")
Expressivity and Simplicity
For example, consider a very common query where we want to
aggregate all the ages for each name, group by name, and then
average the ages
Defining Schema
A schema in Spark defines the column names and associated data
types for a DataFrame.
Two Ways
Define it programmatically
Using DDL
Defining Schema (Example)
Projections and filters
Output
Output
How many distinct Call-Types were
recorded as the causes of the fire
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 40/46
calls?
List different distinct CallTypes
recorded as the causes of the fire
calls?
Output
Aggregation Example: what were the
most common types of fire calls.
Output
find all flights between San Francisco (SFO) and
Chicago (ORD) with at least a two-hour delay
Same Query using DataFrame API
For example, we want to label all US flights,
regardless of origin and destination, with an
indication of the delays they experienced:
Very Long Delays (> 6 hours),
Long Delays (2–6 hours), etc.

Analysis
Example: Looking for Unresolved attributes or relations (valid column name)
Logical Optimizations
applies standard rule-based optimizations to the logical plan (including
constant folding, predicate pushdown, projection pruning)
Physical Plan
takes a logical plan and generates one or more physical plans, using
physical
operators that match the Spark execution engine.

Here is a cleaned-up, detailed, and structured explanation of the content you provided, covering
RDDs, their characteristics, types of operations, performance optimizations, Spark’s UI, and schema-
based operations.

🔥 Spark Core Concept: Resilient Distributed Datasets (RDDs)


🧱 What is an RDD?
RDD is the fundamental data structure in Apache Spark.
Represents an immutable, distributed collection of objects across a cluster.
Each RDD is logically partitioned, enabling parallel computation across nodes.

💡 RDD Core Characteristics


Feature Description
Resilient Fault-tolerant using lineage graphs (DAGs), which allow recomputation of lost data.
Distributed Stored across multiple nodes for scalable parallel processing.

Immutable Once created, cannot be changed. All transformations create new RDDs.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 41/46
🏗️ How to Create an RDD
1. From External Datasets
Using data sources like HDFS, local files, HBase, etc.

python

rdd = sc.textFile("data.txt")

2. By Parallelizing a Collection
Converts a local Python collection to an RDD.

python

rdd = sc.parallelize([1, 2, 3, 4, 5])

🔍 RDD Components
1. Dependencies

Track how an RDD was derived from other RDDs.

Useful for recomputing lost partitions and debugging.

2. Partitions

Splits data to parallelize computation.


More partitions = better parallelism.

3. Compute Function

Defines how to compute each partition.

🔄 Spark Operations on RDDs


🔧 1. Transformations (Lazy)
Create new RDDs from existing ones.
Lazy evaluation: transformations are not executed immediately.

Examples:

`map(func)` → Applies `func` to each element.


`filter(func)` → Filters elements that return `True`.

`flatMap(func)` → Like map, but flattens the result.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 42/46
`distinct()` → Removes duplicates.

Types:

Type Description

Narrow Each output partition depends on one input partition. (e.g., `map`, `filter`)
Wide Requires data shuffling across partitions. (e.g., `reduceByKey`, `groupByKey`)

▶️ 2. Actions (Eager)
Triggers the execution of the transformations.

Examples:

`collect()` → Returns all elements to the driver.


`count()` → Returns number of elements.
`first()` → Returns the first element.
`reduce(func)` → Combines elements using `func`.
`saveAsTextFile(path)` → Saves RDD to external storage.

♻️ Lineage Graph (DAG)


Directed Acyclic Graph showing dependencies between RDDs.
Used by Spark to:
Recompute lost data.
Optimize execution.

Minimize I/O and job stages via lazy evaluation.

🧠 Persisting and Caching RDDs


Why?

If an RDD is used multiple times in a program, recomputation is expensive.


Caching/persisting avoids recomputing by storing partitions in memory/disk.

Methods:

`rdd.cache()` – Stores in memory.


`rdd.persist()` – Can choose storage level (memory, disk, both).

🖥️ Spark UI
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 43/46
Web interface to monitor jobs, stages, and tasks.
Accessible at http://localhost:4040 (default in local mode).

Key Info:

Job DAGs and execution plans.


RDD size and memory usage.
Executor status.
Environment info.

🧩 Pair RDDs (Key-Value Pairs)


Creation

python

pairRDD = rdd.map(lambda x: (x, 1))

Operations:

`reduceByKey(func)` → Combines values by key using a function.


`groupByKey()` → Groups all values with the same key (less efficient).
`sortByKey()` → Sorts RDD by key.
`join()` → Joins two pair RDDs by key.

⚙️ Example Program: Word Count (PySpark)


python

from pyspark import SparkContext, SparkConf

# Create Spark context


sc = SparkContext("local", "PySpark Word Count Example")

# Read data and split into words


words = sc.textFile("D:/workspace/spark/input.txt").flatMap(lambda line: line.split(" "))

# Map words to (word, 1) and reduce by key


wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

# Save results
wordCounts.saveAsTextFile("D:/workspace/spark/output/")

📊 DataFrames and Schema


Defining Schema

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 44/46
A schema defines the structure of a DataFrame: column names and types.

Two ways:
1. Programmatically
2. Using DDL (Data Definition Language)

Example Use Cases:

Project/Filter specific columns.


Aggregate values using `groupBy()`, `agg()`, etc.
Query example: Flights between SFO and ORD with delay > 2 hours.

Labeling Delays:

python

df.withColumn("Delay_Label",
when(col("delay") > 360, "Very Long Delays")
.when(col("delay") > 120, "Long Delays")
.otherwise("Short Delays"))

🧠 Spark Query Analysis


Logical Plan

Initial query representation with unresolved attributes.

Optimizations:

Logical Optimization: Constant folding, predicate pushdown, projection pruning.


Physical Plan: Converts logical plan into physical execution strategies.
Spark chooses best plan using a cost-based optimizer.

✅ Summary Table
Topic Key Point
RDD Immutable, distributed data structure
Transformations Lazy operations producing new RDDs
Actions Trigger execution and return result
Lineage Graph of dependencies for recovery
Persistence Caches RDDs to avoid recomputation
Spark UI Monitor jobs and performance
Pair RDD Key-value data operations
DataFrames Structured API with schema and SQL support

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 45/46
Topic Key Point

Optimization Logical → Physical plan with rule-based tuning

Let me know if you want:

MCQs or quiz-based questions on RDDs


Diagrams for RDD lineage and DAG
More PySpark code examples
This content as downloadable notes or slides

I'm happy to provide it!

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 46/46

You might also like