Data-Intensive Computing Overview
Data-Intensive Computing Overview
Data-intensive computing
Data-intensive
computing
is
concerned
with
production, manipulation, and analysis of large-scale
data in the range of hundreds of megabytes (MB) to
petabytes (PB)
Byte
Kilo Byte
Mega Byte
Giga Byte
Tera Byte
Peta Byte
Exa Byte
Zetta Byte
Yotta Byte
Datasets are maintained in repositories, which are
infrastructures supporting the storage, retrieval, and
indexing of large amounts of information.
To facilitate the classification and search, relevant bits of
information, called metadata, are attached to datasets.
Traditional Approach
• A computer to store and process big data.
• For storage purpose, the programmers will take the help of their choice of
database vendors such as Oracle, IBM, etc.
• Limitation: hectic task to process huge amount of scalable data through a single
database bottleneck.
Big Data and Cloud
• Big Data is a collection of large datasets that cannot be processed using
traditional computing techniques.
• It involves many areas of business and technology example Geospatial
data, Social network data, Search engine (Google) etc.
• Clouds are used for Big Data storage on distributed networks.
• Common Cloud Service provider for Big Data are Amazon, Microsoft, IBM
and many more.
How Big is BIGDATA?
Map Reduce and the art of “Thinking Parallel” by Shailesh Kumar, Third Leap, Inc.
Five V’s of Big
Data
Characterizing
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 1/46
Here's a detailed and structured explanation of your notes on Data-Intensive Computing, Big Data,
Cloud, and the Five Vs of Big Data:
📌 Data-Intensive Computing:
Data-intensive computing is an approach that focuses on the production, manipulation, and analysis
of large-scale datasets, which range from:
Store
Retrieve
Index large-scale data efficiently.
Metadata:
Example: A photograph may have metadata such as date taken, camera model, location, etc.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 2/46
Scalability issues: Single system can't handle exponential data growth.
Includes:
🧠 Areas Involved:
Geospatial data (e.g., maps, GPS)
☁️ Role of Cloud:
Cloud providers offer distributed storage and processing power.
Key Providers:
Amazon Web Services (AWS)
Microsoft Azure
IBM Cloud
Book Reference: "Map Reduce and the art of Thinking Parallel" by Shailesh Kumar.
Velocity Speed of data Data flows in real time from sensors, devices, social media.
Variety Different forms Data comes in many formats – text, images, videos, logs, etc.
Veracity Quality/trustworthiness Data can be messy or uncertain, needs cleaning and validation.
Value Meaningful insights Extracting useful business or scientific insights from raw data.
✅ Summary:
Data-intensive computing is vital in the modern data-driven world.
Technologies like MapReduce and distributed computing make processing large datasets
practical.
Let me know if you'd like visual diagrams or further breakdowns of any section.
Hadoop History
• Two main Problem with Big Data
• Storage
• Processing Power
• 2002 (Nutch Project)
• Doug Cutting and Mike Cafarella.
• Building a Search Engine to Index 1
billion Pages
• Cost: a million dollars in Hardware +
monthly running cost (around $30000)
• Started Looking for feasible solutions
• 2003: White Paper (Half Solution)
• a paper on Google Distributed File System
(Storage)
• 2004: Another Solution
• Map Reduce: Processing Large Datasets
• 2005
• Apache Nutch is limited to 20-40 nodes per
cluster.
• 2006
• Doug Cutting joined Yahoo with Nutch
Project
• 2007
• Successfully tested Hadoop on 1000 Nodes
How Does Hadoop Work?
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 4/46
• It is quite expensive to build bigger servers with heavy
configurations.
• As an alternative, we can tie together many commodity
computers with single-CPU, as a single functional distributed
system.
• The clustered machines can read the dataset in parallel and
provide a much higher throughput.
• Moreover, it is cheaper than one high-end server.
Hadoop Components
Distributed File System
Storage layer
File management
Replication
Map Reduce Programming Framework
Processing/Computation layer
Run jobs submitted by users.
Manages work distribution and Fault
Tolerance
Hadoop Architecture
Map Reduce –Distributed Processing
HDFS- Distributed storage
YARN –yet another resource negotiator
Common Utilities – Java lib & utilitie
Here is a detailed and easy-to-understand explanation of the History of Hadoop, how it works, and
its architecture and components:
📜 Hadoop History
🌐 The Big Problem in Early 2000s
Big Data faced two major issues:
Goal: Build an open-source web search engine that could index 1 billion web pages.
Problem: Estimated cost was $1 million in hardware + $30,000/month for running.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 5/46
Solved the storage problem by distributing files across many machines.
Another Google paper introduced MapReduce, a model for processing large datasets efficiently.
Apache Nutch (built by Doug & Mike) could only scale to 20–40 nodes (limited scalability).
Yahoo successfully tested Hadoop on 1000 nodes, proving it could scale efficiently.
Hadoop became a top-level Apache project soon after.
Work in parallel.
🔁 Benefits:
Cost-effective: Cheaper than a supercomputer.
🧩 Hadoop Components
Hadoop consists of two core components:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 6/46
1. 🔐 HDFS (Hadoop Distributed File System)
Storage layer.
Breaks files into blocks (default 128 MB).
Handles:
File Storage
Data Access
Fault Tolerance
2. ⚙️ MapReduce
Processing layer.
Consists of:
Map function: Filters and sorts data.
Automatically handles:
Task distribution
Job scheduling
Manages:
Resource allocation
Job scheduling across the cluster.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 7/46
Separates resource management from data processing, improving scalability.
🧰 4. Common Utilities
A set of Java libraries and utilities that support the other Hadoop modules.
Provides tools for:
Data serialization
Configuration
I/O operations
🔄 Summary Table
Aspect Details
Would you like a diagram of Hadoop architecture or a comparison between Hadoop and Spark next?
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 8/46
handles system failures, etc.
Map Reduce Architecture
MAP Reduce
JobTracker
• JobTracker is the daemon service for submitting and
tracking MapReduce jobs in Hadoop.
• JobTracker performs following actions in Hadoop:
ü It accepts the MapReduce Jobs from client applications
ü Talks to NameNode to determine data location
ü Locates available TaskTracker Node
ü Submits the work to the chosen TaskTracker Node
TaskTracker
• A TaskTracker node accepts map, reduce or shuffle operations from
a JobTracker.
• Its configured with a set of slots, these indicate the number of tasks
that it can accept
• JobTracker seeks for the free slot to assign a job.
• TaskTracker notifies the JobTracker about job success status.
• TaskTracker also sends the heartbeat signals to the job tracker to
ensure its availability, it also reports the no. of available free slots
with it.
Hadoop Distributed File System
• Files split into 128MB blocks
• Blocks replicated across several
datanodes (often 3)
• Namenode stores metadata
(file names, locations, etc)
Namenode
File1
1
2
3
4
1
2
4
2
1
3
1
4
3
Datanodes
3
2
4
HDFS Architecture
• HDFS follows the master
slave architecture and it
has
the
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 9/46
elements.
• Namenode
• Datanode
• Block
Here is a comprehensive and clear explanation of the MapReduce framework, its architecture, and
how it integrates with the Hadoop Distributed File System (HDFS):
🔧 What it does:
Splits large input data into independent chunks.
Framework handles:
Task scheduling
Monitoring
`map(k1, v1)` Processes input data (key1, value1) list of (key2, value2)
`reduce(k2, list(v2))` Aggregates intermediate results (key2, list of values) list of values (v3)
🔄 Example:
Suppose we are counting word occurrences in documents.
Reduce function: sums up the `1`s for each word to get total count
python
# Simplified example
map("line1", "hadoop map reduce hadoop") → [("hadoop", 1), ("map", 1), ("reduce", 1), ("hadoop", 1)
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 10/46
reduce("hadoop", [1, 1]) → [("hadoop", 2)]
Groups all values by the same key (`k2`) before sending to reducer.
🏗️ MapReduce Architecture
🔸 Components:
✅ JobTracker (Master)
Accepts and manages MapReduce jobs from clients.
Responsibilities:
✅ TaskTracker (Slave)
Located on each worker node.
Available slots
Task success/failure
🧱 HDFS Architecture
🔸 Follows a Master-Slave Architecture:
✅ NameNode (Master)
Stores metadata (no actual data):
✅ DataNode (Slave)
Stores actual blocks of the files.
Sends heartbeat to NameNode.
Handles read/write requests from clients.
DataNode1: B1, B2
DataNode2: B2, B3
DataNode3: B1, B4
DataNode4: B3, B4
When a node fails, the data can still be accessed from the replicas.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 12/46
Execute the assigned tasks.
Read data from DataNodes (local if possible).
4. Map tasks emit intermediate key/value pairs.
✅ Summary Table
Component Description
Map Function Processes raw data to generate intermediate key-value pairs
Reduce Function Aggregates results by key
JobTracker Master daemon, manages jobs
Would you like a diagram or real-world analogy to visualize how MapReduce or HDFS works?
NameNode
• The namenode is the commodity hardware that contains
the namenode software.
• The system having the namenode acts as the master server
and it does the following tasks −
• Manages the file-system namespace.
• Regulates client’s access to files.
• It also executes file system operations such as
renaming, closing, and opening files and directories.
NameNode
https://www.netjstech.com/2018/02/namenode-datanode-and-secondary-namenode-hdfs-
hadoop.html
DataNode
• Datanode is a commodity hardware having the datanode software.
• For every node (Commodity hardware/System) in a cluster, there will be
a datanode.
• These nodes manage the data storage of their system.
• Datanodes perform read-write operations on the file systems, as per
client request.
• They also perform operations such as block creation, deletion, and
replication according to the instructions of the namenode.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 13/46
Block
• Generally the user data is stored in the files of HDFS.
• The file in a file system will be divided into one or more
segments and/or stored in individual data nodes.
• These file segments are called as blocks.
• In other words, the minimum amount of data that HDFS can
read or write is called a Block.
• The default block size is 64MB, but it can be increased as per
the need to change in HDFS configuration.
Roles of Components
Data Replication
• Replication placement
• High initialization time to create replication to all
machines
• An approximate solution: Only 3 replications
• One replication resides in current node
• One replication resides in current rack
• One replication resides in another rack
Goals of HDFS
§ Fault detection and recovery −
§ Since HDFS includes a large number of commodity hardware, failure of
components is frequent.
§ Therefore HDFS should have mechanisms for quick and automatic fault
detection and recovery.
§ Huge datasets −
§ HDFS should have hundreds of nodes per cluster to manage the applications
having huge datasets.
§ Hardware at data −
§ A requested task can be done efficiently, when the computation takes place
near the data. Especially where huge datasets are involved, it reduces the
network traffic and increases the throughput.
Map Reduce and the art of “Thinking Parallel” by Shailesh Kumar, Third Leap, Inc.
Example: Word Count
def mapper(line):
foreach word in line.split():
output(word, 1)
def reducer(key, values):
output(key, sum(values))
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
Word Count Execution
Input
the quick
brown
fox
the fox ate
the
mouse
how now
brown
cow
Map
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 14/46
Map
Shuffle
the, 1
brown, 1
fox, 1
Map
the, 1
fox, 1
the, 1
how, 1
now, 1
brown, 1
Map
ate, 1
mouse, 1
cow, 1
Reduce Output
Reduce
quick, 1
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
ate, 1
cow, 1
mouse, 1
quick, 1
An Optimization: The Combiner
• Local reduce function for repeated keys produced by
same map
• For associative ops. like sum, count, max
• Decreases amount of intermediate data
• Example: local counting for Word Count:
def combiner(key, values):
output(key, sum(values))
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
Word Count with Combiner
Input
the quick
Brown
fox
the fox ate
the
mouse
Map
Map
the, 2
fox, 1
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 15/46
Shuffle
the, 1
brown, 1
fox, 1
Reduce Output
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
Map
how now
brown
cow
how, 1
now, 1
brown, 1
Map
quick, 1
ate, 1
mouse, 1
cow, 1
Reduce
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
ate, 1
cow, 1
mouse, 1
quick, 1
Word Count in Python with Hadoop
Streaming
Mapper.py:
Reducer.py:
import sys
for line in sys.stdin:
for word in line.split():
print(word.lower() + "\t" + 1)
import sys
counts = {}
for line in sys.stdin:
word, count = line.split("\t")
dict[word] = dict.get(word, 0) + int(count)
for word, count in counts:
print(word.lower() + "\t" + 1)
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
MapReduce Execution Details
• Mappers preferentially scheduled on same node or
same rack as their input block– Minimize network use to improve performance
• Mappers save outputs to local disk before serving to
reducers– Allows recovery if a reducer crashes– Allows running more reducers than # of nodes
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 16/46
Fault Tolerance in MapReduce
1. If a task crashes:– Retry on another node
• OK for a map because it had no dependencies
• OK for reduce because map outputs are on disk– If the same task repeatedly fails, fail the job or
ignore that
input block
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
Fault Tolerance in MapReduce
2. If a node crashes:– Relaunch its current tasks on other nodes– Relaunch any maps the node
previously ran
• Necessary because their output files were lost along with the
crashed node
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
Fault Tolerance in MapReduce
3. If a task is going slowly (straggler):– Launch second copy of task on another node– Take the
output of whichever copy finishes first, and kill
the other one
• Critical for performance in large clusters (many
possible causes of stragglers)
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
Takeaways
• By providing a restricted data-parallel programming model,
MapReduce can control job execution in useful ways:– Automatic division of job into tasks–
Placement of computation near data– Load balancing– Recovery from failures & stragglers
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
Example: Word Count
def mapper(line):
foreach word in
line.split():
output(word, 1)
def reducer(key, values):
output(key, sum(values))
Mapper
Input: value: lines of text of input
Output: key: word, value: 1
Reducer
Input: key: word, value: set of
counts
Output: key: word, value: sum
Launching program
Defines this job
Submits job to cluster
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
Word Count Dataflow
Map Reduce Pipeline
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
Example Task - 1
Map Reduce and the art of “Thinking Parallel” by Shailesh Kumar, Third Leap, Inc.
Map Reduce and the art of “Thinking Parallel” by Shailesh Kumar, Third Leap, Inc.
Map Reduce and the art of “Thinking Parallel” by Shailesh Kumar, Third Leap, Inc.
2 . Search
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 17/46
• Input: (lineNumber, line) records
• Output: lines matching a given pattern
• Map:
if(line matches pattern):
output(line)
• Reduce: no reducer (map-only job)
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
3. Sort
• Input: (key, value) records
• Output: same records, sorted by key
Map
Map
aardvark,
elephant
Map
ant, bee
zebra
cow
pig
sheep, yak
Reduce
[A-M]
aardvark
ant
bee
cow
elephant
Reduce
[N-Z]
pig
sheep
yak
zebra
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
4. Inverted Index
• Input: (filename, text) records
• Output: list of files containing each word
• Map:
foreach word in text.split():
output(word, filename)
• Combine: uniquify filenames for each word
• Reduce:
def reduce(word, filenames):
output(word, sort(filenames))
“Cloud Computing with MapReduce and Hadoop”, Matei Zaharia, UC Berkeley AMP Lab
Inverted Index Example
hamlet.txt
to be or
not to be
12th.txt
be not
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 18/46
afraid of
greatness
to, hamlet.txt
be, hamlet.txt
or, hamlet.txt
not, hamlet.txt
be, 12th.txt
not, 12th.txt
afraid, 12th.txt
of, 12th.txt
greatnes
Thanks for sharing all the detailed content! Here's a comprehensive explanation of each concept in
your notes, organized for clarity and deeper understanding:
Cloud computing offers scalable storage and processing, using providers like AWS, Azure, IBM
Cloud.
Examples of Big Data sources:
Social Media (e.g., Facebook, Twitter)
🔷 What is Hadoop?
A framework for distributed storage and processing of big data using clusters of commodity hardware.
Scalable
Fault-tolerant
Suitable for massive data processing
🔷 HDFS Architecture
Master-slave model:
NameNode (Master): Stores metadata (file structure, locations, etc.)
DataNodes (Slaves): Store actual data in blocks
🔷 MapReduce Framework
🧩 Basic Model:
Map Function:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 20/46
Processes input key-value pairs
Outputs intermediate key-value pairs
Reduce Function:
Merges values with the same key
Outputs final key-value pairs
🔷 MapReduce Components
JobTracker:
Manages job scheduling
def mapper(line):
for word in line.split():
output(word, 1)
🧠 Combiner:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 21/46
Optimizes MapReduce by doing local aggregation
Reduces network traffic
python
python
import sys
for line in sys.stdin:
for word in line.split():
print(word.lower() + "\t" + "1")
Reducer.py:
python
import sys
counts = {}
for line in sys.stdin:
word, count = line.split("\t")
counts[word] = counts.get(word, 0) + int(count)
2. Sort:
3. Inverted Index:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 22/46
python
# Map:
for word in text.split():
output(word, filename)
# Combine/Reduce:
def reduce(word, filenames):
output(word, sorted(set(filenames)))
🔷 Goals of HDFS
Fault tolerance
Handle massive datasets
Data locality for efficient processing
🧠 Takeaways:
Hadoop enables parallel processing of massive data
MapReduce simplifies data processing using map and reduce logic
HDFS efficiently handles storage with replication and fault-tolerance
Ideal for distributed, scalable, and fault-tolerant big data systems
If you want any of these concepts visualized with diagrams or explained using real-life analogies or
coding examples, just let me know!
Hive
• Open source project developed by Facebook.
• An SQL like interface to Hadoop.
• Provide data summarization, query and analysis.
• Query execution via MapReduce.
• Also used by Netflix, Cnet, Digg, eHarmony etc.
• HiveQL example:
SELECT customerId, max(total_cost) from hive_purchases GROUP BY
customerId HAVING count(*) > 3;
Hive Architecture
Contd…
There are 4 main components as part of Hive Architecture.
• Hadoop core components (Hdfs, MapReduce): Hive tables internally
stores data in HDFS path and queries are executed as MapReduce
programs.
• Metastore: is a namespace for tables and store details related to table,
column, partition and location.
• Driver: it parses the query and perform semantic analysis on different
query blocks and expressions.
• Hive Clients: are the interfaces through which we can submit query to
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 23/46
the Hive system.
Pig
• Open source project developed by Yahoo
• A scripting platform for processing and analyzing large data sets.
• Apache Pig allows to write complex MapReduce programs using a
simple scripting language.
• Pig Latin language is used which is a data flow language.
• Pig translate Pig Latin script into MapReduce to execute within
Hadoop.
• Pig example:
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
X = FOREACH A GENERATE name,$2;
DUMP X;
Pig Architecture
Components:
•Pig Latin language is used to analyze data.
•Grunt shell is used to execute the program.
•Parser checks the syntax and type of the script.
•Optimizer performs the job of local optimization
such as projection and pushdown.
•Compiler compiles the optimized code into series
of MapReduce jobs.
•Execution Engine produces the final results.
Zookeeper
• It coordinate distributed systems as a Zoo.
• ZooKeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and
providing group services
Here is a detailed and clear explanation of the content you provided on Hive, Pig, and ZooKeeper—
along with their architecture, purpose, and usage in Hadoop ecosystems:
🐝 Apache Hive
🔹 What is Hive?
Developed by: Facebook
Purpose: Hive is a data warehouse tool built on top of Hadoop that provides an SQL-like
interface (HiveQL) for querying and analyzing large datasets.
How it works: Converts HiveQL queries into MapReduce jobs which run on the Hadoop cluster.
🔹 HiveQL Example:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 24/46
sql
This query finds the maximum total_cost per customer but only for those customers who made more
than 3 purchases.
2. Metastore
Example: Keeps info like “Table sales has columns: date, revenue, customerID”.
3. Driver
4. Hive Clients
Hive CLI
JDBC/ODBC connections
Web UIs
🐷 Apache Pig
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 25/46
🔹 What is Pig?
Developed by: Yahoo
Purpose: Platform for analyzing large datasets using a high-level language called Pig Latin.
How it works: Converts Pig Latin scripts into MapReduce jobs.
Designed for:
ETL (Extract, Transform, Load)
Data cleansing
Simple transformations
2. Grunt Shell
3. Parser
4. Optimizer
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 26/46
5. Compiler
6. Execution Engine
🦓 Apache ZooKeeper
🔹 What is ZooKeeper?
A coordination service for distributed systems
Developed to manage:
Configuration
Naming services
Distributed synchronization
Group services (leader election, locking)
🔹 Analogy:
Think of a Zoo with many animals (services). ZooKeeper ensures:
✅ Summary Table
Feature Hive Pig ZooKeeper
Developed
Facebook Yahoo Apache Foundation
By
Language HiveQL (SQL-like) Pig Latin (data flow scripting) N/A (API-based service)
Goal Query and analyze Scripting and data Coordination of distributed
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 27/46
Feature Hive Pig ZooKeeper
structured data transformations systems
Independent (used by Hadoop,
Execution MapReduce MapReduce
HBase)
Common
BI, analytics ETL, data processing Leader election, config mgmt
Use
If you want diagrams or flowcharts to visualize these architectures, or real Hadoop logs and examples
from Hive or Pig execution, let me know—I can generate those too.
Introduction to Spark
• Apache Spark is an open-source framework designed for
large-scale distributed data processing, on premises in data
centers or in the cloud.
• It works in distributed environment across clusters of
computers.
• It is designed to scale up from single server to thousands of
machines.
History of Spark
• Spark was developed by researchers at UCBerkley in the year
2009.
• Previously, the research team was working on MapReduce.
• Limitation of Hadoop Map Reduce– Hard to manage and administer, with cumbersome
operational complexity – Map Reduce was inefficient (or intractable) for interactive or iterative
computing jobs.
• Each pair’s intermediate computed result is written to the local disk for the subsequent
stage of its operation.
•
History of Spark
• Map Reduce was inefficient (or intractable) for interactive or iterative
computing jobs.
Reuse intermediate results across multiple
computations in multistage applications
User runs ad-hoc queries on the same subset of data.
Each query will do the disk I/O on the stable storage,
which can dominate application execution time
History of Spark
• Limitation of Hadoop Map Reduce– was efficient for large scale batch processing applications, but
fell
short for combining other workloads such as machine learning,
streaming, or interactive SQL-like queries.
• Spark extends the MapReduce model to efficiently support more types
of computations, including interactive queries and stream processing.
•
History of Spark
• Main feature of Apache Spark is its in-memory cluster computing that
increases the processing speed of an application.
• Initially found to be 10-20x faster than Hadoop. (based on paper
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 28/46
published in year 2009)
• Spark was first open sourced in the year 2009.
• Spark can create distributed datasets from any file stored in the
Hadoop distributed filesystem (HDFS) or other storage systems
supported by the Hadoop APIs (including your local filesystem,
Amazon S3, Cassandra, Hive, HBase, etc.)
Iterative Operations on Spark
• Spark stores the intermediate results in a distributed memory
(RAM) instead of Stable storage (Disk) and make the system
faster.
Interactive Operations on Spark
• If different queries are run on the same set of data
repeatedly, this particular data can be kept in memory for
better execution times.
Spark Features
Spark Features
Speed Spark runs up to 100 times faster
than Hadoop MapReduce for large-scale
data processing.
• In-memory
retention
intermediate results.
of
• Query Optimization and DAG
Building
Powerful Caching Simple programming
layer provides powerful caching and
persistence capabilities.
Polyglot Spark provides high-level APIs in
Java, Scala, Python, and R.
Spark Features
Extensibility
Unlike Hadoop, Spark
focuses on its fast, parallel computation
engine rather than on storage.
So, we can use Spark to read data stored in
multiple sources—Apache Hadoop, Apache
Cassandra, Apache HBase, MongoDB,
Apache Hive, RDBMSs, and more—and
process it all in memory.
Ease of Use Spark allows you to write
scalable applications in Java, Scala, Python,
and R. It also provides a shell in Scala and
Python.
Spark Features
Real-time Stream Processing
• Spark is designed to handle real-time
data streaming.
• While MapReduce is built to handle and
process the data that is already stored
in Hadoop clusters, Spark can do both
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 29/46
and also manipulate data in real-time
via Spark Streaming.
Deployment
• Spark can run independently in cluster
mode, and it can also run on Hadoop
YARN, Apache Mesos, Kubernetes, and
even in the cloud.
Spark Features
Deployment
• Client Mode “driver” component of spark
job will run on the machine from which job is
submitted.
• Job submitting machine can be near or
very remote to “spark infrastructure”.
• Cluster Mode “driver” component of spark
job will not run on the local machine from
which job is submitted.
• Here spark job will launch “driver”
component inside the cluster.
https://techvidvan.com/tutorials/spark-modes-of-deployment/
https://docs.cloudera.com/runtime/7.2.0/running-spark-applications/topics/spark-yarn
deployment-modes.html
Apache Spark Components: Unified Stack
• Spark ecosystem is composed of various components like Spark SQL, Spark
Streaming, MLlib, GraphX, and the Core API component.
Apache Spark Components: Unified Stack
• Spark Core– base engine for large-scale distributed
data processing.– Provides API to create RDDs.– RDDs are a collection of items
distributed across many compute
nodes that can be manipulated in
parallel. – responsible for
• memory management,
• fault recovery,
• scheduling, and monitoring jobs on
a cluster
• and interacting with storage
systems.
Apache Spark Components: Unified Stack
• Spark SQL– package for working with structured
data .– supports querying data via SQL.– Example: You can read data stored in an
RDBMS table or from file formats with
structured data and then construct
permanent or temporary tables in
Spark.
Apache Spark Components: Unified Stack
• MLlib(Machine Learning)– MLlib stands for Machine Learning
Library. – Used to perform machine learning in
Apache Spark.– Spark comes with a library containing
common machine learning (ML)
algorithms called MLlib.
Apache Spark Components: Unified Stack
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 30/46
• MLlib(Machine Learning)– MLlib provides fast and distributed
implementations
of
common ML
algorithms, statistical analysis, feature
extraction, convex optimizations, and
distributed linear algebra.– Spark MLlib library has extensive
documentation to describe all the
supported utilities.
Apache Spark Components: Unified Stack
• Spark Streaming– component of Spark which is used to
process real-time streaming data.– It enables high-throughput and fault
tolerant stream processing of live
data streams.– Examples of data streams:
• Log files generated by
production web servers,
• queues of messages containing
status updates posted by users
of a web service.
Apache Spark Components: Unified Stack
• GraphX– A library for manipulating graphs (e.g.,
social network graphs, routes and
connection points, or network topology
graphs)– Performing graph-parallel computations.
Partitioning allows for efficient parallelism.
Distributed scheme of breaking up data into chunks or
partitions allows Spark executors to process only data that
is close to them, minimizing network bandwidth.
Spark Jobs
Driver converts your Spark application into one or more Spark jobs.
It then transforms each job into a DAG.
Stages
Stages are created based on what operations can be performed serially or
in parallel.
A Spark job may be divided into a number of stages.
Spark Tasks
Each stage is comprised of Spark tasks (a unit of execution).
Each task maps to a single core and works on a single partition of
data
Example “an executor with 16 cores can have 16 or more tasks working on
16 or
more partitions in parallel, making the execution of Spark’s tasks exceedingly
parallel”
Spark is a distributed data processing engine with its components
working collaboratively on a cluster of machines.
Spark Driver
• Every spark application consists of a
driver program that is responsible for
launching and managing parallel
operations on the Spark cluster.
• For example, if you are using the
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 31/46
interactive shell, the shell acts as the
driver program.
• responsible
for
instantiating
a
SparkSession i.e. a gateway to all the
Spark functionalities.
Spark is a distributed data processing engine with its components
working collaboratively on a cluster of machines.
Spark Driver (Other roles)
§ communicates with the cluster
manager.
§ requests resources (CPU, memory,
etc.) from the cluster manager for
Spark’s executors.
§ transforms all the Spark operations
into DAG computations.
§ Distributes tasks across the Spark
executors.
§ Driver stores the metadata about all
the RDDs and their partitions.
Spark Driver (Summary)
§ runs on the master node of the spark cluster.
§ schedules the job execution and negotiates with the cluster
manager.
§ translates the RDD’s into the execution graph and splits the graph
into multiple stages.
§ stores the metadata about all the RDDs.
§ converts a user application into smaller execution units known as
tasks.
§ Tasks are then executed by the executors.
§
Spark is a distributed data processing engine with its
components working collaboratively on a cluster of machines.
Spark Session
§ provides a single unified entry
point
to
functionality.
all
of
Spark’s
§ Allows you to create JVM runtime
parameters, define Data- Frames
and Datasets, read from data
sources,
access
catalog
metadata, and issue Spark SQL
queries.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 32/46
Spark is a distributed data processing engine with its components
working collaboratively on a cluster of machines.
Cluster manager
§ Spark context works with
the cluster manager to manage
various jobs.
§ takes care of the job execution
within the cluster.
Spark Executors
§ job is to basically execute the
tasks.
§ returns back the result to the Spark
Context.
Spark is a distributed data processing engine with its components
working collaboratively on a cluster of machines.
Overall Context
• Spark Context takes the job,
breaks the job in tasks and
distribute them to the worker
nodes.
• These tasks work on the
partitioned
RDD,
perform
operations, collect the results and
return to the main Spark Context.
Step 1
• ClientsubmitsSparkuserapplicationcode.
• Driver implicitly converts user code that contains transformations and
actions into a logically directed acyclic graph called DAG.
Step 2
• Driver performs certain optimizations and then it converts the logical DAG
into
physical execution plan with set of stages.
• For physical execution plan, this stage includes creating physical execution units
called tasks under each stage.
Step 3
• Driver interacts with the cluster manager and negotiates the resources.
• Cluster manager launches executors in worker nodes on behalf of the
driver.
• Driver sends tasks to the cluster manager based on data placement.
• When executors start, they register themselves with drivers.
Step 4
• Now, Executors start executing the various tasks assigned by the driver
program
• During the course of execution of tasks, driver program will monitor the
set of
executors that runs.
• Driver node also schedules future tasks based on data placement.
Other Important Points:
• Driver program in the spark architecture also schedules future tasks based on
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 33/46
data
placement by tracking the location of cached data.
• When driver programs main () method exits and it will terminate all the
executors
and release the resources.
Here's a detailed, point-wise explanation of the provided content on Hive, Pig, Zookeeper, and Apache
Spark, structured for clarity:
It provides an SQL-like interface (HiveQL) for querying and analyzing large datasets stored in
HDFS.
Originally developed by Facebook and used by companies like Netflix, eHarmony, Digg, CNET, etc.
Hive converts queries into MapReduce jobs to be executed on a Hadoop cluster.
🔹 Sample HiveQL
sql
🦓 Zookeeper
🔹 Introduction
Apache Zookeeper coordinates and manages distributed systems.
It acts like a centralized service to maintain:
Configuration info
Naming services
Distributed synchronization
Group services
🔹 Deployment Modes
Client Mode: Driver runs on the machine submitting the job.
Cluster Mode: Driver runs inside the cluster for better fault tolerance.
2. Cluster Manager
3. Spark Executors
4. Spark Session
1. Submit Application
Client submits the job to the Spark cluster.
2. DAG Formation
Driver parses transformations and actions into a logical DAG.
3. DAG to Physical Plan
If you’d like, I can also give you diagrams, comparative charts, or real-world examples (e.g., Spark vs
Hadoop use cases) to reinforce the concepts.
Let me know if you'd like this content formatted as PDF notes, PowerPoint slides, or quiz questions for
revision.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 38/46
Lineage Graph
track of the set of dependencies between different RDDs
Provide important information to
compute each RDD on demand
and to recover lost data if part of a
RDD is lost.
Spark uses lazy evaluation to
reduce the number of passes it has
to take over our data by grouping
operations together.
Map
takes in a function and applies it to each element in the RDD
Return new each of each element in the RDD as result
Filter
takes in a function and returns an RDD that only has elements
that pass the filter() function.
Other mathematical set of Operations
Aggregations
reduceByKey runs several parallel reduce operations, one for
each key in the dataset, where each operation combines values that
have the same key.
Tuning
Degree
Parallelism
the
of
import sys
from pyspark import SparkContext, SparkConf
# create Spark context with necessary configuration
sc = SparkContext("local","PySpark Word Count Exmaple")
# read data from text file and split each line into words
words = sc.textFile("D:/workspace/spark/input.txt").flatMap(lambda line: line.split(" "))
# count the occurrence of each word
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a +b)
# save the counts to output
wordCounts.saveAsTextFile("D:/workspace/spark/output/")
Expressivity and Simplicity
For example, consider a very common query where we want to
aggregate all the ages for each name, group by name, and then
average the ages
Defining Schema
A schema in Spark defines the column names and associated data
types for a DataFrame.
Two Ways
Define it programmatically
Using DDL
Defining Schema (Example)
Projections and filters
Output
Output
How many distinct Call-Types were
recorded as the causes of the fire
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 40/46
calls?
List different distinct CallTypes
recorded as the causes of the fire
calls?
Output
Aggregation Example: what were the
most common types of fire calls.
Output
find all flights between San Francisco (SFO) and
Chicago (ORD) with at least a two-hour delay
Same Query using DataFrame API
For example, we want to label all US flights,
regardless of origin and destination, with an
indication of the delays they experienced:
Very Long Delays (> 6 hours),
Long Delays (2–6 hours), etc.
Analysis
Example: Looking for Unresolved attributes or relations (valid column name)
Logical Optimizations
applies standard rule-based optimizations to the logical plan (including
constant folding, predicate pushdown, projection pruning)
Physical Plan
takes a logical plan and generates one or more physical plans, using
physical
operators that match the Spark execution engine.
Here is a cleaned-up, detailed, and structured explanation of the content you provided, covering
RDDs, their characteristics, types of operations, performance optimizations, Spark’s UI, and schema-
based operations.
Immutable Once created, cannot be changed. All transformations create new RDDs.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 41/46
🏗️ How to Create an RDD
1. From External Datasets
Using data sources like HDFS, local files, HBase, etc.
python
rdd = sc.textFile("data.txt")
2. By Parallelizing a Collection
Converts a local Python collection to an RDD.
python
🔍 RDD Components
1. Dependencies
2. Partitions
3. Compute Function
Examples:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 42/46
`distinct()` → Removes duplicates.
Types:
Type Description
Narrow Each output partition depends on one input partition. (e.g., `map`, `filter`)
Wide Requires data shuffling across partitions. (e.g., `reduceByKey`, `groupByKey`)
▶️ 2. Actions (Eager)
Triggers the execution of the transformations.
Examples:
Methods:
🖥️ Spark UI
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 43/46
Web interface to monitor jobs, stages, and tasks.
Accessible at http://localhost:4040 (default in local mode).
Key Info:
python
Operations:
# Save results
wordCounts.saveAsTextFile("D:/workspace/spark/output/")
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 44/46
A schema defines the structure of a DataFrame: column names and types.
Two ways:
1. Programmatically
2. Using DDL (Data Definition Language)
Labeling Delays:
python
df.withColumn("Delay_Label",
when(col("delay") > 360, "Very Long Delays")
.when(col("delay") > 120, "Long Delays")
.otherwise("Short Delays"))
Optimizations:
✅ Summary Table
Topic Key Point
RDD Immutable, distributed data structure
Transformations Lazy operations producing new RDDs
Actions Trigger execution and return result
Lineage Graph of dependencies for recovery
Persistence Caches RDDs to avoid recomputation
Spark UI Monitor jobs and performance
Pair RDD Key-value data operations
DataFrames Structured API with schema and SQL support
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 45/46
Topic Key Point
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 46/46