Unit 2
Unit 2
YARN
Cassandra
Programming
Model
CAP Theorem Big Data Platforms Using Examples
Map-Reduce
HBASE
HDFS
HDFS
ARCHITECTURE
Hadoop Architecture
Slave Machines
Master
Machine
Hadoop Components
HADOOP COMPONENTS
HDFS ARCHITECTURE OF GEN 2
(VERSION 2)
HDFS Daemons
• Name Node (NN)
• Data Node (DN)
DN DN DN DN
Master
Name Node:
Machine
• Master daemon which
maintains and manages NN
the Data Nodes (slave
nodes) Slave Machines
• Records the meta data DN DN
DN DN
of all the files stored in
the cluster
• Regularly receives a
Heart beat and a block
report from all the Data
Nodes in the cluster
NameNode Daemon
HDFS Daemons
• Name Node (NN)
• Data Node (DN)
DN DN DN DN
Master
Data Node: Machine
• Slave daemons which runs
on each slave machine. NN
• The actual data is stored on
Data Nodes. Slave Machines
• Responsible for serving
read & write requests from DN DN DN DN
the clients.
NAMENODE DAEMON
SECONDARY
NAME NODE
• Job of Secondary Name Node is to contact
Name Node in a periodic manner after certain
time interval (by default 1 hour) and pulls copy of
meta data information out of Name Node.
NameNode
cluster, eg; Filename, File path, No. of Blocks, BlockId, Block location,
number of blocks, slave related configurations, etc.
• This metadata is stored in memory for faster retrieval to reduce latency that
will be caused due to disk seeks.
HDFS Daemons
• Name Node (NN) NN TT TT TT TT
• Data Node (DN)
AppMaster: Node
Container:
Manager
• One per application.
• Allocates certain amount of
• Coordinates and manages MR
resources (memory, CPU etc.) On
Jobs. App
• Negotiates Resources from RM Container a slave node (NM).
Master
Resourc • Resource Manager is master that manages
Manage
Functions:
• Manages Nodes.
• Manages Containers.
r • Manages application Masters.
• RM has a scheduler which is responsible
for allocating resources to various running
application.
Node Manager
• Application Manager is the framework specific entity that manages execution of application.
• Each application has its own unique Application Master.
• One side it communicates with RM and other with NM.
Functions:
• Negotiates resources (containers) from the Resource Manager.
• Periodically send heartbeats to RM to affirm its health and to update the record of its resource demands.
• Works with the Node Manager to execute the tasks.
• Tracks status and monitors progress of tasks and their resource consumption.
• AM allows individual applications to utilize cluster resources in a shared, secure and multitenant manner.
Container
• Basic unit of resource allocation.
• Application Master runs as a normal container.
• Container is the resource allocation, which is a result of the RM
for specific Resource Request.
• Fine grained resource allocation replaces the fixed Map Reduce
slots.
Apache Spark is an open-source, in-memory,
cluster computing framework for real-time
processing.
It provides high-level API in Java, Scala, Python,
and R.
Features of Deployment
Fault Tolerance
2. Spark SQL
• Spark SQL: A module for working with structured data. It allows querying of data using SQL as
well as a DataFrame API, making it easy to work with structured and semi-structured data.
Spark SQL integrates with a variety of data sources like Hive, Parquet, and JDBC.
3. Spark Streaming
• Spark Streaming: A real-time processing module that processes live data streams. It builds on
top of Spark Core and provides a high-level API for stream processing. It processes data in small
batches and supports integration with sources like Kafka, Flume, and HDFS.
7. PySpark
• PySpark: The Python API for Spark. It enables Python developers to interact with Spark's
distributed computing framework using familiar Python constructs. PySpark supports the Spark
Core, Spark SQL, and MLlib APIs.
7. Structured Streaming
• Structured Streaming: An extension of Spark SQL that supports continuous stream processing.
It provides a more declarative approach to stream processing by allowing users to define
streaming queries similar to batch queries.
Spark Architecture
• Apache Spark architecture is designed to provide a unified computing engine for big data
processing across various data processing scenarios like batch processing, streaming, machine
learning, and more. Spark’s architecture is based on a master-slave model that provides
scalability, fault tolerance, and high performance through in-memory computation.
• Key Components of Spark Architecture:
1. Driver Program
2. Cluster Manager
3. Workers/Executors
4. Distributed Storage (HDFS, S3, etc.)
1. Driver Program
•Role: The Driver Program is the entry point for any Spark application. It contains the main
function and is responsible for creating the SparkContext, which coordinates the execution of the
application.
•Responsibilities:
•Converting user code into a directed acyclic graph (DAG) of stages
and tasks.
•Scheduling tasks on the worker nodes.
•Collecting and aggregating results from the worker nodes.
•Handling user-defined actions (e.g., collect(), count()).
•Managing job lifecycle and fault recovery.
The Driver program communicates with the Cluster Manager to request resources and assigns
tasks to worker nodes (executors).
• 2. Cluster Manager
• Role: The Cluster Manager is responsible for managing
the resources across the cluster. Spark can work with
various cluster managers, including:
• Standalone Cluster Manager: Spark’s built-in manager.
• Apache YARN: Used in Hadoop ecosystems.
• Apache Mesos: A general-purpose cluster manager.
• Kubernetes: A container orchestration system for
managing distributed Spark jobs in containers.
• The Cluster Manager allocates resources (CPU,
memory) across the cluster and manages the execution
of applications submitted by the Driver.
• 3. Workers/Executors
• Workers: These are nodes in the cluster that run the
tasks assigned by the Driver program. A worker node
runs one or more Executors.
• Executors:
• Role: Executors are the core components of Spark's
execution model. They are distributed across worker
nodes in the cluster and are responsible for executing
individual tasks in parallel.
• Responsibilities:
• Running tasks (units of work) assigned by the Driver.
• Storing and caching data in memory or on disk across distributed nodes.
• Communicating with the Driver program to send task results.
• Fault tolerance: If an executor fails, the Driver can reschedule the failed tasks to be executed on another
executor.
• Each executor typically runs for the entire lifetime of a Spark application and has two major
functions:
• Task Execution: It executes tasks on the partitioned data.
• Data Storage: It stores intermediate data in memory/disk and caches RDDs if required.
• 4. Distributed Storage
• Role: Spark works with various distributed storage systems to load and store data across the
cluster. The commonly used systems include:
• HDFS: Hadoop Distributed File System.
• Amazon S3: Cloud-based storage.
• Apache HBase: NoSQL database for real-time read/write access.
• Apache Cassandra: Distributed database system.
• Other file systems: Local filesystem, NFS, etc.
• The storage systems provide persistence and fault tolerance for the data being processed.
RDDs are the building blocks of any Spark application. RDDs Stands for:
• Resilient: Fault tolerant and is capable of rebuilding data on failure
• Distributed: Distributed data among the multiple nodes in a cluster
• Dataset: Collection of data
There are two ways to create RDDs − parallelizing an existing collection in
Resilient
your driver program, or by referencing a dataset in an external storage
system, such as a shared file system, HDFS, HBase, etc.
1.Immutability: Once created, RDDs cannot be altered. However, you can transform them into new RDDs
using operations like map, filter, and reduce.
2.Fault Tolerance: RDDs are resilient to failures because they can be reconstructed from their lineage.
If a partition of an RDD is lost, Spark can recompute it using the original dataset and the transformations applied.
3.Distributed: RDDs are distributed across multiple nodes in a cluster, allowing for parallel processing of large datasets.
4.Partitioning: RDDs are divided into partitions, which are processed in parallel. The number of partitions can be configured,
and partitioning can be used to optimize operations like joins by ensuring related data is co-located.
Operations on RDDs:
1.Transformations: These operations create a new RDD from an existing one. Examples include:
•map(func): Applies a function to each element of the RDD.
•filter(func): Filters the elements of the RDD based on a predicate function.
•flatMap(func): Similar to map, but can return multiple values for each element.
•union(rdd): Combines two RDDs into a single one.
Actions: These operations trigger the execution of transformations and return a result to the driver program
or write data to external storage. Examples include:
•reduce(func): Aggregates the elements of the RDD using the specified function.
• Creating RDD’s
from pyspark import SparkContext
# Initialize SparkContext
sc = SparkContext("local", "RDD Example")
# Create an RDD from a list
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
Create an RDD from an external file
file_rdd = sc.textFile("path/to/textfile.txt")
Transformations
# map: multiply each element by 2
mapped_rdd = rdd.map(lambda x: x * 2)
# filter: keep only even numbers
filtered_rdd = rdd.filter(lambda x: x % 2 == 0)
Actions:
# Collect all elements of the RDD
collected_data = rdd.collect()
print(collected_data)
# Count the number of elements in the RDD
count = rdd.count() print(f"Count: {count}")
# Reduce: sum all elements
total_sum = rdd.reduce(lambda a, b: a + b)
print(f"Sum: {total_sum}")
Partitioning and Caching
# Repartition the RDD into 4 partitions
repartitioned_rdd = rdd.repartition(4)
# Cache the RDD to memory rdd.cache()
# Perform an action to see the cached effect
cached_result = rdd.count()
Big Table Paper Hadoop’s Sub-Project 0.92 release
HISTORY OF HBASE
Multi-
Sparse Distributed Sorted Map Consistent
dimensional
HBase
HBase has Three Major Components.
Architecture
HBase
Component
s
HBase Read
and Write
• HBase Write Path:
– Clients don’t interact directly with the underlying HFlies during writes.
– Data is reconciled from the BlockCache, the Mem-Store and the HFlies to
give the client an up-to-date view of the row(s) it asked for.
HBASE RDBMS
Column-Oriented Row Oriented
RDBMS
Tight integration with MR Not really
2009
2008 2010
HISTORY OF CASSANDRA
Cassandra
Distributed Design
Cluster − A cluster is a
Node − It is the place where Data center − It is a collection
component that contains one
data is stored. of related nodes.
or more data centers.
Mem-table − A mem-table is
Commit log − The commit log a memory-resident data SSTable − It is a disk file to
is a crash-recovery structure. After commit log, which the data is flushed from
mechanism in Cassandra. the data will be written to the the mem-table when its
Every write operation is written mem-table. Sometimes, for a contents reach a threshold
to the commit log. single-column family, there will value.
be multiple mem-tables.
COMPONENTS OF CASSANDRA
Data Model
The Outermost container is known as the cluster.
1 2 3 4 5
CREATE TYPE − ALTER TYPE − DROP TYPE − DESCRIBE TYPE − DESCRIBE
Creates a user- Modifies a user- Drops a user- Describes a user- TYPES − Describes
defined datatype. defined datatype. defined datatype. defined datatype. user-defined
datatypes.
RDBMS vs Cassandra
RDBMS Cassandra
Database Server Cluster
Database KeySpace
Table Column Family
Rows and Columns Rows and Columns
SQL CQL
RDBMS vs Cassandra
RDBMS Cassandra
RDBMS deals with structured data. Cassandra deals with unstructured data.
In RDBMS, a table is an array of arrays. (ROW x COLUMN) In Cassandra, a table is a list of “nested key-value pairs”. (ROW x COLUMN
key x COLUMN value)
Database is the outermost container that contains data corresponding to an Keyspace is the outermost container that contains data corresponding to an
application. application.
Tables are the entities of a database. Tables or column families are the entity of a keyspace.
RDBMS supports the concepts of foreign keys, joins. Relationships are represented using collections.
Available
CAP THEOREM
Partition
Consistent
Tolerant
• In 2000, Eric Brewer presented a theory that he had been working
for a few years at University of California, Berkley, and at his
company iktonomi, at the Symposium on Principles of Distributed
Computing.
• He presented the concept that three core systemic requirements
need to be considered when it comes to designing and deploying
applications in a distributed environment, and further stated the
relationship among these requirements will create shear in terms
of which requirement can be given up to accomplish the
scalability requirements of your situation.
CAP Theorem • The three requirements are consistency, availability, and partition
tolerance, giving Brewer‘s theorem its other name: CAP.
• In simple terms, the CAP theorem states that in a distributed data
system, you can guarantee two of the following three
requirements: consistency (all data available at all nodes or
systems), availability (every request will get a response), and
partition tolerance (the system will operate irrespective of
availability or a partition or loss of data or communication).
• The system architected on this model will be called BASE
(basically available soft state eventually consistent) architecture
as opposed to ACID.
• Combining the principles of the CAP theorem and the data architecture of
BigTable or Dynamo, there are several solutions that have evolved: HBase,
MongoDB, Riak, Voldemort, Neo4J, Cassandra, HyperTable, HyperGraphDB,
Memcached, Tokyo Cabinet, Redis, CouchDB, and more niche solutions.
Theorem
• Cassandra, Dynamo, and Voldemort, which are architected on AP (from CAP).
• Broadly, NoSQL databases have been classified into four subcategories:
1. Key-value pairs. This model is implemented using a hash table where there is a
unique key and a pointer to a particular item of data creating a key-value pair; for
example, Voldemort.
2. Column family stores. An extension of the key-value architecture with columns
and column families, the overall goal was to process distributed data over a pool of
infrastructure; for example, HBase and Cassandra.
3. Document databases. This class of databases is modeled after Lotus Notes and
similar to keyvalue stores. The data is stored as a document and is represented in
JSON or XML formats. The biggest design feature is the flexibility to list multiple
levels of key-value pairs; for example, Riak and CouchDB.
4. Graph databases. Based on the graph theory, this class of database supports the
scalability across a cluster of machines. The complexity of representation for
extremely complex sets of documents is evolving; for example, Neo4J.