Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views73 pages

Unit 2

The document provides an overview of big data platforms, focusing on Hadoop and Apache Spark architectures. It details the components of Hadoop, including HDFS, YARN, and MapReduce, as well as the functionalities of various daemons like NameNode and DataNode. Additionally, it outlines Spark's features, architecture, and the concept of Resilient Distributed Datasets (RDDs) for efficient data processing.

Uploaded by

jagdalesaee024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views73 pages

Unit 2

The document provides an overview of big data platforms, focusing on Hadoop and Apache Spark architectures. It details the components of Hadoop, including HDFS, YARN, and MapReduce, as well as the functionalities of various daemons like NameNode and DataNode. Additionally, it outlines Spark's features, architecture, and the concept of Resilient Distributed Datasets (RDDs) for efficient data processing.

Uploaded by

jagdalesaee024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

UNIT 2

BIG DATA PLATFORMS


Spark

YARN
Cassandra

Programming
Model
CAP Theorem Big Data Platforms Using Examples

Map-Reduce
HBASE

HDFS
HDFS
ARCHITECTURE
Hadoop Architecture

Slave Machines
Master
Machine
Hadoop Components

Hadoop’s core components DN DN


DN DN
1. HDFS (Storage) Master
2. YARN (Processing) Machine
NN NM NM NM NM
HDFS Daemons
• Name Node (NN) Slave Machines
• Data Node (DN) RM
DN DN DN DN
YARN Daemons
• Resource Manager (RM)
• Node Manager (NM) NM NM NM NM

HADOOP COMPONENTS
HDFS ARCHITECTURE OF GEN 2
(VERSION 2)
HDFS Daemons
• Name Node (NN)
• Data Node (DN)
DN DN DN DN
Master
Name Node:
Machine
• Master daemon which
maintains and manages NN
the Data Nodes (slave
nodes) Slave Machines
• Records the meta data DN DN
DN DN
of all the files stored in
the cluster
• Regularly receives a
Heart beat and a block
report from all the Data
Nodes in the cluster

NameNode Daemon
HDFS Daemons
• Name Node (NN)
• Data Node (DN)
DN DN DN DN
Master
Data Node: Machine
• Slave daemons which runs
on each slave machine. NN
• The actual data is stored on
Data Nodes. Slave Machines
• Responsible for serving
read & write requests from DN DN DN DN
the clients.

NAMENODE DAEMON
SECONDARY
NAME NODE
• Job of Secondary Name Node is to contact
Name Node in a periodic manner after certain
time interval (by default 1 hour) and pulls copy of
meta data information out of Name Node.

• Check pointing is a process of combining edit


logs with FsImage.

• Secondary Name Node takes over the


responsibility of check pointing, there fore
making Name Node more available.
Functions
of • To store all the metadata(data about data) of all the slave nodes in a Hadoop

NameNode
cluster, eg; Filename, File path, No. of Blocks, BlockId, Block location,
number of blocks, slave related configurations, etc.
• This metadata is stored in memory for faster retrieval to reduce latency that
will be caused due to disk seeks.

• Hence, it is recommended that Master Node on which Name Node daemon


runs should be a very reliable hardware with high configurations and high
RAM.
• Keep track of all the slave nodes (whether they are alive or dead). This is
done using the heartbeat methodology.
• Replication (provides High availability, reliability and Fault tolerance):
Name Node replicates the data on slave node to various other slave nodes
based on the configured Replication Factor.
• Balancing: Name Node balances data replication, i.e., blocks of data should
not be under or over replicated. This needs to be manually configured.
Functions • The Data Nodes perform the low-level read and write requests from
the file system’s clients
of • The client writes data to one slave node and then it is responsibility
DataNode of Data node to replicates data to the slave nodes according to
replication factor
• Every Data Node sends a heartbeat message to the Name Node
every 3 seconds and conveys that it is alive. In the scenario when
Name Node does not receive a heartbeat from a Data Node for 10
minutes, the Name Node considers that particular Data Node as
dead and starts the process of Block replication on some other Data
Node
• All Data Nodes are synchronized in the Hadoop cluster in a way
that they can communicate with one another and make sure of
i. Balancing the data in the system
ii. Move data for keeping high replication
iii. Copy Data when required
Hadoop Version 1
Hadoop’s core components
1. HDFS (Storage)
2. Map Reduce (Processing)
DN DN DN DN

HDFS Daemons
• Name Node (NN) NN TT TT TT TT
• Data Node (DN)

Map- Reduce Daemons Slave Machines


• Job Tracker (JT) JT
DN DN DN DN
• Task Tracker (TT)
Master
Machine
TT TT TT TT
• MapReduce is a programming framework
that allows us to perform distributed and
parallel processing on large data sets in a
MapReduce distributed environment.
Hadoop Version 1
Map- Reduce Daemons
Job Tracker (JT)
• This is the master node of
the MapReduce system, DN DN DN DN
which manages the jobs
and resources in the
cluster (Task Trackers). NN TT TT TT TT
• The Job Tracker tries to
schedule each map as Slave Machines
close to the actual data JT
being processed on the DN DN DN DN
Task Tracker, which is Master
running on the same Data Machine
Node as the underlying
TT TT TT TT
block.
• One per hadoop cluster.
• Receives job requests
submitted by client.
Hadoop Version 1
Map- Reduce Daemons
Task Tracker (TT)
• These are the slaves DN DN DN DN
that are deployed on
each machine.
• They are responsible NN TT TT TT TT
for running the map
and reducing tasks as Slave Machines
instructed by the Job JT
DN DN DN DN
Tracker. Master
• Reports the status of Machine
execution to job TT TT TT TT
tracker.
• Executes Map Reduce
operations.
YARN
(YET ANOTHER RESOURCE
NEGOTIATOR)
What is YARN?

Hadoop 2.0 produced


a new framework
YARN (Yet Another YARN framework is
Resource Negotiator), responsible for doing
which provides ability cluster resource
to run Non- management.
MapReduce
Applications.
YARN Architecture
Main
Components
of YARN
Resource Manager:
Node Manager:
• Master daemon that manages all
• Responsible for containers ,
other daemons and accept job Resource Monitoring their resource usage
submissions.
Manager i.e., CPU, Memory, disk, network
• Allocates container for the
and reports the same to RM.
AppMaster.

AppMaster: Node
Container:
Manager
• One per application.
• Allocates certain amount of
• Coordinates and manages MR
resources (memory, CPU etc.) On
Jobs. App
• Negotiates Resources from RM Container a slave node (NM).
Master
Resourc • Resource Manager is master that manages

e division of resources among all the


applications in the system.

Manage
Functions:
• Manages Nodes.
• Manages Containers.
r • Manages application Masters.
• RM has a scheduler which is responsible
for allocating resources to various running
application.
Node Manager

• Node Manager is the per-machine “worker” agent, taking care of the


individual compute nodes in Hadoop cluster.
Node Manager is responsible for
• Launching the application’s containers.
• Monitoring their resource usage (CPU, memory, disk, network)
• Reporting the same to the Resource Manager.
• Killing containers as directed by the Resource Manager.
Application Manager

• Application Manager is the framework specific entity that manages execution of application.
• Each application has its own unique Application Master.
• One side it communicates with RM and other with NM.
Functions:
• Negotiates resources (containers) from the Resource Manager.
• Periodically send heartbeats to RM to affirm its health and to update the record of its resource demands.
• Works with the Node Manager to execute the tasks.
• Tracks status and monitors progress of tasks and their resource consumption.
• AM allows individual applications to utilize cluster resources in a shared, secure and multitenant manner.
Container
• Basic unit of resource allocation.
• Application Master runs as a normal container.
• Container is the resource allocation, which is a result of the RM
for specific Resource Request.
• Fine grained resource allocation replaces the fixed Map Reduce
slots.
Apache Spark is an open-source, in-memory,
cluster computing framework for real-time
processing.
It provides high-level API in Java, Scala, Python,
and R.

Spark performs up to 100 times faster in


memory and 10 times faster on disk when
compared to Hadoop.
It is designed to cover a wide range of
workloads such as batch applications, iterative
algorithms, interactive queries, and streaming.
In-Memory Computing
Swift Processing
Powerful Caching

Features of Deployment
Fault Tolerance

Apache Spark Polyglot


Real Time Stream Processing
Dynamic in Nature
Lazy Evaluation
Reusability
Eco-
system
1.Core Spark
• Spark Core: The foundation of the Apache Spark ecosystem. It provides basic functionality like
task scheduling, memory management, fault recovery, and distributed storage system
integration. Spark Core is also responsible for in-memory computation, which makes Spark
faster than traditional MapReduce.

2. Spark SQL
• Spark SQL: A module for working with structured data. It allows querying of data using SQL as
well as a DataFrame API, making it easy to work with structured and semi-structured data.
Spark SQL integrates with a variety of data sources like Hive, Parquet, and JDBC.
3. Spark Streaming
• Spark Streaming: A real-time processing module that processes live data streams. It builds on
top of Spark Core and provides a high-level API for stream processing. It processes data in small
batches and supports integration with sources like Kafka, Flume, and HDFS.

4. MLlib (Machine Learning Library)


• MLlib: A distributed machine learning library that provides a variety of algorithms for
classification, regression, clustering, collaborative filtering, and dimensionality reduction. It also
includes tools for feature extraction, transformation, and statistical analysis.
5. GraphX
• GraphX: A distributed graph processing framework built on Spark. It provides an API for
creating and manipulating graphs and includes algorithms like PageRank, connected
components, and shortest paths. GraphX combines the benefits of graph-parallel and data-
parallel systems.

7. PySpark
• PySpark: The Python API for Spark. It enables Python developers to interact with Spark's
distributed computing framework using familiar Python constructs. PySpark supports the Spark
Core, Spark SQL, and MLlib APIs.
7. Structured Streaming
• Structured Streaming: An extension of Spark SQL that supports continuous stream processing.
It provides a more declarative approach to stream processing by allowing users to define
streaming queries similar to batch queries.
Spark Architecture
• Apache Spark architecture is designed to provide a unified computing engine for big data
processing across various data processing scenarios like batch processing, streaming, machine
learning, and more. Spark’s architecture is based on a master-slave model that provides
scalability, fault tolerance, and high performance through in-memory computation.
• Key Components of Spark Architecture:
1. Driver Program
2. Cluster Manager
3. Workers/Executors
4. Distributed Storage (HDFS, S3, etc.)
1. Driver Program
•Role: The Driver Program is the entry point for any Spark application. It contains the main
function and is responsible for creating the SparkContext, which coordinates the execution of the
application.
•Responsibilities:
•Converting user code into a directed acyclic graph (DAG) of stages
and tasks.
•Scheduling tasks on the worker nodes.
•Collecting and aggregating results from the worker nodes.
•Handling user-defined actions (e.g., collect(), count()).
•Managing job lifecycle and fault recovery.
The Driver program communicates with the Cluster Manager to request resources and assigns
tasks to worker nodes (executors).
• 2. Cluster Manager
• Role: The Cluster Manager is responsible for managing
the resources across the cluster. Spark can work with
various cluster managers, including:
• Standalone Cluster Manager: Spark’s built-in manager.
• Apache YARN: Used in Hadoop ecosystems.
• Apache Mesos: A general-purpose cluster manager.
• Kubernetes: A container orchestration system for
managing distributed Spark jobs in containers.
• The Cluster Manager allocates resources (CPU,
memory) across the cluster and manages the execution
of applications submitted by the Driver.
• 3. Workers/Executors
• Workers: These are nodes in the cluster that run the
tasks assigned by the Driver program. A worker node
runs one or more Executors.
• Executors:
• Role: Executors are the core components of Spark's
execution model. They are distributed across worker
nodes in the cluster and are responsible for executing
individual tasks in parallel.
• Responsibilities:
• Running tasks (units of work) assigned by the Driver.
• Storing and caching data in memory or on disk across distributed nodes.
• Communicating with the Driver program to send task results.
• Fault tolerance: If an executor fails, the Driver can reschedule the failed tasks to be executed on another
executor.

• Each executor typically runs for the entire lifetime of a Spark application and has two major
functions:
• Task Execution: It executes tasks on the partitioned data.
• Data Storage: It stores intermediate data in memory/disk and caches RDDs if required.
• 4. Distributed Storage
• Role: Spark works with various distributed storage systems to load and store data across the
cluster. The commonly used systems include:
• HDFS: Hadoop Distributed File System.
• Amazon S3: Cloud-based storage.
• Apache HBase: NoSQL database for real-time read/write access.
• Apache Cassandra: Distributed database system.
• Other file systems: Local filesystem, NFS, etc.
• The storage systems provide persistence and fault tolerance for the data being processed.
RDDs are the building blocks of any Spark application. RDDs Stands for:
• Resilient: Fault tolerant and is capable of rebuilding data on failure
• Distributed: Distributed data among the multiple nodes in a cluster
• Dataset: Collection of data
There are two ways to create RDDs − parallelizing an existing collection in

Resilient
your driver program, or by referencing a dataset in an external storage
system, such as a shared file system, HDFS, HBase, etc.

Distributed With RDDs, you can perform two types of operations:


1. Transformations: They are the operations that are applied to create a
Dataset new RDD.
2. Actions: They are applied on an RDD to instruct Apache Spark to apply
(RDD) computation and pass the result back to the driver.
Key Features of RDDs:

1.Immutability: Once created, RDDs cannot be altered. However, you can transform them into new RDDs
using operations like map, filter, and reduce.

2.Fault Tolerance: RDDs are resilient to failures because they can be reconstructed from their lineage.

If a partition of an RDD is lost, Spark can recompute it using the original dataset and the transformations applied.

3.Distributed: RDDs are distributed across multiple nodes in a cluster, allowing for parallel processing of large datasets.

4.Partitioning: RDDs are divided into partitions, which are processed in parallel. The number of partitions can be configured,
and partitioning can be used to optimize operations like joins by ensuring related data is co-located.
Operations on RDDs:

There are two types of operations that can be performed on RDDs:

1.Transformations: These operations create a new RDD from an existing one. Examples include:
•map(func): Applies a function to each element of the RDD.
•filter(func): Filters the elements of the RDD based on a predicate function.
•flatMap(func): Similar to map, but can return multiple values for each element.
•union(rdd): Combines two RDDs into a single one.
Actions: These operations trigger the execution of transformations and return a result to the driver program
or write data to external storage. Examples include:

•collect(): Returns all elements of the RDD to the driver program.

•count(): Returns the number of elements in the RDD.

•reduce(func): Aggregates the elements of the RDD using the specified function.
• Creating RDD’s
from pyspark import SparkContext
# Initialize SparkContext
sc = SparkContext("local", "RDD Example")
# Create an RDD from a list
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
Create an RDD from an external file
file_rdd = sc.textFile("path/to/textfile.txt")
Transformations
# map: multiply each element by 2
mapped_rdd = rdd.map(lambda x: x * 2)
# filter: keep only even numbers
filtered_rdd = rdd.filter(lambda x: x % 2 == 0)
Actions:
# Collect all elements of the RDD
collected_data = rdd.collect()
print(collected_data)
# Count the number of elements in the RDD
count = rdd.count() print(f"Count: {count}")
# Reduce: sum all elements
total_sum = rdd.reduce(lambda a, b: a + b)
print(f"Sum: {total_sum}")
Partitioning and Caching
# Repartition the RDD into 4 partitions
repartitioned_rdd = rdd.repartition(4)
# Cache the RDD to memory rdd.cache()
# Perform an action to see the cached effect
cached_result = rdd.count()
Big Table Paper Hadoop’s Sub-Project 0.92 release

2006 2007 2008 2010 2011

Hadoop’s Contrib Apache top-level


Project

HISTORY OF HBASE
Multi-
Sparse Distributed Sorted Map Consistent
dimensional
HBase
HBase has Three Major Components.
Architecture
HBase
Component
s
HBase Read
and Write
• HBase Write Path:
– Clients don’t interact directly with the underlying HFlies during writes.

– HBase Read Path:

– Data is reconciled from the BlockCache, the Mem-Store and the HFlies to
give the client an up-to-date view of the row(s) it asked for.
HBASE RDBMS
Column-Oriented Row Oriented

Flexible Schema, add columns on the Fixed Schema


fly

HBASE V/S Good with sparse tables Not optimized for


sparse tables

RDBMS
Tight integration with MR Not really

Horizontal scalability --- just add Hard to scale


hardware

Good for Semi-structured data as Good for Structure data


well as structured data
CASSANDRA
It was open-sourced by Facebook. It was made an Apache top-level project.

2009

2008 2010

Cassandra was accepted into Apache Incubator.

HISTORY OF CASSANDRA
Cassandra

◦Apache Cassandra is an open source, NoSQL column-oriented


distributed database.
◦It is scalable, fault-tolerant, and consistent.
◦Its distribution design is based on Amazon’s Dynamo and its data
model on Google’s Bigtable.
Cassandra Data Model
Distributed
Design vs
Data Model

Distributed Design
Cluster − A cluster is a
Node − It is the place where Data center − It is a collection
component that contains one
data is stored. of related nodes.
or more data centers.

Mem-table − A mem-table is
Commit log − The commit log a memory-resident data SSTable − It is a disk file to
is a crash-recovery structure. After commit log, which the data is flushed from
mechanism in Cassandra. the data will be written to the the mem-table when its
Every write operation is written mem-table. Sometimes, for a contents reach a threshold
to the commit log. single-column family, there will value.
be multiple mem-tables.

COMPONENTS OF CASSANDRA
Data Model
The Outermost container is known as the cluster.

Each cluster is assigned with a KeySpace.

The basic attributes of a KeySpace are:

• Replication Factor: It is the number of machines in the cluster


that will receive copies of the same data.
• Replica Placement Strategy
• Column-Families: Keyspace is a container for a list of one or
more column families. A column family, in turn, is a container of
a collection of rows. Each row contains ordered columns.
Column families represent the structure of your data. Each
keyspace has at least one and often many column families.
Data
Replication
Strategy

Simple Strategy Network Topology Strategy


Simple Simple Strategy is used when
you have just one data center.
Strategy
Simple Strategy places the first
replica on the node selected by
the partitioner.

After that, remaining replicas


are placed in clockwise direction
in the Node ring.
Network
Topology • Network Topology Strategy is used when you have
Strategy more than two data centers.
• In Network Topology Strategy, replicas are set for
each data center separately. Network Topology
Strategy places replicas in the clockwise direction in
the ring until reaches the first node in another rack.
• This strategy tries to place replicas on different
racks in the same data center. This is due to the
reason that sometimes failure or problem can occur
in the rack. Then replicas on other nodes can
provide data.
Built-In Data Types

Data Types Collection Data Types

Custom Data Types


Data Type Constants Description
Ascii strings Represents ASCII character string
bigint bigint Represents 64-bit signed long
blob blobs Represents arbitrary bytes
boolean booleans Represents true or false
Counter integers Represents counter column
Decimal integers, floats Represents variable-precision decimal
Double integers Represents 64-bit IEEE-754 floating point
Built-In Float integers, floats Represents 32-bit IEEE-754 floating point
Data Types inet strings Represents an IP address, IPv4 or IPv6
Int integers Represents 32-bit signed int
Text strings Represents UTF8 encoded string
Timestamp integers, strings Represents a timestamp
timeuuid uuids Represents type 1 UUID
uuid uuids Represents type 1 or type 4
Varchar strings Represents uTF8 encoded string
Varint integers Represents arbitrary-precision integer
Collection Description
Collecti
list A list is a collection of one
or more ordered on Data
elements.
Types
map A map is a collection of
key-value pairs.

set A set is a collection of one


or more elements.
User-defined datatypes

1 2 3 4 5
CREATE TYPE − ALTER TYPE − DROP TYPE − DESCRIBE TYPE − DESCRIBE
Creates a user- Modifies a user- Drops a user- Describes a user- TYPES − Describes
defined datatype. defined datatype. defined datatype. defined datatype. user-defined
datatypes.
RDBMS vs Cassandra

RDBMS Cassandra
Database Server Cluster
Database KeySpace
Table Column Family
Rows and Columns Rows and Columns
SQL CQL
RDBMS vs Cassandra

RDBMS Cassandra

RDBMS deals with structured data. Cassandra deals with unstructured data.

It has a fixed schema. Cassandra has a flexible schema.

In RDBMS, a table is an array of arrays. (ROW x COLUMN) In Cassandra, a table is a list of “nested key-value pairs”. (ROW x COLUMN
key x COLUMN value)

Database is the outermost container that contains data corresponding to an Keyspace is the outermost container that contains data corresponding to an
application. application.

Tables are the entities of a database. Tables or column families are the entity of a keyspace.

Row is an individual record in RDBMS. Row is a unit of replication in Cassandra.

Column represents the attributes of a relation. Column is a unit of storage in Cassandra.

RDBMS supports the concepts of foreign keys, joins. Relationships are represented using collections.
Available

CAP THEOREM
Partition
Consistent
Tolerant
• In 2000, Eric Brewer presented a theory that he had been working
for a few years at University of California, Berkley, and at his
company iktonomi, at the Symposium on Principles of Distributed
Computing.
• He presented the concept that three core systemic requirements
need to be considered when it comes to designing and deploying
applications in a distributed environment, and further stated the
relationship among these requirements will create shear in terms
of which requirement can be given up to accomplish the
scalability requirements of your situation.
CAP Theorem • The three requirements are consistency, availability, and partition
tolerance, giving Brewer‘s theorem its other name: CAP.
• In simple terms, the CAP theorem states that in a distributed data
system, you can guarantee two of the following three
requirements: consistency (all data available at all nodes or
systems), availability (every request will get a response), and
partition tolerance (the system will operate irrespective of
availability or a partition or loss of data or communication).
• The system architected on this model will be called BASE
(basically available soft state eventually consistent) architecture
as opposed to ACID.
• Combining the principles of the CAP theorem and the data architecture of
BigTable or Dynamo, there are several solutions that have evolved: HBase,
MongoDB, Riak, Voldemort, Neo4J, Cassandra, HyperTable, HyperGraphDB,
Memcached, Tokyo Cabinet, Redis, CouchDB, and more niche solutions.

CAP • Of these, the most popular and widely distributed are:


• HBase, HyperTable, and BigTable, which are architected on CP (from CAP).

Theorem
• Cassandra, Dynamo, and Voldemort, which are architected on AP (from CAP).
• Broadly, NoSQL databases have been classified into four subcategories:
1. Key-value pairs. This model is implemented using a hash table where there is a
unique key and a pointer to a particular item of data creating a key-value pair; for
example, Voldemort.
2. Column family stores. An extension of the key-value architecture with columns
and column families, the overall goal was to process distributed data over a pool of
infrastructure; for example, HBase and Cassandra.
3. Document databases. This class of databases is modeled after Lotus Notes and
similar to keyvalue stores. The data is stored as a document and is represented in
JSON or XML formats. The biggest design feature is the flexibility to list multiple
levels of key-value pairs; for example, Riak and CouchDB.
4. Graph databases. Based on the graph theory, this class of database supports the
scalability across a cluster of machines. The complexity of representation for
extremely complex sets of documents is evolving; for example, Neo4J.

You might also like