0% found this document useful (0 votes)

15 views73 pages

Unit 2

The document provides an overview of big data platforms, focusing on Hadoop and Apache Spark architectures. It details the components of Hadoop, including HDFS, YARN, and MapReduce, as well as the functionalities of various daemons like NameNode and DataNode. Additionally, it outlines Spark's features, architecture, and the concept of Resilient Distributed Datasets (RDDs) for efficient data processing.

Uploaded by

jagdalesaee024

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views73 pages

Unit 2

Uploaded by

jagdalesaee024

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 73

UNIT 2

BIG DATA PLATFORMS

Spark

YARN
Cassandra

Programming
Model
CAP Theorem Big Data Platforms Using Examples

Map-Reduce
HBASE

HDFS
HDFS
ARCHITECTURE
Hadoop Architecture

Slave Machines
Master
Machine
Hadoop Components

Hadoop’s core components DN DN

DN DN
1. HDFS (Storage) Master
2. YARN (Processing) Machine
NN NM NM NM NM
HDFS Daemons
• Name Node (NN) Slave Machines
• Data Node (DN) RM
DN DN DN DN
YARN Daemons
• Resource Manager (RM)
• Node Manager (NM) NM NM NM NM

HADOOP COMPONENTS
HDFS ARCHITECTURE OF GEN 2
(VERSION 2)
HDFS Daemons
• Name Node (NN)
• Data Node (DN)
DN DN DN DN
Master
Name Node:
Machine
• Master daemon which
maintains and manages NN
the Data Nodes (slave
nodes) Slave Machines
• Records the meta data DN DN
DN DN
of all the files stored in
the cluster
• Regularly receives a
Heart beat and a block
report from all the Data
Nodes in the cluster

NameNode Daemon
HDFS Daemons
• Name Node (NN)
• Data Node (DN)
DN DN DN DN
Master
Data Node: Machine
• Slave daemons which runs
on each slave machine. NN
• The actual data is stored on
Data Nodes. Slave Machines
• Responsible for serving
read & write requests from DN DN DN DN
the clients.

NAMENODE DAEMON
SECONDARY
NAME NODE
• Job of Secondary Name Node is to contact
Name Node in a periodic manner after certain
time interval (by default 1 hour) and pulls copy of
meta data information out of Name Node.

• Check pointing is a process of combining edit

logs with FsImage.

• Secondary Name Node takes over the

responsibility of check pointing, there fore
making Name Node more available.
Functions
of • To store all the metadata(data about data) of all the slave nodes in a Hadoop

NameNode
cluster, eg; Filename, File path, No. of Blocks, BlockId, Block location,
number of blocks, slave related configurations, etc.
• This metadata is stored in memory for faster retrieval to reduce latency that
will be caused due to disk seeks.

• Hence, it is recommended that Master Node on which Name Node daemon

runs should be a very reliable hardware with high configurations and high
RAM.
• Keep track of all the slave nodes (whether they are alive or dead). This is
done using the heartbeat methodology.
• Replication (provides High availability, reliability and Fault tolerance):
Name Node replicates the data on slave node to various other slave nodes
based on the configured Replication Factor.
• Balancing: Name Node balances data replication, i.e., blocks of data should
not be under or over replicated. This needs to be manually configured.
Functions • The Data Nodes perform the low-level read and write requests from
the file system’s clients
of • The client writes data to one slave node and then it is responsibility
DataNode of Data node to replicates data to the slave nodes according to
replication factor
• Every Data Node sends a heartbeat message to the Name Node
every 3 seconds and conveys that it is alive. In the scenario when
Name Node does not receive a heartbeat from a Data Node for 10
minutes, the Name Node considers that particular Data Node as
dead and starts the process of Block replication on some other Data
Node
• All Data Nodes are synchronized in the Hadoop cluster in a way
that they can communicate with one another and make sure of
i. Balancing the data in the system
ii. Move data for keeping high replication
iii. Copy Data when required
Hadoop Version 1
Hadoop’s core components
1. HDFS (Storage)
2. Map Reduce (Processing)
DN DN DN DN

HDFS Daemons
• Name Node (NN) NN TT TT TT TT
• Data Node (DN)

Map- Reduce Daemons Slave Machines

• Job Tracker (JT) JT
DN DN DN DN
• Task Tracker (TT)
Master
Machine
TT TT TT TT
• MapReduce is a programming framework
that allows us to perform distributed and
parallel processing on large data sets in a
MapReduce distributed environment.
Hadoop Version 1
Map- Reduce Daemons
Job Tracker (JT)
• This is the master node of
the MapReduce system, DN DN DN DN
which manages the jobs
and resources in the
cluster (Task Trackers). NN TT TT TT TT
• The Job Tracker tries to
schedule each map as Slave Machines
close to the actual data JT
being processed on the DN DN DN DN
Task Tracker, which is Master
running on the same Data Machine
Node as the underlying
TT TT TT TT
block.
• One per hadoop cluster.
• Receives job requests
submitted by client.
Hadoop Version 1
Map- Reduce Daemons
Task Tracker (TT)
• These are the slaves DN DN DN DN
that are deployed on
each machine.
• They are responsible NN TT TT TT TT
for running the map
and reducing tasks as Slave Machines
instructed by the Job JT
DN DN DN DN
Tracker. Master
• Reports the status of Machine
execution to job TT TT TT TT
tracker.
• Executes Map Reduce
operations.
YARN
(YET ANOTHER RESOURCE
NEGOTIATOR)
What is YARN?

Hadoop 2.0 produced

a new framework
YARN (Yet Another YARN framework is
Resource Negotiator), responsible for doing
which provides ability cluster resource
to run Non- management.
MapReduce
Applications.
YARN Architecture
Main
Components
of YARN
Resource Manager:
Node Manager:
• Master daemon that manages all
• Responsible for containers ,
other daemons and accept job Resource Monitoring their resource usage
submissions.
Manager i.e., CPU, Memory, disk, network
• Allocates container for the
and reports the same to RM.
AppMaster.

AppMaster: Node
Container:
Manager
• One per application.
• Allocates certain amount of
• Coordinates and manages MR
resources (memory, CPU etc.) On
Jobs. App
• Negotiates Resources from RM Container a slave node (NM).
Master
Resourc • Resource Manager is master that manages

e division of resources among all the

applications in the system.

Manage
Functions:
• Manages Nodes.
• Manages Containers.
r • Manages application Masters.
• RM has a scheduler which is responsible
for allocating resources to various running
application.
Node Manager

• Node Manager is the per-machine “worker” agent, taking care of the

individual compute nodes in Hadoop cluster.
Node Manager is responsible for
• Launching the application’s containers.
• Monitoring their resource usage (CPU, memory, disk, network)
• Reporting the same to the Resource Manager.
• Killing containers as directed by the Resource Manager.
Application Manager

• Application Manager is the framework specific entity that manages execution of application.
• Each application has its own unique Application Master.
• One side it communicates with RM and other with NM.
Functions:
• Negotiates resources (containers) from the Resource Manager.
• Periodically send heartbeats to RM to affirm its health and to update the record of its resource demands.
• Works with the Node Manager to execute the tasks.
• Tracks status and monitors progress of tasks and their resource consumption.
• AM allows individual applications to utilize cluster resources in a shared, secure and multitenant manner.
Container
• Basic unit of resource allocation.
• Application Master runs as a normal container.
• Container is the resource allocation, which is a result of the RM
for specific Resource Request.
• Fine grained resource allocation replaces the fixed Map Reduce
slots.
Apache Spark is an open-source, in-memory,
cluster computing framework for real-time
processing.
It provides high-level API in Java, Scala, Python,
and R.

Spark performs up to 100 times faster in

memory and 10 times faster on disk when
compared to Hadoop.
It is designed to cover a wide range of
workloads such as batch applications, iterative
algorithms, interactive queries, and streaming.
In-Memory Computing
Swift Processing
Powerful Caching

Features of Deployment
Fault Tolerance

Apache Spark Polyglot

Real Time Stream Processing
Dynamic in Nature
Lazy Evaluation
Reusability
Eco-
system
1.Core Spark
• Spark Core: The foundation of the Apache Spark ecosystem. It provides basic functionality like
task scheduling, memory management, fault recovery, and distributed storage system
integration. Spark Core is also responsible for in-memory computation, which makes Spark
faster than traditional MapReduce.

2. Spark SQL
• Spark SQL: A module for working with structured data. It allows querying of data using SQL as
well as a DataFrame API, making it easy to work with structured and semi-structured data.
Spark SQL integrates with a variety of data sources like Hive, Parquet, and JDBC.
3. Spark Streaming
• Spark Streaming: A real-time processing module that processes live data streams. It builds on
top of Spark Core and provides a high-level API for stream processing. It processes data in small
batches and supports integration with sources like Kafka, Flume, and HDFS.

4. MLlib (Machine Learning Library)

• MLlib: A distributed machine learning library that provides a variety of algorithms for
classification, regression, clustering, collaborative filtering, and dimensionality reduction. It also
includes tools for feature extraction, transformation, and statistical analysis.
5. GraphX
• GraphX: A distributed graph processing framework built on Spark. It provides an API for
creating and manipulating graphs and includes algorithms like PageRank, connected
components, and shortest paths. GraphX combines the benefits of graph-parallel and data-
parallel systems.

7. PySpark
• PySpark: The Python API for Spark. It enables Python developers to interact with Spark's
distributed computing framework using familiar Python constructs. PySpark supports the Spark
Core, Spark SQL, and MLlib APIs.
7. Structured Streaming
• Structured Streaming: An extension of Spark SQL that supports continuous stream processing.
It provides a more declarative approach to stream processing by allowing users to define
streaming queries similar to batch queries.
Spark Architecture
• Apache Spark architecture is designed to provide a unified computing engine for big data
processing across various data processing scenarios like batch processing, streaming, machine
learning, and more. Spark’s architecture is based on a master-slave model that provides
scalability, fault tolerance, and high performance through in-memory computation.
• Key Components of Spark Architecture:
1. Driver Program
2. Cluster Manager
3. Workers/Executors
4. Distributed Storage (HDFS, S3, etc.)
1. Driver Program
•Role: The Driver Program is the entry point for any Spark application. It contains the main
function and is responsible for creating the SparkContext, which coordinates the execution of the
application.
•Responsibilities:
•Converting user code into a directed acyclic graph (DAG) of stages
and tasks.
•Scheduling tasks on the worker nodes.
•Collecting and aggregating results from the worker nodes.
•Handling user-defined actions (e.g., collect(), count()).
•Managing job lifecycle and fault recovery.
The Driver program communicates with the Cluster Manager to request resources and assigns
tasks to worker nodes (executors).
• 2. Cluster Manager
• Role: The Cluster Manager is responsible for managing
the resources across the cluster. Spark can work with
various cluster managers, including:
• Standalone Cluster Manager: Spark’s built-in manager.
• Apache YARN: Used in Hadoop ecosystems.
• Apache Mesos: A general-purpose cluster manager.
• Kubernetes: A container orchestration system for
managing distributed Spark jobs in containers.
• The Cluster Manager allocates resources (CPU,
memory) across the cluster and manages the execution
of applications submitted by the Driver.
• 3. Workers/Executors
• Workers: These are nodes in the cluster that run the
tasks assigned by the Driver program. A worker node
runs one or more Executors.
• Executors:
• Role: Executors are the core components of Spark's
execution model. They are distributed across worker
nodes in the cluster and are responsible for executing
individual tasks in parallel.
• Responsibilities:
• Running tasks (units of work) assigned by the Driver.
• Storing and caching data in memory or on disk across distributed nodes.
• Communicating with the Driver program to send task results.
• Fault tolerance: If an executor fails, the Driver can reschedule the failed tasks to be executed on another
executor.

• Each executor typically runs for the entire lifetime of a Spark application and has two major
functions:
• Task Execution: It executes tasks on the partitioned data.
• Data Storage: It stores intermediate data in memory/disk and caches RDDs if required.
• 4. Distributed Storage
• Role: Spark works with various distributed storage systems to load and store data across the
cluster. The commonly used systems include:
• HDFS: Hadoop Distributed File System.
• Amazon S3: Cloud-based storage.
• Apache HBase: NoSQL database for real-time read/write access.
• Apache Cassandra: Distributed database system.
• Other file systems: Local filesystem, NFS, etc.
• The storage systems provide persistence and fault tolerance for the data being processed.
RDDs are the building blocks of any Spark application. RDDs Stands for:
• Resilient: Fault tolerant and is capable of rebuilding data on failure
• Distributed: Distributed data among the multiple nodes in a cluster
• Dataset: Collection of data
There are two ways to create RDDs − parallelizing an existing collection in

Resilient
your driver program, or by referencing a dataset in an external storage
system, such as a shared file system, HDFS, HBase, etc.

Distributed With RDDs, you can perform two types of operations:

1. Transformations: They are the operations that are applied to create a
Dataset new RDD.
2. Actions: They are applied on an RDD to instruct Apache Spark to apply
(RDD) computation and pass the result back to the driver.
Key Features of RDDs:

1.Immutability: Once created, RDDs cannot be altered. However, you can transform them into new RDDs
using operations like map, filter, and reduce.

2.Fault Tolerance: RDDs are resilient to failures because they can be reconstructed from their lineage.

If a partition of an RDD is lost, Spark can recompute it using the original dataset and the transformations applied.

3.Distributed: RDDs are distributed across multiple nodes in a cluster, allowing for parallel processing of large datasets.

4.Partitioning: RDDs are divided into partitions, which are processed in parallel. The number of partitions can be configured,
and partitioning can be used to optimize operations like joins by ensuring related data is co-located.
Operations on RDDs:

There are two types of operations that can be performed on RDDs:

1.Transformations: These operations create a new RDD from an existing one. Examples include:
•map(func): Applies a function to each element of the RDD.
•filter(func): Filters the elements of the RDD based on a predicate function.
•flatMap(func): Similar to map, but can return multiple values for each element.
•union(rdd): Combines two RDDs into a single one.
Actions: These operations trigger the execution of transformations and return a result to the driver program
or write data to external storage. Examples include:

•collect(): Returns all elements of the RDD to the driver program.

•count(): Returns the number of elements in the RDD.

•reduce(func): Aggregates the elements of the RDD using the specified function.
• Creating RDD’s
from pyspark import SparkContext
# Initialize SparkContext
sc = SparkContext("local", "RDD Example")
# Create an RDD from a list
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
Create an RDD from an external file
file_rdd = sc.textFile("path/to/textfile.txt")
Transformations
# map: multiply each element by 2
mapped_rdd = rdd.map(lambda x: x * 2)
# filter: keep only even numbers
filtered_rdd = rdd.filter(lambda x: x % 2 == 0)
Actions:
# Collect all elements of the RDD
collected_data = rdd.collect()
print(collected_data)
# Count the number of elements in the RDD
count = rdd.count() print(f"Count: {count}")
# Reduce: sum all elements
total_sum = rdd.reduce(lambda a, b: a + b)
print(f"Sum: {total_sum}")
Partitioning and Caching
# Repartition the RDD into 4 partitions
repartitioned_rdd = rdd.repartition(4)
# Cache the RDD to memory rdd.cache()
# Perform an action to see the cached effect
cached_result = rdd.count()
Big Table Paper Hadoop’s Sub-Project 0.92 release

2006 2007 2008 2010 2011

Hadoop’s Contrib Apache top-level

Project

HISTORY OF HBASE
Multi-
Sparse Distributed Sorted Map Consistent
dimensional
HBase
HBase has Three Major Components.
Architecture
HBase
Component
s
HBase Read
and Write
• HBase Write Path:
– Clients don’t interact directly with the underlying HFlies during writes.

– HBase Read Path:

– Data is reconciled from the BlockCache, the Mem-Store and the HFlies to
give the client an up-to-date view of the row(s) it asked for.
HBASE RDBMS
Column-Oriented Row Oriented

Flexible Schema, add columns on the Fixed Schema

fly

HBASE V/S Good with sparse tables Not optimized for

sparse tables

RDBMS
Tight integration with MR Not really

Horizontal scalability --- just add Hard to scale

hardware

Good for Semi-structured data as Good for Structure data

well as structured data
CASSANDRA
It was open-sourced by Facebook. It was made an Apache top-level project.

2009

2008 2010

Cassandra was accepted into Apache Incubator.

HISTORY OF CASSANDRA
Cassandra

◦Apache Cassandra is an open source, NoSQL column-oriented

distributed database.
◦It is scalable, fault-tolerant, and consistent.
◦Its distribution design is based on Amazon’s Dynamo and its data
model on Google’s Bigtable.
Cassandra Data Model
Distributed
Design vs
Data Model

Distributed Design
Cluster − A cluster is a
Node − It is the place where Data center − It is a collection
component that contains one
data is stored. of related nodes.
or more data centers.

Mem-table − A mem-table is
Commit log − The commit log a memory-resident data SSTable − It is a disk file to
is a crash-recovery structure. After commit log, which the data is flushed from
mechanism in Cassandra. the data will be written to the the mem-table when its
Every write operation is written mem-table. Sometimes, for a contents reach a threshold
to the commit log. single-column family, there will value.
be multiple mem-tables.

COMPONENTS OF CASSANDRA
Data Model
The Outermost container is known as the cluster.

Each cluster is assigned with a KeySpace.

The basic attributes of a KeySpace are:

• Replication Factor: It is the number of machines in the cluster

that will receive copies of the same data.
• Replica Placement Strategy
• Column-Families: Keyspace is a container for a list of one or
more column families. A column family, in turn, is a container of
a collection of rows. Each row contains ordered columns.
Column families represent the structure of your data. Each
keyspace has at least one and often many column families.
Data
Replication
Strategy

Simple Strategy Network Topology Strategy

Simple Simple Strategy is used when
you have just one data center.
Strategy
Simple Strategy places the first
replica on the node selected by
the partitioner.

After that, remaining replicas

are placed in clockwise direction
in the Node ring.
Network
Topology • Network Topology Strategy is used when you have
Strategy more than two data centers.
• In Network Topology Strategy, replicas are set for
each data center separately. Network Topology
Strategy places replicas in the clockwise direction in
the ring until reaches the first node in another rack.
• This strategy tries to place replicas on different
racks in the same data center. This is due to the
reason that sometimes failure or problem can occur
in the rack. Then replicas on other nodes can
provide data.
Built-In Data Types

Data Types Collection Data Types

Custom Data Types

Data Type Constants Description
Ascii strings Represents ASCII character string
bigint bigint Represents 64-bit signed long
blob blobs Represents arbitrary bytes
boolean booleans Represents true or false
Counter integers Represents counter column
Decimal integers, floats Represents variable-precision decimal
Double integers Represents 64-bit IEEE-754 floating point
Built-In Float integers, floats Represents 32-bit IEEE-754 floating point
Data Types inet strings Represents an IP address, IPv4 or IPv6
Int integers Represents 32-bit signed int
Text strings Represents UTF8 encoded string
Timestamp integers, strings Represents a timestamp
timeuuid uuids Represents type 1 UUID
uuid uuids Represents type 1 or type 4
Varchar strings Represents uTF8 encoded string
Varint integers Represents arbitrary-precision integer
Collection Description
Collecti
list A list is a collection of one
or more ordered on Data
elements.
Types
map A map is a collection of
key-value pairs.

set A set is a collection of one

or more elements.
User-defined datatypes

1 2 3 4 5
CREATE TYPE − ALTER TYPE − DROP TYPE − DESCRIBE TYPE − DESCRIBE
Creates a user- Modifies a user- Drops a user- Describes a user- TYPES − Describes
defined datatype. defined datatype. defined datatype. defined datatype. user-defined
datatypes.
RDBMS vs Cassandra

RDBMS Cassandra
Database Server Cluster
Database KeySpace
Table Column Family
Rows and Columns Rows and Columns
SQL CQL
RDBMS vs Cassandra

RDBMS Cassandra

RDBMS deals with structured data. Cassandra deals with unstructured data.

It has a fixed schema. Cassandra has a flexible schema.

In RDBMS, a table is an array of arrays. (ROW x COLUMN) In Cassandra, a table is a list of “nested key-value pairs”. (ROW x COLUMN
key x COLUMN value)

Database is the outermost container that contains data corresponding to an Keyspace is the outermost container that contains data corresponding to an
application. application.

Tables are the entities of a database. Tables or column families are the entity of a keyspace.

Row is an individual record in RDBMS. Row is a unit of replication in Cassandra.

Column represents the attributes of a relation. Column is a unit of storage in Cassandra.

RDBMS supports the concepts of foreign keys, joins. Relationships are represented using collections.
Available

CAP THEOREM
Partition
Consistent
Tolerant
• In 2000, Eric Brewer presented a theory that he had been working
for a few years at University of California, Berkley, and at his
company iktonomi, at the Symposium on Principles of Distributed
Computing.
• He presented the concept that three core systemic requirements
need to be considered when it comes to designing and deploying
applications in a distributed environment, and further stated the
relationship among these requirements will create shear in terms
of which requirement can be given up to accomplish the
scalability requirements of your situation.
CAP Theorem • The three requirements are consistency, availability, and partition
tolerance, giving Brewer‘s theorem its other name: CAP.
• In simple terms, the CAP theorem states that in a distributed data
system, you can guarantee two of the following three
requirements: consistency (all data available at all nodes or
systems), availability (every request will get a response), and
partition tolerance (the system will operate irrespective of
availability or a partition or loss of data or communication).
• The system architected on this model will be called BASE
(basically available soft state eventually consistent) architecture
as opposed to ACID.
• Combining the principles of the CAP theorem and the data architecture of
BigTable or Dynamo, there are several solutions that have evolved: HBase,
MongoDB, Riak, Voldemort, Neo4J, Cassandra, HyperTable, HyperGraphDB,
Memcached, Tokyo Cabinet, Redis, CouchDB, and more niche solutions.

CAP • Of these, the most popular and widely distributed are:

• HBase, HyperTable, and BigTable, which are architected on CP (from CAP).

Theorem
• Cassandra, Dynamo, and Voldemort, which are architected on AP (from CAP).
• Broadly, NoSQL databases have been classified into four subcategories:
1. Key-value pairs. This model is implemented using a hash table where there is a
unique key and a pointer to a particular item of data creating a key-value pair; for
example, Voldemort.
2. Column family stores. An extension of the key-value architecture with columns
and column families, the overall goal was to process distributed data over a pool of
infrastructure; for example, HBase and Cassandra.
3. Document databases. This class of databases is modeled after Lotus Notes and
similar to keyvalue stores. The data is stored as a document and is represented in
JSON or XML formats. The biggest design feature is the flexibility to list multiple
levels of key-value pairs; for example, Riak and CouchDB.
4. Graph databases. Based on the graph theory, this class of database supports the
scalability across a cluster of machines. The complexity of representation for
extremely complex sets of documents is evolving; for example, Neo4J.

Module 2 HDFS
No ratings yet
Module 2 HDFS
33 pages
Chapter 10
No ratings yet
Chapter 10
45 pages
BDA Unit 2
No ratings yet
BDA Unit 2
52 pages
10 - Big Data Architecture and Tools
No ratings yet
10 - Big Data Architecture and Tools
31 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Hadoop & NoSQL Database Overview
No ratings yet
Hadoop & NoSQL Database Overview
3 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
56 pages
CH 2. HADOOP
No ratings yet
CH 2. HADOOP
25 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Big Data Processing and Tools Guide
No ratings yet
Big Data Processing and Tools Guide
11 pages
CH 2
No ratings yet
CH 2
6 pages
1 Bda Chapter1 Answer
No ratings yet
1 Bda Chapter1 Answer
7 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Slide 2 GFS and Hadoop
No ratings yet
Slide 2 GFS and Hadoop
95 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Big Data & Hadoop Essentials
No ratings yet
Big Data & Hadoop Essentials
36 pages
BDA praON Iat1
No ratings yet
BDA praON Iat1
12 pages
Big Data-Week 3 - 1
No ratings yet
Big Data-Week 3 - 1
22 pages
Hadoop 1
No ratings yet
Hadoop 1
26 pages
BDA Unit 1
No ratings yet
BDA Unit 1
35 pages
Introduc) On To Bigdata
No ratings yet
Introduc) On To Bigdata
103 pages
HADOOP
No ratings yet
HADOOP
19 pages
Learn
No ratings yet
Learn
16 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Hadoop Common Hadoop Distributed File System (HDFS) Hadoop Yarn Hadoop Mapreduce
No ratings yet
Hadoop Common Hadoop Distributed File System (HDFS) Hadoop Yarn Hadoop Mapreduce
30 pages
Chapter - 6 - Hadoop
No ratings yet
Chapter - 6 - Hadoop
51 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Wa0002.
No ratings yet
Wa0002.
66 pages
Big Data
No ratings yet
Big Data
67 pages
Unit II Hadoop
No ratings yet
Unit II Hadoop
23 pages
Jenny Blog
No ratings yet
Jenny Blog
12 pages
Unit IV Notes
No ratings yet
Unit IV Notes
34 pages
Hadoop Class 2 PDF
No ratings yet
Hadoop Class 2 PDF
18 pages
Bda Final Sem 7
No ratings yet
Bda Final Sem 7
120 pages
Hadoop YARN vs MapReduce Architecture
No ratings yet
Hadoop YARN vs MapReduce Architecture
31 pages
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
No ratings yet
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
62 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Unit 2
No ratings yet
Unit 2
22 pages
Chapter3 HDFS MapReduce YARN
No ratings yet
Chapter3 HDFS MapReduce YARN
35 pages
BDA Unit - 4
No ratings yet
BDA Unit - 4
16 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
31 pages
Big Data Unit 3 Own
No ratings yet
Big Data Unit 3 Own
20 pages
BD U-2 (Anupam Sir)
No ratings yet
BD U-2 (Anupam Sir)
30 pages
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
No ratings yet
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
19 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Big Data Ecosystem & Hadoop Guide
No ratings yet
Big Data Ecosystem & Hadoop Guide
31 pages
Updated Unit-IV Reference PPT 08-02-2022
No ratings yet
Updated Unit-IV Reference PPT 08-02-2022
103 pages
ECS765P - W3 - Hadoop Principles and Components
No ratings yet
ECS765P - W3 - Hadoop Principles and Components
47 pages
Unit 5
No ratings yet
Unit 5
32 pages
Hadoop Cluster & Architecture Guide
No ratings yet
Hadoop Cluster & Architecture Guide
18 pages
Hadoop 2full Mod2
No ratings yet
Hadoop 2full Mod2
10 pages
Day 2 S1 Intro - To - Hadoop - Ashok
No ratings yet
Day 2 S1 Intro - To - Hadoop - Ashok
27 pages
bdcc-2 2
No ratings yet
bdcc-2 2
12 pages
BIGDATA4
No ratings yet
BIGDATA4
28 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
2 - Yarn
No ratings yet
2 - Yarn
59 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
Cassandra Brass Tacks
No ratings yet
Cassandra Brass Tacks
6 pages
Class 8 - MongoDB, Neo4j, InfluxDB, Cassandra
No ratings yet
Class 8 - MongoDB, Neo4j, InfluxDB, Cassandra
2 pages
Data Science
No ratings yet
Data Science
108 pages
Big Data Analysis Exam Key 2024
No ratings yet
Big Data Analysis Exam Key 2024
54 pages
SIP Feature Server Deployment Guide: Migrate Data From Cassandra Database Versions
No ratings yet
SIP Feature Server Deployment Guide: Migrate Data From Cassandra Database Versions
9 pages
CQL Cheatsheet
No ratings yet
CQL Cheatsheet
20 pages
Lec 14
No ratings yet
Lec 14
13 pages
Piyush Patel-Lead Python Developer
No ratings yet
Piyush Patel-Lead Python Developer
13 pages
Project PPT (8 Sem)
No ratings yet
Project PPT (8 Sem)
16 pages
New Bda Manual
No ratings yet
New Bda Manual
80 pages
Cassandra Hadoop Integration
No ratings yet
Cassandra Hadoop Integration
2 pages
Thing Board PDF
100% (1)
Thing Board PDF
71 pages
Google Bigtable for Developers
No ratings yet
Google Bigtable for Developers
3 pages
Final MCQ DT
No ratings yet
Final MCQ DT
176 pages
Hinted Handoff - System Design
No ratings yet
Hinted Handoff - System Design
8 pages
Cassandra
No ratings yet
Cassandra
10 pages
HBase for Data Engineers
No ratings yet
HBase for Data Engineers
13 pages
MTech Big Data & Algorithms Exam
No ratings yet
MTech Big Data & Algorithms Exam
98 pages
Oracle To Cassandra Migration Emtv1.2c
No ratings yet
Oracle To Cassandra Migration Emtv1.2c
10 pages
Distributed Hash Tables Guide
No ratings yet
Distributed Hash Tables Guide
20 pages
Team Daedalus
No ratings yet
Team Daedalus
12 pages
Nosql Unit-2 Notes
No ratings yet
Nosql Unit-2 Notes
36 pages
Adt Lab New
No ratings yet
Adt Lab New
65 pages
Introduction to Big Data & Hadoop
No ratings yet
Introduction to Big Data & Hadoop
45 pages
Apache Cassandra Tutorial
No ratings yet
Apache Cassandra Tutorial
7 pages
NoSQL DB
No ratings yet
NoSQL DB
33 pages
NoSQL Database Types Explained
No ratings yet
NoSQL Database Types Explained
9 pages
Benchmarking Cloud Serving Systems With YCSB
No ratings yet
Benchmarking Cloud Serving Systems With YCSB
12 pages
CAP Theorem
No ratings yet
CAP Theorem
14 pages
DS220 v6 Solutions
No ratings yet
DS220 v6 Solutions
31 pages

Unit 2

Uploaded by

Unit 2

Uploaded by

UNIT 2

BIG DATA PLATFORMS

Hadoop’s core components DN DN

• Check pointing is a process of combining edit

• Secondary Name Node takes over the

• Hence, it is recommended that Master Node on which Name Node daemon

Map- Reduce Daemons Slave Machines

Hadoop 2.0 produced

e division of resources among all the

• Node Manager is the per-machine “worker” agent, taking care of the

Spark performs up to 100 times faster in

Apache Spark Polyglot

4. MLlib (Machine Learning Library)

Distributed With RDDs, you can perform two types of operations:

There are two types of operations that can be performed on RDDs:

•collect(): Returns all elements of the RDD to the driver program.

•count(): Returns the number of elements in the RDD.

2006 2007 2008 2010 2011

Hadoop’s Contrib Apache top-level

– HBase Read Path:

Flexible Schema, add columns on the Fixed Schema

HBASE V/S Good with sparse tables Not optimized for

Horizontal scalability --- just add Hard to scale

Good for Semi-structured data as Good for Structure data

Cassandra was accepted into Apache Incubator.

◦Apache Cassandra is an open source, NoSQL column-oriented

Each cluster is assigned with a KeySpace.

The basic attributes of a KeySpace are:

• Replication Factor: It is the number of machines in the cluster

Simple Strategy Network Topology Strategy

After that, remaining replicas

Data Types Collection Data Types

Custom Data Types

set A set is a collection of one

It has a fixed schema. Cassandra has a flexible schema.

Row is an individual record in RDBMS. Row is a unit of replication in Cassandra.

Column represents the attributes of a relation. Column is a unit of storage in Cassandra.

CAP • Of these, the most popular and widely distributed are:

You might also like