0% found this document useful (0 votes)

29 views10 pages

HBase & Hive Architecture Guide

Uploaded by

placementcell1234567890

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views10 pages

HBase & Hive Architecture Guide

Uploaded by

placementcell1234567890

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

HBase architecture has 3 main components: HMaster, Region Server, Zookeeper.

HMaster –

HMaster in HBase is the HBase architecture’s implementation of a Master server. It serves as a monitoring
agent for all Region Server instances in the cluster along with as an interface for any metadata updates. HMaster
runs on NameNode in a distributed cluster context.

HMaster plays the following critical responsibilities in HBase:

• It is critical in terms of cluster performance and node maintenance.

• HMaster manages admin performance, distributes services to regional servers, and assigns regions to
region servers.
• HMaster includes functions such as regulating load balancing and failover to distribute demand among
cluster nodes.
• When a client requests that any schema or Metadata operations be changed, HMaster assumes
responsibility for these changes.

Region Server

The Region Server, also known as HRegionServer, is in charge of managing and providing certain data areas in
HBase. A region is a portion of a table's data that consists of numerous contiguous rows ordered by the row key.
Each Region Server is in charge of one or more regions that the HMaster dynamically assigns to it. The Region
Server serves as an intermediary for clients sending write or read requests to HBase, directing them to the
appropriate region based on the requested column family. Clients can connect with the Region Server without
requiring HMaster authorization, allowing for efficient and direct access to HBase data.

Advantages:
• Region Servers in HBase enable distributed data management, allowing data partitioning across
Hadoop cluster nodes, parallel processing, fault tolerance, and scalability.
• Region Servers process read/write requests directly, reducing network overhead and latency, and
improving performance by eliminating centralized coordination and reducing latency.
• Region Servers allow HBase to automatically split regions into smaller ones as data accumulates. This
assures uniform distribution and increases query speed and load balancing.

Zookeeper

ZooKeeper is a centralized service in HBase that maintains configuration information, provides distributed
synchronization, and provides naming and grouping functions.

Say some advantages offered by ZooKeeper are:

• Distributed Coordination: Allows HBase components to coordinate for consistent functioning.

• Monitoring Cluster Membership and Health: Tracks node membership and health for stability.
• Metadata Storage: Storage of essential metadata for effective operation and coordination.
• Synchronization and Notification: Provides techniques for synchronization and event notification.
• Leader Election: Allows for the accurate selection of leaders for cluster management.

ZooKeeper's advantages in HBase:

• High Availability: A fault-tolerant design that ensures operational continuity.

• Scalability: Horizontal scalability to manage rising demands and bigger clusters.
• Reliable Coordination: Ensures that actions across components are consistent and ordered.
• Simplified Development: Offloads coordinating tasks, simplifying distributed system development.

HBase Data flow

Write Path:

1. The client sends data to the RegionServer.

2. Data is first written to an in-memory store (MemStore).
3. A write-ahead log (WAL) is updated to ensure durability.
4. Once the MemStore reaches a threshold, data is flushed to HDFS as HFiles.

Read Path:

1. The client sends a read request to the RegionServer.

2. The RegionServer checks the MemStore for recent updates.
3. If not found, it retrieves data from HDFS stored in HFiles.
4. Data is returned to the client.

Advantages of HBase

• When compared to typical relational models, operations such as data reading and processing in Hbase
are much faster.
• Random read and write operations are supported by Hbase, providing quick access to select data
records.
• Hbase integrates with Java client, Thrift, and REST APIs, allowing for greater flexibility in application
creation and integration.
• Hbase works smoothly with MapReduce, Hive, and Pig, providing efficient and scalable data
processing and analytics inside the Hadoop environment.

Disadvantages of HBase

• HBase integration with pig and Hive jobs causes some cluster memory concerns.
• In a shared cluster environment, fewer task slots per node are required to allot HBase CPU
requirements.
• It features a single point of failure, which means that if HMaster fails, the entire cluster fails.
• In the Hbase table joining and normalization is very difficult.
• Hbase has no built-in authentication or permissions.

Features of HBase Architecture

Distributed and Scalable

• HBase is designed to be naturally distributed, providing parallel processing, high availability, fault
tolerance, and the ability to handle massive volumes of data. It also offers horizontal scalability,
allowing easy addition of nodes to accommodate expanding data volumes and workloads while
maintaining speed and responsiveness.

Column-oriented Storage

• HBase employs a column-oriented storage model, which means that data is organized and stored in
columns rather than rows.
• This storage type allows rapid read and write operations for analytical queries and aggregations, as well
as quick retrieval of specified columns.

Consistency and Replication

• Hbase guarantees robust consistency by allowing for rigorous consistency models in which data
modifications are immediately visible and consistent across all copies.
• The Hbase architecture contains built-in replication techniques for replicating data across several nodes
or clusters.

HBase Real Life Connect - Example

You may be aware that Facebook has introduced a new Social Inbox integrating email, IM, SMS, text messages,
and on-site Facebook messages. They need to store over 135 billion messages a month.

Facebook chose HBase because it needed a system that could handle two types of data patterns:

• An ever-growing dataset that is rarely accessed

• An ever-growing dataset that is highly volatile You read what's in your Inbox, and then you rarely
look at it again.
HIVE ARCHITECTURE

Apache Hive is a data warehouse system built on top of Hadoop. It facilitates querying and
managing large datasets stored in Hadoop Distributed File System (HDFS) using an SQL-like
language called HiveQL. Hive converts these SQL-like queries into MapReduce jobs, Tez, or
Spark tasks for execuLon on the underlying Hadoop infrastructure.

Working of Hive

The following diagram depicts the workflow between Hive and Hadoop.
• Step-1: Execute Query –
Interface of the Hive such as Command Line or Web user
interface delivers query to the driver to execute. In this, UI calls
the execute interface to the driver such as ODBC or JDBC.

• Step-2: Get Plan –

Driver designs a session handle for the query and transfer the
query to the compiler to make execution plan. In other words,
driver interacts with the compiler.

• Step-3: Get Metadata –

In this, the compiler transfers the metadata request to any
database and the compiler gets the necessary metadata from
the metastore.

• Step-4: Send Metadata –

Metastore transfers metadata as an acknowledgment to the
compiler.

• Step-5: Send Plan –

Compiler communicating with driver with the execution plan
made by the compiler to execute the query.
• Step-6: Execute Plan –
Execute plan is sent to the execution engine by the driver.
o Execute Job
o Job Done
o Dfs operation (Metadata Operation)
• Step-7: Fetch Results –
Fetching results from the driver to the user interface (UI).

•Step-8: Send Results –

Result is transferred to the execution engine from the driver.
Sending results to Execution engine. When the result is
retrieved from data nodes to the execution engine, it returns the
result to the driver and to user interface (UI).
HIVE DATA TYPE

Advantages of Hive Architecture:

Scalability: Hive is a distributed system that can easily scale to handle

large volumes of data by adding more nodes to the cluster.
Data Accessibility: Hive allows users to access data stored in Hadoop
without the need for complex programming skills. SQL-like language is
used for queries and HiveQL is based on SQL syntax.
Data Integration: Hive integrates easily with other tools and systems in
the Hadoop ecosystem such as Pig, HBase, and MapReduce.
Flexibility: Hive can handle both structured and unstructured data, and
supports various data formats including CSV, JSON, and Parquet.
Security: Hive provides security features such as authentication,
authorization, and encryption to ensure data privacy.

Apache Pig - Architecture

The language used to analyze data in Hadoop using Pig is known as Pig
Latin. It is a highlevel data processing language which provides a rich set
of data types and operators to perform various operations on the data.

Apache Pig Components

As shown in the figure, there are various components in the Apache Pig
framework. Let us take a look at the major components.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of
the script, does type checking, and other miscellaneous checks. The
output of the parser will be a DAG (directed acyclic graph), which
represents the Pig Latin statements and logical operators.

In the DAG, the logical operators of the script are represented as the
nodes and the data flows are represented as edges.

Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out
the logical optimizations such as projection and pushdown.

Compiler
The compiler compiles the optimized logical plan into a series of
MapReduce jobs.

Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order.
Finally, these MapReduce jobs are executed on Hadoop producing the
desired results.

Pig Latin Data Model

The data model of Pig Latin is fully nested and it allows complex non-
atomic datatypes such as map and tuple. Given below is the
diagrammatical representation of Pig Latin’s data model.
Atom
Any single value in Pig Latin, irrespective of their data, type is known as
an Atom. It is stored as string and can be used as string and number. int,
long, float, double, chararray, and bytearray are the atomic values of Pig.
A piece of data or a simple atomic value is known as a field.

Example − ‘raja’ or ‘30’

Tuple
A record that is formed by an ordered set of fields is known as a tuple,
the fields can be of any type. A tuple is similar to a row in a table of
RDBMS.

Example − (Raja, 30)

Bag
A bag is an unordered set of tuples. In other words, a collection of tuples
(non-unique) is known as a bag. Each tuple can have any number of fields
(flexible schema). A bag is represented by ‘{}’. It is similar to a table in
RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple
contain the same number of fields or that the fields in the same position
(column) have the same type.

Example − {(Raja, 30), (Mohammad, 45)}

A bag can be a field in a relation; in that context, it is known as inner bag.

Example − {Raja, 30, {9848022338, [email protected],}}

Map
A map (or data map) is a set of key-value pairs. The key needs to be of
type chararray and should be unique. The value might be of any type. It is
represented by ‘[]’

Example − [name#Raja, age#30]

Hbase - in Detail: Pushpinder Singh Paxcel Technologies
No ratings yet
Hbase - in Detail: Pushpinder Singh Paxcel Technologies
32 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
55 pages
Big Data & Hadoop Ecosystem Guide
No ratings yet
Big Data & Hadoop Ecosystem Guide
4 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hbase - Quick Guide Hbase - Overview
No ratings yet
Hbase - Quick Guide Hbase - Overview
53 pages
Big Data and Hadoop Guide
No ratings yet
Big Data and Hadoop Guide
8 pages
Bda Unit 5
No ratings yet
Bda Unit 5
16 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
NLP Final Mini Project
No ratings yet
NLP Final Mini Project
17 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
SDL Module-No SQL Module Assignment No. 2: Q1 What Is Hadoop and Need For It? Discuss It's Architecture
No ratings yet
SDL Module-No SQL Module Assignment No. 2: Q1 What Is Hadoop and Need For It? Discuss It's Architecture
6 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
DA Unit-5
No ratings yet
DA Unit-5
78 pages
Data Engineering YouTube Roadmap
No ratings yet
Data Engineering YouTube Roadmap
4 pages
HBase: Data Management & Architecture
No ratings yet
HBase: Data Management & Architecture
36 pages
HBase
No ratings yet
HBase
31 pages
DocScanner Jan 12, 2023 2-29 PM
No ratings yet
DocScanner Jan 12, 2023 2-29 PM
32 pages
Apache HBase Tutorial & Setup Guide
No ratings yet
Apache HBase Tutorial & Setup Guide
19 pages
Hadoop Week 6
No ratings yet
Hadoop Week 6
38 pages
Hadoop HBASE
No ratings yet
Hadoop HBASE
71 pages
Unit 5 BDA
No ratings yet
Unit 5 BDA
34 pages
HBase: Features, Operations, and Architecture
No ratings yet
HBase: Features, Operations, and Architecture
93 pages
Bda - Unit 5
No ratings yet
Bda - Unit 5
30 pages
Hbase Understanding Mapreduce: Unit-2 P-2
No ratings yet
Hbase Understanding Mapreduce: Unit-2 P-2
32 pages
Unit V
No ratings yet
Unit V
6 pages
HBase
No ratings yet
HBase
27 pages
Hadoop Pig
No ratings yet
Hadoop Pig
27 pages
BDA Module 2-2023
No ratings yet
BDA Module 2-2023
30 pages
HBase - Tutorial
No ratings yet
HBase - Tutorial
14 pages
Unit 5 Big Data
No ratings yet
Unit 5 Big Data
34 pages
BDA Unit-4 Part-2 HBase, Hive, Pig
No ratings yet
BDA Unit-4 Part-2 HBase, Hive, Pig
74 pages
HBASE
No ratings yet
HBASE
11 pages
22621-2023-Summer-Question-Paper (Msbte Study Resources)
No ratings yet
22621-2023-Summer-Question-Paper (Msbte Study Resources)
2 pages
MLA Obj
No ratings yet
MLA Obj
14 pages
Hbase
No ratings yet
Hbase
23 pages
Hadoop Tools for Data Experts
No ratings yet
Hadoop Tools for Data Experts
15 pages
College Data Management System Report
No ratings yet
College Data Management System Report
12 pages
Big Data Analytics Unit-5
No ratings yet
Big Data Analytics Unit-5
28 pages
Code Search & Conversion Tool
No ratings yet
Code Search & Conversion Tool
18 pages
10 HBase
No ratings yet
10 HBase
13 pages
BDT Unit - V
No ratings yet
BDT Unit - V
15 pages
Big Data UNIT 5 Own
No ratings yet
Big Data UNIT 5 Own
18 pages
15+ Resume
No ratings yet
15+ Resume
3 pages
Science Glossary for Students
No ratings yet
Science Glossary for Students
3 pages
My Timetable
No ratings yet
My Timetable
3 pages
Chapter 4 - Data Base Manipulation Using PHP
No ratings yet
Chapter 4 - Data Base Manipulation Using PHP
23 pages
WebAssembly Binary Security Analysis
No ratings yet
WebAssembly Binary Security Analysis
2 pages
S Pig Hive HBase Zookeeper
No ratings yet
S Pig Hive HBase Zookeeper
19 pages
Unit 3 & 4 Big Data
No ratings yet
Unit 3 & 4 Big Data
18 pages
BDA Unit-5
No ratings yet
BDA Unit-5
31 pages
Data Clustering Exercise Guide
No ratings yet
Data Clustering Exercise Guide
1 page
HBASE
No ratings yet
HBASE
18 pages
S Pig Hive HBase
No ratings yet
S Pig Hive HBase
19 pages
History of DBMS
No ratings yet
History of DBMS
7 pages
Car Resale Value
No ratings yet
Car Resale Value
20 pages
Unit 3 Hbase, Mongodb and Couch DB
No ratings yet
Unit 3 Hbase, Mongodb and Couch DB
12 pages
Unit 5 Lecture No-3 (Hbase)
No ratings yet
Unit 5 Lecture No-3 (Hbase)
35 pages
Unit 5 Lecture No-3 (Hbase)
No ratings yet
Unit 5 Lecture No-3 (Hbase)
35 pages
Lukaniszyn Et Al 2024 Digital Twins Generated by AI
No ratings yet
Lukaniszyn Et Al 2024 Digital Twins Generated by AI
17 pages
SS Iyengar - CV
No ratings yet
SS Iyengar - CV
82 pages
Introduction To Hibernate-JPA
No ratings yet
Introduction To Hibernate-JPA
18 pages
GIS-824, Advanced Geographic Information Systems
No ratings yet
GIS-824, Advanced Geographic Information Systems
2 pages
Bda From Module 3
No ratings yet
Bda From Module 3
81 pages
Datesheet
No ratings yet
Datesheet
3 pages
AI Professional Resume Guide
No ratings yet
AI Professional Resume Guide
2 pages
Wis2box Automating Data Ingestion
No ratings yet
Wis2box Automating Data Ingestion
6 pages
Data Science and Big Data UNIT 4
No ratings yet
Data Science and Big Data UNIT 4
10 pages
10 Igcse CS Paper 1 QP Pre Board 2 Exam 2024-25
No ratings yet
10 Igcse CS Paper 1 QP Pre Board 2 Exam 2024-25
13 pages
Take Home Assignment - CCS3342-Business Intelligence
No ratings yet
Take Home Assignment - CCS3342-Business Intelligence
2 pages
2 Unit 5
No ratings yet
2 Unit 5
24 pages
Petroleum Data Analytics
No ratings yet
Petroleum Data Analytics
2 pages
Microsoft Azure Basics Made Simple
No ratings yet
Microsoft Azure Basics Made Simple
9 pages
Resume Jan
No ratings yet
Resume Jan
1 page
Ba Iift 17-18
No ratings yet
Ba Iift 17-18
40 pages
Apache HIVE
No ratings yet
Apache HIVE
5 pages
Artificial Intelligence Application On Aircraft Ma
No ratings yet
Artificial Intelligence Application On Aircraft Ma
7 pages
IICS Dynamic Mapping Capabilities Using CDI and CAI
No ratings yet
IICS Dynamic Mapping Capabilities Using CDI and CAI
16 pages
PGDCA and MCA Lately Revised (New) and FinalizedGOOD-signed
No ratings yet
PGDCA and MCA Lately Revised (New) and FinalizedGOOD-signed
6 pages
HBase Architecture and Its Important Components
No ratings yet
HBase Architecture and Its Important Components
11 pages
23MCX07
No ratings yet
23MCX07
2 pages
Unit - V - Hadoop Related Tools
No ratings yet
Unit - V - Hadoop Related Tools
31 pages
11lecture - Technology and Tools (HiveHbaseMahout)
No ratings yet
11lecture - Technology and Tools (HiveHbaseMahout)
54 pages
Unit 1 P2 HBase
No ratings yet
Unit 1 P2 HBase
22 pages
CCS334 BDA - Unit 5
No ratings yet
CCS334 BDA - Unit 5
27 pages

HBase & Hive Architecture Guide

Uploaded by

HBase & Hive Architecture Guide

Uploaded by

HBase architecture has 3 main components: HMaster, Region Server, Zookeeper.

HMaster plays the following critical responsibilities in HBase:

• It is critical in terms of cluster performance and node maintenance.

Say some advantages offered by ZooKeeper are:

• Distributed Coordination: Allows HBase components to coordinate for consistent functioning.

ZooKeeper's advantages in HBase:

• High Availability: A fault-tolerant design that ensures operational continuity.

HBase Data flow

1. The client sends data to the RegionServer.

1. The client sends a read request to the RegionServer.

Features of HBase Architecture

Distributed and Scalable

Consistency and Replication

HBase Real Life Connect - Example

• An ever-growing dataset that is rarely accessed

• Step-2: Get Plan –

• Step-3: Get Metadata –

• Step-4: Send Metadata –

• Step-5: Send Plan –

•Step-8: Send Results –

Advantages of Hive Architecture:

Scalability: Hive is a distributed system that can easily scale to handle

Apache Pig - Architecture

Apache Pig Components

Pig Latin Data Model

Example − ‘raja’ or ‘30’

Example − (Raja, 30)

Example − {(Raja, 30), (Mohammad, 45)}

A bag can be a field in a relation; in that context, it is known as inner bag.

Example − {Raja, 30, {9848022338, [email protected],}}

Example − [name#Raja, age#30]

You might also like