Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
29 views10 pages

HBase & Hive Architecture Guide

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views10 pages

HBase & Hive Architecture Guide

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

HBase architecture has 3 main components: HMaster, Region Server, Zookeeper.

HMaster –

HMaster in HBase is the HBase architecture’s implementation of a Master server. It serves as a monitoring
agent for all Region Server instances in the cluster along with as an interface for any metadata updates. HMaster
runs on NameNode in a distributed cluster context.

HMaster plays the following critical responsibilities in HBase:

• It is critical in terms of cluster performance and node maintenance.


• HMaster manages admin performance, distributes services to regional servers, and assigns regions to
region servers.
• HMaster includes functions such as regulating load balancing and failover to distribute demand among
cluster nodes.
• When a client requests that any schema or Metadata operations be changed, HMaster assumes
responsibility for these changes.

Region Server

The Region Server, also known as HRegionServer, is in charge of managing and providing certain data areas in
HBase. A region is a portion of a table's data that consists of numerous contiguous rows ordered by the row key.
Each Region Server is in charge of one or more regions that the HMaster dynamically assigns to it. The Region
Server serves as an intermediary for clients sending write or read requests to HBase, directing them to the
appropriate region based on the requested column family. Clients can connect with the Region Server without
requiring HMaster authorization, allowing for efficient and direct access to HBase data.

Advantages:
• Region Servers in HBase enable distributed data management, allowing data partitioning across
Hadoop cluster nodes, parallel processing, fault tolerance, and scalability.
• Region Servers process read/write requests directly, reducing network overhead and latency, and
improving performance by eliminating centralized coordination and reducing latency.
• Region Servers allow HBase to automatically split regions into smaller ones as data accumulates. This
assures uniform distribution and increases query speed and load balancing.

Zookeeper

ZooKeeper is a centralized service in HBase that maintains configuration information, provides distributed
synchronization, and provides naming and grouping functions.

Say some advantages offered by ZooKeeper are:

• Distributed Coordination: Allows HBase components to coordinate for consistent functioning.


• Monitoring Cluster Membership and Health: Tracks node membership and health for stability.
• Metadata Storage: Storage of essential metadata for effective operation and coordination.
• Synchronization and Notification: Provides techniques for synchronization and event notification.
• Leader Election: Allows for the accurate selection of leaders for cluster management.

ZooKeeper's advantages in HBase:

• High Availability: A fault-tolerant design that ensures operational continuity.


• Scalability: Horizontal scalability to manage rising demands and bigger clusters.
• Reliable Coordination: Ensures that actions across components are consistent and ordered.
• Simplified Development: Offloads coordinating tasks, simplifying distributed system development.

HBase Data flow

Write Path:

1. The client sends data to the RegionServer.


2. Data is first written to an in-memory store (MemStore).
3. A write-ahead log (WAL) is updated to ensure durability.
4. Once the MemStore reaches a threshold, data is flushed to HDFS as HFiles.

Read Path:

1. The client sends a read request to the RegionServer.


2. The RegionServer checks the MemStore for recent updates.
3. If not found, it retrieves data from HDFS stored in HFiles.
4. Data is returned to the client.

Advantages of HBase

• When compared to typical relational models, operations such as data reading and processing in Hbase
are much faster.
• Random read and write operations are supported by Hbase, providing quick access to select data
records.
• Hbase integrates with Java client, Thrift, and REST APIs, allowing for greater flexibility in application
creation and integration.
• Hbase works smoothly with MapReduce, Hive, and Pig, providing efficient and scalable data
processing and analytics inside the Hadoop environment.

Disadvantages of HBase

• HBase integration with pig and Hive jobs causes some cluster memory concerns.
• In a shared cluster environment, fewer task slots per node are required to allot HBase CPU
requirements.
• It features a single point of failure, which means that if HMaster fails, the entire cluster fails.
• In the Hbase table joining and normalization is very difficult.
• Hbase has no built-in authentication or permissions.

Features of HBase Architecture

Distributed and Scalable

• HBase is designed to be naturally distributed, providing parallel processing, high availability, fault
tolerance, and the ability to handle massive volumes of data. It also offers horizontal scalability,
allowing easy addition of nodes to accommodate expanding data volumes and workloads while
maintaining speed and responsiveness.

Column-oriented Storage

• HBase employs a column-oriented storage model, which means that data is organized and stored in
columns rather than rows.
• This storage type allows rapid read and write operations for analytical queries and aggregations, as well
as quick retrieval of specified columns.

Consistency and Replication

• Hbase guarantees robust consistency by allowing for rigorous consistency models in which data
modifications are immediately visible and consistent across all copies.
• The Hbase architecture contains built-in replication techniques for replicating data across several nodes
or clusters.

HBase Real Life Connect - Example

You may be aware that Facebook has introduced a new Social Inbox integrating email, IM, SMS, text messages,
and on-site Facebook messages. They need to store over 135 billion messages a month.

Facebook chose HBase because it needed a system that could handle two types of data patterns:

• An ever-growing dataset that is rarely accessed

• An ever-growing dataset that is highly volatile You read what's in your Inbox, and then you rarely
look at it again.
HIVE ARCHITECTURE

Apache Hive is a data warehouse system built on top of Hadoop. It facilitates querying and
managing large datasets stored in Hadoop Distributed File System (HDFS) using an SQL-like
language called HiveQL. Hive converts these SQL-like queries into MapReduce jobs, Tez, or
Spark tasks for execuLon on the underlying Hadoop infrastructure.

Working of Hive

The following diagram depicts the workflow between Hive and Hadoop.
• Step-1: Execute Query –
Interface of the Hive such as Command Line or Web user
interface delivers query to the driver to execute. In this, UI calls
the execute interface to the driver such as ODBC or JDBC.

• Step-2: Get Plan –


Driver designs a session handle for the query and transfer the
query to the compiler to make execution plan. In other words,
driver interacts with the compiler.

• Step-3: Get Metadata –


In this, the compiler transfers the metadata request to any
database and the compiler gets the necessary metadata from
the metastore.

• Step-4: Send Metadata –


Metastore transfers metadata as an acknowledgment to the
compiler.

• Step-5: Send Plan –


Compiler communicating with driver with the execution plan
made by the compiler to execute the query.
• Step-6: Execute Plan –
Execute plan is sent to the execution engine by the driver.
o Execute Job
o Job Done
o Dfs operation (Metadata Operation)
• Step-7: Fetch Results –
Fetching results from the driver to the user interface (UI).

•Step-8: Send Results –


Result is transferred to the execution engine from the driver.
Sending results to Execution engine. When the result is
retrieved from data nodes to the execution engine, it returns the
result to the driver and to user interface (UI).
HIVE DATA TYPE

Advantages of Hive Architecture:

Scalability: Hive is a distributed system that can easily scale to handle


large volumes of data by adding more nodes to the cluster.
Data Accessibility: Hive allows users to access data stored in Hadoop
without the need for complex programming skills. SQL-like language is
used for queries and HiveQL is based on SQL syntax.
Data Integration: Hive integrates easily with other tools and systems in
the Hadoop ecosystem such as Pig, HBase, and MapReduce.
Flexibility: Hive can handle both structured and unstructured data, and
supports various data formats including CSV, JSON, and Parquet.
Security: Hive provides security features such as authentication,
authorization, and encryption to ensure data privacy.

Apache Pig - Architecture

The language used to analyze data in Hadoop using Pig is known as Pig
Latin. It is a highlevel data processing language which provides a rich set
of data types and operators to perform various operations on the data.

Apache Pig Components

As shown in the figure, there are various components in the Apache Pig
framework. Let us take a look at the major components.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of
the script, does type checking, and other miscellaneous checks. The
output of the parser will be a DAG (directed acyclic graph), which
represents the Pig Latin statements and logical operators.

In the DAG, the logical operators of the script are represented as the
nodes and the data flows are represented as edges.

Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out
the logical optimizations such as projection and pushdown.

Compiler
The compiler compiles the optimized logical plan into a series of
MapReduce jobs.

Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order.
Finally, these MapReduce jobs are executed on Hadoop producing the
desired results.

Pig Latin Data Model

The data model of Pig Latin is fully nested and it allows complex non-
atomic datatypes such as map and tuple. Given below is the
diagrammatical representation of Pig Latin’s data model.
Atom
Any single value in Pig Latin, irrespective of their data, type is known as
an Atom. It is stored as string and can be used as string and number. int,
long, float, double, chararray, and bytearray are the atomic values of Pig.
A piece of data or a simple atomic value is known as a field.

Example − ‘raja’ or ‘30’

Tuple
A record that is formed by an ordered set of fields is known as a tuple,
the fields can be of any type. A tuple is similar to a row in a table of
RDBMS.

Example − (Raja, 30)

Bag
A bag is an unordered set of tuples. In other words, a collection of tuples
(non-unique) is known as a bag. Each tuple can have any number of fields
(flexible schema). A bag is represented by ‘{}’. It is similar to a table in
RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple
contain the same number of fields or that the fields in the same position
(column) have the same type.

Example − {(Raja, 30), (Mohammad, 45)}

A bag can be a field in a relation; in that context, it is known as inner bag.

Example − {Raja, 30, {9848022338, [email protected],}}


Map
A map (or data map) is a set of key-value pairs. The key needs to be of
type chararray and should be unique. The value might be of any type. It is
represented by ‘[]’

Example − [name#Raja, age#30]

You might also like