Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
23 views32 pages

HBase Architecture

HBase is a distributed, column-oriented NoSQL data store designed for fast random reads and writes, capable of handling petabytes of data. It features a schemaless data model, self-managed data partitions, and is optimized for low latency access, addressing limitations of traditional Hadoop components. The architecture includes region servers for scalability, a master node for managing regions, and uses ZooKeeper for coordination and health monitoring of the cluster.

Uploaded by

konelatulipe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views32 pages

HBase Architecture

HBase is a distributed, column-oriented NoSQL data store designed for fast random reads and writes, capable of handling petabytes of data. It features a schemaless data model, self-managed data partitions, and is optimized for low latency access, addressing limitations of traditional Hadoop components. The architecture includes region servers for scalability, a master node for managing regions, and uses ZooKeeper for coordination and health monitoring of the cluster.

Uploaded by

konelatulipe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

HBase Architecture

Prasanth Kothuri, CERN

2
Why HBase?
- Hadoop without HBase
- Distributed, fault-tolerant, throughput-optimized data storage (HDFS)
- Distributed, batch-oriented database processing frameworks like
MapReduce
- Distributed in-memory processing of the entire dataset (Spark), SQL
executed over HDFS data as MR jobs (HIVE) or MPP model (Impala)
- Whats missing
- No low latency random access to your big data
- Can’t update or delete existing rows
- HBase does all this + much more

HBase Architecture 3
What is HBase?
- Distributed column-oriented key-value data store
- Designed to handle petabytes of data and billions of rows
- Fast random reads and writes
- Schemaless data model («NoSQL»)
- Not a RDBMS: no SQL, no joins & no indexes
- Self-managed data partitions, aka auto sharding

- Random access to your planet-sized data

HBase Architecture 4
Building Blocks
- The most basic unit in HBase is a column
- Each column can have multiple versions, with each distinct value stored in
a seperate cell

- One or more columns form a row, that is uniquely addressed by a


row key

- A table is a collection of rows


- All rows are stored in the sorted order of the row key
hbase(main):025:0> scan 'blogposts'
ROW COLUMN+CELL
post1 column=post:author, timestamp=1440601435122, value=Prasanth Kothuri
post1 column=post:body, timestamp=1440601483940, value=This is a test blog entry
post1 column=post:title, timestamp=1440601427153, value=Hello World
post2 column=post:author, timestamp=1440601550401, value=Prasanth Kothuri
post2 column=post:body, timestamp=1440601839670, value=HDFS is a distributed file system that is ...
post2 column=post:title, timestamp=1440601528247, value=Introduction to HDFS
2 row(s) in 0.0060 seconds

HBase Architecture 5
Building Blocks
- Column Families
- Columns are grouped into column families
- Defined when the table is created
- Should not be changed too often
- The number of column families should be reasonable [?]

- Refrencing columns
- Column name is called qualifier
- Reference using – family:qualifier

HBase Architecture 6
Building Blocks
- A note on the NULL value
- In RDBMS NULLs occupy space
- In Hbase, NULL columns are simply not stored

- cell
- Every column value, or cell, is timestamped
- This can be used to save multiple versions of a value
- Versions are stored in decreasing timestamp
- Cell Versions can be constrained by predicate deletions
- Keep only values from the last month

HBase Architecture 7
Tables, Rows, Columns and Cells
- Access to data
- (Table, RowKey, Family, Column, Timestamp) ->
Value
- SortedMap<RowKey, List<SortedMap<Column,
List<Value, Timestamp>>>>

- Which means:
- The first SortedMap is the table, containing a List of column
families
- The families contain another SortedMap, representing
columns and a List of value, timestamp tuples

HBase Architecture 8
Rows and Columns in HBase

HBase Architecture 9
Time-oriented and spreadsheet view

Time-oriented view

Spredsheet view

HBase Architecture 10
Building Blocks – scan of blogposts

hbase(main):044:0* scan 'blogposts'

ROW COLUMN+CELL
guestpost1 column=image:bodyimage, timestamp=1440698372345, value=image3.jpg
guestpost1 column=image:header, timestamp=1440698372323, value=image3
guestpost1 column=post:author, timestamp=1440698372251, value=Barack Obama
guestpost1 column=post:body, timestamp=1440698372298, value=blah bla blah...
guestpost1 column=post:title, timestamp=1440698372276, value=How to play Golf
post1 column=image:bodyimage, timestamp=1440698351420, value=image1.jpg
post1 column=image:header, timestamp=1440698351395, value=image1
post1 column=post:author, timestamp=1440698351284, value=Prasanth Kothuri
post1 column=post:body, timestamp=1440698351372, value=This is a test blog post
post1 column=post:title, timestamp=1440698351346, value=Hello World
post2 column=image:bodyimage, timestamp=1440698372220, value=image2.jpg
post2 column=image:header, timestamp=1440698372182, value=image2
post2 column=post:author, timestamp=1440698372095, value=Prasanth Kothuri
post2 column=post:body, timestamp=1440698372148, value=HDFS is a distributed file system that is...
post2 column=post:title, timestamp=1440698372123, value=Introduction to HDFS
post4 column=post:author, timestamp=1440698372385, value=Prasanth Kothuri
post4 column=post:body, timestamp=1440698372422, value=Distributed sorted key value store
post4 column=post:title, timestamp=1440698372403, value=HBase Architecture

4 row(s) in 0.0770 seconds

HBase Architecture 11
How does it scale?
- Region
- This is the basic unit of scalability and load balancing
- Regions are contiguous ranges of rows stored together
- Regions are split by the system when they become too large
- Regions can also be merged to reduce the number of files
- How it works
- Intially, there is one region
- System monitors region size: if a thresold is attained, SPLIT
- Regions are split in two at the middle key
- This creates roughly two equivalent regions

HBase Architecture 12
Table Regions

HBase Architecture 13
Region Servers
- Region Servers
- Each region is served by exactly one Region Server
- Region servers can server multiple regions
- from same table and also from other tables

- Failures
- Regions allow for fast recovery upon failure
- Fine-grained Load Balancing is also achieved using regions as
they can be easily moved across servers

HBase Architecture 14
Sharding in HBase

HBase Architecture 15
Anatomy of a Region Server
BlockCache

WAL
HRegion HRegion

HStore HStore
HStore HStore
StoreFile StoreFile MemStore

HFile HFile

HDFS

Legend:
- A RegionServer contains a single WAL, single BlockCache, and multiple Regions
- A Region contains multiple Stores, one for each Column Family
- A Store consists of multiple StoreFiles and a MemStore
- A StoreFile corresponds to a single Hfile
- Hfiles and WAL are persisted on HDFS

HBase Architecture 16
HBase Cluster View

Legend:
- RegionServer is collocated with an HDFS DataNode
- Clients directly communicate with Region Server for sending and receiving data
- Master manages region assignment and DDLs
- Online configuration state is maintained in ZooKeeper

HBase Architecture 17
HBase Architecture
- HBase Master
- Assigns regions to regions servers using ZooKeeper
- Handles load balancing
- Not part of the data path
- Holds metadata and schema

- Region Server
- Handles READs and WRITEs
- Handles the WAL and HFiles
- Handle region splitting

HBase Architecture 18
HBase Storage
- HFiles
- A block-indexed format to store sorted key-value pairs
- Looks like this

- Where are Hbase files stored?


- Hfiles are divided into blocks and stored into HDFS
- Hbase has a root directory set to /hbase in HDFS
- There is a subdirectory for each hbase table under /hbase

HBase Architecture 19
HBase Storage
- HBase Files in HDFS
HFiles
[hdfs@itrac925 ~]$ hdfs dfs -ls -r /hbase
drwxr-xr-x - hbase hbase 0 2015-08-26 17:03 /hbase/data/default/blogposts
drwxr-xr-x - hbase hbase 0 2015-08-26 17:03 /hbase/data/default/blogposts/.tabledesc
-rw-r--r-- 3 hbase hbase 535 2015-08-26 17:03 /hbase/data/default/blogposts/.tabledesc/.tableinfo.0000000001
drwxr-xr-x - hbase hbase 0 2015-08-26 17:03 /hbase/data/default/blogposts/.tmp
drwxr-xr-x - hbase hbase 0 2015-08-26 18:03 /hbase/data/default/blogposts/26467f89a1aaa52a7d48493a9f88549e
-rw-r--r-- 3 hbase hbase 44 2015-08-26 17:03 /hbase/data/default/blogposts/26467f89a1aaa52a7d48493a9f88549e/.regioninfo
drwxr-xr-x - hbase hbase 0 2015-08-26 18:03 /hbase/data/default/blogposts/26467f89a1aaa52a7d48493a9f88549e/.tmp
drwxr-xr-x - hbase hbase 0 2015-08-26 17:03 /hbase/data/default/blogposts/26467f89a1aaa52a7d48493a9f88549e/image
drwxr-xr-x - hbase hbase 0 2015-08-26 18:03 /hbase/data/default/blogposts/26467f89a1aaa52a7d48493a9f88549e/post
-rw-r--r-- 3 hbase hbase 1313 2015-08-26 18:03
/hbase/data/default/blogposts/26467f89a1aaa52a7d48493a9f88549e/post/a7331f6dd5544d109ce9ce1b1a4696dd
drwxr-xr-x - hbase hbase 0 2015-08-26 17:03 /hbase/data/default/blogposts/26467f89a1aaa52a7d48493a9f88549e/recovered.edits
-rw-r--r-- 3 hbase hbase 0 2015-08-26 17:03 /hbase/data/default/blogposts/26467f89a1aaa52a7d48493a9f88549e/recovered.edits/2.seqid

WALS
-rw-r--r-- 3 hbase hbase 87291610 2015-08-27 08:32 /hbase/WALs/itrac901.cern.ch,60020,1440455544980/itrac901.cern.ch%2C60020%2C1440455544980.null0.1440653566417
-rw-r--r-- 3 hbase hbase 63860405 2015-08-27 08:32 /hbase/WALs/itrac902.cern.ch,60020,1440455544763/itrac902.cern.ch%2C60020%2C1440455544763.null0.1440653563334
-rw-r--r-- 3 hbase hbase 16737365 2015-08-27 08:32 /hbase/WALs/itrac903.cern.ch,60020,1440455544535/itrac903.cern.ch%2C60020%2C1440455544535.null0.1440653562699
-rw-r--r-- 3 hbase hbase 153680074 2015-08-27 08:32 /hbase/WALs/itrac904.cern.ch,60020,1440455545250/itrac904.cern.ch%2C60020%2C1440455545250.null0.1440653567897

HBase Architecture 20
HBase Physical Architecture

HBase Architecture 21
ZooKeeper
- ZooKeeper is high performance cluster coordination
service for distributed applications like HBase
- ZooKeeper maintains region server memership and
health in the Hbase cluster
- ZooKeeper also has the location of –META- table
regions

HBase Architecture 22
HBase Read operations
- Client contacts the ZooKeeper for the location of the
.META
- Client scans the .META to find the region server hosting
the required region
- A quick exclusion of storefiles is done using bloom filter
and timestamps
- Then the memstore and remaining store files are
scanned to find the matching key

HBase Architecture 23
HBase Write operations
- First, data is written to WAL (Write Ahead Log)
- Then data is moved into memory structure called
MemStore
- When the size of MemStore reaches its threshold, data
will be flushed to a new Hfile

HBase Architecture 24
HBase Write operations – contd.
- Deletes are written as new tombstone marker
- Updates are written as seperate KeyValue instances,
possibily spread across multiple store files
- Minor compactions merge Hfiles into smaller number of
files
- Relatively low cost operation
- Major compactions merge Hfiles into single Hfile
removing any deleted items
- Costly operation

HBase Architecture 25
Client API
- Java API
- HTable class in the org.apache.hadoop.hbase.client package
- Python
- HappyBase, Startbase
- REST Interface
- Stargate
- Thrift Interface
- Hbase shell
- No SQL
- But there are gateways from Hive, Impala and other components

HBase Architecture 26
HBase use cases / workloads
- Good for
- Large Datasets
- Sparse Datasets
- Denormalized data records
- Need capability for large volume random reads

- Avoid if you have


- Small Datasets
- Relational Data
- Transactions

- Choice of row keys is cruicial in avoiding region server ‘hotspotting’

HBase Architecture 27
Hands On – 1
CRUD operations using HBase shell

Step 1) Start the HBase shell


hbase shell

Step 2) Create a table called blogposts with post and image column families
create ‘blogposts’, ‘post’, ‘image’

HBase Architecture 28
Hands On – 1 contd
Step 3) Insert some data into the table
put 'blogposts', 'post1', 'post:author', 'Prasanth Kothuri'
put 'blogposts', 'post1', 'post:title', 'Hello World'
put 'blogposts', 'post1', 'post:body', 'This is a test blog post'
put 'blogposts', 'post1', 'image:header', 'image1'
put 'blogposts', 'post1', 'image:bodyimage', 'image1.jpg'

put 'blogposts', 'post2', 'post:author', 'Prasanth Kothuri'


put 'blogposts', 'post2', 'post:title', 'Introduction to HDFS'
put 'blogposts', 'post2', 'post:body', 'HDFS is a distributed file system that is...'
put 'blogposts', 'post2', 'image:header', 'image2'
put 'blogposts', 'post2', 'image:bodyimage', 'image2.jpg'

put 'blogposts', 'guestpost1', 'post:author', 'Barack Obama'


put 'blogposts', 'guestpost1', 'post:title', 'How to play Golf'
put 'blogposts', 'guestpost1', 'post:body', 'blah bla blah...'
put 'blogposts', 'guestpost1', 'image:header', 'image3'
put 'blogposts', 'guestpost1', 'image:bodyimage', 'image3.jpg'

put 'blogposts', 'post4', 'post:author', 'Prasanth Kothuri'


put 'blogposts', 'post4', 'post:title', 'HBase Architecture'
put 'blogposts', 'post4', 'post:body', 'Distributed sorted key value store'

HBase Architecture 29
Hands On 1 - contd
Step 4) Scan the full table

scan ‘blogposts’

Step 5) Lookup for a specific key

get ‘blogposts’,’post1’

Step 6) Update a row


put 'blogposts','guestpost1','post:title','How to save the world'

Step 6) Delete a row

delete ‘blogposts’,’guestpost1’

HBase Architecture 30
Conclusion
- Logical Data Model
- Table, Rows, Column Families, Columns and Cells
- Logical HBase architecture
- Regions, Region Servers, HBase Master and ZooKeeper
- Physical HBase architecture
- WAL, Hfiles, HBase on HDFS
- Data semantics supported by Hbase
- GET, SCAN, PUT, CREATE and DELETE
- ‘usecases’ suited for HBase
HBase Architecture 31
Q&A

E-mail: [email protected]
Blog: http://prasanthkothuri.wordpress.com
See also: https://db-blog.web.cern.ch/ 32

You might also like