Hbases Architecture
is inspired by
Recap
HBase vs RDBMS
This is how data is stored
in traditional databases
id
type
for user
from user
timestamp
Friend request
status
Ryan
Jessica
146710201
Comment
Chaz
Daniel
146711200
Comment
Rick
Brendan
1467112205
Like
Rick
Brendan
1467112213
Recap
Column oriented storage
id
type
for user
from user
timestamp
Friend
request status
Ryan
Jessica
146710201
Comment
Chaz
Daniel
146711200
Comment
Rick
Brendan
1467112205
Like
Rick
Brendan
1467112213
Data is stored
in a map
<Row
id,
Col
id>
Key =
Value = <data>
Recap
Column oriented storage
id
type
for user
from user
timestamp
Friend
request status
Ryan
Jessica
146710201
Comment
Chaz
Daniel
146711200
Comment
Rick
Brendan
1467112205
Like
Rick
Brendan
1467112213
Data is stored
in a map
2,
for_user
Key =
Value =
Chaz
Recap
Column oriented storage
row
id
1
column
value
type
Friend request status
for user
Ryan
from user
Jessica
id
type
for user
from user
timestamp
Friend
request status
Ryan
Jessica
146710201
timestamp
146710201
Comment
Chaz
Daniel
146711200
type
Comment
for user
Chaz
from user
Daniel
timestamp
146711200
type
Comment
for user
Rick
from user
Brendan
timestamp
1467112205
3
4
Comment
Like
Rick
Rick
Brendan
Brendan
1467112205
1467112213
Recap
Column oriented storage
Keys
An HBase table
is in fact a
sorted map
Values
row
id
1
column
value
type
Friend request status
for user
Ryan
from user
Jessica
timestamp
146710201
type
Comment
for user
Chaz
from user
Daniel
timestamp
146711200
type
Comment
for user
Rick
from user
Brendan
A sorted nested map
<Row id,
ColumnFamily,
<Column,
<Timestamp,Value>>>
A
sorted
nested
<Row id,
map
ColumnFamily,
<Column,
When you read data
<Timestamp,Value>>>
from HBase, it
performs a lookup for
the specified row id
A
sorted
nested
<Row id,
map
ColumnFamily,
<Column,
When you write data to
<Timestamp,
V
alue>>>
HBase, it needs to insert the
row id in the right place, so
the rows are sorted
A
sorted
nested
<Row id,
map
ColumnFamily,
<Column,
<Timestamp,
V
alue>>>
HBase does this
using Region Servers
Region Servers
row id
1
2
3
Region 1
4
5
6
7
8
Region 2
9
10
11
12
Region 3
Row ids in a
table are divided
into ranges
called regions
Region Servers
row id
1
2
3
Region 1
4
5
6
7
8
Region 2
9
10
11
12
Region 3
Each region is
handled by a
Region Ser ver
Region Server 1
Region 1
Region 3
Region Server 2
Region 2
Region Servers
Regions serve as an
index to perform fast
lookup for where a
row key belongs
Region Server 1
Region 1
Region 3
Region Server 2
Region 2
Region Servers
A region server
handles all read-write
operations to Regions
that are allotted to it
Region Server
Memstore
Region Servers
Initially all
writes are
stored in
memory
Region Server
WriteAheadLog
Memstore
Region Servers
Whenever there is a
new change, the
data is updated in
the Memstore and
a change log is
written to disk
Region Server
WriteAheadLog
Memstore
Region Servers
The WriteAheadLog
is created for
recovery in case
the Region Ser ver
crashes
Region Server
WriteAheadLog
HFile
Memstore
Region Servers
Periodically the
Memstore gets
full, and the data
in Memstore is
flushed to disk
Region Server
WriteAheadLog
HFile
Memstore
Region Servers
The data for a
row key is either
in the Memstore
or in a HFile
Region Server
WriteAheadLog
HFile
Memstore
Region Servers
HFiles are
stored in
HDFS
Region Server
WriteAheadLog
HFile
Memstore
Region Servers
HDFS will break
up the HFile into
blocks and store
it on different
nodes
Region Server
WriteAheadLog
HFile
Memstore
Region Servers
To minimize disk
seeks, the region
ser ver keeps an index
of row key to HFile
block in memory
Region Server
Region Servers
WriteAheadLog
HFile
Memstore
It only performs
1 disk seek for
finding a row key
Region Ser ver 1
WAL
HFile
Memstore
Region Ser ver 2
WAL
HFile
Memstore
Region Servers
When you try to
read/insert data
1. The region ser ver
containing the row
key is identified
Region Ser ver 1
WAL
HFile
Memstore
Region Ser ver 2
WAL
HFile
Memstore
Region Servers
When you try to read/
insert data
1. The region ser ver containing
the row key is identified
2. The region server will
lookup the Memstore or
the HFile and do the needful
Region Server
Clients interact directly
with a Region server
handling the relevant row
keys
WAL
HFile
Memstore
HDFS
Region Server
They need to know
which region ser ver
their row key is
being handled by
WAL
HFile
Memstore
HDFS
Region Server
HBase uses a
Master ser ver to
manage Regions
and RegionSer vers
WAL
HFile
Memstore
HDFS
Region Server
Master
The Master assigns
regions to region servers,
manages load balancing
etc
WAL
HFile
Memstore
HDFS
Region Server
Master
WAL
HFile
Memstore
The Master uses
Apache Zookeeper to
help assign regions to
region ser vers
HDFS
Region Server
Master
Zookeeper
Zookeeper helps clients
lookup the relevant
region ser ver for a
specific row id
WAL
HFile
Memstore
HDFS
HBase
Region Server
Master
Zookeeper
WAL
HFile
Memstore
HDFS