H Base Tutorial
H Base Tutorial
Apache HBase is a column-oriented key/value data store built to run on top of the Hadoop Distributed
File System (HDFS)
Versioned database
HBase website
Unzipping the downloaded file
Making a new directory for HBase
Moving Hbase to new directory
Setting JAVA_HOME
Opening bashrc file
Setting HBASE_HOME in bashrc file
Starting Hbase
Hbase started
Entering in Hbase shell
Stopping Hbase
Setting Hbase database destination path
HBase data model stores semi-structured data having different data types, varying column size
and field size. The layout of HBase data model eases data partitioning and distribution across the
cluster. HBase data model consists of several logical components- row key, column family, table
name, timestamp, etc. Row Key is used to uniquely identify the rows in HBase tables. Column
families in HBase are static whereas the columns, by themselves, are dynamic.
HBase provides low-latency random reads and writes on top of HDFS. In HBase, tables are
dynamically distributed by the system whenever they become too large to handle (Auto
Sharding). The simplest and foundational unit of horizontal scalability in HBase is a Region. A
continuous, sorted set of rows that are stored together is referred to as a region (subset of table
data). HBase architecture has a single HBase master node (HMaster) and several slaves i.e.
region servers. Each region server (slave) serves a set of regions, and a region can be served only
by a single region server. Whenever a client sends a write request, HMaster receives the request
and forwards it to the corresponding region server.
i. HMaster
HBase HMaster is a lightweight process that assigns regions to region servers in the Hadoop
cluster for load balancing. Responsibilities of HMaster –
These are the worker nodes which handle read, write, update, and delete requests from clients.
Region Server process, runs on every node in the hadoop cluster. Region Server runs on HDFS
DataNode and consists of the following components –
● Block Cache – This is the read cache. Most frequently read data is stored in the read cache and
whenever the block cache is full, recently used data is evicted.
● MemStore- This is the write cache and stores new data that is not yet written to the disk. Every
column family in a region has a MemStore.
● Write Ahead Log (WAL) is a file that stores new data that is not persisted to permanent storage.
● HFile is the actual storage file that stores the rows as sorted key values on a disk.
iii. Zookeeper
HBase uses ZooKeeper as a distributed coordination service for region assignments and to
recover any region server crashes by loading them onto other region servers that are functioning.
ZooKeeper is a centralized monitoring server that maintains configuration information and
provides distributed synchronization. Whenever a client wants to communicate with regions,
they have to approach Zookeeper first. HMaster and Region servers are registered with
ZooKeeper service, client needs to access ZooKeeper quorum in order to connect with region
servers and HMaster. In case of node failure within an HBase cluster, ZKquoram will trigger
error messages and start repairing failed nodes.
ZooKeeper service keeps track of all the region servers that are there in an HBase cluster-
tracking information about how many region servers are there and which region servers are
holding which DataNode. HMaster contacts ZooKeeper to get the details of region servers.
Various services that Zookeeper provides include –
https://www.dezyre.com/article/overview-of-hbase-architecture-and-its-components/295
You can also perform Gets and Scan using the HBase Shell, the REST API, or the Thrift API.
Specify a startrow or stoprow or both. Neither startrow nor stoprow need to exist.
Because HBase sorts rows lexicographically, it will return the first row after startrow would
have occurred, and will stop returning rows after stoprow would have occurred.The goal is to
reduce IO and network.
● The startrow is inclusive and the stoprow is exclusive. Given a table with rows a,
b, c, d, e, f, and startrow of c and stoprow of f, rows c-e are returned.
● If you omit startrow, the first row of the table is the startrow.
● If you omit the stoprow, all results after startrow (including startrow) are
returned.
● If startrow is lexicographically after stoprow, and you set Scan
setReversed(boolean reversed) to true, the results are returned in reverse
order. Given the same table above, with rows a-f, if you specify c as the stoprow and f
as the startrow, rows f, e, and d are returned.
Example syntex
Scan()
Scan(byte[] startRow)
Scan(byte[] startRow, byte[] stopRow)
Specify a scanner cache that will be filled before the Scan result is returned, setting
setCaching to the number of rows to cache before returning the result. By default, the
caching setting on the table is used. The goal is to balance IO and network load.
Example syntex
public Scan setCaching(int caching)
To limit the number of columns if your table has very wide rows (rows with a large number of columns),
use setBatch(int batch) and set it to the number of columns you want to return in one batch. A large
number of columns is not a recommended design pattern.
Example Syntex
To specify a maximum result size, use setMaxResultSize(long), with the number of bytes. The
goal is to reduce IO and network.
Example syntex
When you use setCaching and setMaxResultSize together, single server requests are limited by
either number of rows or maximum result size, whichever limit comes first.
You can limit the scan to specific column families or columns by using addFamily or addColumn. The
goal is to reduce IO and network. IO is reduced because each column family is represented by a Store on
each RegionServer, and only the Stores representing the specific column families in question need to be
accessed.
You can use a filter by using setFilter. Filters are discussed in detail in HBase Filtering and the Filter
API.
You can disable the server-side block cache for a specific scan using the API
setCacheBlocks(boolean). This is an expert setting and should only be used if you know what
you are doing.
Hedged Reads
Hadoop 2.4 introduced a new feature called hedged reads. If a read from a block is slow, the
HDFS client starts up another parallel, 'hedged' read against a different block replica. The result
of whichever read returns first is used, and the outstanding read is cancelled. This feature helps
in situations where a read occasionally takes a long time rather than when there is a systemic
problem. Hedged reads can be enabled for HBase when the HFiles are stored in HDFS. This
feature is disabled by default.
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/admin_hbase_scanning.html
HBase Filtering
When reading data from HBase using Get or Scan operations, you can use custom filters to return a
subset of results to the client. While this does not reduce server-side IO, it does reduce network
bandwidth and reduces the amount of data the client needs to process. Filters are generally used using
the Java API, but can be used from HBase Shell for testing and debugging purposes.
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/admin_hbase_filtering.html
Variations on Put
There are several different ways to write data into HBase. Some of them are listed below.
Versions
When you put data into HBase, a timestamp is required. The timestamp can be generated
automatically by the RegionServer or can be supplied by you. The timestamp must be unique per
version of a given cell, because the timestamp identifies the version. To modify a previous
version of a cell, for instance, you would issue a Put with a different value for the data itself, but
the same timestamp.
HBase's behavior regarding versions is highly configurable. The maximum number of versions
defaults to 1 in CDH 5, and 3 in previous versions. You can change the default value for HBase
by configuring hbase.column.max.version in hbase-site.xml, either using an
advanced configuration snippet if you use Cloudera Manager, or by editing the file directly
otherwise.
You can also configure the maximum and minimum number of versions to keep for a given column, or
specify a default time-to-live (TTL), which is the number of seconds before a version is deleted. The
following examples all use alter statements in HBase Shell to create new column families with the
given characteristics, but you can use the same syntax when creating a new table or to alter an existing
column family. This is only a fraction of the options you can specify for a given column family.
HBase sorts the versions of a cell from newest to oldest, by sorting the timestamps lexicographically.
When a version needs to be deleted because a threshold has been reached, HBase always chooses the
"oldest" version, even if it is in fact the most recent version to be inserted. Keep this in mind when
designing your timestamps. Consider using the default generated timestamps and storing other version-
specific data elsewhere in the row, such as in the row key. If MIN_VERSIONS and TTL conflict,
MIN_VERSIONS takes precedence.
Deletion
When you request for HBase to delete data, either explicitly using a Delete method or implicitly
using a threshold such as the maximum number of versions or the TTL, HBase does not delete
the data immediately. Instead, it writes a deletion marker, called a tombstone, to the HFile, which
is the physical file where a given RegionServer stores its region of a column family. The
tombstone markers are processed during major compaction operations, when HFiles are rewritten
without the deleted data included.
Even after major compactions, "deleted" data may not actually be deleted. You can specify the
KEEP_DELETED_CELLS option for a given column family, and the tombstones will be
preserved in the HFile even after major compaction. One scenario where this approach might be
useful is for data retention policies.
Another reason deleted data may not actually be deleted is if the data would be required to
restore a table from a snapshot which has not been deleted. In this case, the data is moved to an
archive during a major compaction, and only deleted when the snapshot is deleted. This is a good
reason to monitor the number of snapshots saved in HBase.
https://www.cloudera.com/documentation/enterprise/5-9-
x/topics/admin_writing_data_to_hbase.html
Importing Data Into HBase
The method you use for importing data into HBase depends on several factors:
● To migrate data between HBase version that are not wire compatible, such as from CDH 4 to
CDH 5, see Importing HBase Data From CDH 4 to CDH 5.
Using CopyTable
CopyTable uses HBase read and write paths to copy part or all of a table to a new table in
either the same cluster or a different cluster. CopyTable causes read load when reading from
the source, and write load when writing to the destination. Region splits occur on the destination
table in real time as needed. To avoid these issues, use snapshot and export commands
instead of CopyTable. Alternatively, you can pre-split the destination table to avoid excessive
splits. The destination table can be partitioned differently from the source table. See this section
of the Apache HBase documentation for more information.
Edits to the source table after the CopyTable starts are not copied, so you may need to do an
additional CopyTable operation to copy new data into the destination table. Run CopyTable as
follows, using --help to see details about possible parameters.
The following example creates a new table using HBase Shell in non-interactive mode, and then
copies data in two ColumnFamilies in rows starting with timestamp 1265875194289 and
including the last row before the CopyTable started, to the new table.
$ echo create 'NewTestTable', 'cf1', 'cf2', 'cf3' | bin/hbase shell --non-
interactive
$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --
starttime=1265875194289 --families=cf1,cf2,cf3 --new.name=NewTestTable
TestTable
In CDH 5, snapshots are recommended instead of CopyTable for most situations.
Using Snapshots
As of CDH 4.7, Cloudera recommends snapshots instead of CopyTable where possible. A
snapshot captures the state of a table at the time the snapshot was taken. Because no data is
copied when a snapshot is taken, the process is very quick. As long as the snapshot exists, cells
in the snapshot are never deleted from HBase, even if they are explicitly deleted by the API.
Instead, they are archived so that the snapshot can restore the table to its state at the time of the
snapshot.
After taking a snapshot, use the clone_snapshot command to copy the data to a new
(immediately enabled) table in the same cluster, or the Export utility to create a new table based
on the snapshot, in the same cluster or a new cluster. This is a copy-on-write operation. The new
table shares HFiles with the original table until writes occur in the new table but not the old
table, or until a compaction or split occurs in either of the tables. This can improve performance
in the short term compared to CopyTable.
To export the snapshot to a new cluster, use the ExportSnapshot utility, which uses
MapReduce to copy the snapshot to the new cluster. Run the ExportSnapshot utility on the
source cluster, as a user with HBase and HDFS write permission on the destination cluster, and
HDFS read permission on the source cluster. This creates the expected amount of IO load on the
destination cluster. Optionally, you can limit bandwidth consumption, which affects IO on the
destination cluster. After the ExportSnapshot operation completes, you can see the snapshot in
the new cluster using the list_snapshot command, and you can use the
clone_snapshot command to create the table in the new cluster from the snapshot.
For full instructions for the snapshot and clone_snapshot HBase Shell commands, run
the HBase Shell and type help snapshot. The following example takes a snapshot of a table,
uses it to clone the table to a new table in the same cluster, and then uses the
ExportSnapshot utility to copy the table to a different cluster, with 16 mappers and limited
to 200 Mb/sec bandwidth.
$ bin/hbase shell
hbase(main):005:0> snapshot 'TestTable', 'TestTableSnapshot'
0 row(s) in 2.3290 seconds
Job Counters
Using BulkLoad
HBase uses the well-known HFile format to store its data on disk. In many situations, writing HFiles
programmatically with your data, and bulk-loading that data into HBase on the RegionServer, has
advantages over other data ingest mechanisms. BulkLoad operations bypass the write path completely,
providing the following benefits:
● The data is available to HBase immediately but does cause additional load or latency on the
cluster when it appears.
● BulkLoad operations do not use the write-ahead log (WAL) and do not cause flushes or split
storms.
● BulkLoad operations do not cause excessive garbage collection.
Note: Because they bypass the WAL, BulkLoad operations are not propagated between clusters
using replication. If you need the data on all replicated clusters, you must perform the BulkLoad
on each cluster.
If you use BulkLoads with HBase, your workflow is similar to the following:
1. Extract your data from its existing source. For instance, if your data is in a MySQL
database, you might run the mysqldump command. The process you use depends on
your data. If your data is already in TSV or CSV format, skip this step and use the
included ImportTsv utility to process your data into HFiles. See the ImportTsv
documentation for details.
2. Process your data into HFile format. See
http://hbase.apache.org/book.html#_hfile_format_2 for details about HFile format.
Usually you use a MapReduce job for the conversion, and you often need to write the
Mapper yourself because your data is unique. The job must to emit the row key as the
Key, and either a KeyValue, a Put, or a Delete as the Value. The Reducer is
handled by HBase; configure it using HFileOutputFormat.configureIncrementalLoad()
and it does the following:
o Inspects the table to configure a total order partitioner
o Uploads the partitions file to the cluster and adds it to the DistributedCache
o Sets the number of reduce tasks to match the current number of regions
o Sets the output key/value class to match HFileOutputFormat requirements
o Sets the Reducer to perform the appropriate sorting (either
KeyValueSortReducer or PutSortReducer)
3. One HFile is created per region in the output folder. Input data is almost completely
re-written, so you need available disk space at least twice the size of the original data set.
For example, for a 100 GB output from mysqldump, you should have at least 200 GB
of available disk space in HDFS. You can delete the original input file at the end of the
process.
4. Load the files into HBase. Use the LoadIncrementalHFiles command (more
commonly known as the completebulkload tool), passing it a URL that locates the files in
HDFS. Each file is loaded into the relevant region on the RegionServer for the region.
You can limit the number of versions that are loaded by passing the --versions= N
option, where N is the maximum number of versions to include, from newest to oldest
(largest timestamp to smallest timestamp).
5. If a region was split after the files were created, the tool automatically splits the HFile
according to the new boundaries. This process is inefficient, so if your table is being
written to by other processes, you should load as soon as the transform step is done.
● A central source cluster might propagate changes to multiple destination clusters, for failover or
due to geographic distribution.
● A source cluster might push changes to a destination cluster, which might also push its own
changes back to the original cluster.
● Many different low-latency clusters might push changes to one centralized cluster for backup or
resource-intensive data-analytics jobs. The processed data might then be replicated back to the
low-latency clusters.
● Multiple levels of replication can be chained together to suit your needs. The following diagram
shows a hypothetical scenario. Use the arrows to follow the data paths.
At the top of the diagram, the San Jose and Tokyo clusters, shown in red, replicate changes
to each other, and each also replicates changes to a User Data and a Payment Data cluster.
Each cluster in the second row, shown in blue, replicates its changes to the All Data
Backup 1 cluster, shown in grey. The All Data Backup 1 cluster replicates changes to
the All Data Backup 2 cluster (also shown in grey), as well as the Data Analysis
cluster (shown in green). All Data Backup 2 also propagates any of its own changes back
to All Data Backup 1.
The Data Analysis cluster runs MapReduce jobs on its data, and then pushes the processed
data back to the San Jose and Tokyo clusters.
Using Pig and HCatalog
Apache Pig is a platform for analyzing large data sets using a high-level language. Apache HCatalog is a
sub-project of Apache Hive, which enables reading and writing of data from one Hadoop utility to
another. You can use a combination of Pig and HCatalog to import data into HBase. The initial format of
your data and other details about your infrastructure determine the steps you follow to accomplish this
task. The following simple example assumes that you can get your data into a TSV (text-separated value)
format, such as a tab-delimited or comma-delimited text file.
1. Format the data as a TSV file. You can work with other file formats; see the Pig and HCatalog
project documentation for more details.
The following example shows a subset of data from Google's NGram Dataset, which
shows the frequency of specific phrases or letter-groupings found in publications indexed
by Google. Here, the first column has been added to this dataset as the row ID. The first
column is formulated by combining the n-gram itself (in this case, Zones) with the line
number of the file in which it occurs (z_LINE_NUM). This creates a format such as
"Zones_z_6230867." The second column is the n-gram itself, the third column is the
year of occurrence, the fourth column is the frequency of occurrence of that Ngram in
that year, and the fifth column is the number of distinct publications. This extract is from
the z file of the 1-gram dataset from version 20120701. The data is truncated at the ...
mark, for the sake of readability of this document. In most real-world scenarios, you will
not work with tables that have five columns. Most HBase tables have one or two
columns.
Zones_z_6230867 Zones 1507 1 1
Zones_z_6230868 Zones 1638 1 1
Zones_z_6230869 Zones 1656 2 1
Zones_z_6230870 Zones 1681 8 2
...
Zones_z_6231150 Zones 1996 17868 4356
Zones_z_6231151 Zones 1997 21296 4675
Zones_z_6231152 Zones 1998 20365 4972
Zones_z_6231153 Zones 1999 20288 5021
Zones_z_6231154 Zones 2000 22996 5714
Zones_z_6231155 Zones 2001 20469 5470
Zones_z_6231156 Zones 2002 21338 5946
Zones_z_6231157 Zones 2003 29724 6446
Zones_z_6231158 Zones 2004 23334 6524
Zones_z_6231159 Zones 2005 24300 6580
Zones_z_6231160 Zones 2006 22362 6707
Zones_z_6231161 Zones 2007 22101 6798
Zones_z_6231162 Zones 2008 21037 6328
2. Using the hadoop fs command, put the data into HDFS. This example places the file into an
/imported_data/ directory.
4. CREATE TABLE
5. zones_frequency_table (id STRING, ngram STRING, year STRING, freq
STRING, sources STRING)
6. STORED BY 'org.apache.hcatalog.hbase.HBaseHCatStorageHandler'
7. TBLPROPERTIES (
8. 'hbase.table.name' = 'zones_frequency_table',
9. 'hbase.columns.mapping' = 'd:ngram,d:year,d:freq,d:sources',
10. 'hcat.hbase.output.bulkMode' = 'true'
);
$ hcat -f zones_frequency_table.ddl
11. Create a Pig file to process the TSV file created in step 1, using the DDL file created in step 3.
Modify the file names and other parameters in this command to match your values if you use
data different from this working example. USING PigStorage('\t') indicates that the
input file is tab-delimited. For more details about Pig syntax, see the Pig Latin reference
documentation.
14. Use the pig command to bulk-load the data into HBase.
...
HTable table = null;
try {
table = myCode.createTable(tableName, fam);
int i = 1;
List<Put> puts = new ArrayList<Put>();
for (String labelExp : labelExps) {
Put put = new Put(Bytes.toBytes("row" + i));
put.add(fam, qual, HConstants.LATEST_TIMESTAMP, value);
puts.add(put);
i++;
}
table.put(puts);
} finally {
if (table != null) {
table.flushCommits();
}
}
...
$ mkdir HBaseThrift
$ cd HBaseThrift/
$ thrift -gen py /path/to/Hbase.thrift
$ mv gen-py/* .
$ rm -rf gen-py/
$ mkdir thrift
$ cp -rp ~/Downloads/thrift-0.9.0/lib/py/src/* ./thrift/
The following iexample shows a simple Python application using the Thrift Proxy API.
mutations = [
Hbase.Mutation(column=messagecolumncf, value=line.strip()),
Hbase.Mutation(column=linenumbercolumncf,
value=encode(linenumber)),
Hbase.Mutation(column=usernamecolumncf, value=username)
]
mutationsbatch.append(Hbase.BatchMutation(row=rowkey,mutations=mutations))
transport.close()
The Thrift Proxy API does not support writing to HBase clusters that are secured using Kerberos.
This example was modified from the following two blog posts on http://www.cloudera.com. See
them for more details.
/**
* A serializer for the AsyncHBaseSink, which splits the event body into
* multiple columns and inserts them into a row whose key is available in
* the headers
*/
public class SplittingSerializer implements AsyncHbaseEventSerializer {
private byte[] table;
private byte[] colFam;
private Event currentEvent;
private byte[][] columnNames;
private final List<PutRequest> puts = new ArrayList<PutRequest>();
private final List<AtomicIncrementRequest> incs = new
ArrayList<AtomicIncrementRequest>();
private byte[] currentRowKey;
private final byte[] eventCountCol = "eventCount".getBytes(); @Override
public void initialize(byte[] table, byte[] cf) {
this.table = table;
this.colFam = cf;
} @Override
public void setEvent(Event event) {
// Set the event and verify that the rowKey is not present
this.currentEvent = event;
String rowKeyStr = currentEvent.getHeaders().get("rowKey");
if (rowKeyStr == null) {
throw new FlumeException("No row key found in headers!");
}
currentRowKey = rowKeyStr.getBytes();
} @Override
public List<PutRequest> getActions() {
// Split the event body and get the values for the columns
String eventStr = new String(currentEvent.getBody());
String[] cols = eventStr.split(",");
puts.clear();
for (int i = 0; i < cols.length; i++) {
//Generate a PutRequest for each column.
PutRequest req = new PutRequest(table, currentRowKey, colFam,
columnNames[i], cols[i].getBytes());
puts.add(req);
}
return puts;
} @Override
public List<AtomicIncrementRequest> getIncrements() {
incs.clear();
//Increment the number of events received
incs.add(new AtomicIncrementRequest(table, "totalEvents".getBytes(),
colFam, eventCountCol));
return incs;
} @Override
public void cleanUp() {
table = null;
colFam = null;
currentEvent = null;
columnNames = null;
currentRowKey = null;
} @Override
public void configure(Context context) {
//Get the column names from the configuration
String cols = new String(context.getString("columns"));
String[] names = cols.split(",");
byte[][] columnNames = new byte[names.length][];
int i = 0;
for(String name : names) {
columnNames[i++] = name.getBytes();
}
} @Override
public void configure(ComponentConfiguration conf) {
}
}
Using Spark
You can write data to HBase from Apache Spark by using def saveAsHadoopDataset(conf:
JobConf): Unit. This example is adapted from a post on the spark-users mailing list.
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.client
// ... some other settings
new PairRDDFunctions(localData.map(convert)).saveAsHadoopDataset(jobConfig)
package org.apache.spark.streaming.examples
import java.util.Properties
import kafka.producer._
object MetricAggregatorHBase {
def main(args : Array[String]) {
if (args.length < 6) {
System.err.println("Usage: MetricAggregatorTest <master> <zkQuorum>
<group> <topics> <destHBaseTableName> <numThreads>")
System.exit(1)
}
ssc.start
ssc.awaitTermination
}
record.add(Bytes.toBytes("metric"), Bytes.toBytes("col"),
Bytes.toBytes(value.toString))
producer.send(messages : _*)
Thread.sleep(100)
}
}
}
Error in Hadoop
Finally found the solution, as the bashrc file got corrupted somehow so the solution for this is to recover
the new bashrc file from the system and the setting up the paths again
export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
http://askubuntu.com/questions/319882/problem-in-bashrc
Now the major task is to set up all paths in bashrc file again
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/usr/local/hadoop-2.6.2
export HIVE_HOME=/usr/lib/apache-hive-1.2.1-bin
export PATH=$PATH:$HIVE_HOME/bin
export PATH=$PATH:$HADOOP_HOME/bin
export FLUME_HOME=/usr/local/flume
export FLUME_CONF_DIR=$FLUME_HOME/conf
export FLUME_CLASSPATH=$FLUME_CONF_DIR
export PATH=$PATH:$FLUME_HOME/bin
Starting to test Hadoop