HBase Installation and Data Storage

HBase version: 1.2.4
Hadoop version: 2.7.3

Apache HBase is a member of Hadoop product family. HBase is designed per Google’s Big Table technology. I am interested in knowing how capable HBase is in handling data other than billions of web pages.

I have discussed the installation of Hadoop, including Yarn. And this blog uses the same setup. On top of it, we will add HBase 1.2.4. So we have 6 nodes total. we will use nnode1 as master node, and nnode2 as secondary master node, and dnode1 to dnode4 for data. For a distributed deployment, we also need Zookeeper. The Zookeeper keeps key meta data information. Zookeeper needs quorum, which is a group of members, to function. When some member fails it will rebuild the failed node upon recovery based on simple majority vote. So a quorum of 3 can only allow 1 failure, and a quorum of 5 can allow 2 failures. A quorum of 4 is not any better than quorum of 3 because simple majority of 4 is 3, thus it allows only 1 node failure. In this case, I chose 5.

Step 1: install HBase

The stable release can be found here . At this moment, it is 1.2.4.

On nnode1, while login as hadoop user, down the tarball, and execute following in terminal.

[hadoop@nnode1 ~]$ cd Downloads/
[hadoop@nnode1 Downloads]$ sudo mv hbase-1.2.4-bin.tar.gz /opt/app
[hadoop@nnode1 Downloads]$ cd /opt/app
[hadoop@nnode1 app]$ sudo tar xvfz hbase-1.2.4-bin.tar.gz
[hadoop@nnode1 app]$ sudo mv hbase-1.2.4-bin hbase-1.2.4
[hadoop@nnode1 app]$ sudo chown -R hadoop:dev hbase-1.2.4

Step 2: configure HBase

Add following to .bashrc or .bash_profile (one of them only, .bashrc is executed very time a new session starts, .bash_profile is only executed the first time you login to Linux Desktop or the first connecting to the host via SSH etc).

export HBASE_HOME=/opt/app/hbase-1.2.4
export PATH=$PATH:$HBASE_HOME/bin
export CLASSPATH=$CLASSPATH:$HBASE_HOME/lib/*

Increase user limits by adding following to /etc/security/limits.conf.

hadoop – nofile 32768
hadoop – nproc 32000

On nnode1, while login as hadoop, add following to $HBASE_HOME/conf/hbase-site.xml.

hbase.cluster.distributed
true
hbase.rootdir
hdfs://nnode1:9000/hbase
hbase.zookeeper.property.dataDir
/opt/data/zookeeper
hbase.zookeeper.quorum
nnode1,nnode2,dnode1,dnode2,dnode3

Next we will create folders for Zookeeper.

[hadoop@nnode1 ~]$ su –
[root@nnode1 ~]# mkdir /opt/data/zoopkeeper
[root@nnode1 ~]# chown hadoop:dev /opt/data/zoopkeeper
[root@nnode1 ~]# ssh nnode2 ‘mkdir /opt/data/zoopkeeper && chown hadoop:dev /opt/data/zoopkeeper’
[root@nnode1 ~]# ssh dnode1 ‘mkdir /opt/data/zoopkeeper && chown hadoop:dev /opt/data/zoopkeeper’
[root@nnode1 ~]# ssh dnode2 ‘mkdir /opt/data/zoopkeeper && chown hadoop:dev /opt/data/zoopkeeper’
[root@nnode1 ~]# ssh dnode3 ‘mkdir /opt/data/zoopkeeper && chown hadoop:dev /opt/data/zoopkeeper’
[root@nnode1 ~]# ssh dnode4 ‘mkdir /opt/data/zoopkeeper && chown hadoop:dev /opt/data/zoopkeeper’

Next, we will synchronize the configuration to other nodes.

On nnode1, while login as root, do following:

for i in `cat /home/hadoop/mysites | grep -v nnode1`; do
rsync -avzhe ssh /etc/security/limits.conf $i:/etc/security
rsync -avzhe ssh /opt/app/hbase-1.2.4 $i:/opt/app
done

On nnode1, while login as hadoop, do following:
for i in `cat /home/hadoop/mysites | grep -v nnode1`; do
rsync -avzhe ssh /home/hadoop/.bashrc $i:/home/hadoop
rsync -avzhe ssh /home/hadoop/.bash_profile $i:/home/hadoop
done

On

On nnode1, while login as hadoop, do following to create HBase root directory on HDFS and intermediate data staging area.

[hadoop@nnode1 ~]$ hadoop fs -mkdir /user/hadoop/hbasetest
[hadoop@nnode1 ~]$ hadoop fs -mkdir /hbase

Step 3: start HBase

On nnode1, while login as hadoop, start HDFS by:

start-dfs.sh

On nnode2, while login as hadoop, start Yarn by following. Note, here we are using nnode2 as resource managers.

start-yarn.sh

On nnode1, while login as hadoop, start HBase by:

start-hbase.sh

That will output messages like the screenshot. Note the HBase starting sequences are:

– start the Zookeepers first on 5 of the designated nodes
– start the first master node
– start the regional servers on all data nodes
– start the second master node (backup master node)

To check the health of HBase server, run following in HBase command line, which starts by “hbase shell”. In the prompt, type “status”. That shall show a summary of the system. “Status ‘detailed'” will provide more detailed information. Note how the commands work here. In many case, command parameters need to be in quotes. And the “list” command will list the tables in the default database. In this case, we don’t have any.

[hadoop@nnode1 ~]$ hbase shell
hbase(main):001:0> status
1 active master, 1 backup masters, 4 servers, 0 dead, 1.5000 average load
hbase(main):011:0> status ‘detailed’
hbase(main):011:0> list

Next, check the configuration from HBase administration web, which is http://nnode1:16010/.

Next check the default folders created by HBase under the HBase ROOT.

[hadoop@nnode1 ~]$ hadoop fs -ls /hbase
Found 9 items
drwxr-xr-x – hadoop supergroup 0 2017-01-27 10:59 /hbase/.tmp
drwxr-xr-x – hadoop supergroup 0 2017-01-27 10:59 /hbase/MasterProcWALs
drwxr-xr-x – hadoop supergroup 0 2017-01-27 10:59 /hbase/WALs
drwxr-xr-x – hadoop supergroup 0 2017-01-27 11:10 /hbase/archive
drwxr-xr-x – hadoop supergroup 0 2017-01-26 16:42 /hbase/corrupt
drwxr-xr-x – hadoop supergroup 0 2017-01-26 16:43 /hbase/data
-rw-r–r– 3 hadoop supergroup 42 2017-01-26 09:46 /hbase/hbase.id
-rw-r–r– 3 hadoop supergroup 7 2017-01-26 09:46 /hbase/hbase.version
drwxr-xr-x – hadoop supergroup 0 2017-01-27 11:09 /hbase/oldWALs

/hbase/data will be storing actual data files (file stores). Data files are organized by databases, officially called “namedspace”. When the system is started, there are 2 default name spaces, “default” and “hbase”.

[hadoop@nnode1 ~]$ hadoop fs -ls /hbase/data
Found 3 items
drwxr-xr-x – hadoop supergroup 0 2017-01-26 09:46 /hbase/data/default
drwxr-xr-x – hadoop supergroup 0 2017-01-26 09:46 /hbase/data/hbase

Step 4: load HBase

Next, we will load some data into HBase. We will create a database called dynabi, and then create table called sales. The table has following structures. I have added 1 sample data for each field. HBase uses a column family based storage system. That means data are stored by column families, which are groups of columns. Data are physically organized by rows.

HBASE_ROW_KEY
SALES_ID 19980629:4115:125:3:999
customer (column family)
CUST_ID 4115
CUST_EMAIL Aaron@company.example.com
CUST_GENDER M
CUST_POSTAL_CODE 78247
product (column family)
PROD_ID 125
PROD_NAME 3
PROD_CATEGORY 1/2″ Bulk diskettes, Box of 50 Software/Other
sales (column family)
CHANNEL_ID 3
PROMO_ID 999
QUANTITY_SOLD 1
AMOUNT_SOLD 16.63

Execute following in hbase shell to create the table. And do a list to confirm that the table is corrected.

hbase(main):015:0> create_namespace ‘dynabi’
hbase(main):015:0> create ‘dynabi:sales’,’customer’,’product’,’sales’
hbase(main):015:0> list
TABLE
dynabi:sales
1 row(s) in 0.0760 seconds
=> [“dynabi:sales”]

Now, you will see 3 databases in HDFS.

[hadoop@nnode1 ~]$ hadoop fs -ls /hbase/data
Found 3 items
drwxr-xr-x – hadoop supergroup 0 2017-01-26 09:46 /hbase/data/default
drwxr-xr-x – hadoop supergroup 0 2017-01-27 09:33 /hbase/data/dynabi
drwxr-xr-x – hadoop supergroup 0 2017-01-26 09:46 /hbase/data/hbase

We have dumped some sales sample data and saved then under /home/hadoop.

[hadoop@nnode1 ~]$ ls -ltr sales1998*
-rw-r–r–. 1 hadoop dev 1832362 Jan 27 09:25 sales199802.txt
-rw-r–r–. 1 hadoop dev 1835671 Jan 27 09:28 sales199803.txt
-rw-r–r–. 1 hadoop dev 1520871 Jan 27 09:42 sales199804.txt
-rw-r–r–. 1 hadoop dev 1593157 Jan 27 10:00 sales199805.txt
-rw-r–r–. 1 hadoop dev 1498920 Jan 27 10:03 sales199806.txt

To load these file to HBase, we will stage them on HDFS first by using HDFS put command.

[hadoop@nnode1 ~]$ hadoop fs -put /home/hadoop/sales1998* /user/hadoop/hbasetest
[hadoop@nnode1 ~]$ hadoop fs -ls /user/hadoop/hbasetest
Found 5 items
-rw-r–r– 3 hadoop supergroup 1832362 2017-01-27 11:31 /user/hadoop/hbasetest/sales199802.txt
-rw-r–r– 3 hadoop supergroup 1835671 2017-01-27 11:31 /user/hadoop/hbasetest/sales199803.txt
-rw-r–r– 3 hadoop supergroup 1520871 2017-01-27 11:31 /user/hadoop/hbasetest/sales199804.txt
-rw-r–r– 3 hadoop supergroup 1593157 2017-01-27 11:31 /user/hadoop/hbasetest/sales199805.txt
-rw-r–r– 3 hadoop supergroup 1498920 2017-01-27 11:31 /user/hadoop/hbasetest/sales199806.txt

Once files are staged on HDFS, we can use the importtsv utility to load data to HBase tables. Note data in the files are “tab”-separated.

[hadoop@nnode1 ~]$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY
,customer:cust_id,customer:email,customer:gender,customer:postal_code
,product:product_id,product:name,product:category
,sales:channel_id,sales:promo_id,sales:quantity,sales:amount ‘dynabi:sales’ /user/hadoop/hbasetest

You can now examine the file stores on HBase from HBase administration web or using HDFS commands.

Here you see how the table is split to different regions. And how the regions are stored on 3 region servers. By default at table has 1 region to start, and only start splitting after it grow into certain size. In this case I manually split the table to 4 regions before data loading.

Using HDFS fs commands, you can better understand how HBase stores data by column family. Following I removed some lines to keep the output clean.

[hadoop@nnode1 ~]$ hadoop fs -ls -R /hbase/data/dynabi/sales
drwxr-xr-x – hadoop supergroup 0 2017-01-27 09:39 /hbase/data/dynabi/sales/74f0b40dde4587d594b67cfdfa86f480/customer
-rw-r–r– 3 hadoop supergroup 1925648 2017-01-27 09:39 /hbase/data/dynabi/sales/74f0b40dde4587d594b67cfdfa86f480/customer/b8c508806f8748d8b463645fe4635f42
drwxr-xr-x – hadoop supergroup 0 2017-01-27 09:39 /hbase/data/dynabi/sales/74f0b40dde4587d594b67cfdfa86f480/product
-rw-r–r– 3 hadoop supergroup 1567268 2017-01-27 09:39 /hbase/data/dynabi/sales/74f0b40dde4587d594b67cfdfa86f480/product/2ba7483609444ddb8fa00611fc5807eb
drwxr-xr-x – hadoop supergroup 0 2017-01-27 09:39 /hbase/data/dynabi/sales/74f0b40dde4587d594b67cfdfa86f480/sales
-rw-r–r– 3 hadoop supergroup 1675165 2017-01-27 09:39 /hbase/data/dynabi/sales/74f0b40dde4587d594b67cfdfa86f480/sales/70d0cde1dd5d47a0a0b3ca053abb586a
drwxr-xr-x – hadoop supergroup 0 2017-01-27 09:39 /hbase/data/dynabi/sales/a941b753dad82b11fdd172fec29ec84f/customer
-rw-r–r– 3 hadoop supergroup 1991644 2017-01-27 09:39 /hbase/data/dynabi/sales/a941b753dad82b11fdd172fec29ec84f/customer/c44ed0041ef14831aaa4f9bbd362f1d7
drwxr-xr-x – hadoop supergroup 0 2017-01-27 09:39 /hbase/data/dynabi/sales/a941b753dad82b11fdd172fec29ec84f/product
-rw-r–r– 3 hadoop supergroup 1611997 2017-01-27 09:39 /hbase/data/dynabi/sales/a941b753dad82b11fdd172fec29ec84f/product/9b3b5347c3b944058c9eff606a759435
drwxr-xr-x – hadoop supergroup 0 2017-01-27 09:39 /hbase/data/dynabi/sales/a941b753dad82b11fdd172fec29ec84f/sales
-rw-r–r– 3 hadoop supergroup 1731788 2017-01-27 09:39 /hbase/data/dynabi/sales/a941b753dad82b11fdd172fec29ec84f/sales/9c9c3d26ca174795a5a4661689495f23
drwxr-xr-x – hadoop supergroup 0 2017-01-27 09:39 /hbase/data/dynabi/sales/cba9f583b476219b4105bccf0c288340/customer
-rw-r–r– 3 hadoop supergroup 1991582 2017-01-27 09:39 /hbase/data/dynabi/sales/cba9f583b476219b4105bccf0c288340/customer/b144e837380b440d82825f94d7f574fe
drwxr-xr-x – hadoop supergroup 0 2017-01-27 09:39 /hbase/data/dynabi/sales/cba9f583b476219b4105bccf0c288340/product
-rw-r–r– 3 hadoop supergroup 1621641 2017-01-27 09:39 /hbase/data/dynabi/sales/cba9f583b476219b4105bccf0c288340/product/edb6fc43d63740c5ba7d4ccb9914005e
drwxr-xr-x – hadoop supergroup 0 2017-01-27 09:39 /hbase/data/dynabi/sales/cba9f583b476219b4105bccf0c288340/sales
-rw-r–r– 3 hadoop supergroup 1731688 2017-01-27 09:39 /hbase/data/dynabi/sales/cba9f583b476219b4105bccf0c288340/sales/0c13121a514c42be9a263131d23c1835
drwxr-xr-x – hadoop supergroup 0 2017-01-27 10:58 /hbase/data/dynabi/sales/d12de2806c12ec3c63582cc4a9ed3dc4/customer
-rw-r–r– 3 hadoop supergroup 1950428 2017-01-27 09:39 /hbase/data/dynabi/sales/d12de2806c12ec3c63582cc4a9ed3dc4/customer/136d9bbcdf454cab88661805f4e148a3
-rw-r–r– 3 hadoop supergroup 9816731 2017-01-27 10:58 /hbase/data/dynabi/sales/d12de2806c12ec3c63582cc4a9ed3dc4/customer/633c932766254b9fb9cd43346a1ba7a6
drwxr-xr-x – hadoop supergroup 0 2017-01-27 10:58 /hbase/data/dynabi/sales/d12de2806c12ec3c63582cc4a9ed3dc4/product
-rw-r–r– 3 hadoop supergroup 7985693 2017-01-27 10:58 /hbase/data/dynabi/sales/d12de2806c12ec3c63582cc4a9ed3dc4/product/b12ea291b6ef449d85c814647ed6ff28
-rw-r–r– 3 hadoop supergroup 1586352 2017-01-27 09:39 /hbase/data/dynabi/sales/d12de2806c12ec3c63582cc4a9ed3dc4/product/d372e5d9d77c4995aaa8b2cf71371ca3
drwxr-xr-x – hadoop supergroup 0 2017-01-27 10:58 /hbase/data/dynabi/sales/d12de2806c12ec3c63582cc4a9ed3dc4/sales
-rw-r–r– 3 hadoop supergroup 8531493 2017-01-27 10:58 /hbase/data/dynabi/sales/d12de2806c12ec3c63582cc4a9ed3dc4/sales/ba8f405884bf4401a00ba1b00e131175
-rw-r–r– 3 hadoop supergroup 1695088 2017-01-27 09:39 /hbase/data/dynabi/sales/d12de2806c12ec3c63582cc4a9ed3dc4/sales/d00ecc6a800b417688a8f4212218ee43