Explore HDFS using Hive

Explore HDFS using Hive

Hadoop version: 2.7.3
Hadoop cluster size: 6 nodes (1 primary name node, 1 secondary name node / resource manager, 4 data nodes)
Hive version: 2.1.1
Metastore database: Oracle 12c single node RAC multitenant

Beginning of this year, I went back learning Java, and I was very excited about all those new features and new packages. Java is absolutely become the kind of all programming languages, very powerful and very impressive.

And then I coded a few pieces of MapReduce codes trying to explore some new ideas on HDFS. It turned out that they are correct – MapReduce is not for normal IT guys. For a normal, non-internet company, it is very hard to have an IT team capable of writing meaning application with Java.

My next steps are of course to explore more “IT-ready” tools. I am thinking about Cloudera and Spark. But I would like to explore a core feature that inspired them. That is Hive. Here we will explore HDFS using Hive.

My setup is listed at the beginning. Note Hive needs a meta-data store that keeps all the structural information of data stored on HDFS (via Hive). A typical case is CSV files. The structure of CSV, or the fields and their definitions, are meta-data of the CSV files. If that meta-data can be stored somewhere, the there can be a generic engine that is load/read the data without custom-build MapReduce Java code. Hive is such a generic engine. Hive provides a SQL-like interface, and it translates SQL to MapReduce Java code.

Step 1: start Oracle database

I am using a Oracle 12c single node RAC with multitenant configuration. So these steps are specific to my environment. The database is on a 2 core 8 GB Dell Precision 2400 laptop. On it, I installed Oracle Linux 6.8 previously.

First thing first, start the oracle database.

[oracle@dp2400 ~]$ . oraenv
ORACLE_SID = [oracle] ? orcl
The Oracle base has been set to /u01/app/oracle
[oracle@dp2400 ~]$ srvctl start database -db orcl
[oracle@dp2400 ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Wed Jan 25 08:18:15 2017

Copyright (c) 1982, 2014, Oracle. All rights reserved.

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 – 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Advanced Analytics
and Real Application Testing options

SQL>

Step 2: create Hive schema on metastore database

You can run these on SQL Developer, or like me in SQL*Plus command window.

SQL> alter session set container=pdb1;
SQL> create user hive identified by xxxxxx;
SQL> grant connect,resource to hive;
SQL> grant select_catalog_role to hive;
SQL> alter user hive quota unlimited on users;

Step 3: download and install Hive

You can download Hive from here. I used version 2.1.1. Binary tarball works fine. you can also get the source and compile it. but it is not necessary. Execute following commands to unzip it.

[hadoop@nnode1 ~]$ sudo mv apache-hive-2.1.1-bin.tar.gz /opt/app
[hadoop@nnode1 ~]$ cd /opt/app
[hadoop@nnode1 app]$ sudo tar xvfz apache-hive-2.1.1-bin.tar.gz
[hadoop@nnode1 app]$ sudo mv apache-hive-2.1.1-bin hive-2.1.1
[hadoop@nnode1 app]$ sudo chown -R hadoop:dev hive-2.1.1

Next add following content to .bashrc or .bash_profile depends on which one you chose.

export HIVE_HOME=/opt/app/hive-2.1.1
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:$HIVE_HOME/lib/*

On nnode1, while logged in as hadoop, copy them to other nodes.

for i in `cat /home/hadoop/mysites | grep -v nnode1`; do
rsync -avzhe ssh /home/hadoop/.bashrc $i:/home/hadoop
rsync -avzhe ssh /home/hadoop/.bash_profile $i:/home/hadoop
done

Next, if you don’t have Oracle client, then install Oracle client. But if you can get ojdbc7.jar from another oracle installation then you don’t have to. I copied ojdbc7.jar from my Oracle server installation.

Save ojdbc7.jar in $HIVE_HOME/lib

Step 4: create Hive metastore schema

The Hive metastore holds schema of data on HDFS. So the metastore schema is the schema of schema. It can be confusing.

Connect to the Oracle database using SQL*Plus and execute $HIVE_HOME/scripts/metastore/upgrade/oracle/hive-schema-2.1.0.oracle.sql.

Note, that script will call another script hive-txn-schema-2.1.0.oracle.sql (under the same folder). so please make sure the login has access to both of them.

That will create a bunch of tables and insert a few pieces of seed data.

Step 5: configure Hive

Next, copy $HIVE_HOME/conf/hive-default.xml.template to hive-site.xml. And then modify following properties in hive-site.xml. Note I removed the properties tags to keep the list clean.

javax.jdo.option.ConnectionDriverName
oracle.jdbc.driver.OracleDriver

javax.jdo.option.ConnectionURL
jdbc:oracle:thin:hive@dp2400:1539/pdb1

javax.jdo.option.ConnectionUserName
hive

javax.jdo.option.ConnectionPassword
xxxxx

hive.metastore.orm.retrieveMapNullsAsEmptyStrings
true

hive.exec.local.scratchdir
/home/hadoop/hive/logs

hive.querylog.location
/home/hadoop/hive/logs

hive.downloaded.resources.dir
/tmp/hadoop/${hive.session.id}_resources

hive.server2.logging.operation.log.location
/tmp/hadoop/operation_logs

hive.metastore.uris
thrift://nnode1:9083

Next, run following on nnode1 while logged as hadoop to create logging folders.

[hadoop@nnode1 hive-2.1.1]$ mkdir ~/hive
[hadoop@nnode1 hive-2.1.1]$ mkdir ~/hive/logs
[hadoop@nnode1 hive-2.1.1]$ ssh nnode2 "mkdir ~/hive && mkdir ~/hive/logs"
[hadoop@nnode1 hive-2.1.1]$ ssh dnode2 "mkdir ~/hive && mkdir ~/hive/logs"
[hadoop@nnode1 hive-2.1.1]$ ssh dnode1 "mkdir ~/hive && mkdir ~/hive/logs"
[hadoop@nnode1 hive-2.1.1]$ ssh dnode3 "mkdir ~/hive && mkdir ~/hive/logs"
[hadoop@nnode1 hive-2.1.1]$ ssh dnode4 "mkdir ~/hive && mkdir ~/hive/logs"

Next, on nnode1 while logged in as hadoop, copy the hive files to other nodes.

for i in `cat /home/hadoop/mysites | grep -v nnode1`; do
rsync -avzhe ssh /opt/app/hive-2.1.1 $i:/opt/app
done

Step 6: start HDFS and create HDFS folders for Hive

Next, on nnode1 while logged in as hadoop, start HDFS.

[hadoop@nnode1 ~]$ start-dfs.sh
[hadoop@nnode1 ~]$ hdfs dfsadmin -report | grep Live
Live datanodes (4):

Next, on nnode1 while logged in as hadoop, create two required folders /tmp and /user/hive/warehouse.

[hadoop@nnode1 ~]$ hadoop fs -mkdir /tmp
[hadoop@nnode1 ~]$ hadoop fs -mkdir /user/hive
[hadoop@nnode1 ~]$ hadoop fs -mkdir /user/hive/warehouse
[hadoop@nnode1 ~]$ hadoop fs -chmod g+w /tmp
[hadoop@nnode1 ~]$ hadoop fs -chmod g+w /user/hive/warehouse

Next, on nnode2 while logged in as hadoop, start Yarn.

[hadoop@nnode2 ~]$ start-yarn.sh
[hadoop@nnode2 ~]$ yarn node -list

17/01/25 08:56:29 INFO client.RMProxy: Connecting to ResourceManager at nnode2/192.168.0.222:8032
Total Nodes:4
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
dnode2:48257 RUNNING dnode2:8042 0
dnode1:54810 RUNNING dnode1:8042 0
dnode3:44207 RUNNING dnode3:8042 0
dnode4:50431 RUNNING dnode4:8042 0

Step 7: start Hive metastore service and Hive CLI

Next, on nnode1 while logged in as hadoop, start Hive metastore service.

[hadoop@nnode1 ~]$ nohup hive --service metastore &
[hadoop@nnode1 ~]$ hive

That will start Hive CLI. Issue “show databases” should list the “default” database.

hive> show databases;
hive> create database dynabi;
hive> create database hdfs;
hive> show databases;

OK
default
dynabi
hdfs
Time taken: 0.02 seconds, Fetched: 3 row(s)

Now, open Hadoop administrator web site http://nnode1:50070 and then go to “Browse the file system” utility.

You can explore the structure on web. that is quite intuitive. Note 2 sub-folders dynabi.db and hdfs.db are created under /user/hive/datawarehouse.

Step 8: create table and load data

Next, we will create sales table on Hive and load some sample data. The sample data is from one of the Oracle database sample tables. it has following structure.

I downloaded the table and saved under /home/hadoop/sales.txt. The file size is 29,908,033. I chose | as field delimiter and did not include heading of quotation.

First, in Hive CLI, change to the newly created database hdfs;

hive> use hdfs;

Paste following to the Hive CLI command line to create the sales table.

CREATE TABLE IF NOT EXISTS sales (
prod_id int,
cust_id int,
time_id string,
channel_id int,
promo_id int,
quantity_sold decimal(10,2),
amount_sold decimal(10,2)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
;

Next, load data into sales table.

hive> LOAD DATA LOCAL INPATH ‘/home/hadoop/sales.txt’ OVERWRITE INTO TABLE sales;
Loading data to table hdfs.sales
OK
Time taken: 3.816 seconds

Next, browse in HDFS admin web to review the newly created table.

The file size looks slightly larger than the original file. And it is replicated across dnode1, dnode2, dnode3. Clicking on the download link will allow you to examine the file content. But we will not do it as file is pretty big.

Next, let count the records from Hive CLI. And monitor the job status from Yarn (http://nnode2:8088/). Note Yarn has allocated 2 nodes to run the job.

hive> select count(*) from sales;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20170125102054_78afc66d-f0da-4f1b-940f-33c24210e7ba
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1485363340972_0003, Tracking URL = http://nnode2:8088/proxy/application_1485363340972_0003/
Kill Command = /opt/app/hadoop-2.7.3/bin/hadoop job -kill job_1485363340972_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-01-25 10:21:18,528 Stage-1 map = 0%, reduce = 0%
2017-01-25 10:21:37,048 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 8.33 sec
2017-01-25 10:22:38,123 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 9.5 sec
2017-01-25 10:23:38,931 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 10.48 sec
2017-01-25 10:24:39,674 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 10.91 sec
2017-01-25 10:25:05,079 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 15.2 sec
MapReduce Total cumulative CPU time: 15 seconds 200 msec
Ended Job = job_1485363340972_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 15.2 sec HDFS Read: 29916811 HDFS Write: 106 SUCCESS
Total MapReduce CPU Time Spent: 15 seconds 200 msec
OK
918843 <<<< this matches the source
Time taken: 252.505 seconds, Fetched: 1 row(s)