Install and Configure OOZIE for workflow management

Oozie Version: 4.3.0
Hadoop Version: 2.7.3 (4-node cluster)
MySQL Version: 5.7 (for Oozie database)

Oozie, as a workflow manager, involves many Apache projects. Because of that, its installation and configuration can be quite tricky. The online documentation for Oozie installation is good enough if you are working on a single-node installation, and you have Hadoop, Hive and others in the same release level as what are configured in the default Oozie release.

This article documents a lot of critical configurations that are scattered on Internet. Hope this can help you avoid many pitfalls in Oozie configuration.

Step 1: Configure Hadoop proxy user

This setting is not common, but it is critical for Oozie to work with Hadoop. On the primary name node, nnode1, add following to $HADOOP_HOME/etc/hadoop/core-site.xml. This basically allows “hadoop” user, which in my case is the super user that will own Oozie installation too, to impersonate any other hadoop users. If you will install Oozie under a different software owner, then the property need to be changed accordingly.

It is important to synchronize the file to all other nodes, or impersonation will not work.

for i in `cat /home/hadoop/mysites | grep -v nnode1`; do
rsync -avzhe ssh /opt/app/hadoop-2.7.3/etc/hadoop/core-site.xml $i:/opt/app/hadoop-2.7.3/etc/hadoop
done

Next, make sure the HDFS Daemon, Yarn Daemon, and JobHistory daemon are all stopped and restarted. Carrying out these steps in organized way can save you a lot of debugging time.

Step 2: Download Oozie software and build binaries

Download oozie-4.3.0.tar.gz from http://oozie.apache.org/, following the links to proper mirror sites. In my case, the file is saved under /home/hadoop/Downloads folder. Unzip the tar ball using “tar xvfz” command in the download folder. The unzipped files will be saved under /home/hadoop/Downloads/oozie-4.3.0. We will refer to this folder as OOZIE_BUILD_ROOT.

The downloaded are source files. Thus we need to build into binaries. Oozie use maven to manage dependency and build code. To install maven on OEL7, execute “sudo yum install maven“. This just need to be installed on one of the node. In my case, it is nnode1.

Next, the pom.xml file need to be modified to avoid a bug. With the default pom.xml, two conflicting servlet-api will be packages into final installed war files. Here will change the scope of artifact servlet-api to provided. What that means is that the servlet-api will be downloaded from online maven repository for development only, not for final distribution.

To build Oozie binaries, run bin/mkdistro.sh as following under the OOZIE_BUILD_ROOT folder. Note the document listed several other options that you can modify depending on the versions of other software components you will use, such as Tomcat, Hive or HBase version. You are better off to leave other options as default.

[hadoop@nnode1 oozie-4.3.0]$ bin/mkdistro.sh -DskipTests -Phadoop-2 -Dhadoop.version=2.7.3

At the completion, you should see all modules are built successfully.

Step 3: Install and configure Oozie

Since there is no installation scripts, we will copy all binaries to the Oozie installation folder. In this case, we will install Oozie under /opt/app/oozie-4.3.0. We will call /opt/app/oozie-4.3.0 OOZIE_HOME.

sudo mkdir /opt/app/oozie-4.3.0
sudo chown hadoop:dev /opt/app/oozie-4.3.0

Next, move to OOZIE_HOME and copy over all binaries.
cp -R /distro/target/oozie-4.3.0-distro/oozie-4.3.0/*

Next, create a libext folder under OOZIE_HOME, and then copy the Oozie Hadoop interface jar files.
mkdir /opt/app/oozie-4.3.0/libext
cp /hadooplibs/hadoop-auth-2/target/oozie-hadoop-auth-hadoop-2-4.3.0.jar /libext/
cp /hadooplibs/hadoop-distcp-2/target/oozie-hadoop-distcp-hadoop-2-4.3.0.jar /libext/
cp /hadooplibs/hadoop-utils-2/target/oozie-hadoop-utils-hadoop-2-4.3.0.jar /libext/

Next, download http://archive.cloudera.com/gplextras/misc/ext-2.2.zip, and also save it under /libext/.

Next, copy following Hadoop libraries to /libext/.

  • /share/hadoop/common/*.jar
  • /share/hadoop/common/lib/*.jar
  • /share/hadoop/hdfs/*.jar
  • /share/hadoop/hdfs/lib/*.jar
  • /share/hadoop/mapreduce/*.jar
  • /share/hadoop/mapreduce/lib/*.jar
  • /share/hadoop/yarn/*.jar
  • /share/hadoop/yarn/lib/*.jar

Next, copy MySQL JDBC driver to libext/.
cp /usr/share/java/mysql-connector-java.jar /libext/

Next, we need to fix a build problem. If you exam the default package oozie.war, using the “jar tvf” command, you will notice that Hadoop 2.6.0 libraries are packaged in it, although we specified Hadoop 2.7.3 at build time. Since we will run Oozie with Hadoop 2.7.3, we have to replace these jar files with 2.7.3 ones, otherwise, there will be authorization issues to run Oozie workflows.

Since we have copied all hadoop jar files to libext/, we just need to delete the bad ones from the oozie.war file. We do that by unpacking the war file, delete the bad files manually, and then repackage the war.

cd /opt/app/oozie-4.3.0
mkdir warfix
mv oozie.war warfix
cd warfix
jar xvf oozie.war
rm WEB-INF/lib/hadoop*2.6.0.jar
jar cvf ../oozie.war ./*
cd ..
rm -rf warfix

Next, we need to deploy oozie.war file to the build-in tomcat web server. To prepare the war file, xecute oozie.setup.sh from /bin/oozie-setup.sh prepare-war

Oozie share libraries need to be stored on HDFS. Next, we will create /user/hadoop/share/lib folder for such purpose. Execute following “hadoop fs” commands:
hadoop fs -mkdir /user/hadoop/share
hadoop fs -mkdir /user/hadoop/share/lib

To upload shared libs to HDFS, we run oozie-setup.sh with “sharelib create” command.
/bin/oozie-setup.sh sharelib create -fs hdfs://nnode1:9000

Note the HDFS path where share libs are saved (/user/hadoop/share/lib/lib_20170328132010). We will need this in subsequent configuration.

Next, we will create Oozie schema on MySQL. By default, Oozie create a Derby database to hold Oozie metadata. Using MySQL is a more production ready way. I have installed MySQL on nnode1 previously. The steps are in blog BD013. Here we assume MySQL daemon is running. We connect to the server and create a user called “oozie” and a database called “ooziedb”.

mysql -uroot -p
mysql> CREATE DATABASE ooziedb;
mysql> CREATE USER 'oozie'@'nnode1' IDENTIFIED BY 'ob!ee11G' ;
mysql> GRANT ALL PRIVILEGES ON ooziedb TO 'oozie'@'nnode1';
mysql> FLUSH PRIVILEGES;

Before we can create Oozie schema in MySQL, we have to configure oozie-site.xml under /conf/ folder. Just delete all properties and put following:

<property> <name>oozie.service.ProxyUserService.proxyuser.hadoop.hosts</name> <value>*</value> </property>
<property> <name>oozie.service.ProxyUserService.proxyuser.hadoop.groups</name> <value>*</value> </property>
<property> <name>oozie.db.schema.name</name> <value>ooziedb</value> </property>
<property> <name>oozie.service.JPAService.create.db.schema</name> <value>true</value> </property>
<property> <name>oozie.service.JPAService.validate.db.connection</name> <value>true</value> </property>
<property> <name>oozie.service.JPAService.jdbc.driver</name> <value>com.mysql.jdbc.Driver</value> </property>
<property> <name>oozie.service.JPAService.jdbc.url</name> <value>jdbc:mysql://nnode1/ooziedb</value> </property>
<property> <name>oozie.service.JPAService.jdbc.username</name> <value>oozie</value> </property>
<property> <name>oozie.service.JPAService.jdbc.password</name> <value>ob!ee11G</value> </property>
<property> <name>oozie.service.JPAService.pool.max.active.conn</name> <value>10</value> </property>
<property> <name>oozie.service.HadoopAccessorService.hadoop.configurations</name> <value>*=/opt/app/hadoop-2.7.3/etc/hadoop/</value></property>
<property> <name>oozie.service.WorkflowAppService.system.libpath</name><value>hdfs:///user/hadoop/share/lib/lib_20170328132010</value></property>
<property> <name>mapreduce.jobhistory.address</name> <value>nnode2:10020</value> </property>

In above XML text get messed up by browser, please refer to the picture here.

Please note, here we are doing multiple configuration changes.
Oozie database related configurations:

  • oozie.db.schema.name
  • oozie.service.JPAService.create.db.schema
  • oozie.service.JPAService.validate.db.connection
  • oozie.service.JPAService.jdbc.driver
  • oozie.service.JPAService.jdbc.url
  • oozie.service.JPAService.jdbc.username
  • oozie.service.JPAService.jdbc.password
  • oozie.service.JPAService.pool.max.active.conn

Hadoop proxy configurations:

  • oozie.service.ProxyUserService.proxyuser.hadoop.hosts
  • oozie.service.ProxyUserService.proxyuser.hadoop.groups
  • oozie.service.HadoopAccessorService.hadoop.configurations

Custom shared library path:

  • oozie.service.WorkflowAppService.system.libpath

Job history server address:

  • mapreduce.jobhistory.address

Next, we will create a soft-link to MySQL JDBC driver under Oozie lib directory.
ln -s /usr/share/java/mysql-connector-java.jar /lib/mysql-connector-java.jar

Next, we will create Oozie schema through oozie-setup.sh, command “db create”.
/bin/oozie-setup.sh db create -run

Verify tables are created properly.

mysql -uoozie -hnnode1 -p
mysql> use ooziedb
mysql> show tables;

Next modify .bashrc for the Oozie software owner, in this case it is hadoop, and add following environment changes.

export OOZIE_HOME=/opt/app/oozie-4.3.0
export PATH=$PATH:$OOZIE_HOME/bin
export OOZIE_URL=http://nnode1:11000/oozie

Step 4: Run examples

To start Oozie server, run oozied.sh with start command, followed by oozie admin command to check status.

[hadoop@nnode1 oozie-4.3.0]$ oozied.sh start
[hadoop@nnode1 oozie-4.3.0]$ oozie admin -status
System mode: NORMAL

Next, we will unpack the examples tar ball and upload to HDFS.
cd /opt/app/oozie-4.3.0
tar xvf oozie-examples.tar.gz
hadoop fs -put examples /user/hadoop

Next, modify examples/apps/map-reduce/job.properties (on local file system, not HDFS file system).

nameNode=hdfs://nnode1:9000
jobTracker=nnode2:8032

Next, start a map-reduce workflow, and monitor its status vai oozie job -info command. Note the job name is used in the second command.

[hadoop@nnode1 oozie-4.3.0]$ oozie job -config examples/apps/map-reduce/job.properties -run
job: 0000000-170328142601690-oozie-hado-W
[hadoop@nnode1 oozie-4.3.0]$ oozie job -info 0000000-170328142601690-oozie-hado-W

You can also check job status from Job History server once the workflow is finished. The job history server Web UI is at nnode2:19888.