Tuesday, 19 August 2014

How to read the data from Apache Hive Database through IBM Infosphere DataStage

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. It supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem.

This blog provides an overview of how to read data from Hive database using Sample ODBC application (Example utility) supplied along with the ODBC driver bundles (also comes with Infosphere DataStage) and through ODBC Connector stage of DataStage. If you just want to read the data from Hive Database without logging into the database server, you can do that with Sample ODBC application. 

InfoSphere DataStage provides comprehensive support for ODBC through ODBC Enterprise Stage as well as ODBC Connector Stage on parallel canvas.

Here I am showing you the two ways to read data from Hive.
  1. Through sample odbc application of DataStage
  2. Through ODBC Connector stage of DataStage. 
 1. How to connect to Hive database through Sample odbc application

To connect to a hive database, you must first configure an ODBC data source definition (DSN) for the database by using the IBM Apache Hive Wire Protocol ODBC driver.
Make sure that the ODBC driver for Hive is installed.

a) Steps to be followed on Windows
         Create a Data Source Name (DSN)
  1. Start the Microsoft ODBC Data Source Administrator.
    On a 32-bit Windows computer, click Start > Control panel > Administrative Tools > Data Sources (ODBC). On a 64-bit Windows computer, navigate to C:\Windows\SysWOW64\odbcad32.exe.
    Note: On Windows, InfoSphere® Information Server is a 32-bit application. Even on a 64-bit Windows computer, the DataStage is running as a 32-bit application, therefore you must use the 32-bit version of the ODBC Data Source Administrator.
  1. On the System DSN page, click Add.
  2. On the Create New Data Source page, select the IBM Apache Hive Wire Protocol and click Finish.

    Configure the DSN
  1. Select the DSN which you just created then click on the Configure button to configure the driver. 

                                              Figure(1): DSN Configuration

where Database is the name of the Hive database the DSN connects to and Host Name is where the Hive server resides.
Wire Protocol Version Indicates which protocol to use when connecting to the Apache Hive server. If set to 0 -(AutoDetect), the driver tries to connect using the HiveServer1 protocol and the HiveServer2 protocol simultaneously.
Wireprotocol Version 2 represents Hive Server version-2, It is a server interface that enables remote clients to execute queries against Hive and retrieve the results. It is designed to provide better support for open API clients like JDBC and ODBC.

You need to set WireProtocolVersion=2 if you want to connect to Database through DataStage while Database is installed on another server.

Note: If the protocol specified for WireProtocolVersion is not accepted by the Hive server to which you are connecting, the connection fails with a login timeout error, if LoginTimeout is enabled. For example, a login timeout error would be returned if you specify WireProtocolVersion=2 and the driver attempts to connect to an Apache Hive server that only accepts connections using the HiveServer1 protocol.

Now click on Advanced Tab and set the entries as depicted below.


                                                             Figure(2) :DSN Configuration

2. Click on Test Connect and check if you are able to establish the connection with Database.
If Connection is established you can use the DSN to extract the data from Database else you need to check for the connection details (Host name or port ) if Hive Database is running properly on the given host name or port.

Ensure that ODBC Driver Installation Path (by default C:\IBM\ODBCdrivers) is included in your System Path variable.

Connection to Hive Database through Sample ODBC Application

Open the command prompt and go to the ODBC driver install location then run the Example.exe.

b) Steps to be followed on UNIX:
         DSN Configuration in the odbc.ini file

1. Add a new ODBC DSN definition for the Apache Hive to the .odbc.ini file.
2. Specify the Hostname and PortNumber to the host and port where Hive database server is running.
3. Specify the Database name from which you want to connect.
4. Save the .odbc.ini file.



Description=DataDirect Apache Hive Wire Protocol

Hive_Host is the host name where the Hive server resides and conn_db is the name of the Hive Database the DSN connects to.

Connection to Hive Database through Sample ODBC Application

1. Login to the machine where you installed the DataStage, set the DSHOME and ODBCINI environment variable which is updated in the dsenv script. You can either source the dsenv using the command “ source /opt/IBM/InformationServer/Server/DSEngine/dsenv “
or set the DSHOME with /opt/IBM/InformationServer/Server/DSEngine and ODBCINI with /opt/IBM/InformationServer/Server/DSEngine/.odbc.ini values.

2. Go to the directory where example program resides by default it will be under /opt/IBM/InformationServer/Server/branded_odbc/samples/example directory, run the example.


 2. How to connect to Hive database through ODBC Connector Stage

         Following are the steps required to connect to Hive from Infosphere DataStage.
  1. Add a DSN entry in your .odbc.ini file as mentioned above.
  2. Design a ODBC Read job in Parallel Canvas of DataStage with ODBC Connector as the source, pick any stage from palette you want to use for target like Peek, Sequential File etc.
          Here is an example to design a Parallel job using ODBC Connector stage.


                                                         Figure(3) : ODBC Read Job 
   3.  Double click on ODBC Connector to set the properties. Enter the DSN , user and password. The  user and password generally set during installation of Database, use the same user and password in the job. Check the screen shot below for these settings.

                                            Figure(4) : Properties of ODBC_Connector
  1. Set the properties of Sequential File, give the location of the file where you want to store the data, select the File Update Mode whatever you want like overwrite, append. In the Format tab you can choose the delimiters etc. The screen shot below depicts an example of how to set these properties.

                                                      Figure(5) : Properties of Sequential File

5. Save your Job, compile and then Run it.

Conclusion : The above blog illustrates how a user can configure a connection to the Hive Data Source using the IBM Apache Hive Wire Protocol ODBC driver supplied by DataDirect and use the DSN in the ODBC Connector to read the data from the Hive Datasource into the ETL job flow. Because of the limitations in the support for the HiveQL in the ODBC Connector, the connector only supports reading of the data from Hive, writing into the Hive is currently not supported. Users might have to use alternative mechanisms for writing into Hive.

Disclaimer: The postings on this site are those of the authors and don’t necessarily represent IBM’s positions, strategies or opinions.