Apache Hive is a data warehouse
infrastructure built on top of Hadoop for providing data
summarization, query, and analysis. It supports analysis of large
datasets stored in Hadoop's HDFS and compatible file systems such as
Amazon S3 filesystem.
This blog provides an overview of how
to read data from Hive database using Sample ODBC application
(Example utility) supplied along with the ODBC driver bundles (also
comes with Infosphere DataStage) and through ODBC Connector stage of
DataStage. If you just want to read the data from Hive Database
without logging into the database server, you can do that with Sample
ODBC application.
InfoSphere DataStage provides
comprehensive support for ODBC through ODBC Enterprise Stage as well
as ODBC Connector Stage on parallel canvas.
Here I am showing you the two ways
to read data from Hive.
- Through sample odbc application of DataStage
- Through ODBC Connector stage of DataStage.
To connect to a hive database, you
must first configure an ODBC data source definition (DSN) for the
database by using the IBM Apache Hive Wire Protocol ODBC driver.
Make sure that the ODBC driver for Hive
is installed.
a) Steps to be followed on Windows
Create a Data Source Name
(DSN)
- Start the Microsoft ODBC Data Source Administrator.On a 32-bit Windows computer, click Start > Control panel > Administrative Tools > Data Sources (ODBC). On a 64-bit Windows computer, navigate to C:\Windows\SysWOW64\odbcad32.exe.Note: On Windows, InfoSphere® Information Server is a 32-bit application. Even on a 64-bit Windows computer, the DataStage is running as a 32-bit application, therefore you must use the 32-bit version of the ODBC Data Source Administrator.
- On the System DSN page, click Add.
- On the Create New Data Source page, select the IBM Apache Hive Wire Protocol and click Finish.Configure the DSN
- Select the DSN which you just created then click on the Configure button to configure the driver.
Figure(1): DSN Configuration
where Database is the name of the Hive
database the DSN connects to and Host Name is where the Hive server
resides.
Wire Protocol Version Indicates which
protocol to use when connecting to the Apache Hive server. If set to
0 -(AutoDetect), the driver tries to connect using the HiveServer1
protocol and the HiveServer2 protocol simultaneously.
Wireprotocol Version 2 represents
Hive Server version-2, It is a server interface that enables remote
clients to execute queries against Hive and retrieve the results. It
is designed to provide better support for open API clients like JDBC
and ODBC.
You need to set WireProtocolVersion=2
if you want to connect to Database through DataStage while Database
is installed on another server.
Note: If the protocol specified
for WireProtocolVersion is not accepted by the Hive server to which
you are connecting, the connection fails with a login timeout error,
if LoginTimeout is enabled. For example, a login timeout error would
be returned if you specify WireProtocolVersion=2 and the driver
attempts to connect to an Apache Hive server that only accepts
connections using the HiveServer1 protocol.
Now click on Advanced Tab and set the
entries as depicted below.
Figure(2)
:DSN Configuration
2. Click on Test Connect and check
if you are able to establish the connection with Database.
If Connection is established you can
use the DSN to extract the data from Database else you need to check
for the connection details (Host name or port ) if Hive Database is
running properly on the given host name or port.
Ensure that ODBC Driver Installation
Path (by default C:\IBM\ODBCdrivers) is included in your System Path
variable.
Connection to Hive Database through
Sample ODBC Application
Open the command prompt and go to the
ODBC driver install location then run the Example.exe.
b) Steps to be followed on UNIX:
DSN Configuration in the
odbc.ini file
1. Add a new ODBC DSN definition for
the Apache Hive to the .odbc.ini file.
2. Specify the Hostname and PortNumber
to the host and port where Hive database server is running.
3. Specify the Database name from which
you want to connect.
4. Save the .odbc.ini file.
Example
[HIVE65]
Driver=/opt/IBM/InformationServer/Server/branded_odbc/lib/VMhive00.so
Description=DataDirect Apache Hive Wire
Protocol
Hostname=<Hive_Host>
Database=conn_db
PortNumber=10000
MaxVarcharSize=8192
TransactionMode=1
LogonID=
Password=
WireProtocolVersion=2
RemoveColumnQualifiers=0
StringDescribeType=12
UseCurrentSchema=0
QEWSD=2456861
Hive_Host is the host name where the
Hive server resides and conn_db is the name of the Hive Database the
DSN connects to.
Connection to Hive Database through
Sample ODBC Application
1. Login to the machine where you
installed the DataStage, set the DSHOME and ODBCINI environment
variable which is updated in the dsenv script. You can either source
the dsenv using the command “ source
/opt/IBM/InformationServer/Server/DSEngine/dsenv “
or set the DSHOME with
/opt/IBM/InformationServer/Server/DSEngine and ODBCINI with
/opt/IBM/InformationServer/Server/DSEngine/.odbc.ini values.
2. Go to the directory where example
program resides by default it will be under
/opt/IBM/InformationServer/Server/branded_odbc/samples/example
directory, run the example.
Example
2. How to connect to Hive
database through ODBC Connector Stage
Following are the steps required to
connect to Hive from Infosphere DataStage.
- Add a DSN entry in your .odbc.ini file as mentioned above.
- Design a ODBC Read job in Parallel Canvas of DataStage with ODBC Connector as the source, pick any stage from palette you want to use for target like Peek, Sequential File etc.
Figure(3) : ODBC Read Job
3. Double click on ODBC Connector to
set the properties. Enter the DSN , user and password. The user and
password generally set during installation of Database, use the same
user and password in the job. Check the screen shot below for these
settings.
Figure(4) :
Properties of ODBC_Connector
- Set the properties of Sequential File, give the location of the file where you want to store the data, select the File Update Mode whatever you want like overwrite, append. In the Format tab you can choose the delimiters etc. The screen shot below depicts an example of how to set these properties.
Figure(5) :
Properties of Sequential File
5. Save your Job, compile and then Run
it.
Conclusion : The above blog
illustrates how a user can configure a connection to the Hive Data
Source using the IBM Apache Hive Wire Protocol ODBC driver supplied by
DataDirect and use the DSN in the ODBC Connector to read the data
from the Hive Datasource into the ETL job flow. Because of the
limitations in the support for the HiveQL in the ODBC Connector, the
connector only supports reading of the data from Hive, writing
into the Hive is currently not supported. Users might have to use
alternative mechanisms for writing into Hive.
Disclaimer: The postings on this site
are those of the authors and don’t necessarily represent IBM’s
positions, strategies or opinions.
No comments:
Post a Comment