Information Integration Blog

Friday, 19 August 2016

Moving data from on-premises to cloud using IBM DataWorks Forge

In this blog post we will show how you can easily move data from an on-premises source such as IBM Db2 database to a cloud based database such as IBM dash DB using IBM DataWorks Forge. We are assuming that you have already created a service instance of IBM DataWorks Forge and have setup a Secure Gateway. If not you can refer to the following blog posts:

Open the IBM DataWorks Forge GUI. The first thing that you need to do is to create a Data Connection to your source and to your target. Steps for creating the Data Connection are given below:

Click on "Create Connection" link on the left hand side panel.
Select the source type (IBM Db2 or IBM dash DB)
Specify a name and description for the connection.
Enter the connection details such as Hostname, port, database name and credentials for connecting to the database.
For IBM Db2 (and any other on-premises source) you will also have to provide the secure gateway details. If you have created a secure gateway in the same org and space it will be listed in the drop down list. Select the secure gateway which is present in the same VLAN as that our your source.
Click on "Create Connection".
DataWorks will try to connect to the source and if the connection is successful, it will create the connection.

Once the source and target data connections have been created, you can create a DataWorks Forge Activity which will do the actual data movement. There are two kinds of activities: (a) One that simply copies the source data to the target and (b) others that read the data from the source, do some amount of data shaping such as join, etc., and then copy the transformed data to the target. In this blog post we will showcase how you can do a straight copy of the source data to the target without any shaping.

Click on "Refine & Copy" link from the left hand side panel. It will open the asset browser.
You will be show the list of source connections that you have created. Select the one which connects to the source IBM Db2 database.
It will then show the list of schemas which are available in the source database. Select the schema in which your source data is present.
It will then show the list of tables present in the selected schema. Select one or more tables from the list. Data from these selected tables will be copied to the target.
When you click on a table, the GUI will show the list of columns which are present in the table. If you only want to see the columns in a table you can click on the table name. If you want to select all the columns in the table, then click on the checkbox next to the table name. If you wish to select a sub-set of the columns you can click on the checkbox next to the column names.
The list of selected tables are show on the right hand side. You can click on the three dots next to a selected table to either (a) see some sample data in the table or (b) delete to table from the selection.
Once the source table(s) have been selected, click on "Copy to Target" button at the top. This will directly copy the data to the target without doing any shaping. If you wish to shape the data, then click on the "Refine Data" button. In this blog we will focus on the "Copy to Target" option.
When you click on "Copy to Target" the asset browser is again opened and it show the list of target connections which you have created. Select the one for dash db which you have created.
It will show the list of schemas present in the dash db. Please note that if you are using the multi-tenant dash db instance, then only the schema with name starting with DASH<some number> has write permissions. Hence select this schema.
It will then show the supported table actions. There are four options:

Append to the table: If the target table already exists, then data will be appended to it.
Recreate the table: If the table does not exists, it will be created. If it exists, then it will be deleted and created again. In this process if there are any key constraints and permissions defined on the original table, they will be lost.
Replace the table contents: If the table does not exist, it will be created. If it does exists, then all the data from the table will be deleted and the data from the source table will be copied to it. The key constraints and permission will remain unchanged.
Merge with the table: This is similar to upsert. In other words, for each record in the source table, if the target table does not contain a record with the same primary key values as the source records, then the source record will be inserted into the target table. However, if the target table contains a record with the same primary key values as that of the source record, then the target record will be updated with the values of the source record. If the target table does not contain a primary key, then the source record will be added to the table without removing the tables existing data. If the target table does not exist, then it will be created without any primary key and the source records will be copied to the target.

On the right hand side panel, IBM DataWorks Forge will list the set of tables which will get created in the target. The name of the target table can be changed by clicking on the edit icon next to the table name.
Finally, provide a name for the activity at the top and then click on Run. The activity is now running and you will be taken to the activities page. You can find the newly created activity in the list.
Click on the activity name to see the details of the activity. It will show the status of the activity run. You will have to refresh the activity details page by clicking on the refresh button at the top right corner to check the latest status of the activity run.
Once the run is finished, you can check the data movement logs by clicking on the activity run in the "Activity Runs" section of the activity details page.

Thats all there is to using DataWorks Forge to move data from an on-prem source such as IBM Db2 to IBM dash DB on the cloud!

-Manish

Setting up Secure Gateway for connecting to On-Prem sources using IBM DataWorks Forge

One of the key use cases for IBM DataWorks Forge is to move data from different on-prem sources into cloud based targets. Examples of on-prem sources supported by IBM DataWorks Forge are: IBM Db2, Oracle, IBM Pure Data For Analytics, IBM Informix, etc. Similarly examples of cloud based targets include IBM dash DB, IBM Db2, Amazon S3, Bluemix Object Storage, etc.

In order to connect to any on-prem source, IBM DataWorks Forge makes use of IBM Secure Gateway service. The Secure Gateway Service brings Hybrid Integration capability to your Bluemix environment. It provides secure connectivity from Bluemix to other applications and data sources running on-premise or in other clouds. A remote client is provided to enable secure connectivity. In this blog post we will explain how to configure the Secure Gateway for usage with IBM DataWorks Forge.

Open the IBM DataWorks Forge GUI in Bluemix. For details on how to create a IBM DataWorks Forge service instance you can refer to the following blog post.
Click on the "Secure Gateway" link on the left hand side panel under the "ADMIN" section
Click on "Add Gateway" button.
Provide a name for the secure gateway. If you do not want the token which is used by on-prem clients for connecting to the gateway to expire, uncheck the box of "Token expiration". Click on "Add Gateway".
A new gateway will be created and will be shown in the list. Click on the newly created gateway and it will show the details of the gateway. On that page, click on "Add Clients" button at the top.
There are three ways to connect to the gateway. We will provide the details of using a docker based approach. Hence select the bullet for "docker".
It will show you a command which needs to be run on the machine where the secure gateway client is to be run. Copy this command.
You will need a machine in the on-prem environment where the secure gateway client needs to be installed. On this environment, you need to first install docker (if it is not already installed).
In our testing we installed docker on a machine with OS Windows 7.1FP1 or higher. The steps for installing are available at: https://docs.docker.com/installation/windows/. Once docker is installed, it also installs SSH on the windows machine. We need to ensure that the path for SSH is added to the PATH environment variable in System properties. SSH is typically installed in C:\Program Files (x86)\Git\bin folder in windows. Hence add this folder to the PATH environment variable in System properties.
Once the above is done, open an command prompt and execute the following commands:

Change the directory to the folder where docker is installed. It is typically installed in c:\Program Files\Boot2Docker
Run the following command: "boot2docker init". This is a onetime activity which is to be run the first time you install docker on the machine.
Then run the following two commands

boot2docker start
boot2docker ssh

Commands mentioned in step 10.3 are to be run in a window which needs to be kept open whenever you want to read data from an on-prem source. If the window is closed, you can open a new window and run the commands (in 10.3) again.

You can ensure that your docker is up-to-date by running the following command:

docker pull ibmcom/secure-gateway-client

Once the above steps have been done, you can run the command mentioned in Step 7 above, e.g., docker run -it bluemix/secure-gateway-client <gateway_id>. This will download the Secure Gateway client, run it, and connect to the gateway. Please note that whenever you wish to connect to an on-prem source, you will have to ensure that the command is running in a window on the on-premises machine.

Please ensure that there is no firewall between the machine on which docker is installed and the source. An easy way to check this is to ensure that you can telnet to the source from the machine on which docker is installed.

Once this is done, you are good to go!

-Manish

Creating a IBM DataWorks Forge service instance in Bluemix

IBM has a data integration service offering on the cloud called as IBM DataWorks Forge. It is targeted towards the citizen analyst persona (business user). Citizen analyst make use of various kinds of analytic tools available in the cloud for doing analytics. One of the first things that they need is to get their data into the cloud. This is where IBM DataWorks Forge helps.

IBM DataWorks Forge allows customers to move data from on-prem systems to Cloud as well as from Cloud based sources to other cloud based targets. IBM DataWorks Forge also supports shaping of data such as joins, filter, string manipulation, standardisation, etc. In this post I explain how you can create a service instance of IBM DataWorks Forge and start using it.

1. Navigate to http://www.bluemix.net and login with your IBM ID.
2. Click on the "Catalog" link in the top navigation bar
3. Search for "DataWorks" in the search bar
4. The DataWorks Icon will be displayed (shown below)

5. However over it and it will show a button for "View more". Click on the button.
6. Select the plan which you wish to subscribe to. Change the service name if needed and click on "Create"
7. Navigate to the Dashboard using the top navigation bar. It will show the newly created DataWorks Forge service instance in the list of services.
8. Click on the DataWorks service instance. It will open the DataWorks landing page.
9. Click on the arrow at the right part of the screen.

10. This will launch the DataWorks GUI and you are all set to use DataWorks!

Tuesday, 19 August 2014

How to read the data from Apache Hive Database through IBM Infosphere DataStage

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. It supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem.

This blog provides an overview of how to read data from Hive database using Sample ODBC application (Example utility) supplied along with the ODBC driver bundles (also comes with Infosphere DataStage) and through ODBC Connector stage of DataStage. If you just want to read the data from Hive Database without logging into the database server, you can do that with Sample ODBC application.

InfoSphere DataStage provides comprehensive support for ODBC through ODBC Enterprise Stage as well as ODBC Connector Stage on parallel canvas.

Here I am showing you the two ways to read data from Hive.

Through sample odbc application of DataStage
Through ODBC Connector stage of DataStage.

1. How to connect to Hive database through Sample odbc application

To connect to a hive database, you must first configure an ODBC data source definition (DSN) for the database by using the IBM Apache Hive Wire Protocol ODBC driver.

Make sure that the ODBC driver for Hive is installed.

a) Steps to be followed on Windows

Create a Data Source Name (DSN)

Start the Microsoft ODBC Data Source Administrator.

On a 32-bit Windows computer, click Start > Control panel > Administrative Tools > Data Sources (ODBC). On a 64-bit Windows computer, navigate to C:\Windows\SysWOW64\odbcad32.exe.

Note: On Windows, InfoSphere® Information Server is a 32-bit application. Even on a 64-bit Windows computer, the DataStage is running as a 32-bit application, therefore you must use the 32-bit version of the ODBC Data Source Administrator.

On the System DSN page, click Add.
On the Create New Data Source page, select the IBM Apache Hive Wire Protocol and click Finish.

Configure the DSN

Select the DSN which you just created then click on the Configure button to configure the driver.

Figure(1): DSN Configuration

where Database is the name of the Hive database the DSN connects to and Host Name is where the Hive server resides.

Wire Protocol Version Indicates which protocol to use when connecting to the Apache Hive server. If set to 0 -(AutoDetect), the driver tries to connect using the HiveServer1 protocol and the HiveServer2 protocol simultaneously.

Wireprotocol Version 2 represents Hive Server version-2, It is a server interface that enables remote clients to execute queries against Hive and retrieve the results. It is designed to provide better support for open API clients like JDBC and ODBC.

You need to set WireProtocolVersion=2 if you want to connect to Database through DataStage while Database is installed on another server.

Note: If the protocol specified for WireProtocolVersion is not accepted by the Hive server to which you are connecting, the connection fails with a login timeout error, if LoginTimeout is enabled. For example, a login timeout error would be returned if you specify WireProtocolVersion=2 and the driver attempts to connect to an Apache Hive server that only accepts connections using the HiveServer1 protocol.

Now click on Advanced Tab and set the entries as depicted below.

Figure(2) :DSN Configuration

2. Click on Test Connect and check if you are able to establish the connection with Database.

If Connection is established you can use the DSN to extract the data from Database else you need to check for the connection details (Host name or port ) if Hive Database is running properly on the given host name or port.

Ensure that ODBC Driver Installation Path (by default C:\IBM\ODBCdrivers) is included in your System Path variable.

Connection to Hive Database through Sample ODBC Application

Open the command prompt and go to the ODBC driver install location then run the Example.exe.

b) Steps to be followed on UNIX:

DSN Configuration in the odbc.ini file

1. Add a new ODBC DSN definition for the Apache Hive to the .odbc.ini file.

2. Specify the Hostname and PortNumber to the host and port where Hive database server is running.

3. Specify the Database name from which you want to connect.

4. Save the .odbc.ini file.

Example

[HIVE65]

Driver=/opt/IBM/InformationServer/Server/branded_odbc/lib/VMhive00.so

Description=DataDirect Apache Hive Wire Protocol

Hostname=<Hive_Host>

Database=conn_db

PortNumber=10000

MaxVarcharSize=8192

TransactionMode=1

LogonID=

Password=

WireProtocolVersion=2

RemoveColumnQualifiers=0

StringDescribeType=12

UseCurrentSchema=0

QEWSD=2456861

Hive_Host is the host name where the Hive server resides and conn_db is the name of the Hive Database the DSN connects to.

Connection to Hive Database through Sample ODBC Application

1. Login to the machine where you installed the DataStage, set the DSHOME and ODBCINI environment variable which is updated in the dsenv script. You can either source the dsenv using the command “ source /opt/IBM/InformationServer/Server/DSEngine/dsenv “

or set the DSHOME with /opt/IBM/InformationServer/Server/DSEngine and ODBCINI with /opt/IBM/InformationServer/Server/DSEngine/.odbc.ini values.

2. Go to the directory where example program resides by default it will be under /opt/IBM/InformationServer/Server/branded_odbc/samples/example directory, run the example.

Example

2. How to connect to Hive database through ODBC Connector Stage

Following are the steps required to connect to Hive from Infosphere DataStage.

Add a DSN entry in your .odbc.ini file as mentioned above.
Design a ODBC Read job in Parallel Canvas of DataStage with ODBC Connector as the source, pick any stage from palette you want to use for target like Peek, Sequential File etc.

Here is an example to design a Parallel job using ODBC Connector stage.

Figure(3) : ODBC Read Job

3. Double click on ODBC Connector to set the properties. Enter the DSN , user and password. The user and password generally set during installation of Database, use the same user and password in the job. Check the screen shot below for these settings.

Figure(4) : Properties of ODBC_Connector

Set the properties of Sequential File, give the location of the file where you want to store the data, select the File Update Mode whatever you want like overwrite, append. In the Format tab you can choose the delimiters etc. The screen shot below depicts an example of how to set these properties.

Figure(5) : Properties of Sequential File

5. Save your Job, compile and then Run it.

Conclusion : The above blog illustrates how a user can configure a connection to the Hive Data Source using the IBM Apache Hive Wire Protocol ODBC driver supplied by DataDirect and use the DSN in the ODBC Connector to read the data from the Hive Datasource into the ETL job flow. Because of the limitations in the support for the HiveQL in the ODBC Connector, the connector only supports reading of the data from Hive, writing into the Hive is currently not supported. Users might have to use alternative mechanisms for writing into Hive.

Disclaimer: The postings on this site are those of the authors and don’t necessarily represent IBM’s positions, strategies or opinions.