Friday, 19 August 2016

Moving data from on-premises to cloud using IBM DataWorks Forge

In this blog post we will show how you can easily move data from an on-premises source such as IBM Db2 database to a cloud based database such as IBM dash DB using IBM DataWorks Forge.  We are assuming that you have already created a service instance of IBM DataWorks Forge and have setup a Secure Gateway. If not you can refer to the following blog posts:

  1. Creating an IBM DataWorks Forge service instance in Bluemix
  2. Setting up a Secure Gateway for connecting to on-prem sources using IBM DataWorks Forge.

Open the IBM DataWorks Forge GUI.    The first thing that you need to do is to create a Data Connection to your source and to your target.  Steps for creating the Data Connection are given below:
  1. Click on "Create Connection" link on the left hand side panel.
  2. Select the source type (IBM Db2 or IBM dash DB)
  3. Specify a name and description for the connection.
  4. Enter the connection details such as Hostname, port, database name and credentials for connecting to the database.
  5. For IBM Db2 (and any other on-premises source) you will also have to provide the secure gateway details. If you have created a secure gateway in the same org and space it will be listed in the drop down list.  Select the secure gateway which is present in the same VLAN as that our your source.
  6. Click on "Create Connection". 
  7. DataWorks will try to connect to the source and if the connection is successful, it will create the connection.
Once the source and target data connections have been created, you can create a DataWorks Forge Activity which will do the actual data movement.  There are two kinds of activities: (a) One that simply copies the source data to the target and (b) others that read the data from the source, do some amount of data shaping such as join, etc., and then copy the transformed data to the target.  In this blog post we will showcase how you can do a straight copy of the source data to the target without any shaping.  
  1. Click on "Refine & Copy" link from the left hand side panel.  It will open the asset browser.
  2. You will be show the list of source connections that you have created.  Select the one which connects to the source IBM Db2 database.
  3. It will then show the list of schemas which are available in the source database.  Select the schema in which your source data is present.  
  4. It will then show the list of tables present in the selected schema.   Select one or more tables from the list.  Data from these selected tables will be copied to the target.  
  5. When you click on a table, the GUI will show the list of columns which are present in the table.  If you only want to see the columns in a table you can click on the table name. If you want to select all the columns in the table, then click on the checkbox next to the table name.  If you wish to select a sub-set of the columns you can click on the checkbox next to the column names. 
  6. The list of selected tables are show on the right hand side.  You can click on the three dots next to a selected table to either (a) see some sample data in the table or (b) delete to table from the selection.
  7. Once the source table(s) have been selected, click on "Copy to Target" button at the top. This will directly copy the data to the target without doing any shaping.  If you wish to shape the data, then click on the "Refine Data" button.  In this blog we will focus on the "Copy to Target" option.
  8. When you click on "Copy to Target" the asset browser is again opened and it show the list of target connections which you have created.  Select the one for dash db which you have created.
  9. It will show the list of schemas present in the dash db.  Please note that if you are using the multi-tenant dash db instance, then only the schema with name starting with DASH<some number> has write permissions.  Hence select this schema.
  10. It will then show the supported table actions.  There are four options: 
    1. Append to the table: If the target table already exists, then data will be appended to it.
    2. Recreate the table: If the table does not exists, it will be created. If it exists, then it will be deleted and created again.  In this process if there are any key constraints and permissions defined on the original table, they will be lost.
    3. Replace the table contents: If the table does not exist, it will be created. If it does exists, then all the data from the table will be deleted and the data from the source table will be copied to it.  The key constraints and permission will remain unchanged.
    4. Merge with the table: This is similar to upsert. In other words, for each record in the source table, if the target table does not contain a record with the same primary key values as the source records, then the source record will be inserted into the target table. However, if the target table contains a record with the same primary key values as that of the source record, then the target record will be updated with the values of the source record. If the target table does not contain a primary key, then the source record will be added to the table without removing the tables existing data.  If the target table does not exist, then it will be created without any primary key and the source records will be copied to the target.
  11. On the right hand side panel, IBM DataWorks Forge will list the set of tables which will get created in the target.  The name of the target table can be changed by clicking on the edit icon next to the table name.  
  12. Finally, provide a name for the activity at the top and then click on Run.  The activity is now running and you will be taken to the activities page.  You can find the newly created activity in the list.
  13. Click on the activity name to see the details of the activity. It will show the status of the activity run.  You will have to refresh the activity details page by clicking on the refresh button at the top right corner to check the latest status of the activity run.  
  14. Once the run is finished, you can check the data movement logs by clicking on the activity run in the "Activity Runs" section of the activity details page.

Thats all there is to using DataWorks Forge to move data from an on-prem source such as IBM Db2 to IBM dash DB on the cloud!


Setting up Secure Gateway for connecting to On-Prem sources using IBM DataWorks Forge

One of the key use cases for IBM DataWorks Forge is to move data from different on-prem sources into cloud based targets.  Examples of on-prem sources supported by IBM DataWorks Forge are: IBM Db2, Oracle, IBM Pure Data For Analytics, IBM Informix, etc.  Similarly examples of cloud based targets include IBM dash DB, IBM Db2, Amazon S3, Bluemix Object Storage, etc.  

In order to connect to any on-prem source, IBM DataWorks Forge makes use of IBM Secure Gateway service.  The Secure Gateway Service brings Hybrid Integration capability to your Bluemix environment. It provides secure connectivity from Bluemix to other applications and data sources running on-premise or in other clouds. A remote client is provided to enable secure connectivity.  In this blog post we will explain how to configure the Secure Gateway for usage with IBM DataWorks Forge.

  1. Open the IBM DataWorks Forge GUI in Bluemix.  For details on how to create a IBM DataWorks Forge service instance you can refer to the following blog post.
  2. Click on the "Secure Gateway" link on the left hand side panel under the "ADMIN" section
  3. Click on "Add Gateway" button.
  4. Provide a name for the secure gateway.  If you do not want the token which is used by on-prem clients for connecting to the gateway to expire, uncheck the box of "Token expiration".  Click on "Add Gateway".
  5. A new gateway will be created and will be shown in the list.  Click on the newly created gateway and it will show the details of the gateway.  On that page, click on "Add Clients" button at the top.
  6. There are three ways to connect to the gateway.  We will provide the details of using a docker based approach.  Hence select the bullet for "docker".
  7. It will show you a command which needs to be run on the machine where the secure gateway client is to be run.  Copy this command.
  8. You will need a machine in the on-prem environment where the secure gateway client needs to be installed.  On this environment, you need to first install docker (if it is not already installed). 
  9. In our testing we installed docker on a machine with OS Windows 7.1FP1 or higher.  The steps for installing are available at:  Once docker is installed, it also installs SSH on the windows machine. We need to ensure that the path for SSH is added to the PATH environment variable in System properties.  SSH is typically installed in C:\Program Files (x86)\Git\bin folder in windows.  Hence add this folder to the PATH environment variable in System properties.
  10. Once the above is done, open an command prompt  and execute the following commands:
    1. Change the directory to the folder where docker is installed.  It is typically installed in c:\Program Files\Boot2Docker
    2. Run the following command: "boot2docker init".  This is a onetime activity which is to be run the first time you install docker on the machine.
    3. Then run the following two commands
      1. boot2docker start
      2. boot2docker ssh

Commands mentioned in step 10.3 are to be run in a window which needs to be kept open whenever you want to read data from an on-prem source.  If the window is closed, you can open a new window and run the commands (in 10.3) again.

You can ensure that your docker is up-to-date by running the following command: 

docker pull ibmcom/secure-gateway-client

Once the above steps have been done, you can run the command mentioned in Step 7 above, e.g., docker run -it bluemix/secure-gateway-client <gateway_id>. This will download the Secure Gateway client, run it, and connect to the gateway.    Please note that whenever you wish to connect to an on-prem source, you will have to ensure that the command is running in a window on the on-premises machine.

Please ensure that there is no firewall between the machine on which docker is installed and the source.  An easy way to check this is to ensure that you can telnet to the source from the machine on which docker is installed. 

Once this is done, you are good to go!  


Creating a IBM DataWorks Forge service instance in Bluemix

IBM has a data integration service offering on the cloud called as IBM DataWorks Forge.  It is targeted towards the citizen analyst persona (business user).   Citizen analyst make use of various kinds of analytic tools available in the cloud for doing analytics.  One of the first things that they need is to get their data into the cloud.  This is where IBM DataWorks Forge helps.

IBM DataWorks Forge allows customers to move data from on-prem systems to Cloud as well as from Cloud based sources to other cloud based targets.  IBM DataWorks Forge also supports shaping of data such as joins, filter, string manipulation, standardisation, etc.  In this post I explain how you can create a service instance of IBM DataWorks Forge and start using it.

1. Navigate to and login with your IBM ID.
2. Click on the "Catalog" link in the top navigation bar
3. Search for "DataWorks" in the search bar
4. The DataWorks Icon will be displayed (shown below)

5.  However over it and it will show a button for "View more".  Click on the button.
6. Select the plan which you wish to subscribe to.  Change the service name if needed and click on "Create"
7. Navigate to the Dashboard using the top navigation bar.  It will show the newly created DataWorks Forge service instance in the list of services.
8.  Click on the DataWorks service instance. It will open the DataWorks landing page.
9.  Click on the arrow at the right part of the screen.
10. This will launch the DataWorks GUI and you are all set to use DataWorks!