Information Integration Blog: July 2014

Wednesday, 30 July 2014

How to read and write data from JDBC Connector using InfoSphere DataStage?

How to read and write data from JDBC Connector using InfoSphere DataStage?

JDBC Connector is a generic connector which comes under Connectivity component category in IBM InfoSphere Information Server. It is used by IBM InfoSphere DataStage to perform extract, load , lookup and metadata operations on Databases which provide Java Database Connectivity. This blog provides an overview of how to read and write data from database using JDBC Connector through DataStage.

To connect to database through JDBC Connector, it must have the proper driver to be loaded to connect to a particular database in order to extract, load or lookup the data. For example if we want to connect to Derby database through JDBC CC, we need to have the proper jdbc driver to be loaded. That driver class is part of a jar file. The information regarding which drivers need to be picked for the corresponding database can be found in the database JDBC documentation, for example, the derby jdbc driver can be found in the below link:

https://builds.apache.org/job/Derby-docs/lastSuccessfulBuild/artifact/trunk/out/tools/rtoolsijtools16879.html

Java Database Connectivity is very useful in many ways, for example, as it is based on java, the same driver class can be used across all platforms. This is because of Java's platform independent feature. The same is applicable to JDBC CC as well because it follows the same technology to connect to any database which provides jdbc connectivity.

Advantages of using JDBC CC:

JDBC CC uses java connectivity to connect to any data source. It only needs driver class name and the jar file which has this class file in it. The driver class file and the jar file will be same across all the platforms as it is in java, so no platform specific issues.

Another advantage is the ability to connect to data sources for which we don't have native connector and for which there is no ODBC connectivity, such as Apache Derby.

Also, it is easy to configure.

JDBC CC Configuration:

JDBC CC comes with Information Server. JDBC Connector is easy to configure as it only needs the driver jar file to be placed on the system where we have Information Server and make the file isjdbc.config point to it, no platform specific issues as its all Java. To configure the JDBC CC the prerequisites are:

1) Install Information Server: JDBC CC is introduced on IS912.
2) Configure isjdbc.config file under Information Server Home path: This configuration file contains the name of the driver class and the jar file which has that driver class file in it. For more info please refer:
http://pic.dhe.ibm.com/infocenter/iisinfsv/v9r1/index.jsp?topic=%2Fcom.ibm.swg.im.iis.conn.jdbc.usage.doc%2Ftopics%2Fjdbc_config_driver.html

Usually, the java based data sources only have java connectivity option (JDBC), like Derby. In that case we can use JDBC CC to connect to that database to extract and load the data.

JDBC Connector as a Source:

The following figure shows a sample job which can extract the data from Derby database and put it into a Sequential File.

From the Job Canvas, Double click on JDBC Connector Stage and go to the Output tab.

For Connection:

JDBC CC needs an URL to connect to any data source. The URL syntax varies as per the data source from which we need to extract data. If we take an example of Derby database the syntax for URL looks like :

jdbc:<datasource_name>://<hostname>:<port>/<database>
jdbc:derby://locahost:1527/conn_db

localhost denotes we have derby database setup locally.
1527 is the port number where derby is running.
conn_db denotes the database name which has been created and is being used to fetch or load data.

Provide the Username and Password info and the Table Name from which you want to fetch the data.

For Column Definition:

Go to the Columns tab under Outputs tab and specify the columns of the table which you want to extract from Database.

JDBC Connector as a target:

The following figure shows a sample job which can load the data from a sequential file into Derby database.

From the Job Canvas, Double click on JDBC Connector Stage and go to the Input tab.

For Connection:

Please follow the same steps above which were performed when JDBC CC used as source.

For Column Definition:

Go to the Columns tab under Inputs tab and specify the columns of the table in which you want to load data.

Conclusion JDBC CC is one of the easy-to-learn connectors with advantages like platform independancy, one single driver can be used for all the platforms, easy to configure, provides access to databases where native connectors are not provided and comes with very specific jdbc data sources.

Disclaimer: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

Friday, 11 July 2014

How to read and write data from Greenplum Connector using InfoSphere DataStage

Greenplum Connector is a generic connector which comes under Connectivity component category in IBM InfoSphere Information Server. It is used by IBM InfoSphere DataStage to perform extract, load, lookup and metadata operations (by using InfoSphere Metadata Asset Manager (IMAM)) on Greenplum Database. Greenplum Connector is supported on Parallel Canvas in InfoSphere DataStage. This blog provides an overview of how to read and write data from Pivotal Greenplum database using Greenplum Connector through DataStage.

To connect to Pivotal Greenplum database through Greenplum Connector, it must have the proper driver to be loaded to connect to a particular database in order to extract, load or lookup the data.

Greenplum CC Configuration:

To connect to a Greenplum database, you must first configure an ODBC data source definition (DSN) for the database by using the IBM Greenplum Wire Protocol ODBC driver. Before you begin ensure that the ODBC driver for Greenplum libraries is installed.

a) Windows OS Procedure:

1. Start the Microsoft ODBC Data Source Administrator.

On a 32-bit Windows computer, click Start > Control panel > Administrative Tools > Data Sources (ODBC)

On a 64-bit Windows computer, navigate to C:\Windows\SysWOW64\odbcad32.exe.

Note: On Windows, InfoSphere® Information Server is a 32-bit application. Even on a 64-bit Windows computer, the connector is running as a 32-bit application. Therefore you must use the 32-bit version of the ODBC Data Source Administrator as the Greenplum connector will not be able to locate DSN definitions created in the 64-bit ODBC Data Source Administrator.

2. On the System DSN page, click Add.

3. On the Create New Data Source page, select the IBM Greenplum Wire Protocol driver and click Finish. For information about configuring the driver options, see the Greenplum Wire Protocol Driver chapter in the DataDirect Connect Series for ODBC User's Guide.

b) UNIX OS Procedure:

1. Add a new ODBC DSN definition for the Greenplum Wire Protocol to the .odbc.ini file.

2. Specify the Hostname and PortNumber to the host and port where Greenplum database server is running.

3. Specify the Database name to default database to use for connections using the new ODBC DSN.

4. Save the .odbc.ini file.

Example

[Greenplum_DEV_SERVER]

Driver=/opt/IBM/InformationServer/Server/branded_odbc/lib/VMgplm00.so

Description=DataDirect 7.0 Greenplum Wire Protocol

Database=gp_dev

HostName=gp_host

PortNumber=5432

where, gp_dev is the name of the Greenplum database the DSN connects to and gp_host is the host name where the Greenplum server resides.

* For more information about configuring and other driver options, see the Greenplum Wire Protocol Driver chapter in the DataDirect Connect Series for ODBC User's Guide.

c) Greenplum parallel file distribution program (gpfdist)

The Greenplum Connector stage exchanges data with the Greenplum server by using the Greenplum file distribution program, which is called gpfdist. The gpfdist program runs on the database client, and it must be installed on the InfoSphere Information Server engine tier computer. For data to be transferred by using the gpfdist protocol, a network route must be present to enable bidirectional access by using an IP address and optionally the presence of a DNS server to facilitate the name resolution. The connector invokes a gpfdist process on every physical computer node and creates the external table. The host of the external table data is identified by the fastname entry in the parallel engine configuration file ($APT_CONFIG_FILE).

In order for the connector to invoke gpfdist on each engine tier, the location of gpfdist(%GPHOME_LOADERS%\bin) must be in the system path. In addition, the location of gpfdist dependent libraries (%GPHOME_LOADERS%\lib) must be in the system library path. On Windows, the system environment variable PATH is updated in the Advanced system settings. On Linux, the PATH environment variable is updated in the dsenv script.

Note: On Windows, the Greenplum installer adds %GPHOME_LOADERS%\bin and %GPHOME_LOADERS%\lib to the PATH system environment variable. Verify that these directories are in the PATH.

Greenplum Connector as a Source:
The following figure shows a sample job which can extract the data from Greenplum database and put it into a Sequential File.

From the Job Canvas, Double click on Greenplum Connector Stage and go to the Output tab.

Connection:

Data source (Required) - Data source name (DSN) as defined in the ODBC driver manager or the odbc.ini file for a Greenplum database

Database (Optional) - Name of the Greenplum database on the server defined in the ODBC Data source. If not specified, the database specified in the Data source will be used.

Username (Required) - The user name for the ODBC connection

Password (Optional)* - The password for the ODBC connection

* Greenplum Database allows password less authentication, and can potentially support other forms of authentication (SSL, LDAP) through the ODBC driver. If the user would like to use these types of authentication methods then the connector should allow the Password property to be optional.
Usage : Provide the source table name in Table Name option.

For Column Definition:
Go to the Columns tab under Outputs tab and specify the columns of the table which you want to extract from Greenplum Database.

Greenplum Connector as a target:

The following figure shows a sample job which can load the data from a sequential file into Greenplum database.

From the Job Canvas, Double click on Greenplum Connector Stage and go to the Input tab.

Connection:

For Connection, please follow the same steps above which were performed when Greenplum CC

Usage:
When the user selects the insert mode, then the user should be able to choose the direct insert mode. Select the below options

Write Mode=Insert

Table Action= Target table name.

For Column Definition:
Go to the Columns tab under Inputs tab and specify the columns of the table in which you want to load data.

Conclusion

Greenplum connector can be use to perform operations like read data from Greenplum databases or write data to Greenplum databases or look up data in the contexts of those jobs. Also by using InfoSphere Metadata Asset Manager (IMAM), it can be used to import metadata from Greenplum database.

Disclaimer: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

Wednesday, 9 July 2014

Create a QualityStage Match Specification in 8 easy steps!!!

The QualityStage Match Wizard is a simple interactive tool which can be used to create template based match specifications. We just need to answer a few guided questions, make simple selections and we'll be all set!! A basic match specification can be created quickly and easily with a minimal knowledge of matching concepts, Match Designer functionality and its workflow.

The match specifications created using the Match Wizard serve as a starting point for many purposes. Customers can use them to learn and understand match specification creation process and the concept of matching. They can be used to understand how to choose blocking columns, match commands, match threshold, reliability and chance agreements (m probability and u probability) for a given data and configure the test environment. Sales Executives can use them in their demos instead of building the match specifications from the scratch with a minimal learning curve involved.

From IIS version 8.7, Match Wizard is available as an enhancement to the Match Designer and it should be noted that it is not an alternate or a substitute for the Match Designer. Once the Match Wizard steps are completed, Match Designer is launched for any further development, refinement, saving and testing of the wizard generated match specification so that it can be subsequently used in a match job. Currently, we can use the Match Wizard to create match specification for matching the data standardized using QualityStage US Name and US Address rule sets.

For the matching process, we need sample data and its frequency distribution information. It is always recommended to standardize the sample data before using it in a match specification as the standardization process ensures uniformity in the data. QualityStage Standardize stage and the rule sets can be used to achieve this. The frequency distribution information of the sample has a very important role in the matching process. The data that is more frequent is less significant while matching as chances of it getting matched are very high and the vice-versa. The distributions of the sample data can be obtained by using the QualityStage Match Frequency Stage.

The Match Designer expects the input sample data and frequency distributions to be a DataStage dataset file. We can use the sample data and the predefined jobs that come with the product to standardize the data and create sample and frequency datasets. Sample data can be found at - ISInstallationDirectory/Server/PXEngine/DataQuality/MatchTemplates/StandardizationInput from IIS version9.1. DataStage Export(dsx) file of the predefined jobs which can be imported to any DataStage project can be found at - ISInstallationDirectory\Clients\Samples\DataQuality\MatchTemplates\Jobs\PredefinedJobs.dsx.This dsx contains match jobs as well which can be used to deploy the completed match specifications

Steps to create a match specification using Match Wizard:

Launch the match wizard
Select the Match Form
Select the Match Type
Select the Match Threshold
Select the additional column(s)
Configure Test Environment
1. Source data set
2. Frequency data set
3. Database Connection
Summary
Save the Match Specification in the Match Designer

Let's see each of these steps in detail :

Step # 1: Launch the match wizard:

In the DataStage Designer Client click on File → New → Data Quality Select Match Specification (Fig 1) .

In the 'Select Match Build Method' dialog, click on 'Help me get started' link(Fig 2). This will launch the Match Specification Setup Wizard.

Let's get familiarized with the Match Wizard design(refer Fig 3)

Default selections would be provided wherever possible as in the one below.

Step # 2 - Select the Match Form(refer Fig 3 above)

There are 2 kinds of matching available

Un-duplicate Matching – The option 'Within a single source' is for creating an Un-duplicate match specification where matching is done within a data source (generally used to eliminate duplicates in a source file)
Reference Matching – The option 'One source to another source' is for creating a Reference match specification where data source is matched with a reference source (generally used to enrich a source file from a reference file)

Appropriate Match Form should be selected according to the requirement. Now let's continue with the default selection 'Within a single source'.

Step # 3 - Select the Match Type (refer Fig 4)

The Match Wizard provides us with 4 types of matching for each match form.

Individual Deduplication – This match type helps us identify duplicate record entries for a person residing in an address
Individual Householding – This match type helps us identify duplicate record entries for people residing in an address
Business Deduplication – This match type helps us identify duplicate record entries for a business in an address
Business Householding - This match type helps us identify duplicate record entries for businesses in an address

Match type should be determined based on the business goal for matching. In this form too lets continue with the default option selected 'Individual De-duplication'.

Step # 4 - Select the Match Threshold (refer Fig 5)

Match Tolerance or Match Threshold is determined based whether we want to be certain about matched records or we want to consider all the possible or potential matches. Based on the match threshold selected, predetermined match and clerical cut off values will be assigned for each match pass.

Lower the Match Threshold – This results in more matches with lower certainty and false positives. (false positive meaning records categorized as match records would be actually non-match records)
Raise the Match Threshold – This results in less matches with higher certainty and false negatives (false negative meaning records categorized as non-match would be actually matched records)

More information on this can be found at http://www-01.ibm.com/support/knowledgecenter/SSZJPZ_8.5.0/com.ibm.swg.im.iis.qs.ug.doc/topics/c_Defining_cutoff_values.html?lang=en

Let's continue with the default selection.

Step # 5 - Select the additional column(s) (Optional Step) (refer Fig 6)

We can improve the match results by including more columns in the matching. For each match type, Match Wizard provides us a set of additional columns which we can include in the match to get better match results. But, we can add these columns to the match specification only if the source data has been standardized with one or more QualityStage rule sets VDATE, VEMAIL, VPHONE , USTAXID. There is a requirements twisty under each column which can be expanded to see the conditions to be met to use that column in the matching. For each additional column selected, an individual match pass would be created.

To keep our match specification simple, am not selecting any of the additional columns here.

Step # 6 - Configure Test Environment (Optional Step)

In order to execute the match specification, the Match Designer needs the information of from where it can access the sample input data and reference data (if it is a reference match), frequency distribution of the sample input and reference data, details of database into which match results can be stored on successful completion of the execution. Providing these details is called configuring the test environment.

This step is optional and if we don't intend to complete it now, we can do it in the Match Designer before executing the match specification. We'll select the check box for items which we intend to provide the information. (Fig 7)

Step # 6a - Source data set(Optional Step) (refer Fig 8)

We need to provide the location of the dataset which contains the sample input for the Match Designer. Here since we are creating a single source match specification, we see only one file selection dialog. For a two source match (reference match) we would see an additional reference input data set file selection dialog.

Step # 6b - Frequency data set(Optional Step) (refer Fig 9)

We need to provide the location of the dataset which contains the frequency distribution of the sample input for the Match Designer. Here too since we are creating a single source match specification, we see only one file selection dialog. For a two source match (reference match) we would see an additional reference frequency data set file selection dialog.

Step # 6c - Database Connection (Optional Step) (refer Fig 10)

We need to provide the database connection details or the data connection object which the match designer and the QS server will use to connect to the match designer results data base.

Step # 7 - Summary (refer Fig 11)

That's it!! We are almost done! The summary of all the selections made in the Match Wizard will be displayed. Any optional step completed will have a check mark and those not completed will be greyed out. Finish, Back and Cancel buttons will be enabled. We can go back to any form and change any of the selections made and the changes will be reflected in the Summary form.

Step # 8 - Save the Match Specification in the Match Designer (refer Fig 12)

On clicking the Finish button in the Summary form, Match Designer is launched with the template generated one source de-duplication match specification with predetermined default match passes. Each match pass will be composed of predetermined blocking columns, match commands and cut-off values set to a lower or higher threshold as per the selection made in the wizard. Test environment will be populated with the details entered in the Match Wizard (To open Test Environment window, Under Compose tab, go to Configure Specification → Test Environment). Save all the match passes and the match specification with the default names or with the names of your choice, test them and get the match results.

Disclaimer: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

Tuesday, 8 July 2014

Access XMETA repository using SQL Views

Background

The contents of XMETA Metadata repository for InfoSphere Information Server 9.1 and prior releases is not directly accessible to any external application. This is due to the fact that it is designed using the EMF model. These models are often very abstract and generic, which makes them difficult to be understood by an application – especially if the application is not familiar with the semantics of the model. Further, the database tables and views of the metadata repository are not uniform across the different database platforms supported by Information Server. The problem is further compounded by the fact that database table and views names are very cryptic and not uniform across databases. Proper relationships are also not defined between database tables. The solution for this problem is SQL views.

What are SQL Views

SQL Views are set of predefined / custom database views defined on top of the XMETA metadata repository thus allowing users to access subset of XMETA metadata repository in read only mode. These views not only hide the internal information (eg: XMETA stores Dates as long values whereas views retrieves them as Dates field) and it's complexities of XMETA repository, it also provides uniform access across different database platform. SQL Views are grouped into separate schemas (aka View Schema) based on product/model area. These views can be directly accessed by any external or internal application using ODBC/JDBC. A dedicated Model Schema is also available for each of the View schema with tables derived from the views having same names and structure. These model schema can be used for development purpose. Primary and foreign keys that represents the original model references and inheritance relationships are defined on the tables. These model tables and associated keys can be imported to generate the reporting model from which different reports can be designed.

Why to use SQL Views

SQL Views makes XMETA metadata repository accessible through internal and external applications via JDBC/ODBC. SQL Views decouple applications from the underlying persistent model and minimize the impact of model upgrades and development locksteps. It also provides the work around from some of the internal proprietary XMQL limitation (eg: support for scalar functions). These views are uniform across database platforms and are easily consumable. Persistent data representations are converted to more friendly types in the views when applicable (eg: Long integers to SQL timestamps). All the internal system attributes (eg: lockids, blobs holding objects) are filtered out of the views. By creating SQL Views in separate and dedicated schema, the security is enforced at the database level whose access can be restricted. SQL views also gives the ability to filtered out in the view definitions

Where to use them:

SQL Views makes it possible to integrate any application which needs to access XMETA's metadata using JDBC/ODBC. It also can be used to integrate with any reporting tool like IBM Cognos. Starting InfoSphere Information Server 11.3, suite bundle comes with predefined set of sample Cognos report which can be directly used in IBM Cognos environment with minimal or no changes.

Available SQL Views with InfoSphere Information Server 11.3

Starting with InfoSphere Information Server 11.3, SQL Views are available in 3 different functional/domain areas. These functional areas are Common Metadata, Data Quality and Information Governance.

- Common Metadata Views

Common Metadata contains SQL Views for subset of Common Model object types that represent database metadata that in usually imported by bridges or connector. These SQL Views describe hosts, databases, schemas, tables, views, stored procedures,and columns. In addition there are views for notes (annotations), users, and stewardship assignments.

- Data Quality Views

Data Quality contains SQL Views for object types from the data quality domain that are created in InfoSphere Information Analyzer: projects, data rules, data rule sets, data rule definitions, data rule set definitions.

- Information Governance Views

It contains SQL Views for object types from the Information Governance domain that are created in InfoSphere Information Governance Catalog, including categories,terms, labels, information governance policies, and information governance rules. This also includes SQL Views that link information governance objects to the data rules that implement them, and to other common metadata assets.

Disclaimer: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

Monday, 7 July 2014

How to import Metadata using Oracle Connector from IMAM (InfoSphere Metadata Asset Manager) Window (or Information Server DataStage Client) on Unix OS

For any Connector Metadata import user needs to restart the Node agent after setting all the required parameters of a connector. For Oracle Connector we can do MDI (Metadata Import) using below steps

Steps to setup Oracle environment prior to import (Please use root user for below steps)

1 Install Oracle Client/Server on Information Server machine

2 Set the Oracle related environments like ORACLE_HOME, LD_LIBRARY-PATH/LIBPATH, PATH using one of the below steps

export ORACLE_HOME=/home/oracle/client/product/11.2.0/client_1

export LD_LIBRARY-PATH=/home/oracle/client/product/11.2.0/client_1/lib:$ LD_LIBRARY-PATH

export PATH= =/home/oracle/client/product/11.2.0/client_1:$PATH

Add the below parameters in dsenv

PATH=/home/oracle/client/product/11.2.0/client_1/bin:$PATH export PATH

LD_LIBRARY_PATH=/home/oracle/client/product/11.2.0/client_1/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH

ORACLE_HOME=/home/oracle/client/product/11.2.0/client_1 export ORACLE_HOME

Now source dsenv and restart the DSEngine.

*LIBPATH needs to use for AIX instead of LD_LIBRARY_PATH

3 Source dsenv, it can be done by below command

source /opt/IBM/InformationServer/Server/DSEngine/dsenv

4 Goto /opt/IBM/InformationServer/ASBNode/bin/ and run restart the Node agent as ./NodeAgents.sh restart

5 Goto DataStage Client or IMAM GUI Window and run the Oracle Connector Metadata import steps.

Steps to Import MDI using IMAM GUI
1) Connect to IMAM GUI by using below link

https://localhost:9443/ibm/imam/console/loginForm.jsp (where as localhost is your machine hostname or IP)

2) Goto Administration tab and create Metadata Interchange Server.

3) Goto Import Tab and Click on New Import Area

3) Select IBM InfoSphere Oracle Connector From the Bridge/Connector Selection after selecting Metadata Interchange Server and give the Import area name and Click on Next button

4) Select the DataConnection

5) Select Asset to Import and click on Next button

6) Provide the Host System name and click on next button

7) In next page Select Express Import and click on Import button

8) You can see the imported assets Shared Import Tabs of IMAM Window

Summary: We can import Metadata using Oracle Connector from IMAM GUI (or DataStage Client) after restarting Node agent (after setting all the Oracle client related environments) from root user.

Disclaimer: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”