Friday, 11 July 2014

How to read and write data from Greenplum Connector using InfoSphere DataStage

Greenplum Connector is a generic connector which comes under Connectivity component category in IBM InfoSphere Information Server. It is used by IBM InfoSphere DataStage to perform extract, load, lookup and metadata operations (by using InfoSphere Metadata Asset Manager (IMAM)) on Greenplum Database. Greenplum Connector is supported on Parallel Canvas in InfoSphere DataStage. This blog provides an overview of how to read and write data from Pivotal Greenplum database using Greenplum Connector through DataStage.

To connect to Pivotal Greenplum database through Greenplum Connector, it must have the proper driver to be loaded to connect to a particular database in order to extract, load or lookup the data.

Greenplum CC Configuration:
To connect to a Greenplum database, you must first configure an ODBC data source definition (DSN) for the database by using the IBM Greenplum Wire Protocol ODBC driver. Before you begin ensure that the ODBC driver for Greenplum libraries is installed.
a) Windows OS Procedure:
1. Start the Microsoft ODBC Data Source Administrator.
 On a 32-bit Windows computer, click Start > Control panel > Administrative Tools > Data Sources (ODBC)
On a 64-bit Windows computer, navigate to C:\Windows\SysWOW64\odbcad32.exe.
Note: On Windows, InfoSphere® Information Server is a 32-bit application. Even on a 64-bit Windows computer, the connector is running as a 32-bit application. Therefore you must use the 32-bit version of the ODBC Data Source Administrator as the Greenplum connector will not be able to locate DSN definitions created in the 64-bit ODBC Data Source Administrator.
2. On the System DSN page, click Add.
3. On the Create New Data Source page, select the IBM Greenplum Wire Protocol driver and click Finish. For information about configuring the driver options, see the Greenplum Wire Protocol Driver chapter in the DataDirect Connect Series for ODBC User's Guide.
b) UNIX OS Procedure:
1. Add a new ODBC DSN definition for the Greenplum Wire Protocol to the .odbc.ini file.
2. Specify the Hostname and PortNumber to the host and port where Greenplum database server is running.
3. Specify the Database name to default database to use for connections using the new ODBC DSN.
4. Save the .odbc.ini file.
Description=DataDirect 7.0 Greenplum Wire Protocol
where, gp_dev is the name of the Greenplum database the DSN connects to and gp_host is the host name where the Greenplum server resides.
* For more information about configuring and other driver options, see the Greenplum Wire Protocol Driver chapter in the DataDirect Connect Series for ODBC User's Guide.
c) Greenplum parallel file distribution program (gpfdist)
The Greenplum Connector stage exchanges data with the Greenplum server by using the Greenplum file distribution program, which is called gpfdist.  The gpfdist program runs on the database client, and it must be installed on the InfoSphere Information Server engine tier computer. For data to be transferred by using the gpfdist protocol, a network route must be present to enable bidirectional access by using an IP address and optionally the presence of a DNS server to facilitate the name resolution. The connector invokes a gpfdist process on every physical computer node and creates the external table. The host of the external table data is identified by the fastname entry in the parallel engine configuration file ($APT_CONFIG_FILE).
In order for the connector to invoke gpfdist on each engine tier, the location of gpfdist(%GPHOME_LOADERS%\bin) must be in the system path. In addition, the location of gpfdist dependent libraries (%GPHOME_LOADERS%\lib) must be in the system library path. On Windows, the system environment variable PATH is updated in the Advanced system settings. On Linux, the PATH environment variable is updated in the dsenv script. 
Note: On Windows, the Greenplum installer adds %GPHOME_LOADERS%\bin and %GPHOME_LOADERS%\lib to the PATH system environment variable. Verify that these directories are in the PATH.

Greenplum Connector as a Source:
The following figure shows a sample job which can extract the data from Greenplum database and put it into a Sequential File.

From the Job Canvas, Double click on Greenplum Connector Stage and go to the Output tab.


Data source (Required) - Data source name (DSN) as defined in the ODBC driver manager or the odbc.ini file for a Greenplum database
Database (Optional) - Name of the Greenplum database on the server defined in the ODBC Data source. If not specified, the database specified in the Data source will be used.
Username (Required) - The user name for the ODBC connection
Password (Optional)* - The password for the ODBC connection
* Greenplum Database allows password less authentication, and can potentially support other forms of authentication (SSL, LDAP) through the ODBC driver. If the user would like to use these types of authentication methods then the connector should allow the Password property to be optional.
Usage :  Provide the source table name in Table Name option.


For Column Definition:
Go to the Columns tab under Outputs tab and specify the columns of the table which you want to extract from Greenplum Database.

Greenplum Connector as a target:

The following figure shows a sample job which can load the data from a sequential file into Greenplum database.


From the Job Canvas, Double click on Greenplum Connector Stage and go to the Input tab.


For Connection, please follow the same steps above which were performed when Greenplum CC
When the user selects the insert mode, then the user should be able to choose the direct insert mode. Select the below options
Write Mode=Insert
Table Action= Target table name.

For Column Definition:
Go to the Columns tab under Inputs tab and specify the columns of the table in which you want to load data.

Greenplum connector can be use to perform operations like read data from Greenplum databases or write data to Greenplum databases or look up data in the contexts of those jobs. Also by using InfoSphere Metadata Asset Manager (IMAM), it can be used to import metadata from Greenplum database.

Disclaimer: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

1 comment: