Information Integration Blog: Extract Unstructured Data using Information Server Connectivity Stage

Abstract

Unstructured data is information that does not have a predefined data model or does not fit well into relational tables. Unstructured data can be text from books, journals, metadata, audio, video files, the body of word processor documents, web pages, and presentation charts. This blog describes about the stage that handles the unstructured data.

Introduction:

With the release of Information Server 9.1, it supports the Unstructured Data with Microsoft Excel files as data source.In this blog,we are going to create a job to extract the unstructured data and explain about the properties with an example.

Configure to extract data

Create a stage in information server with the following two components:

Unstructured data stage
Output sequential file

Following figure illustrates how to extract data by using unstructured data stage. Double click the unstructured data stage icon to enter or modify the options. Options for the respective stage need to be specified at stage level appropriately and in configure window to make the job run.

Configure Window

Following figure describes the Configure window of unstructured stage which is displayed when configure button in the Stage properties is clicked. Configure data source,import,map sections appropriately and click OK .Respective stage properties and column schema is populated.

Configure Window Properties

Specify filename,Password if file is password protected in Data source Section and specify template file for design time when wild cards specified in the input excel files. Select Use template file for design time and specify a template file name. Template file is used for subsequent configuration steps, and not used at runtime.Specify a value for Template password if the specified template file is password protected.

Specify range expression and click load button. Appropriate excel columns will be displayed in Import Section.

From the Column header field, select First row of data ranges. When First row of data ranges is selected, first row is regarded as the header and the Unstructured Data stage starts extracting from the second row.

Different types of data like File/Document properties , Sheet/Row/Cell Information values can be extracted using property tab and Custom properties can be specified in Advanced tab in Import Section.

Click Import button to get the data stage columns. Other cell information like formula,comment,author of the comment,Hyperlink label,type,address can be extracted by specifying the appropriate value for Import option in Map section. Click OK

Stage Level Properties

Unstructured data stage can be configured to skip null values in the cells that are being extracted by specifying skip null rows option in stage properties

Run time column propagation should be enabled and appropriate property should be set for hidden columns option to extract hidden column data

When extracting data from the file, specify how you want to handle errors. Select Fail to abort the job. Select Skip to skip the data and continue to run the job

Click OK , compile and execute the job to get the required output.

Example:

In this example there are two input excel file names (Employee1.xls and Employee2.xls). To retrieve the data from both the files mention the File name as Employee*.xls. To load the excel columns in Import section template file ie., original file name either employee1.xls or employee2.xls should be specified in the Template File option in Configure window.

Input Excel Sheet Data (Employee1.xls)

Input Excel Sheet Data (Employee2.xls)

Output Data (sequential File)

Column_A – Column_G contains actual data.RowNumber Column the details of the row from where the data is extracted. Sheet Name columns provides the details from which sheet the data is extracted. Cell_D7 is the value extracted for the property specified in the particular cell property in Advanced tab of Import section.

Conclusion

Information Server 9.1 release supports single output link which has single schema. Upcoming release supports multiple output links to extract multiple data structures.

Disclaimer: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

Information Integration Blog

Saturday, 14 June 2014

Extract Unstructured Data using Information Server Connectivity Stage – Unstructured Connector

1 comment: