Information Integration Blog: Create a QualityStage Match Specification in 8 easy steps!!!

The QualityStage Match Wizard is a simple interactive tool which can be used to create template based match specifications. We just need to answer a few guided questions, make simple selections and we'll be all set!! A basic match specification can be created quickly and easily with a minimal knowledge of matching concepts, Match Designer functionality and its workflow.

The match specifications created using the Match Wizard serve as a starting point for many purposes. Customers can use them to learn and understand match specification creation process and the concept of matching. They can be used to understand how to choose blocking columns, match commands, match threshold, reliability and chance agreements (m probability and u probability) for a given data and configure the test environment. Sales Executives can use them in their demos instead of building the match specifications from the scratch with a minimal learning curve involved.

From IIS version 8.7, Match Wizard is available as an enhancement to the Match Designer and it should be noted that it is not an alternate or a substitute for the Match Designer. Once the Match Wizard steps are completed, Match Designer is launched for any further development, refinement, saving and testing of the wizard generated match specification so that it can be subsequently used in a match job. Currently, we can use the Match Wizard to create match specification for matching the data standardized using QualityStage US Name and US Address rule sets.

For the matching process, we need sample data and its frequency distribution information. It is always recommended to standardize the sample data before using it in a match specification as the standardization process ensures uniformity in the data. QualityStage Standardize stage and the rule sets can be used to achieve this. The frequency distribution information of the sample has a very important role in the matching process. The data that is more frequent is less significant while matching as chances of it getting matched are very high and the vice-versa. The distributions of the sample data can be obtained by using the QualityStage Match Frequency Stage.

The Match Designer expects the input sample data and frequency distributions to be a DataStage dataset file. We can use the sample data and the predefined jobs that come with the product to standardize the data and create sample and frequency datasets. Sample data can be found at - ISInstallationDirectory/Server/PXEngine/DataQuality/MatchTemplates/StandardizationInput from IIS version9.1. DataStage Export(dsx) file of the predefined jobs which can be imported to any DataStage project can be found at - ISInstallationDirectory\Clients\Samples\DataQuality\MatchTemplates\Jobs\PredefinedJobs.dsx.This dsx contains match jobs as well which can be used to deploy the completed match specifications

Steps to create a match specification using Match Wizard:

Launch the match wizard
Select the Match Form
Select the Match Type
Select the Match Threshold
Select the additional column(s)
Configure Test Environment
1. Source data set
2. Frequency data set
3. Database Connection
Summary
Save the Match Specification in the Match Designer

Let's see each of these steps in detail :

Step # 1: Launch the match wizard:

In the DataStage Designer Client click on File → New → Data Quality Select Match Specification (Fig 1) .

In the 'Select Match Build Method' dialog, click on 'Help me get started' link(Fig 2). This will launch the Match Specification Setup Wizard.

Let's get familiarized with the Match Wizard design(refer Fig 3)

Default selections would be provided wherever possible as in the one below.

Step # 2 - Select the Match Form(refer Fig 3 above)

There are 2 kinds of matching available

Un-duplicate Matching – The option 'Within a single source' is for creating an Un-duplicate match specification where matching is done within a data source (generally used to eliminate duplicates in a source file)
Reference Matching – The option 'One source to another source' is for creating a Reference match specification where data source is matched with a reference source (generally used to enrich a source file from a reference file)

Appropriate Match Form should be selected according to the requirement. Now let's continue with the default selection 'Within a single source'.

Step # 3 - Select the Match Type (refer Fig 4)

The Match Wizard provides us with 4 types of matching for each match form.

Individual Deduplication – This match type helps us identify duplicate record entries for a person residing in an address
Individual Householding – This match type helps us identify duplicate record entries for people residing in an address
Business Deduplication – This match type helps us identify duplicate record entries for a business in an address
Business Householding - This match type helps us identify duplicate record entries for businesses in an address

Match type should be determined based on the business goal for matching. In this form too lets continue with the default option selected 'Individual De-duplication'.

Step # 4 - Select the Match Threshold (refer Fig 5)

Match Tolerance or Match Threshold is determined based whether we want to be certain about matched records or we want to consider all the possible or potential matches. Based on the match threshold selected, predetermined match and clerical cut off values will be assigned for each match pass.

Lower the Match Threshold – This results in more matches with lower certainty and false positives. (false positive meaning records categorized as match records would be actually non-match records)
Raise the Match Threshold – This results in less matches with higher certainty and false negatives (false negative meaning records categorized as non-match would be actually matched records)

More information on this can be found at http://www-01.ibm.com/support/knowledgecenter/SSZJPZ_8.5.0/com.ibm.swg.im.iis.qs.ug.doc/topics/c_Defining_cutoff_values.html?lang=en

Let's continue with the default selection.

Step # 5 - Select the additional column(s) (Optional Step) (refer Fig 6)

We can improve the match results by including more columns in the matching. For each match type, Match Wizard provides us a set of additional columns which we can include in the match to get better match results. But, we can add these columns to the match specification only if the source data has been standardized with one or more QualityStage rule sets VDATE, VEMAIL, VPHONE , USTAXID. There is a requirements twisty under each column which can be expanded to see the conditions to be met to use that column in the matching. For each additional column selected, an individual match pass would be created.

To keep our match specification simple, am not selecting any of the additional columns here.

Step # 6 - Configure Test Environment (Optional Step)

In order to execute the match specification, the Match Designer needs the information of from where it can access the sample input data and reference data (if it is a reference match), frequency distribution of the sample input and reference data, details of database into which match results can be stored on successful completion of the execution. Providing these details is called configuring the test environment.

This step is optional and if we don't intend to complete it now, we can do it in the Match Designer before executing the match specification. We'll select the check box for items which we intend to provide the information. (Fig 7)

Step # 6a - Source data set(Optional Step) (refer Fig 8)

We need to provide the location of the dataset which contains the sample input for the Match Designer. Here since we are creating a single source match specification, we see only one file selection dialog. For a two source match (reference match) we would see an additional reference input data set file selection dialog.

Step # 6b - Frequency data set(Optional Step) (refer Fig 9)

We need to provide the location of the dataset which contains the frequency distribution of the sample input for the Match Designer. Here too since we are creating a single source match specification, we see only one file selection dialog. For a two source match (reference match) we would see an additional reference frequency data set file selection dialog.

Step # 6c - Database Connection (Optional Step) (refer Fig 10)

We need to provide the database connection details or the data connection object which the match designer and the QS server will use to connect to the match designer results data base.

Step # 7 - Summary (refer Fig 11)

That's it!! We are almost done! The summary of all the selections made in the Match Wizard will be displayed. Any optional step completed will have a check mark and those not completed will be greyed out. Finish, Back and Cancel buttons will be enabled. We can go back to any form and change any of the selections made and the changes will be reflected in the Summary form.

Step # 8 - Save the Match Specification in the Match Designer (refer Fig 12)

On clicking the Finish button in the Summary form, Match Designer is launched with the template generated one source de-duplication match specification with predetermined default match passes. Each match pass will be composed of predetermined blocking columns, match commands and cut-off values set to a lower or higher threshold as per the selection made in the wizard. Test environment will be populated with the details entered in the Match Wizard (To open Test Environment window, Under Compose tab, go to Configure Specification → Test Environment). Save all the match passes and the match specification with the default names or with the names of your choice, test them and get the match results.

Disclaimer: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

Information Integration Blog

Wednesday, 9 July 2014

Create a QualityStage Match Specification in 8 easy steps!!!

1 comment: