The
QualityStage Match Wizard is a simple interactive tool which can be
used to create template based match specifications. We just need to
answer a few guided questions, make simple selections and we'll be
all set!! A basic match specification can be created quickly and
easily with a minimal knowledge of matching concepts, Match Designer
functionality and its workflow.
The
match specifications created using the Match Wizard serve as a
starting point for many purposes. Customers can use them to learn and
understand match specification creation process and the concept of
matching. They can be used to understand how to choose blocking
columns, match commands, match threshold, reliability and chance
agreements (m probability and u probability) for a given data and
configure the test environment. Sales Executives can use them in
their demos instead of building the match specifications from the
scratch with a minimal learning curve involved.
From
IIS version 8.7, Match Wizard is available as an enhancement to the
Match Designer and it should be noted that it is not an alternate or
a substitute for the Match Designer. Once the Match Wizard steps are
completed, Match Designer is launched for any further development,
refinement, saving and testing of the wizard generated match specification so
that it can be subsequently used in a match job. Currently, we can
use the Match Wizard to create match specification for matching the data
standardized using QualityStage US Name and US Address rule sets.
For
the matching process, we need sample data and its frequency
distribution information. It is always recommended to standardize the
sample data before using it in a match specification as the standardization
process ensures uniformity in the data. QualityStage Standardize
stage and the rule sets can be used to achieve this. The frequency
distribution information of the sample has a very important role in
the matching process. The data that is more frequent is less
significant while matching as chances of it getting matched are very
high and the vice-versa. The distributions of the sample data can be
obtained by using the QualityStage Match Frequency Stage.
The
Match Designer expects the input sample data and frequency
distributions to be a DataStage dataset file. We can use the sample
data and the predefined jobs that come with the product to
standardize the data and create sample and frequency datasets. Sample
data can be found at -
ISInstallationDirectory/Server/PXEngine/DataQuality/MatchTemplates/StandardizationInput
from IIS version9.1. DataStage Export(dsx) file of the predefined
jobs which can be imported to any DataStage project can be found at -
ISInstallationDirectory\Clients\Samples\DataQuality\MatchTemplates\Jobs\PredefinedJobs.dsx.This
dsx contains match jobs as well which can be used to deploy the
completed match specifications
Steps to create a match specification using Match Wizard:
- Launch the match wizard
- Select the Match Form
- Select the Match Type
- Select the Match Threshold
- Select the additional column(s)
-
- Source data set
- Frequency data set
- Database Connection
- Summary
- Save the Match Specification in the Match Designer
Let's see each of these steps in detail :
Step # 1: Launch the match wizard:
- In the DataStage Designer Client click on File → New → Data Quality Select Match Specification (Fig 1) .
- In the 'Select Match Build Method' dialog, click on 'Help me get started' link(Fig 2). This will launch the Match Specification Setup Wizard.
- Let's get familiarized with the Match Wizard design(refer Fig 3)
- The Match Specification Setup Wizard is a 3 pane form with
- left pane showing the steps that need to be completed,
- center pane showing the options to choose from and
- right pane showing examples and explanations to help us choose from the options in the center pane.
- Next and Back buttons used to navigate from one form to the other.
- Cancel button used to exit the wizard in any step.
- Finish button used to launch the match specification in the Match Designer for further processing once all the required steps are completed.
Default
selections would be provided wherever possible as in the one below.
Step
# 2 - Select the Match Form(refer
Fig 3 above)
There
are 2 kinds of matching available
- Un-duplicate Matching – The option 'Within a single source' is for creating an Un-duplicate match specification where matching is done within a data source (generally used to eliminate duplicates in a source file)
- Reference Matching – The option 'One source to another source' is for creating a Reference match specification where data source is matched with a reference source (generally used to enrich a source file from a reference file)
Appropriate
Match Form should be selected according to the requirement. Now let's
continue with the default selection 'Within a single source'.
Step
# 3 - Select the Match Type (refer
Fig 4)
The
Match Wizard provides us with 4 types of matching for each match
form.
- Individual Deduplication – This match type helps us identify duplicate record entries for a person residing in an address
- Individual Householding – This match type helps us identify duplicate record entries for people residing in an address
- Business Deduplication – This match type helps us identify duplicate record entries for a business in an address
- Business Householding - This match type helps us identify duplicate record entries for businesses in an address
Match
type should be determined based on the business goal for matching.
In this form too lets continue with the default option selected
'Individual De-duplication'.
Step
# 4 - Select the Match Threshold (refer
Fig 5)
Match
Tolerance or Match Threshold is determined based whether we want to
be certain about matched records or we want to consider all the
possible or potential matches. Based on the match threshold selected,
predetermined match and clerical cut off values will be assigned for
each match pass.
- Lower the Match Threshold – This results in more matches with lower certainty and false positives. (false positive meaning records categorized as match records would be actually non-match records)
- Raise the Match Threshold – This results in less matches with higher certainty and false negatives (false negative meaning records categorized as non-match would be actually matched records)
More
information on this can be found at
http://www-01.ibm.com/support/knowledgecenter/SSZJPZ_8.5.0/com.ibm.swg.im.iis.qs.ug.doc/topics/c_Defining_cutoff_values.html?lang=en
Let's
continue with the default selection.
Step
# 5 - Select the additional column(s) (Optional Step) (refer
Fig 6)
We
can improve the match results by including more columns in the
matching. For each match type, Match Wizard provides us a set of
additional columns which we can include in the match to get better
match results. But, we can add these columns to the match specification only
if the source data has been standardized with one or more QualityStage rule sets VDATE, VEMAIL, VPHONE , USTAXID. There is a
requirements twisty under each column which can be expanded to see
the conditions to be met to use that column in the matching. For each
additional column selected, an individual match pass would be
created.
To
keep our match specification simple, am not selecting any of the
additional columns here.
Step
# 6 - Configure Test Environment (Optional Step)
In
order to execute the match specification, the Match Designer needs
the information of from where it can access the sample input data and
reference data (if it is a reference match), frequency distribution
of the sample input and reference data, details of database into
which match results can be stored on successful completion of the
execution. Providing these details is called configuring the test
environment.
This
step is optional and if we don't intend to complete it now, we can do
it in the Match Designer before executing the match specification. We'll
select the check box for items which we intend to provide the
information. (Fig 7)
Step
# 6a - Source data set(Optional Step) (refer
Fig 8)
We
need to provide the location of the dataset which contains the sample
input for the Match Designer. Here since we are creating a single
source match specification, we see only one file selection dialog. For a two
source match (reference match) we would see an additional reference
input data set file selection dialog.
Step
# 6b - Frequency data set(Optional Step) (refer
Fig 9)
We
need to provide the location of the dataset which contains the
frequency distribution of the sample input for the Match Designer.
Here too since we are creating a single source match specification, we see only one
file selection dialog. For a two source match (reference match) we
would see an additional reference frequency data set file selection
dialog.
Step # 6c - Database Connection (Optional Step) (refer Fig 10)
We
need to provide the database connection details or the data
connection object which the match designer and the QS server will use
to connect to the match designer results data base.
Step
# 7 - Summary (refer Fig 11)
That's
it!! We are almost done! The summary of all the selections made in
the Match Wizard will be displayed. Any optional step completed will
have a check mark and those not completed will be greyed out. Finish,
Back and Cancel buttons will be enabled. We can go back to any form
and change any of the selections made and the changes will be
reflected in the Summary form.
Step
# 8 - Save the Match Specification in the Match Designer (refer
Fig 12)
On clicking the Finish button in the Summary form, Match
Designer is launched with the template generated one source
de-duplication match specification with predetermined default match
passes. Each match pass will be composed of predetermined blocking
columns, match commands and cut-off values set to a lower or higher
threshold as per the selection made in the wizard. Test environment
will be populated with the details entered in the Match Wizard (To
open Test Environment window, Under Compose tab, go to Configure
Specification → Test Environment). Save all the match passes and
the match specification with the default names or with the names of your choice,
test them and get the match results.
Disclaimer:
“The postings on this site are my own and don’t necessarily
represent IBM’s positions, strategies or opinions.”
how to save changes in cutoff values .. they are not reflecting when migrated from dev to prd
ReplyDelete