Victor

Victor Mendes' Blog

Data Quality Services (RC0) - Cleanse and Train the KB

This post will show a new KB (Knowledge Base) being trained using the new feature via domain management which allows the KB to import the knowledge gained from a cleansing project. In the previous release, CTP3, the KB was trained using Knowledge Discovery from either a table or excel file. This could also have been done interactively via domain management, but not directly from a cleansed project.

In this example we will retrain the new KB by importing knowledge learnt from two cleansed samples of data. The outputs of the cleansed projects will be imported into the KB. Before retraining our KB we will need to tweak one of the cleansing configuration settings. This will retrain our new KB faster.

New Knowledge Base

If you have not created a KB before a step by step guide can be found here.

The new KB was called UK Counties. A single domain value was used and it was called UK County (Full Name).

Using the Knowledge Discovery feature we first trained our empty KB with a unique list of UK Counties. The Domain Management feature could also have been used by entering domain values or importing them from an excel file.

Training the KB doesn't stop there, further training is required to expand the KB knowledge. Interactively we could add new domain values or synonyms to the already existing values such as Bucks for Buckinghamshire if we knew this was a potential synonym. For this example we will be using the knowledge gained from our cleansing project to retrain our KB.

Below is a screen shot of some of the domain values prior to retraining and being published to the DQS server.

A90_NewKB

Configuring Cleansing Projects

During cleansing a computer-assisted process will analyse how the source data conforms to the knowledge in the KB and a confidence level is determined. There are two confidence level thresholds: suggestions and auto corrections. Suggestions can be approved or rejected. Auto corrections are already approved, but it is possible to reject them as well.

These thresholds can be set in the general settings for a cleansing project from the DQS configuration area.

By default 0.6 (60%) and 0.8 (80%) are set respectively for suggestions and auto corrections.

As the KB is still young we need to retrain it with some samples of our real data and therefore make more suggestions. Hence we will reduce the confidence level percentage for suggestions down to 0.4 (40%).

B02_Config

N.B. This can go the other way where too many suggestions are given or a source value could be mapped to a completely incorrect domain value. As the knowledge of KB increases the confidence level for suggestions should also be increased to isolate new values and minimise interactive cleaning.

Training the KB

There are a number of ways to train the KB

  1. Knowledge Discovery,  or
  2. Domain Management by either
    1. Interactively entering or editing domain values,
    2. Importing domain values from an excel file, or
    3. Using the completed output of a cleansing project.

The final option is the new feature added to RC0 which makes managing knowledge within the KB simpler.

Let’s look at our first sample of data.

B03_Config

You will notice a number of data quality issues such as postcode and county, town and county or a short county name. Our objective is to clean this data to the County Full Name domain value in the KB.

If you have not created a data quality project before a step by step guide can be found here.

We will skip to the end of our cleansing project and analyse the results.

On successful completion we can see from the results that 0 records were corrected and 7 records have suggested domain values. We can now perform an interactive cleanse by which we approve or reject the suggestions made. Where the data confidence level has not been met for suggestions of 40% or auto corrections of 80% it will be considered as a new value. For this particular example we should not be expecting new values as we have a definitive list of counties and therefore we will need to add corrected values manually.

Let’s start with the suggested values. All are correct and therefore only need to be approved. We approve all values by clicking the check box under the Approve column or we can click the Approve All icon C08_DQP. All suggested values will move into the Corrected tab area.

C05_DQP

Now let’s review the new tab below. Three new values in the source data have been found, the analysis could not confidently associate any of the three values with a confidence level greater than or equal to 40% and therefore has presented them as new values.

C06_DQP

Looking at the values we know the county is Middlesex as is in the KB, hence we manually enter this value into the Correct to column. Now we can approve each value or we can click the Approve All icon.

C07_DQP

All values have now moved to the corrected tab.

C09_DQP

Our data cleanse results are now ready for final review before finishing the project. Finishing a project no longer allows the project to be re-run for cleansing purposes as the knowledge gained would be lost, however you will be able to interactively clean to alter the final results. It’s these results and knowledge that is imported back into the KB, but the project must be marked as finished.

It is the CountyName_Source that we want to feed back to the KB as synonyms for the domain values found or modified in the CountyName_Output.

C10_DQP

To feed this new acquired knowledge we open our KB via Domain Management and can now use the new feature Import project values.

D01_KBDM

We select the finished cleansing project and uncheck Add values from New Tab.

D02_KBDM

Only the new knowledge from the project has been added to the KB. The Show Only New check box is checked automatically.

D03_KBDM

If we uncheck the Show Only New check box and scroll down to London and Middlesex you will see the synonyms for these counties have been added to the KB.

D04_KBDM

We now publish the KB to the DQS server and are ready to analyse another data sample.

E01_DQP

As above we create a new cleansing project and analyse the data. The newly found knowledge from the previous cleansing project has corrected five values with 4 values suggested, leaving one value as new.

If we view the reason column for the corrected tab, four of the values were associated with the synonyms of domain values within the KB from the knowledge learned. 'London.' was cleansed and had a confidence greater than or equal to 80% and therefore auto corrected.

E03_DQP

In the suggested tab the correct to values are all correct and only need approving.

E04_DQP

The new value failed the confidence level of 40%, however it could have been found and a suggestion made if the threshold was reduced further. This is not a problem as the correct value was entered and approved.

E06_DQP

Finally we review our results and finish the project.

E07_DQP

We now feed this new knowledge back into the KB via Domain Management.

E08_DQP

The new synonyms will now be added to the existing KB domain values.

E09_DQP

This has been a simple example on how to train and retrain a new KB. By having a definitive list of counties the need to analyse our data prior to creating the KB is somewhat reduced.

When building a KB it is important to acquire knowledge from samples of source data using any of the features available. From this example the output of a cleansing project provides a simpler method where the confidence level for suggested values is much lower. Once the KB contains sufficient knowledge for cleansing the confidence level can be increased and larger sets of data can be cleansed.

DQS is a great tool for the power user to maintain content with little input from IT. In posts to follow I will also review Matching projects and the SSIS DQS component.