Jeremy Kashel

Jeremy Kashel's Blog

Data Quality Services Leading Values

The Leading Values checkbox is one of the options available when you setup a domain within DQS Domain Management. I’ve run a couple of courses and demos recently on DQS and one of the questions that cropped up was what does the Leading Values checkbox do? This blog post will explain the Leading Values option by showing a few examples.

Domain Management

In this case we’re going to setup a domain called Country, which will be fairly simple for the purposes of this blog post. All of the defaults have been left, aside from the Use Leading Values, which is unchecked.


Knowledge Discovery

The next step is to add some knowledge to the Knowledge Base (KB), which we’ll do via knowledge discovery. To assist in this process I have a simple Excel file that contains a list of countries – this needs to be fed into the KB:


The first step within knowledge discovery is to pick a data source. In this case the data source is Excel and the Source Column of Name has been mapped to the domain of Country:


Once we get to the part of managing the Domain Values, then here is where we can have a look at the data in order to spot valid and invalid values. Towards the bottom of the list, we can see that there are values for both USA and for United States of America:


We can set these as synonyms, as shown above. If we go back into Domain Management afterwards, then we can see that USA is a synonym of United States of America:


Data Quality Projects

If we move on and create a data quality project to clean some data, then we’ll see these values in action. The new source file has some new countries, but also has both USA and United States of America. If we run the new source file through the Data Quality Project then we’ll see the following output:


As we’ve turned off “Use Leading Values”, USA has been deemed a correct “Domain Value” as its a synonym of United States of America.

To contrast this – lets take a look at what would have happened if we left the default setting – which is “Use Leading Values” checked. Therefore with the exact same setup, but with “Use Leading Values” checked, then we will get the following output:


This time USA is now on the Corrected tab. Although its value is “correct” as far as the domain is concerned, because we’ve specified to Use Leading Values, then DQS has altered USA to its leading value. The reason for the correction is very clear – “Corrected to leading value”.

I’ve personally not had a need to alter the setting, as I’ve always wanted the leading value to be the primary output. But its good that DQS does give an element of flexibility in its setup to suit different requirements.

DQS Reference Data Services with SSIS

Rather than just using the the local domain values in its internal knowledge base, Data Quality Services (DQS) can instead clean data by using external reference data. This is a feature called Reference Data Services and in this blog post I’m going to show an example of cleaning SSIS data by connecting to the Azure Data Market.

Azure Data Market

After you’ve signed up for Azure Data Market, you’ll need to subscribe to datasets that are compatible with DQS. Unfortunately they generally cost money, but I have found one (namely DMTI Spatial) that allows you to sign up for a free trial. I do think this is a good idea and I’d like to see free trials offered on Data Market by other providers. I don’t happen to have a huge data quality problem with Canadian addresses, but it does allow me to show how Reference Data Services works with SSIS.

Domain Management

Once we’ve signed up for some data, the next step is to build the knowledge base in the Data Quality Client. In my case, I’m going to setup a simple set of address domains based on addresses:


For a detailed guide on how to setup domains, see the following article.

Reference Data Services

In order to create the link to the external data, settings need to be changed within Domain Management. Having selected the composite address domain, the first step is to click on the Reference Data tab, then click on the browse button, which will cause the following window to appear:


As there is only one dataset available, I’m going to pick it and then map each of the domains to the schema given by the reference dataset:


After clicking ok, there are some optional settings to adjust in order to influence auto correction, the number of candidates and the minimum confidence:



Now its time to look at how the combination of DQS and the external data can be used within SSIS. I’ve put together just a simple CSV containing a few commercial Canadian addresses to clean, with a few spelling mistakes etc:


Within SSIS, the first step is to create a data flow that pulls data from the CSV. Then, within the data flow, we need to connect the the CSV source to the DQS Cleansing Transform:


The key part within SSIS is to edit the DQS Cleansing transform, as we need to first select a knowledge base, then match the incoming columns to the DQS domains:


Interestingly on the advanced tab of the transform there are checkboxes to return additional columns from the reference data provider, which is a provider-dependant feature and is not actually available within the DQ client:


Now we’re ready to go. When running the SSIS package with a data viewer, the following results are output:


Most of our addresses have been looked up and corrected, e.g. “4370 Lorimer” to “4370 Lorimer Road” and “Gerard St” to “Gerrard St”. The confidence of the cleaning is included in the output, as well as the aforementioned ‘extra’ appended data. In this case, as we’re using the DMTI provider, the type of property and latitude/longitude are amongst the columns returned.

As a final point, I noticed that the initial results from the external provider were changing my address values of “Road” and “Street” to “Rd” and “St”. As you can see from my Excel screenshot earlier, my input addresses were a mix of “Rd” and “Road”, but I wanted to standardise on “Road” and “Street”. This is exactly what Term Based Relations are for in DQS, but i didn’t expect them to work with external reference data as the data returned by the external data is in theory correct. Surprisingly, they do work with Reference Data Services, meaning its possible to gain a bit of extra control over the output.

Data Quality Services - Kimball ETL Screening

Although it’s arguably not the most popular in the Kimball series, The Data Warehouse ETL Toolkit contains a huge amount of vital information that’s focussed on the ETL side of data warehousing. In particular I’ve found the data quality screens design to be useful, so I thought it might be worth exploring if this concept can be applied to Data Quality Services.

What is a Kimball Data Quality Screen?

A screen is simply a check that is made against data that comes into the ETL pipeline and is used to measure data quality. Physically, the Kimball screen design is typically implemented as a table called Screen, which contains, as rows, the actual checks to perform against staged data. In addition, the screen table will contain metadata to aid data quality measurement, such as screen severity, category, source system and possibly exception actions.

Each time that a screen yields some results (i.e. data quality issues) then another table, Error Event Fact, is populated with the results. Therefore, by using the data generated by the screening process, The Error Event Fact becomes the place to look for an overall measurement of data quality. The schema for this design from the Kimball book (with a few modifications) is as follows:


The check performed by each screen is typically a snippet of SQL that checks for a particular condition, e.g. Is the customer’s postcode missing, or is the tax amount within permitted ranges?

Can Data Quality Services assist with ETL screening?

DQS is capable of auditing and checking data, as each domain within a knowledge base can have its own domain rules. In order to see how this might work as Kimball screens, I’ve set up a simple knowledge base around purchase orders. Here’s the domain management window from the DQS client:


Within the Delivery Charge domain, I’ve setup a domain rule called ‘Delivery Charge Negative’ in order to ensure that the delivery charge cannot be negative:


I’ve done something similar with the PO Date domain, setting up a rule to state that the PO date must be greater than 01/01/2012 in this case. Data run against the PO knowledge base will now be enforced against these two rules, so these are effectively my two Kimball style screens. To recap, DQS cleaning can be run either from the Data Quality Client or from SSIS.

Populating the Error Event Fact with SSIS

I’m going to be running my DQS domain rules via SSIS, as my end goal is to populate the Error Event Fact. Remember, if any DQ issues occur, I want the Error Event Fact to know about them.

In this scenario, the data from the source system has already been staged to a SQL table, which gives the following results in the SSIS source adaptor preview:


This data is passed to the DQS Cleansing SSIS task, with the data mapped to the DQS Knowledge Base in the following way:


Once the package is run, the data will be passed to DQS, which will run the data against its knowledge base, including running the domain rules. The data that is returned from DQS is as shown below. Note, PO55 row has failed both rules, whereas the other two rows have failed one row each.


Now we know exactly which rows have failed each data quality screen. The only trouble is that the data is in the wrong format for our error event fact. We want a total of 4 rows in the error event fact has PO Id 5 has failed 2 screens, whereas the other two rows have failed 1 screen. Therefore, we carry out a unpivot before inserting into the Error Event Fact. For this blog post my ScreenId is hard coded, but in reality you would have to lookup an appropriate ID for the DQS screen. The full SSIS package is as follows:


Data Quality Reporting

With the Error Event Fact populated, a variety of reports can be produced. In this case I’ve made a very simple Analysis Services cube and made a quick report in Excel 2013, using slicers, sparklines and also the new Timeline feature:


From this report I can see the screen that’s failed the most rows, see a trend of data quality and filter on specific issues.


With a bit of work, DQS is capable of producing the rows needed for the Kimball Error Event Fact, but how good is it? I see the advantages of using DQS as follows:

  • Front end user interface in order to allow data stewards to set up their own rules to screen the data – much better than holding rules in SSIS or manually in a table;
  • DQS becomes a one stop shop for all data quality, as, in addition to holding the screens as domain rules, DQS will clean your data for you, according to the valid/invalid domain values that you set up in the knowledge base.

There are a few disadvantages to it also:

  • The domain rules are not as flexible as pure SQL or SSIS, meaning that you may not be able to express all the screening rules that you had in mind;
  • DQS can't maintain all the metadata that you need to implement Kimball screening, so you’ll need to supplement it with some SQL tables to hold things like screen severity, screen type etc;
  • Even though the performance has been improved in CUI, it’s still not suitable for large amounts of data. Here’s an overview of how long you can expect the SSIS cleansing to take.

So my view is that if you’re already using DQS and data stewards like it using it to manage data quality, and your data volumes are not too big, then implement the classic Kimball screening design using DQS. But there are more cons than pros unfortunately, so if you’ve not already made an investment in DQS, and you have the time to invest in your own Kimball style screening solution, then I think you can come up with something that’s a bit more flexible and scalable.

Matching with Master Data Services, DQS and SSIS

If you have an MDM requirement to consolidate multiple sources of Master Data together into a single golden record, then you have a few different ways to achieve this on the Microsoft platform. This blog post gives an overview of the different ways that various matching methods can be used in conjunction with Master Data Services for consolidation in Master Data Management, outlining the pros and cons of each option. In summary, the options are:

  • Data Quality Services (DQS)
  • SQL Server Integration Services (SSIS)
  • Plus Master Data Services itself has a few (not that well known) matching capabilities

Data Quality Services

A key requirement in all but the simplest MDM solutions is that the matching/consolidation must be carried out in an automated manner, with a Data Steward alerted to take action if needed (e.g. the matching engine incorrectly matches two customers, the Data Steward opens MDS and corrects this). This scenario would be hard to achieve with DQS, as it’s possible to automate the DQS cleaning, but it’s not possible to automate the DQS matching. This is something that I’ve raised connect issues about, here and here.

If your data to be matched into MDS is not coming in on a daily basis, and you therefore need to do more ad-hoc matching to produce your golden Master Data records, then DQS could be for you. The MDS Excel Add-in will give you the capability of matching data that you import into Excel with MDS members, harnessing the power of DQS. An overview of how this works is available here.

Integration Services

SSIS has been around for a long time now and, as many of you will know, contains fuzzy matching components. With the right MDS model design, its possible to carry out a batch based fuzzy match between your master records and end up with a mapping between your the records that exist in your source systems and your MDS golden records. The rough conceptual steps to do this are:

  1. Load the new and changed records from the source systems into a staging area.
  2. Clean and standardise your data. This is actually something that DQS cleaning can help with.
  3. Query your staging area to get the new records that you want to insert/update into Master Data Services.
  4. Now the question arises, do we have an exact or close match for these records already in MDS? While the exact matches are easy to deal with, use the SSIS Fuzzy Lookup component to establish whether there are any approximate matches.
  5. Link the source records to master records (if the match is high enough) using MDS Domain Attributes.
  6. Carry out appropriate inserts and updates into MDS using the MDS staging tables.
  7. Ensure that a Data Steward is alerted in some way if necessary (e.g. if the match threshold is below x% confidence). This can be done with Email or MDS Notifications, for example.

This process can run in batch overnight, with the Data Steward approving or rejecting the matches that SSIS has carried out the following morning. Whilst the above over-simplifies the process and technical work required, hopefully the process makes sense at a high level.

Master Data Services

Although you cannot feed MDS your source data and get it to automatically carry out matching for you, it does actually contain the raw components in order to do this. By this I mean the MDS database contains an assembly called Microsoft.MasterDataServices.DataQuality, which gives you a number of fuzzy matching T-SQL functions. These are called from the MDS front end when you carry out some filtering when viewing entities. Using them just for filtering in the front end really isn’t using the functions to their true capability, but thankfully you can use these functions in your own code.

You can use the MDS T-SQL functions in a similar way to the conceptual SSIS method outlined above, in order to match and eliminate duplicates. In addition, the MDS web API can also be used to carry out a fuzzy match, as mentioned in this forum post. Retrieving match candidates using a web service may be an attractive option if you’re trying to do real time MDM.


Essentially until it’s possible to automate DQS matching, we have a choice between SSIS and the MDS matching functions. The following e-book gives a very detailed overview of the matches that both are capable of doing. The MDS T-SQL functions are more flexible than the SSIS fuzzy components as you can choose what fuzzy algorithm you want to use, but the SSIS components let you choose between Fuzzy Grouping and Fuzzy Lookup out of the box, without having to write SQL. Although I tend find that both give very good matching results, the MDS T-SQL functions produce slightly better matches in my experience, plus give you the option of trying different algorithms to suit different scenarios.

It’s also worth mentioning that Profisee Maestro (full disclosure, we are a partner) integrates with MDS, offering its own matching algorithms. Maestro also has a front end with the ability to assist with survivorship and approval, which I think is a useful addition to MDS. Speaking of survivorship and approval, there are two options in MDS out-of-the box. The new Master Data Manager web front-end is much improved, but potentially the MDS Excel Add-In allows a bit more flexibility for survivorship carried out by a Data Steward, just due to its natural ability for filling/copying/pasting/filtering.

So overall, due to the various improvements, Master Data Services is now capable of tackling more complex MDM scenarios than in the 2008 R2 version.

DQS Matching Vs SSIS Fuzzy Grouping and Lookup

When the term fuzzy matching comes up in the Microsoft world, it’s natural for anyone who’s used the BI stack to think of SQL Server Integration Services (SSIS), due to the fact that it has both Fuzzy Grouping and Fuzzy Matching components. Therefore, when using the matching in Data Quality Services (DQS) the other day, I thought it might be worth contrasting the matching capabilities in the two products.


The SSIS Fuzzy Grouping transformation is typically used on a set of data containing duplicates, as it will operate across all of the rows, grouping similar rows together based on fuzzy comparisons, with a view to eliminating duplicates. In contrast, the Fuzzy Lookup transformation takes a value in the SSIS pipeline and uses fuzzy matching to match the input value against a set of clean reference data in a database.

The matching in DQS takes a set of data and groups the data into clusters, based on a number of rules and other factors, again with a view to removing duplicates. It’s therefore only directly comparable against the SSIS Fuzzy Grouping, rather than the Fuzzy Lookup.

Test Scenario

For the purposes of this blog post, I’ve got a sample query from AdventureWorksDW, taking data from the DimCustomer table. I’ve unioned all that data with a few sample records that contain typos. E.g. I’ve got 'Pamela Fernndez' misspelt for one customer and '4610 Pinto Rd' instead of ‘Road’ for another. I’m going to attempt to de-duplicate data in SSIS first and then do the same thing in DQS to see if there is any difference.

SSIS Fuzzy Grouping

As shown below, I’ve just got a very simple OLE DB source passing some data into a Fuzzy Grouping, then I’m going to examine the data via a data viewer:

Control flow

The columns that I’m passing into the Fuzzy Grouping for the Customers are Full Name, Address Line 1, Gender and City just for the purposes of this blog post. I’ve configured the Fuzzy Grouping transform to carry out Fuzzy Matching on Address Line 1 and Full Name, as these are the most inconsistent (intentionally) across my made up data. I’ve also dropped the threshold right down to 0.40 to see a full range of matches that we may or may not get:


I’ve filtered the data for customers with a surname of Fernandez or a first name of Pamela, which includes two made up records. When I run the package, the data viewer produces the following results:

Data Viewer

The _score column has a high score of 0.8245935, which is based on the similarity scores of FullName and AddressLine1. There are 3 records included in the cluster, including someone called ‘P Fernandez’ who is male. Although that’s been matched, it’s only because I decreased the threshold, plus it can easily be excluded by adding an exact match for Gender, as shown below:


DQS Matching

So overall SSIS copes well with the test scenario, how does DQS compare?

After creating a knowledge base, creating domains and carrying out knowledge discovery, it’s possible to use DQS to create a Matching Policy. The Matching Policy involves building a number of rules that determine how the data matches. The starting point in creating a matching rule is determining which domains you want to match on and whether they should be matched using the fuzzy algorithms (similar) or matched exactly:

Rule editor

So not too dissimilar to SSIS. You then choose to give each domain a percentage weight, which must add up to 100%. For example, if you give Full Name a weight 30% and, when run, its match score is 61% accuracy, then you have Address Line 1 at a 70% weight and a 45% accuracy, then an overall match score for a row will be (0.3 * 0.61) + (0.7 * 0.45) = 49.8% match score.

Starting off with a 50% weight on the two domains, I get the following results when operating on the same data:

dqs clusters

As with SSIS, it hasn’t got it quite right, but using the Gender domain will help. What I quite like in DQS is that I can flag a domain as being a pre-requisite:


Flagging a domain as a pre-requisite means that the domain will be used to eliminate matches, but will not contribute to the matching score. If I wanted it to contribute to the matching score I would just have to uncheck pre-requisite and give it a weight. Running the matching again gives the following results:

dqs results 2

So the duplicate is correctly mapped with a score of 75%. Drilling down on the match was achieved is quite nice in DQS also:


This brings me to an interesting observation about the matching. I originally had my dummy record with a value of “3 Pierr Demoulainn” for the address. So in addition to having a typo in the second name and missing the word “Rue”, my other test also missed the second e out of the word “Pierre”. If I run my test data with these values, I get no matches at all. The lack of the second e in “Pierr” seems to have made all the difference. I can get it to match if I give a lot of weighting to Full Name. If I do this, we see the following results:

drill down 2

Address Line 1 has been given a score of 0%. Making the same change to my test data and then running in SSIS gives different results. The similarity for the address drops from 0.81 to 0.75, but it certainly doesn’t drop to zero. Although it will depend massively on your own data, the matching in DQS seems a bit more stringent. This can be mitigated by using more domains for the matching, rather than the simple 2 or 3 domains that I’ve used. To back that statement up, using a composite domain on both Address Line 1 and City does yield a strong match (71%) for the same data.


The rule based matching in DQS gives a lot of flexibility on how to weight matching across different domains. Coupled with the knowledge base components that interact with the matching, such as cleaning and Term-Based Relations, DQS has more features specifically aimed at matching than SSIS. It all depends on what sort of data you’re working with, but in theory, as you maintain the knowledge base over time, DQS should give you strong matches based on the knowledge you’ve given it.

However, there are some drawbacks – it’s a shame that the matching can’t be automated (please vote for my connect suggestion if you agree). SSIS by its very nature can be automated, meaning the matching in SSIS will be more suitable for a lot of implementations. Secondly, I’d like to be able to match external records against correct values already in the knowledge base, rather than just doing a fuzzy grouping on the external records. I’m a bit surprised DQS can’t do a fuzzy lookup, maybe this will change in V2. Finally, as I’ve shown above, some of the matching results are a little unexpected, whereas in SSIS they were good pretty much straight away. To be fair to DQS this could probably be mitigated with a better choice of matching fields and also running cleaning against a fully populated knowledge base beforehand.

Data Quality Services

As I’m sure you’ve heard, CTP3 of SQL Server Denali was released yesterday, and can be downloaded here. Denali includes Data Quality Services (DQS), Microsoft’s new data cleansing and matching component that’s based on the Zoomix acquisition that occurred a couple of years back. Data Quality Services didn’t make it into the first CTP, but is now available, so I though it would be worth a quick blog post.

Installing Data Quality Services

Data Quality Services is an option in the main install, but when you go to run the Data Quality Client, you’ll get a message stating that DQS is not installed. As far as I can tell, DQS needs to be installed manually, by running the DQSInstaller.exe, which you can find in the SQL Server Binn directory. This will create two SQL Server databases:


Data Quality Client

Once DQS is configured, you’ll be in a position to use the Data Quality Client, which is a windows application, available in 32 or 64 bit. Once you connect, you’ll get the following screen:


The knowledge base is the key to how DQS works, being utilised to determine the data quality of your source data. You can create your own knowledge base, using your own data, or even cloud-based data. For this example, I’m going to use the built in knowledge base called DQS Data.

As an example, I’ve created some data that I want to cleanse. It’s adventure works country data that I’ve put into a separate table that I’ve called dbo.Country.

My task now is is to clean this data, which I can do by creating a data quality project:


I’ve called my project Countries, and I’ve picked the built-in DQS Data knowledge base, which I noticed contains reference data for countries. The activity that I’ve selected is Cleansing, and then I’ve clicked Create.

DQS will then prompt for mapping the source data to one of the domains in the Knowledge Base. Here I’ve mapping my country name to the Country domain from the knowledge base:


After clicking the Start button on the next screen, the cleaning process starts, which gives the following results:


I put a couple of typos into my source data, such as changing Algeria to ‘Algerian’ and Albania to ‘Albana’. These were picked up by DQS, along with a few others, but a user has the opportunity to approve or reject, via the radio buttons shown above. I chose to approve the first two, and then clicked next. In the final screen, DQS allows you to output the cleaned data to a separate table, as shown below:


If you go into the table that DQS has created, you will see that there is a Name_Status column, which holds the cleaning status on a per record basis:


This can be used to update the source data, if required, and therefore address the data quality issues.


This has been a quick introduction to cleaning data with Data Quality Services. There’s plenty more to look at, and I hope to find time to do so at some point - In particular I’m keen to take a look at how DQS works with Master Data Services. I’ve noticed that there’s also a new MS DQS blog ( – I’m sure that will be a great resource for DQS learning.