Jeremy Kashel

Jeremy Kashel's Blog

Debug MDX with Generate

On occasion you may find that your MDX queries don’t quite give you the result that you expect, which sometimes may be due to the way in which a certain bit of MDX has changed the context of part of your query.

This can sometimes be tough to figure out, especially with functions like Sum() or Filter(), whereby we are looping over a set and then only returning the end product. For example, take the following query, where I’m trying to find out the count of UK male customers who have bought Road Bikes in both the current and the previous year:

WITH MEMBER [Measures].[Repeat Customer Count] AS
    IIF([Measures].[Internet Sales Amount] <> 0 AND
    ([Measures].[Internet Sales Amount], [Date].[Calendar Year].PrevMember) <> 0, 1, NULL)

SELECT      {[Measures].[Repeat Customer Count]} ON 0,
            [Date].[Calendar Year].[Calendar Year] ON 1
FROM        [Adventure Works]
            --Road Bikes, UK, Male
WHERE       ([Product].[Subcategory].&[2], [Sales Territory].[Sales Territory Country].&[United Kingdom],

It returns the following result set, which I know to be incorrect:


Unlike debugging either of the axes, where of course we can see what gets returned, debugging examples such as the above might be difficult. The MDX Generate function can help here, as it works in a similar way to Sum(), by iterating over a set and then evaluating an argument once per iteration. The idea is that Generate will build us a new set, by using the looping context of the first set that you pass it. In this case, however, we can use it to return us a concatenated string based on our input set.

What I’ve done here is to use the CurrentMember of attributes that I think might be causing problems (Customer, Gender) and then just print them out as a string in order to see if the attributes have the context that I expect. Therefore I’ve replaced the ‘1’ in my original query with the string of:

" | " + [Customer].[Customer].CurrentMember.MEMBER_CAPTION + ", " 
    + [Customer].[Gender].CurrentMember.MEMBER_CAPTION

The full query and results then become:

        IIF([Measures].[Internet Sales Amount] <> 0 AND
        ([Measures].[Internet Sales Amount], [Date].[Calendar Year].PrevMember) <> 0, 
        " | " + [Customer].[Customer].CurrentMember.MEMBER_CAPTION + ", "
        + [Customer].[Gender].CurrentMember.MEMBER_CAPTION, NULL)

SELECT        {[Measures].[Repeat Customer Count]} ON 0,
            [Date].[Calendar Year].[Calendar Year] ON 1
FROM        [Adventure Works]
            --Road Bikes, UK, Male
WHERE        ([Product].[Subcategory].&[2], [Sales Territory].[Sales Territory Country].&[United Kingdom],

I can spot straight away that female customers have been returned in 2007, which is due to the fact that sets specified inside the WITH MEMBER section are not evaluated in the context of the slicer. In this case the problem can be solved by using the EXISTING keyword inside the Sum, so that the customers get sliced by the gender attribute:

    EXISTING [Customer].[Customer].[Customer].Members,
    IIF([Measures].[Internet Sales Amount] <> 0 AND
    ([Measures].[Internet Sales Amount], [Date].[Calendar Year].PrevMember) <> 0, 1, NULL)

This won’t be applicable in all MDX debugging scenarios, but hopefully it will help someone out.

Excel Fuzzy Lookup Add-In and SSIS

Although it’s been out for a while, I thought I’d mention, for those who don’t know, that you can download a Fuzzy Lookup Add-In for Microsoft Excel.

The Add-In allows you to perform a fuzzy lookup for Excel data against another Excel dataset. I won’t go into too much detail, as this has already been covered here, but essentially the Add-In requires you to setup two datasets, namely a “left” and “right” table. Once you carry out some configuration/mappings, then the fuzzy match will return all of the rows in your left table, with the appropriate values looked-up from your right table. As an example I’ve chosen to do my lookup based on reseller names and an address:


Part of the reason for this post is that I think this add-in could be useful from a self service perspective for power users to assist in MDM/Matching/DW projects. There’s often a lot of trial and error in trying to get the right balance for matching rules, and, given that the Fuzzy Add-in gives similar results to SSIS, it may well be a good place to start with matching an dataset as part of an MDM project, for example. Alternatively, it could be used to carry out pre-matching/clean-up on data before an MDM project starts.


I also wanted to contrast this Add-in to the Fuzzy Lookup transformation that you get in SSIS. Both the Fuzzy Lookup Add-In and the SSIS variant do the same thing – they both match one dataset against another in a fuzzy manner, returning you 1-n matches per row. In fact, according to this paper, they use the same Microsoft Fuzzy algorithm.

With the default settings, at least with my small test datasets, I get similar results in Excel and SSIS, albeit with different similarity numbers. These are the key differences that I’ve noticed:

  • Excel returns only overall row similarity, whereas SSIS returns similarity per column also.
  • Excel exposes an additional set of tuning options. As an example, you can set the Containment Bias, whereby you decide on the penalty for tokens in the right record that are not in the left record. You can also edit a number of other settings in an XML file.
  • SSIS outputs both the confidence and similarity, whereas Excel outputs only the similarity.
  • Excel will output the similarity XML per row, whereby you can see how its split up a string into differently weighted tokens.
  • As you might expect, SSIS out performs Excel.

As I quick test, I started working with Adventure Works resellers and seeing if I could get the same match results from both products. To highlight this, I’ve filtered this down to just 1 reseller (“Bikes World Comp. Store”) that I’ve made up and seeing what are the top 5 resellers both products return. SSIS, with its default settings, gives the following when matching just on the reseller name:


Excel, using the Fuzzy Add-In, gives the following with its default settings:


So overall some very good matches out of the box from the Excel Add-In, certainly comparable to SSIS. What’s interesting about this result set is that “Bike World” is ranked higher in Excel than “World Bike Discount Store”. We can align the behaviour of the two products, in this case anyway, by changing the Containment Bias from a default of 0.8 to 0. This will assign a full penalty to tokens in the right record that are not in the left record. So, as the left record is “Bikes World Comp. Store”, and the right record is “Bike World”, we are by default being lenient on the fact that its missing tokens. Increasing the penalty for missing tokens gives the following, with the top record now matching SSIS:


Of course, you would normally match on more than just the name, plus what’s worked above here for me might not work in other scenarios. But the bottom line is that the Excel Fuzzy Lookup Add-In gives very good matches out-of-the-box, is easy to use for the Excel user and therefore could well be of some use as part of data matching projects.

Debug Custom MDS Workflows

The custom MDS workflows, AKA External Action Business Rules or custom MDS assemblies, are one of the more advanced features of MDS. They allow you to extend the standard MDS business rules by getting MDS to fire Entity Member member data at an external assembly of your choice.

There are a lot of moving parts to this feature, MDS business rules, the custom .Net assembly, service broker and a windows service! When a business rule meets your condition, then the rule is written to a service broker queue. The windows service interrogates the queue and then passes the information to the custom assembly. I’m not going to show how to do this, this is already done by Johan Machielse and also on MSDN. What I would like to show is how to debug the use of the custom business rules, just due to the number of moving parts and also a couple of quirks that I’ve encountered when using them.

Custom Assembly Deployment

One of the first things that may catch you out is the deployment of the assembly. To troubleshoot this, I recommend the           “-console” argument of the workflow service. If you run it in command line mode you may encounter a few errors:

“Could not load file or assembly 'YourAssembly' or one of its dependencies. This assembly is built by a runtime newer than
the currently loaded runtime and cannot be loaded.”

This means that you need to go the properties of your assembly in Visual Studio and change the Target Framework from .NET Framework 4 to 3.5.

Another error that you may get is "ERROR: Could not create type. Ignored.”  Assuming that you have actually put the assembly in the bin directory, this is most likely to be that the either the namespace or the class name in your DLL doesn’t match what you have in your config file. Check the contents of Microsoft.MasterDataServices.Workflow.exe.config match your DLL.

You may also get, “ERROR: Type does not implement IWorkflowTypeExtender.” This means that your assembly has been picked up correctly, but it does not implement IWorkflowTypeExtender.

If the assembly does start up ok without errors, you may still get errors when the exe tries to take a message off the service broker queue. An error such as “Could not dispatch due to missing extender <tag>” means that the tag that you’ve given to the assembly in the config file does not match the “Workflow Type” value of the MDS business rule:

<setting name="WorkflowTypeExtenders" serializeAs="String">
<value>MDSTEST=MDS.CustomWorkflow, CustomWorkflow</value>
Debugging the Custom Assembly

Once you do get the assembly working, you may find that the code isn’t quite doing what you want. A good way to troubleshoot this is to choose “Attach To Process” from the Debug menu in Visual Studio, as shown below:


If you do this, then you can add breakpoints to your .Net code as normal and step through in order to troubleshoot issues.

To conclude, this probably isn’t an exhaustive list of all the sort of errors that could occur with custom workflows, but with any luck it will save someone a few hours of troubleshooting.

Data Quality Services - Kimball ETL Screening

Although it’s arguably not the most popular in the Kimball series, The Data Warehouse ETL Toolkit contains a huge amount of vital information that’s focussed on the ETL side of data warehousing. In particular I’ve found the data quality screens design to be useful, so I thought it might be worth exploring if this concept can be applied to Data Quality Services.

What is a Kimball Data Quality Screen?

A screen is simply a check that is made against data that comes into the ETL pipeline and is used to measure data quality. Physically, the Kimball screen design is typically implemented as a table called Screen, which contains, as rows, the actual checks to perform against staged data. In addition, the screen table will contain metadata to aid data quality measurement, such as screen severity, category, source system and possibly exception actions.

Each time that a screen yields some results (i.e. data quality issues) then another table, Error Event Fact, is populated with the results. Therefore, by using the data generated by the screening process, The Error Event Fact becomes the place to look for an overall measurement of data quality. The schema for this design from the Kimball book (with a few modifications) is as follows:


The check performed by each screen is typically a snippet of SQL that checks for a particular condition, e.g. Is the customer’s postcode missing, or is the tax amount within permitted ranges?

Can Data Quality Services assist with ETL screening?

DQS is capable of auditing and checking data, as each domain within a knowledge base can have its own domain rules. In order to see how this might work as Kimball screens, I’ve set up a simple knowledge base around purchase orders. Here’s the domain management window from the DQS client:


Within the Delivery Charge domain, I’ve setup a domain rule called ‘Delivery Charge Negative’ in order to ensure that the delivery charge cannot be negative:


I’ve done something similar with the PO Date domain, setting up a rule to state that the PO date must be greater than 01/01/2012 in this case. Data run against the PO knowledge base will now be enforced against these two rules, so these are effectively my two Kimball style screens. To recap, DQS cleaning can be run either from the Data Quality Client or from SSIS.

Populating the Error Event Fact with SSIS

I’m going to be running my DQS domain rules via SSIS, as my end goal is to populate the Error Event Fact. Remember, if any DQ issues occur, I want the Error Event Fact to know about them.

In this scenario, the data from the source system has already been staged to a SQL table, which gives the following results in the SSIS source adaptor preview:


This data is passed to the DQS Cleansing SSIS task, with the data mapped to the DQS Knowledge Base in the following way:


Once the package is run, the data will be passed to DQS, which will run the data against its knowledge base, including running the domain rules. The data that is returned from DQS is as shown below. Note, PO55 row has failed both rules, whereas the other two rows have failed one row each.


Now we know exactly which rows have failed each data quality screen. The only trouble is that the data is in the wrong format for our error event fact. We want a total of 4 rows in the error event fact has PO Id 5 has failed 2 screens, whereas the other two rows have failed 1 screen. Therefore, we carry out a unpivot before inserting into the Error Event Fact. For this blog post my ScreenId is hard coded, but in reality you would have to lookup an appropriate ID for the DQS screen. The full SSIS package is as follows:


Data Quality Reporting

With the Error Event Fact populated, a variety of reports can be produced. In this case I’ve made a very simple Analysis Services cube and made a quick report in Excel 2013, using slicers, sparklines and also the new Timeline feature:


From this report I can see the screen that’s failed the most rows, see a trend of data quality and filter on specific issues.


With a bit of work, DQS is capable of producing the rows needed for the Kimball Error Event Fact, but how good is it? I see the advantages of using DQS as follows:

  • Front end user interface in order to allow data stewards to set up their own rules to screen the data – much better than holding rules in SSIS or manually in a table;
  • DQS becomes a one stop shop for all data quality, as, in addition to holding the screens as domain rules, DQS will clean your data for you, according to the valid/invalid domain values that you set up in the knowledge base.

There are a few disadvantages to it also:

  • The domain rules are not as flexible as pure SQL or SSIS, meaning that you may not be able to express all the screening rules that you had in mind;
  • DQS can't maintain all the metadata that you need to implement Kimball screening, so you’ll need to supplement it with some SQL tables to hold things like screen severity, screen type etc;
  • Even though the performance has been improved in CUI, it’s still not suitable for large amounts of data. Here’s an overview of how long you can expect the SSIS cleansing to take.

So my view is that if you’re already using DQS and data stewards like it using it to manage data quality, and your data volumes are not too big, then implement the classic Kimball screening design using DQS. But there are more cons than pros unfortunately, so if you’ve not already made an investment in DQS, and you have the time to invest in your own Kimball style screening solution, then I think you can come up with something that’s a bit more flexible and scalable.

Unique Values in Master Data Services

I’ve been doing quite a few demos and Proof of Concept (POC) projects recently for MDS. Understandably there are always a raft of questions that come up during demos and POCs, one of which recently was “how do I ensure that this attribute’s value is unique?”.

MDS can easily deal with simple uniqueness enforcement, which is something that anyone who has done any MDS development will no doubt know straight away. But if the requirement is more complex, such as ensuring that the attribute is unique in conjunction with other attributes, then I don’t think it’s as well known. Therefore I thought it would be worth a quick post on how to do this.

Code Attribute

To get the simple case out of the way, each MDS entity is always created with an attribute called Code, which must be unique. Therefore if you try and create/update a member with a duplicate Code, then you get the following error message:


Business Rules

To enforce uniqueness on any other attribute, you have to use business rules. Therefore, if I want to ensure that the attribute called Name must be unique, then I must carry out the following:

  • Go to the business rules editor and create a new rule
  • Drag the action of “must be unique” over to the actions node
  • Drag the Name attribute over to the “Select attribute” node, meaning that the action should look as follows:
  • image
  • Once that’s done, save the rule step and then re-publish the rule.

Trying to enter a duplicate value for the Name attribute will now give the following validation error:


Business Rules – Multiple Attributes

Now for the harder scenario. In the sample MDS product model, there is a Country attribute and also a Product Sub Category. I want to ensure that Name is unique, but only within a Country and Sub Category combination. To do this we go to the same business rules interface, and add the same action as before. Once that is done we carry out the following additional steps:

  • Change the radio button to “In combination with the following”
  • This is the part that I don’t think is that obvious. Drag over the Country attribute and then the Product Sub Category attribute onto the “attributes” node. This should give the following:
  • image
  • As before, save the rule step, then publish the rule.

Now a duplicate name will be permitted, but only within a country and sub category combination. If we do enter a duplicate value within a given Country/Product Sub Category group, then the following validation message is shown:


Master Data Services SQL Server 2012 Vs 2008 R2

I’ve given quite a few MDS demos recently and one question I seem to get asked often is “What features does MDS 2012 have compared to MDS 2008R2?”. As its happened quite a few times, I thought it would be a good idea to put a quick post together to explain the differences. The table below details all the features MDS has and compares the two versions:


MDS 2012

MDS 2008 R2

Front End

·         New Silverlight UI added for the Explorer area. Better performance, with less clicks to get tasks done.

·         Excel Add-in introduced

·         HTML based ASP.NET implementation.

·         No Excel Add-in


·         Carry out in the web UI, plus some tasks can be carried out in Excel

·         Allows the auto generation of a code without using business rules

·         Carry out modelling in the web UI only


·         Three types of hierarchies

·         Manage hierarchies in the new Silverlight UI

·         Three types of hierarchies

·         Manage hierarchies in the older UI style


·         Add members from Explicit Hierarchies or other Collections

·         Maintain a weighting on a member by member basis

·         Add members from Explicit Hierarchies or other Collections

Staging Process

·         A staging table is created per entity that you create in MDS. The table mirrors the exact structure that you see when you view the entity in the front end, making loading easier.

·         A generic staging table is used for all entities. This means, for example, you must load attribute changes in attribute name/value pairs


·         Deploy the model structure from the front end.

·         Deploy the data from the new command line deployment tool

·         Deploy model structure and data from the front end


·         Grant permissions to models, entities, attributes and attribute groups

·         Security cleaned up and simplified (no longer necessary to grant permissions to hierarchies)

·         Grant permissions to models, entities, attributes, attribute groups and hierarchies

Business Rules

·         Alter attribute values, validate data integrity, initiate workflow.

·         Some minor validation changes around using business rules to auto-generate Code values

·         Alter attribute values, validate data integrity, initiate workflow.


SharePoint Integration

·         Initiate a SharePoint workflow from MDS business rules

·         New functionality to embed the web UI inside a SharePoint page

·         Initiate a SharePoint workflow from MDS business rules


Data Quality Services (DQS) Integration

·         Connect to a DQS matching policy from Excel, against MDS master data

·         No DQS capability, as DQS is new in SQL Server 2012

Version Management

·         Create multiple copies of your master data and work with them independently as needed

·         Some minor UI improvements have been added

·         As SQL 2012, create multiple copies of your master data and work with them independently as needed

Transactions (auditing)

·         Transactions removed from Explorer and now are only available in Version Management

·         Therefore users cannot reverse their own transactions, only administrators can do this

·         Annotations are now permanent and cannot be deleted

·         Transactions viewable in two areas, Explorer and Version Management

·         Users can reverse their own transactions

·         Annotations can be deleted


·         No support to view Metadata – this is now a deprecated feature

·         Customise the metadata for most objects in MDS

·         View metadata in the metadata explorer

Installation & Configuration

·         Installation from the main SQL Server setup program

·         Minor changes to the configuration settings

·         Installation from the SQL Server media, but not included in the main setup program

For me the improvements to the performance and the addition of the Excel Add-in are the most compelling reasons to upgrade. The great thing in particular about the Excel Add-in is that data stewards can manage data with the familiarity of Excel, but the data itself will be stored in SQL Server 2012.

Matching with Master Data Services, DQS and SSIS

If you have an MDM requirement to consolidate multiple sources of Master Data together into a single golden record, then you have a few different ways to achieve this on the Microsoft platform. This blog post gives an overview of the different ways that various matching methods can be used in conjunction with Master Data Services for consolidation in Master Data Management, outlining the pros and cons of each option. In summary, the options are:

  • Data Quality Services (DQS)
  • SQL Server Integration Services (SSIS)
  • Plus Master Data Services itself has a few (not that well known) matching capabilities

Data Quality Services

A key requirement in all but the simplest MDM solutions is that the matching/consolidation must be carried out in an automated manner, with a Data Steward alerted to take action if needed (e.g. the matching engine incorrectly matches two customers, the Data Steward opens MDS and corrects this). This scenario would be hard to achieve with DQS, as it’s possible to automate the DQS cleaning, but it’s not possible to automate the DQS matching. This is something that I’ve raised connect issues about, here and here.

If your data to be matched into MDS is not coming in on a daily basis, and you therefore need to do more ad-hoc matching to produce your golden Master Data records, then DQS could be for you. The MDS Excel Add-in will give you the capability of matching data that you import into Excel with MDS members, harnessing the power of DQS. An overview of how this works is available here.

Integration Services

SSIS has been around for a long time now and, as many of you will know, contains fuzzy matching components. With the right MDS model design, its possible to carry out a batch based fuzzy match between your master records and end up with a mapping between your the records that exist in your source systems and your MDS golden records. The rough conceptual steps to do this are:

  1. Load the new and changed records from the source systems into a staging area.
  2. Clean and standardise your data. This is actually something that DQS cleaning can help with.
  3. Query your staging area to get the new records that you want to insert/update into Master Data Services.
  4. Now the question arises, do we have an exact or close match for these records already in MDS? While the exact matches are easy to deal with, use the SSIS Fuzzy Lookup component to establish whether there are any approximate matches.
  5. Link the source records to master records (if the match is high enough) using MDS Domain Attributes.
  6. Carry out appropriate inserts and updates into MDS using the MDS staging tables.
  7. Ensure that a Data Steward is alerted in some way if necessary (e.g. if the match threshold is below x% confidence). This can be done with Email or MDS Notifications, for example.

This process can run in batch overnight, with the Data Steward approving or rejecting the matches that SSIS has carried out the following morning. Whilst the above over-simplifies the process and technical work required, hopefully the process makes sense at a high level.

Master Data Services

Although you cannot feed MDS your source data and get it to automatically carry out matching for you, it does actually contain the raw components in order to do this. By this I mean the MDS database contains an assembly called Microsoft.MasterDataServices.DataQuality, which gives you a number of fuzzy matching T-SQL functions. These are called from the MDS front end when you carry out some filtering when viewing entities. Using them just for filtering in the front end really isn’t using the functions to their true capability, but thankfully you can use these functions in your own code.

You can use the MDS T-SQL functions in a similar way to the conceptual SSIS method outlined above, in order to match and eliminate duplicates. In addition, the MDS web API can also be used to carry out a fuzzy match, as mentioned in this forum post. Retrieving match candidates using a web service may be an attractive option if you’re trying to do real time MDM.


Essentially until it’s possible to automate DQS matching, we have a choice between SSIS and the MDS matching functions. The following e-book gives a very detailed overview of the matches that both are capable of doing. The MDS T-SQL functions are more flexible than the SSIS fuzzy components as you can choose what fuzzy algorithm you want to use, but the SSIS components let you choose between Fuzzy Grouping and Fuzzy Lookup out of the box, without having to write SQL. Although I tend find that both give very good matching results, the MDS T-SQL functions produce slightly better matches in my experience, plus give you the option of trying different algorithms to suit different scenarios.

It’s also worth mentioning that Profisee Maestro (full disclosure, we are a partner) integrates with MDS, offering its own matching algorithms. Maestro also has a front end with the ability to assist with survivorship and approval, which I think is a useful addition to MDS. Speaking of survivorship and approval, there are two options in MDS out-of-the box. The new Master Data Manager web front-end is much improved, but potentially the MDS Excel Add-In allows a bit more flexibility for survivorship carried out by a Data Steward, just due to its natural ability for filling/copying/pasting/filtering.

So overall, due to the various improvements, Master Data Services is now capable of tackling more complex MDM scenarios than in the 2008 R2 version.

Analysis Services Member Properties and Excel

When working with users who are browsing a cube using Excel 2010, a common requirement that I find is that the users want to create tabular or grid based reports. E.g. “I have a list of currencies, I want to display the currency code in an adjacent column”. Or “Here are my account codes, I want to also see the account type”.

The common way that I see users attempt this is to stack attribute hierarchies on top of one another. For example, using Adventure Works, to get the currency names and codes together, I can stack the two attribute hierarchies:


This produces the following report:


Not exactly what we want – ideally we want the currency codes in separate columns, on the same line as the name.

An Alternative Approach

One well known way to achieve this layout is to right click on the relevant member and display its member properties, whereas another is to set the layout to Tabular Form. I tend to prefer the member properties route as I think it produces slightly better reports, so I’ll be covering a simple member properties design tip shortly. First of all, here’s the sort of report that I’m able to create using member properties:


I’m able to do this because the Adventure Works Source Currency Code Attribute Hierarchy has the Source Currency (names) as a member property. Therefore, I get the following option when I right click on the currency code:


Displaying member properties will only work if the attribute relationships are set up correctly. The reason that I’m able to see the member property ‘Source Currency’ is because of the following attribute relationship:


However, if we just display the currency name, using the Source Currency attribute hierarchy and then try and right click to see the currency code, we’ll unfortunately see nothing in the ‘Show Properties in Report’ submenu. This is because the currency name is a property of the code, rather than the other way around. I’d argue that for a user it’s intuitive to get the name from the code, or vice versa, as they sit at the same level. With a few simple changes we can achieve this, essentially by adding an extra attribute to the dimension…

Within the dimension editor in BIDS, the first step is to add the new attribute. In this case it should be based on the Currency Code column that is, of course, already in use. That doesn’t matter, we want to use it twice. Therefore, I’ve created a new attribute called Currency Code and have set AttributeHierarchyEnabled = False, as there’s no need for anyone to browse this attribute. Now onto the all important attribute relationships. The attribute hierarchy that contains the names is called Source Currency. As it’s not possible to get the currency codes by right clicking on this attribute hierarchy, Source Currency is the attribute needs to be related to the new Currency Code attribute:


Finally, the end result for the user is that they now get the ability to easily get access to the currency codes by right clicking on the names:


Power View in SQL Server 2012 RTM

Power View has been given a couple of small changes from RC0 to the RTM release:

Firstly, you can now choose the colour of the various visualisations via the Themes section, which you’ll find on the new Styles ribbon tab:


This will allow you to change the colour for a slicer, table or card, whereas for a chart it will set the palette that gets used for the chart series. Surprisingly it seems to set the theme globally, rather than allowing you to set the colour/palette for a single object:


The other change that I’ve noticed is that the Card visualisation now has a Style option on the ribbon:


This allows the cards to be displayed without normal coloured background, with a much larger font. So potentially useful for highlighting key numeric facts:


I do really like Power View, in particular the clean look & feel, fast performance, export to PowerPoint and animations to name a few. However, even though RC0 added a lot of new features, I think it’s still missing some of the functionality offered by other reporting tools. I know that it’s marketed as a data visualisation tool rather than a reporting tool, but those lines will become a bit blurred for users. This is where Connect comes in…

One connect suggestion that I’ve created is to allow drill down in Power View charts. If you’ve used Power View to perhaps spot a trend or spike in the data, then it makes sense for you to manipulate the view to find the root cause, which I think drill down would help with. Other suggestions that I think would be useful are heat maps and allowing Power View to access multidimensional models. Please vote on Connect if you agree!

DQS Matching Vs SSIS Fuzzy Grouping and Lookup

When the term fuzzy matching comes up in the Microsoft world, it’s natural for anyone who’s used the BI stack to think of SQL Server Integration Services (SSIS), due to the fact that it has both Fuzzy Grouping and Fuzzy Matching components. Therefore, when using the matching in Data Quality Services (DQS) the other day, I thought it might be worth contrasting the matching capabilities in the two products.


The SSIS Fuzzy Grouping transformation is typically used on a set of data containing duplicates, as it will operate across all of the rows, grouping similar rows together based on fuzzy comparisons, with a view to eliminating duplicates. In contrast, the Fuzzy Lookup transformation takes a value in the SSIS pipeline and uses fuzzy matching to match the input value against a set of clean reference data in a database.

The matching in DQS takes a set of data and groups the data into clusters, based on a number of rules and other factors, again with a view to removing duplicates. It’s therefore only directly comparable against the SSIS Fuzzy Grouping, rather than the Fuzzy Lookup.

Test Scenario

For the purposes of this blog post, I’ve got a sample query from AdventureWorksDW, taking data from the DimCustomer table. I’ve unioned all that data with a few sample records that contain typos. E.g. I’ve got 'Pamela Fernndez' misspelt for one customer and '4610 Pinto Rd' instead of ‘Road’ for another. I’m going to attempt to de-duplicate data in SSIS first and then do the same thing in DQS to see if there is any difference.

SSIS Fuzzy Grouping

As shown below, I’ve just got a very simple OLE DB source passing some data into a Fuzzy Grouping, then I’m going to examine the data via a data viewer:

Control flow

The columns that I’m passing into the Fuzzy Grouping for the Customers are Full Name, Address Line 1, Gender and City just for the purposes of this blog post. I’ve configured the Fuzzy Grouping transform to carry out Fuzzy Matching on Address Line 1 and Full Name, as these are the most inconsistent (intentionally) across my made up data. I’ve also dropped the threshold right down to 0.40 to see a full range of matches that we may or may not get:


I’ve filtered the data for customers with a surname of Fernandez or a first name of Pamela, which includes two made up records. When I run the package, the data viewer produces the following results:

Data Viewer

The _score column has a high score of 0.8245935, which is based on the similarity scores of FullName and AddressLine1. There are 3 records included in the cluster, including someone called ‘P Fernandez’ who is male. Although that’s been matched, it’s only because I decreased the threshold, plus it can easily be excluded by adding an exact match for Gender, as shown below:


DQS Matching

So overall SSIS copes well with the test scenario, how does DQS compare?

After creating a knowledge base, creating domains and carrying out knowledge discovery, it’s possible to use DQS to create a Matching Policy. The Matching Policy involves building a number of rules that determine how the data matches. The starting point in creating a matching rule is determining which domains you want to match on and whether they should be matched using the fuzzy algorithms (similar) or matched exactly:

Rule editor

So not too dissimilar to SSIS. You then choose to give each domain a percentage weight, which must add up to 100%. For example, if you give Full Name a weight 30% and, when run, its match score is 61% accuracy, then you have Address Line 1 at a 70% weight and a 45% accuracy, then an overall match score for a row will be (0.3 * 0.61) + (0.7 * 0.45) = 49.8% match score.

Starting off with a 50% weight on the two domains, I get the following results when operating on the same data:

dqs clusters

As with SSIS, it hasn’t got it quite right, but using the Gender domain will help. What I quite like in DQS is that I can flag a domain as being a pre-requisite:


Flagging a domain as a pre-requisite means that the domain will be used to eliminate matches, but will not contribute to the matching score. If I wanted it to contribute to the matching score I would just have to uncheck pre-requisite and give it a weight. Running the matching again gives the following results:

dqs results 2

So the duplicate is correctly mapped with a score of 75%. Drilling down on the match was achieved is quite nice in DQS also:


This brings me to an interesting observation about the matching. I originally had my dummy record with a value of “3 Pierr Demoulainn” for the address. So in addition to having a typo in the second name and missing the word “Rue”, my other test also missed the second e out of the word “Pierre”. If I run my test data with these values, I get no matches at all. The lack of the second e in “Pierr” seems to have made all the difference. I can get it to match if I give a lot of weighting to Full Name. If I do this, we see the following results:

drill down 2

Address Line 1 has been given a score of 0%. Making the same change to my test data and then running in SSIS gives different results. The similarity for the address drops from 0.81 to 0.75, but it certainly doesn’t drop to zero. Although it will depend massively on your own data, the matching in DQS seems a bit more stringent. This can be mitigated by using more domains for the matching, rather than the simple 2 or 3 domains that I’ve used. To back that statement up, using a composite domain on both Address Line 1 and City does yield a strong match (71%) for the same data.


The rule based matching in DQS gives a lot of flexibility on how to weight matching across different domains. Coupled with the knowledge base components that interact with the matching, such as cleaning and Term-Based Relations, DQS has more features specifically aimed at matching than SSIS. It all depends on what sort of data you’re working with, but in theory, as you maintain the knowledge base over time, DQS should give you strong matches based on the knowledge you’ve given it.

However, there are some drawbacks – it’s a shame that the matching can’t be automated (please vote for my connect suggestion if you agree). SSIS by its very nature can be automated, meaning the matching in SSIS will be more suitable for a lot of implementations. Secondly, I’d like to be able to match external records against correct values already in the knowledge base, rather than just doing a fuzzy grouping on the external records. I’m a bit surprised DQS can’t do a fuzzy lookup, maybe this will change in V2. Finally, as I’ve shown above, some of the matching results are a little unexpected, whereas in SSIS they were good pretty much straight away. To be fair to DQS this could probably be mitigated with a better choice of matching fields and also running cleaning against a fully populated knowledge base beforehand.