Jeremy Kashel's Blog

Analysis Services Dimension Processing Errors

I’ve seen a few data quality issues recently in a some client data warehouses/marts where I’ve been asked to carry out some Analysis Services work. These data quality issues have often led to the classic SSAS dimension processing error, namely:

“Errors in the OLAP storage engine: A duplicate attribute key has been found when processing: Table: 'dbo_DimProduct', Column: 'SubCategory', Value: 'Other'. The attribute is 'Sub Category'”

Normally this error is easy to track down, but there are a few scenarios that I’ve encountered where SSAS attribute properties actually make it harder for you to troubleshoot.

Scenario 1 – Trimming

When hitting the above error, the first thing to do would be to run a query to see where the duplicate attribute is. In my example, I’ve just produced a very simple dimension that’s loosely based on Adventure Works – here are the attribute relationships:

image

Therefore the first place to start would be to write a query that checks if we have any Sub Categories that are duplicated across Categories, such as:

SELECT        SubCategory
FROM          DimProduct
GROUP BY      SubCategory
HAVING        COUNT(DISTINCT Category) > 1

In my case, this actually yields no results. So what’s happening here? We can’t find any duplicates in SQL, but SSAS thinks there are duplicates. Well….Analysis Services dimension attributes contain a property called Trimming, which, by default, will remove spaces at the end of an attribute key or name. Needless to say, a dimension shouldn’t actually contain leading or trailing spaces, but in this specific scenario, the ETL wasn’t as robust as it could be. In the actual example I encountered, there was a tab at the end of one of the attribute names. To visualise that in the DimProduct example that I’ve created, the highlighted attribute has a tab at the end:

image

Scenario 2 – NullProcessing

The next scenario I want to highlight is the same error, but a different dimension. Again I’m going to create a simple example to illustrate what actually happened. This time I’ve got a customer dimension table, as follows:

image

The error is: A duplicate attribute key has been found when processing: Table: 'dbo_DimCustomer', Column: 'CustomerCountryKey', Value: '0'. The attribute is 'Customer Country'

A quick query on the above data will reveal that a CustomerCountryKey of 0 doesn’t actually exist twice for the same country, so what’s happened here? Again its due to a dimension property, this time on the attribute key, called NullProcessing. As its default is Automatic, it means that Nulls will get converted to zero during dimension processing. As we already have a member with CustomerCountryKey of 0, then we get an error.

ETL and Data Quality

These sort of issues highlight just why handling data quality during the ETL is so important! For the first scenario, some basic trimming/cleaning can fix the issue before it hits Analysis Services. As for the last issue, this is partly due to data warehouse/mart design. Personally I would never allow a dimension attribute to be null, partly because you shouldn’t present null attributes to a user, but also due to the lack of control that can result as shown above. Therefore a bit of design work up front to ensure that you have a robust data model is essential. Interestingly the properties that I’ve mentioned don’t exist for tables in Analysis Services Tabular. Whereas I think there are a few SSAS MD features that need to make it over into the next version of Tabular, I can personally live without the two properties outlined above.

Tabular Analysis Services Perspectives and Measure Groups

I’ve always liked how Excel and other client tools deal with measure groups in a multi dimensional SSAS cube. Once you connect to the cube, you get a drop down in the Pivot Table fields pane that lets you choose which measure group you want. From there, you get a filtered list of measures and dimensions. For example, if I connect to the Adventure Works 2012 sample cube, I get the following:

image

By picking Internet Orders, I will see only the measures dimensions that relate to Internet Orders, which provides me with a good way of navigating a large cube.

Tabular

Unfortunately though, if I connect to a Tabular model (e.g. the sample Adventure Works 2012 model that I got from codeplex), then if I click the same drop down, then I will see every table in the entire model, whether the table  is a dimension, a fact or something else. E.g. here for a tabular model than contains 3 fact tables, but 15 tables in total, then I see all 15 tables:

image

This isn’t ideal in my opinion, as I’d much rather that users had the ability to quickly jump to an area of interest, which measure groups achieve fairly well. As I have 3 fact tables, I would expect to see those 3 fact tables/measure groups in the drop down.

I did accept this as a quirk of tabular until I stumbled across something recently. If you connect to a tabular perspective, then Excel behaves exactly as Multi Dimensional does, i.e. it shows you only measure groups in the drop down. To illustrate I’ve added a new perspective to the model that contains all tables and I’ve called this perspective ‘Adventure Works’:

image

Now I can connect to this new perspective via Excel. Remember the perspective contains all my tables, so it shouldn’t be any different than connecting the model itself:

image

The perspective does give a different result though – now only the actual measure groups are displayed to me in the drop down, which is much more user friendly:

image

Summary

This is completely different to the way that perspectives work in multi dimensional. When making MD cubes, I wouldn’t always need perspectives, just because the measure groups provided a ‘natural’ way of the users picking the subset of the cube that was of interest to them. Now with tabular, it seems that using a perspective will actually improve the user experience.

MDS Derived Hierarchy Permissions

The MDS Hierarchy Member permissions have been given a complete revamp in SQL Server 2012. As explained here, the hierarchy member permissions are purely optional and allow you to limit the access to a specific set of members.

Even though the actual member permissions are optional, you may have a need to purely allow a user to browse and/or update the Derived Hierarchy. This is also something that’s been affected by the aforementioned change, so I want to explore here how we would grant users access to a Derived Hierarchy.

In 2008 R2, on the model tab, you got a Derived Hierarchies node, as shown in the image below:

image

If we contrast that to SQL Server 2012 (albeit with a different model), we don’t have a Derived Hierarchies node:

image

Therefore, to access the actual hierarchy, according to this article, you need to give the user Update permission on the model. That’s not the only way to do it, but certainly working only at the entity level, or the member level, is not good enough. To explore, lets look at an example. I have the sample Customer model installed and I want to give a test user access to the following CustomerType hierarchy, which is just made up of the CustomerType and Customer entities:

image

Going to the User and Group Permissions, just quick confirmation that assigning Update permission to the Customer and CustomerType entities is not good enough. As shown below, the user does not have access to any of the Derived Hierarchies that I wanted:

image

Upping the permissions level a bit by granting read only permission to the model does the trick – the test user can now see the Derived Hierarchy and can also add members. This is thanks to 1) a Read Only permission at the Model level and 2) Update permission on the two entities that make up the Derived Hierarchy:

image

However, the simplified security comes with a slight drawback, which is to get access to the Derived Hierarchy we’ve now given the user Read Only access to all entities in the model:

image

You may not want to grant a user access to all entities, but thankfully it can be avoided by granting an explicit Deny at the entity level. For example, as shown below I’ve set an explicit Deny for the AddressType entity:

image

This will ensure that the user now doesn’t see the AddressType entity on the Entities menu. Now that the user can see the hierarchy, what about restricting access to the members? That’s a bigger topic, so I don’t want to cover it here, but I will just provide a useful tip. When you’re testing your security, don’t forget that member permissions are not applied immediately! The following article explains how to immediately apply the member permissions, which is something that I’d recommend carrying out whilst testing.

MDX With Microsoft SQL Server 2012 Analysis Services Cookbook Book Review

Packt have just released a second version of their ‘MDX Cookbook’, titled ‘MDX With Microsoft SQL Server 2012 Analysis Services’, authored by Sherry Li and Tomislav Piasevoli. As I’m one of the technical reviewers of this book, it’s put me in a good position to be able to put together a review of the book in this blog post.

The new SQL Server 2012 version of the book is similar in terms of content and structure to the original, presenting a number of solutions to common MDX problems in a ‘recipe’ format.

In the first two chapters, the book covers basic MDX techniques, such as the layout of a basic query and a variety of techniques for working with sets. The pace of the book warms up nicely here, starting with a few basic concepts, before gradually getting more complex.

After a useful chapter on working with time in MDX, there are some excellent chapters that give what should be useful recipes on common reporting techniques, such as top/bottom members, ranking and averages.

As the book progresses, there are some more complex chapters that cover areas such as linear regression, complex sorts and recursion – to name a few. Whether the recipes are simple or complex, they works well, as each recipe is packed with step-by-step instructions and screenshots, all based on the SQL Server 2012 Adventure Works cube.

image

Although it’s also in the first edition, I particularly like the chapter ‘When MDX is Not Enough’, which talks about how modifying the cube design can assist you when writing some MDX queries. Essentially the chapter promotes evaluating if you can modify the cube design in some way, rather than writing very complex MDX, which is something that I completely agree with.

The book differs slightly to the 2008 R2 version in that a few more of the basics (e.g. basic MDX query structure) are covered in the introductory chapters. This seems to work well, meaning someone relatively new to MDX can pick up some simple concepts, whereas someone who is more experienced in MDX will also get a massive amount out of the book.

So overall this comes highly recommend for your bookshelf or Kindle…

SSIS Fuzzy Lookup Transformation Tokens

After doing my recent blog post on the Excel Fuzzy Lookup Add-In Vs SSIS, it got me thinking about the capabilities of the SSIS Fuzzy Lookup Transformation.

Although I’m aware that the Fuzzy Lookup is a token-based solution, I’ve only ever tuned it really by altering the similarity thresholds on both the columns and also on the task itself. What you can also do is alter how the transform deals with tokens. Therefore, I thought it would be worth a quick post to show how the tokens help produce matches.

Tokens

The Fuzzy Lookup works by splitting up strings into several different components, known as tokens. For example, the Adventure Works reseller “Cross-Country Riding Supplies” might be split up into the tokens “Cross”, “Country”, “Riding” and “Supplies”. This is key to the matching process, as the fuzzy algorithms will attempt a match based on the commonality between tokens in the input and reference datasets.

SSIS Fuzzy Lookup

The advanced tab of the fuzzy lookup will show the delimiters used to create the tokens. As you can see, a hyphen is included in the list, meaning words separated by a hyphen get interpreted as separate tokens:

image

On the “Reference Table” tab of the transformation, there is the option to store a new index, which means that SQL Server will create a table in the data source to hold the tokens found in the reference table:

image

In my made up dbo.Retailers table, I’ve got two retailers – “Nearest Bike Store” and “Gears+Chain Supply”. After running the SSIS package once, the dbo.IndexTest table will be created, which is where the reference tokens get stored in order to help with matching. As it’s just stored in SQL Server we can select from it.Before we do that, I should just say, I’m only passing this fuzzy lookup 1 input string, which is a misspelt company called “Bikes Chaiin Supply”. If I run the package, I get the following matches and scores:

image

This has a fairly low similarity and confidence, so we can probably do better. Going back to the index, if we select from dbo.IndexTest, we get the following:

image

The index table also contains substrings found in the reference table, known as q-grams, which are designed to assist with matches that contain errors. What’s interesting about these results is that “gears+chain” has been entered on its own, meaning that the word “chain” has not been interpreted as a token in this case. The solution, in this case, is to alter the token delimiters to include “+”:

image

If we re-run the package and then look again at the index table, we now see that the word “chain” has been successfully interpreted as a token. In addition we also have a few variants on the word chain, such as “chai” and “hain”:

image

After re-running the matching process with out new delimiters, we can now look at the results of the matching process. If we take a look at the data viewer output, we can see that this has now resulted in a better similarity and confidence for the first row:

image

The delimiters work across all columns for the whole fuzzy lookup,  so that’s something to test when changing the delimiters – it may not be beneficial in all situations. However the new delimiter has clearly worked very well in this example, so it’s very much worth considering if you’re not quite getting the level of matches that you had hoped for.

Posted: Jul 05 2013, 10:00 AM by Jeremy Kashel | with no comments
Filed under:
Debug MDX with Generate

On occasion you may find that your MDX queries don’t quite give you the result that you expect, which sometimes may be due to the way in which a certain bit of MDX has changed the context of part of your query.

This can sometimes be tough to figure out, especially with functions like Sum() or Filter(), whereby we are looping over a set and then only returning the end product. For example, take the following query, where I’m trying to find out the count of UK male customers who have bought Road Bikes in both the current and the previous year:

WITH MEMBER [Measures].[Repeat Customer Count] AS
SUM(
    [Customer].[Customer].[Customer].Members,
    IIF([Measures].[Internet Sales Amount] <> 0 AND
    ([Measures].[Internet Sales Amount], [Date].[Calendar Year].PrevMember) <> 0, 1, NULL)
    )

SELECT      {[Measures].[Repeat Customer Count]} ON 0,
            [Date].[Calendar Year].[Calendar Year] ON 1
FROM        [Adventure Works]
            --Road Bikes, UK, Male
WHERE       ([Product].[Subcategory].&[2], [Sales Territory].[Sales Territory Country].&[United Kingdom],
            [Customer].[Gender].&[M])

It returns the following result set, which I know to be incorrect:

image

Unlike debugging either of the axes, where of course we can see what gets returned, debugging examples such as the above might be difficult. The MDX Generate function can help here, as it works in a similar way to Sum(), by iterating over a set and then evaluating an argument once per iteration. The idea is that Generate will build us a new set, by using the looping context of the first set that you pass it. In this case, however, we can use it to return us a concatenated string based on our input set.

What I’ve done here is to use the CurrentMember of attributes that I think might be causing problems (Customer, Gender) and then just print them out as a string in order to see if the attributes have the context that I expect. Therefore I’ve replaced the ‘1’ in my original query with the string of:

" | " + [Customer].[Customer].CurrentMember.MEMBER_CAPTION + ", " 
    + [Customer].[Gender].CurrentMember.MEMBER_CAPTION

The full query and results then become:

GENERATE(
        [Customer].[Customer].[Customer].Members,
        IIF([Measures].[Internet Sales Amount] <> 0 AND
        ([Measures].[Internet Sales Amount], [Date].[Calendar Year].PrevMember) <> 0, 
        " | " + [Customer].[Customer].CurrentMember.MEMBER_CAPTION + ", "
        + [Customer].[Gender].CurrentMember.MEMBER_CAPTION, NULL)
        )

SELECT        {[Measures].[Repeat Customer Count]} ON 0,
            [Date].[Calendar Year].[Calendar Year] ON 1
FROM        [Adventure Works]
            --Road Bikes, UK, Male
WHERE        ([Product].[Subcategory].&[2], [Sales Territory].[Sales Territory Country].&[United Kingdom],
            [Customer].[Gender].&[M])
image

I can spot straight away that female customers have been returned in 2007, which is due to the fact that sets specified inside the WITH MEMBER section are not evaluated in the context of the slicer. In this case the problem can be solved by using the EXISTING keyword inside the Sum, so that the customers get sliced by the gender attribute:

SUM(
    EXISTING [Customer].[Customer].[Customer].Members,
    IIF([Measures].[Internet Sales Amount] <> 0 AND
    ([Measures].[Internet Sales Amount], [Date].[Calendar Year].PrevMember) <> 0, 1, NULL)
    )

This won’t be applicable in all MDX debugging scenarios, but hopefully it will help someone out.

Excel Fuzzy Lookup Add-In and SSIS

Although it’s been out for a while, I thought I’d mention, for those who don’t know, that you can download a Fuzzy Lookup Add-In for Microsoft Excel.

The Add-In allows you to perform a fuzzy lookup for Excel data against another Excel dataset. I won’t go into too much detail, as this has already been covered here, but essentially the Add-In requires you to setup two datasets, namely a “left” and “right” table. Once you carry out some configuration/mappings, then the fuzzy match will return all of the rows in your left table, with the appropriate values looked-up from your right table. As an example I’ve chosen to do my lookup based on reseller names and an address:

image

Part of the reason for this post is that I think this add-in could be useful from a self service perspective for power users to assist in MDM/Matching/DW projects. There’s often a lot of trial and error in trying to get the right balance for matching rules, and, given that the Fuzzy Add-in gives similar results to SSIS, it may well be a good place to start with matching an dataset as part of an MDM project, for example. Alternatively, it could be used to carry out pre-matching/clean-up on data before an MDM project starts.

SSIS

I also wanted to contrast this Add-in to the Fuzzy Lookup transformation that you get in SSIS. Both the Fuzzy Lookup Add-In and the SSIS variant do the same thing – they both match one dataset against another in a fuzzy manner, returning you 1-n matches per row. In fact, according to this paper, they use the same Microsoft Fuzzy algorithm.

With the default settings, at least with my small test datasets, I get similar results in Excel and SSIS, albeit with different similarity numbers. These are the key differences that I’ve noticed:

  • Excel returns only overall row similarity, whereas SSIS returns similarity per column also.
  • Excel exposes an additional set of tuning options. As an example, you can set the Containment Bias, whereby you decide on the penalty for tokens in the right record that are not in the left record. You can also edit a number of other settings in an XML file.
  • SSIS outputs both the confidence and similarity, whereas Excel outputs only the similarity.
  • Excel will output the similarity XML per row, whereby you can see how its split up a string into differently weighted tokens.
  • As you might expect, SSIS out performs Excel.

As I quick test, I started working with Adventure Works resellers and seeing if I could get the same match results from both products. To highlight this, I’ve filtered this down to just 1 reseller (“Bikes World Comp. Store”) that I’ve made up and seeing what are the top 5 resellers both products return. SSIS, with its default settings, gives the following when matching just on the reseller name:

image

Excel, using the Fuzzy Add-In, gives the following with its default settings:

image

So overall some very good matches out of the box from the Excel Add-In, certainly comparable to SSIS. What’s interesting about this result set is that “Bike World” is ranked higher in Excel than “World Bike Discount Store”. We can align the behaviour of the two products, in this case anyway, by changing the Containment Bias from a default of 0.8 to 0. This will assign a full penalty to tokens in the right record that are not in the left record. So, as the left record is “Bikes World Comp. Store”, and the right record is “Bike World”, we are by default being lenient on the fact that its missing tokens. Increasing the penalty for missing tokens gives the following, with the top record now matching SSIS:

image

Of course, you would normally match on more than just the name, plus what’s worked above here for me might not work in other scenarios. But the bottom line is that the Excel Fuzzy Lookup Add-In gives very good matches out-of-the-box, is easy to use for the Excel user and therefore could well be of some use as part of data matching projects.

Posted: Jun 11 2013, 02:32 PM by Jeremy Kashel | with no comments
Filed under: ,
Debug Custom MDS Workflows

The custom MDS workflows, AKA External Action Business Rules or custom MDS assemblies, are one of the more advanced features of MDS. They allow you to extend the standard MDS business rules by getting MDS to fire Entity Member member data at an external assembly of your choice.

There are a lot of moving parts to this feature, MDS business rules, the custom .Net assembly, service broker and a windows service! When a business rule meets your condition, then the rule is written to a service broker queue. The windows service interrogates the queue and then passes the information to the custom assembly. I’m not going to show how to do this, this is already done by Johan Machielse and also on MSDN. What I would like to show is how to debug the use of the custom business rules, just due to the number of moving parts and also a couple of quirks that I’ve encountered when using them.

Custom Assembly Deployment

One of the first things that may catch you out is the deployment of the assembly. To troubleshoot this, I recommend the           “-console” argument of the workflow service. If you run it in command line mode you may encounter a few errors:

“Could not load file or assembly 'YourAssembly' or one of its dependencies. This assembly is built by a runtime newer than
the currently loaded runtime and cannot be loaded.”

This means that you need to go the properties of your assembly in Visual Studio and change the Target Framework from .NET Framework 4 to 3.5.

Another error that you may get is "ERROR: Could not create type. Ignored.”  Assuming that you have actually put the assembly in the bin directory, this is most likely to be that the either the namespace or the class name in your DLL doesn’t match what you have in your config file. Check the contents of Microsoft.MasterDataServices.Workflow.exe.config match your DLL.

You may also get, “ERROR: Type does not implement IWorkflowTypeExtender.” This means that your assembly has been picked up correctly, but it does not implement IWorkflowTypeExtender.

If the assembly does start up ok without errors, you may still get errors when the exe tries to take a message off the service broker queue. An error such as “Could not dispatch due to missing extender <tag>” means that the tag that you’ve given to the assembly in the config file does not match the “Workflow Type” value of the MDS business rule:

<setting name="WorkflowTypeExtenders" serializeAs="String">
<value>MDSTEST=MDS.CustomWorkflow, CustomWorkflow</value>
</setting>
image
Debugging the Custom Assembly

Once you do get the assembly working, you may find that the code isn’t quite doing what you want. A good way to troubleshoot this is to choose “Attach To Process” from the Debug menu in Visual Studio, as shown below:

image

If you do this, then you can add breakpoints to your .Net code as normal and step through in order to troubleshoot issues.

To conclude, this probably isn’t an exhaustive list of all the sort of errors that could occur with custom workflows, but with any luck it will save someone a few hours of troubleshooting.

Data Quality Services - Kimball ETL Screening

Although it’s arguably not the most popular in the Kimball series, The Data Warehouse ETL Toolkit contains a huge amount of vital information that’s focussed on the ETL side of data warehousing. In particular I’ve found the data quality screens design to be useful, so I thought it might be worth exploring if this concept can be applied to Data Quality Services.

What is a Kimball Data Quality Screen?

A screen is simply a check that is made against data that comes into the ETL pipeline and is used to measure data quality. Physically, the Kimball screen design is typically implemented as a table called Screen, which contains, as rows, the actual checks to perform against staged data. In addition, the screen table will contain metadata to aid data quality measurement, such as screen severity, category, source system and possibly exception actions.

Each time that a screen yields some results (i.e. data quality issues) then another table, Error Event Fact, is populated with the results. Therefore, by using the data generated by the screening process, The Error Event Fact becomes the place to look for an overall measurement of data quality. The schema for this design from the Kimball book (with a few modifications) is as follows:

image

The check performed by each screen is typically a snippet of SQL that checks for a particular condition, e.g. Is the customer’s postcode missing, or is the tax amount within permitted ranges?

Can Data Quality Services assist with ETL screening?

DQS is capable of auditing and checking data, as each domain within a knowledge base can have its own domain rules. In order to see how this might work as Kimball screens, I’ve set up a simple knowledge base around purchase orders. Here’s the domain management window from the DQS client:

image

Within the Delivery Charge domain, I’ve setup a domain rule called ‘Delivery Charge Negative’ in order to ensure that the delivery charge cannot be negative:

image

I’ve done something similar with the PO Date domain, setting up a rule to state that the PO date must be greater than 01/01/2012 in this case. Data run against the PO knowledge base will now be enforced against these two rules, so these are effectively my two Kimball style screens. To recap, DQS cleaning can be run either from the Data Quality Client or from SSIS.

Populating the Error Event Fact with SSIS

I’m going to be running my DQS domain rules via SSIS, as my end goal is to populate the Error Event Fact. Remember, if any DQ issues occur, I want the Error Event Fact to know about them.

In this scenario, the data from the source system has already been staged to a SQL table, which gives the following results in the SSIS source adaptor preview:

image

This data is passed to the DQS Cleansing SSIS task, with the data mapped to the DQS Knowledge Base in the following way:

image

Once the package is run, the data will be passed to DQS, which will run the data against its knowledge base, including running the domain rules. The data that is returned from DQS is as shown below. Note, PO55 row has failed both rules, whereas the other two rows have failed one row each.

image

Now we know exactly which rows have failed each data quality screen. The only trouble is that the data is in the wrong format for our error event fact. We want a total of 4 rows in the error event fact has PO Id 5 has failed 2 screens, whereas the other two rows have failed 1 screen. Therefore, we carry out a unpivot before inserting into the Error Event Fact. For this blog post my ScreenId is hard coded, but in reality you would have to lookup an appropriate ID for the DQS screen. The full SSIS package is as follows:

image

Data Quality Reporting

With the Error Event Fact populated, a variety of reports can be produced. In this case I’ve made a very simple Analysis Services cube and made a quick report in Excel 2013, using slicers, sparklines and also the new Timeline feature:

image

From this report I can see the screen that’s failed the most rows, see a trend of data quality and filter on specific issues.

Summary

With a bit of work, DQS is capable of producing the rows needed for the Kimball Error Event Fact, but how good is it? I see the advantages of using DQS as follows:

  • Front end user interface in order to allow data stewards to set up their own rules to screen the data – much better than holding rules in SSIS or manually in a table;
  • DQS becomes a one stop shop for all data quality, as, in addition to holding the screens as domain rules, DQS will clean your data for you, according to the valid/invalid domain values that you set up in the knowledge base.

There are a few disadvantages to it also:

  • The domain rules are not as flexible as pure SQL or SSIS, meaning that you may not be able to express all the screening rules that you had in mind;
  • DQS can't maintain all the metadata that you need to implement Kimball screening, so you’ll need to supplement it with some SQL tables to hold things like screen severity, screen type etc;
  • Even though the performance has been improved in CUI, it’s still not suitable for large amounts of data. Here’s an overview of how long you can expect the SSIS cleansing to take.

So my view is that if you’re already using DQS and data stewards like it using it to manage data quality, and your data volumes are not too big, then implement the classic Kimball screening design using DQS. But there are more cons than pros unfortunately, so if you’ve not already made an investment in DQS, and you have the time to invest in your own Kimball style screening solution, then I think you can come up with something that’s a bit more flexible and scalable.

Unique Values in Master Data Services

I’ve been doing quite a few demos and Proof of Concept (POC) projects recently for MDS. Understandably there are always a raft of questions that come up during demos and POCs, one of which recently was “how do I ensure that this attribute’s value is unique?”.

MDS can easily deal with simple uniqueness enforcement, which is something that anyone who has done any MDS development will no doubt know straight away. But if the requirement is more complex, such as ensuring that the attribute is unique in conjunction with other attributes, then I don’t think it’s as well known. Therefore I thought it would be worth a quick post on how to do this.

Code Attribute

To get the simple case out of the way, each MDS entity is always created with an attribute called Code, which must be unique. Therefore if you try and create/update a member with a duplicate Code, then you get the following error message:

image

Business Rules

To enforce uniqueness on any other attribute, you have to use business rules. Therefore, if I want to ensure that the attribute called Name must be unique, then I must carry out the following:

  • Go to the business rules editor and create a new rule
  • Drag the action of “must be unique” over to the actions node
  • Drag the Name attribute over to the “Select attribute” node, meaning that the action should look as follows:
  • image
  • Once that’s done, save the rule step and then re-publish the rule.

Trying to enter a duplicate value for the Name attribute will now give the following validation error:

image

Business Rules – Multiple Attributes

Now for the harder scenario. In the sample MDS product model, there is a Country attribute and also a Product Sub Category. I want to ensure that Name is unique, but only within a Country and Sub Category combination. To do this we go to the same business rules interface, and add the same action as before. Once that is done we carry out the following additional steps:

  • Change the radio button to “In combination with the following”
  • This is the part that I don’t think is that obvious. Drag over the Country attribute and then the Product Sub Category attribute onto the “attributes” node. This should give the following:
  • image
  • As before, save the rule step, then publish the rule.

Now a duplicate name will be permitted, but only within a country and sub category combination. If we do enter a duplicate value within a given Country/Product Sub Category group, then the following validation message is shown:

image

Master Data Services SQL Server 2012 Vs 2008 R2

I’ve given quite a few MDS demos recently and one question I seem to get asked often is “What features does MDS 2012 have compared to MDS 2008R2?”. As its happened quite a few times, I thought it would be a good idea to put a quick post together to explain the differences. The table below details all the features MDS has and compares the two versions:

Feature

MDS 2012

MDS 2008 R2

Front End

·         New Silverlight UI added for the Explorer area. Better performance, with less clicks to get tasks done.

·         Excel Add-in introduced

·         HTML based ASP.NET implementation.

·         No Excel Add-in

Modelling

·         Carry out in the web UI, plus some tasks can be carried out in Excel

·         Allows the auto generation of a code without using business rules

·         Carry out modelling in the web UI only

Hierarchies

·         Three types of hierarchies

·         Manage hierarchies in the new Silverlight UI

·         Three types of hierarchies

·         Manage hierarchies in the older UI style

Collections

·         Add members from Explicit Hierarchies or other Collections

·         Maintain a weighting on a member by member basis

·         Add members from Explicit Hierarchies or other Collections

Staging Process

·         A staging table is created per entity that you create in MDS. The table mirrors the exact structure that you see when you view the entity in the front end, making loading easier.

·         A generic staging table is used for all entities. This means, for example, you must load attribute changes in attribute name/value pairs

Deployment

·         Deploy the model structure from the front end.

·         Deploy the data from the new command line deployment tool

·         Deploy model structure and data from the front end

Security

·         Grant permissions to models, entities, attributes and attribute groups

·         Security cleaned up and simplified (no longer necessary to grant permissions to hierarchies)

·         Grant permissions to models, entities, attributes, attribute groups and hierarchies

Business Rules

·         Alter attribute values, validate data integrity, initiate workflow.

·         Some minor validation changes around using business rules to auto-generate Code values

·         Alter attribute values, validate data integrity, initiate workflow.

 

SharePoint Integration

·         Initiate a SharePoint workflow from MDS business rules

·         New functionality to embed the web UI inside a SharePoint page

·         Initiate a SharePoint workflow from MDS business rules

 

Data Quality Services (DQS) Integration

·         Connect to a DQS matching policy from Excel, against MDS master data

·         No DQS capability, as DQS is new in SQL Server 2012

Version Management

·         Create multiple copies of your master data and work with them independently as needed

·         Some minor UI improvements have been added

·         As SQL 2012, create multiple copies of your master data and work with them independently as needed

Transactions (auditing)

·         Transactions removed from Explorer and now are only available in Version Management

·         Therefore users cannot reverse their own transactions, only administrators can do this

·         Annotations are now permanent and cannot be deleted

·         Transactions viewable in two areas, Explorer and Version Management

·         Users can reverse their own transactions

·         Annotations can be deleted

Metadata

·         No support to view Metadata – this is now a deprecated feature

·         Customise the metadata for most objects in MDS

·         View metadata in the metadata explorer

Installation & Configuration

·         Installation from the main SQL Server setup program

·         Minor changes to the configuration settings

·         Installation from the SQL Server media, but not included in the main setup program

 
For me the improvements to the performance and the addition of the Excel Add-in are the most compelling reasons to upgrade. The great thing in particular about the Excel Add-in is that data stewards can manage data with the familiarity of Excel, but the data itself will be stored in SQL Server 2012.
Matching with Master Data Services, DQS and SSIS

If you have an MDM requirement to consolidate multiple sources of Master Data together into a single golden record, then you have a few different ways to achieve this on the Microsoft platform. This blog post gives an overview of the different ways that various matching methods can be used in conjunction with Master Data Services for consolidation in Master Data Management, outlining the pros and cons of each option. In summary, the options are:

  • Data Quality Services (DQS)
  • SQL Server Integration Services (SSIS)
  • Plus Master Data Services itself has a few (not that well known) matching capabilities

Data Quality Services

A key requirement in all but the simplest MDM solutions is that the matching/consolidation must be carried out in an automated manner, with a Data Steward alerted to take action if needed (e.g. the matching engine incorrectly matches two customers, the Data Steward opens MDS and corrects this). This scenario would be hard to achieve with DQS, as it’s possible to automate the DQS cleaning, but it’s not possible to automate the DQS matching. This is something that I’ve raised connect issues about, here and here.

If your data to be matched into MDS is not coming in on a daily basis, and you therefore need to do more ad-hoc matching to produce your golden Master Data records, then DQS could be for you. The MDS Excel Add-in will give you the capability of matching data that you import into Excel with MDS members, harnessing the power of DQS. An overview of how this works is available here.

Integration Services

SSIS has been around for a long time now and, as many of you will know, contains fuzzy matching components. With the right MDS model design, its possible to carry out a batch based fuzzy match between your master records and end up with a mapping between your the records that exist in your source systems and your MDS golden records. The rough conceptual steps to do this are:

  1. Load the new and changed records from the source systems into a staging area.
  2. Clean and standardise your data. This is actually something that DQS cleaning can help with.
  3. Query your staging area to get the new records that you want to insert/update into Master Data Services.
  4. Now the question arises, do we have an exact or close match for these records already in MDS? While the exact matches are easy to deal with, use the SSIS Fuzzy Lookup component to establish whether there are any approximate matches.
  5. Link the source records to master records (if the match is high enough) using MDS Domain Attributes.
  6. Carry out appropriate inserts and updates into MDS using the MDS staging tables.
  7. Ensure that a Data Steward is alerted in some way if necessary (e.g. if the match threshold is below x% confidence). This can be done with Email or MDS Notifications, for example.

This process can run in batch overnight, with the Data Steward approving or rejecting the matches that SSIS has carried out the following morning. Whilst the above over-simplifies the process and technical work required, hopefully the process makes sense at a high level.

Master Data Services

Although you cannot feed MDS your source data and get it to automatically carry out matching for you, it does actually contain the raw components in order to do this. By this I mean the MDS database contains an assembly called Microsoft.MasterDataServices.DataQuality, which gives you a number of fuzzy matching T-SQL functions. These are called from the MDS front end when you carry out some filtering when viewing entities. Using them just for filtering in the front end really isn’t using the functions to their true capability, but thankfully you can use these functions in your own code.

You can use the MDS T-SQL functions in a similar way to the conceptual SSIS method outlined above, in order to match and eliminate duplicates. In addition, the MDS web API can also be used to carry out a fuzzy match, as mentioned in this forum post. Retrieving match candidates using a web service may be an attractive option if you’re trying to do real time MDM.

Conclusion

Essentially until it’s possible to automate DQS matching, we have a choice between SSIS and the MDS matching functions. The following e-book gives a very detailed overview of the matches that both are capable of doing. The MDS T-SQL functions are more flexible than the SSIS fuzzy components as you can choose what fuzzy algorithm you want to use, but the SSIS components let you choose between Fuzzy Grouping and Fuzzy Lookup out of the box, without having to write SQL. Although I tend find that both give very good matching results, the MDS T-SQL functions produce slightly better matches in my experience, plus give you the option of trying different algorithms to suit different scenarios.

It’s also worth mentioning that Profisee Maestro (full disclosure, we are a partner) integrates with MDS, offering its own matching algorithms. Maestro also has a front end with the ability to assist with survivorship and approval, which I think is a useful addition to MDS. Speaking of survivorship and approval, there are two options in MDS out-of-the box. The new Master Data Manager web front-end is much improved, but potentially the MDS Excel Add-In allows a bit more flexibility for survivorship carried out by a Data Steward, just due to its natural ability for filling/copying/pasting/filtering.

So overall, due to the various improvements, Master Data Services is now capable of tackling more complex MDM scenarios than in the 2008 R2 version.

Analysis Services Member Properties and Excel

When working with users who are browsing a cube using Excel 2010, a common requirement that I find is that the users want to create tabular or grid based reports. E.g. “I have a list of currencies, I want to display the currency code in an adjacent column”. Or “Here are my account codes, I want to also see the account type”.

The common way that I see users attempt this is to stack attribute hierarchies on top of one another. For example, using Adventure Works, to get the currency names and codes together, I can stack the two attribute hierarchies:

image

This produces the following report:

image

Not exactly what we want – ideally we want the currency codes in separate columns, on the same line as the name.

An Alternative Approach

One well known way to achieve this layout is to right click on the relevant member and display its member properties, whereas another is to set the layout to Tabular Form. I tend to prefer the member properties route as I think it produces slightly better reports, so I’ll be covering a simple member properties design tip shortly. First of all, here’s the sort of report that I’m able to create using member properties:

image

I’m able to do this because the Adventure Works Source Currency Code Attribute Hierarchy has the Source Currency (names) as a member property. Therefore, I get the following option when I right click on the currency code:

image

Displaying member properties will only work if the attribute relationships are set up correctly. The reason that I’m able to see the member property ‘Source Currency’ is because of the following attribute relationship:

image

However, if we just display the currency name, using the Source Currency attribute hierarchy and then try and right click to see the currency code, we’ll unfortunately see nothing in the ‘Show Properties in Report’ submenu. This is because the currency name is a property of the code, rather than the other way around. I’d argue that for a user it’s intuitive to get the name from the code, or vice versa, as they sit at the same level. With a few simple changes we can achieve this, essentially by adding an extra attribute to the dimension…

Within the dimension editor in BIDS, the first step is to add the new attribute. In this case it should be based on the Currency Code column that is, of course, already in use. That doesn’t matter, we want to use it twice. Therefore, I’ve created a new attribute called Currency Code and have set AttributeHierarchyEnabled = False, as there’s no need for anyone to browse this attribute. Now onto the all important attribute relationships. The attribute hierarchy that contains the names is called Source Currency. As it’s not possible to get the currency codes by right clicking on this attribute hierarchy, Source Currency is the attribute needs to be related to the new Currency Code attribute:

image

Finally, the end result for the user is that they now get the ability to easily get access to the currency codes by right clicking on the names:

image

Power View in SQL Server 2012 RTM

Power View has been given a couple of small changes from RC0 to the RTM release:

Firstly, you can now choose the colour of the various visualisations via the Themes section, which you’ll find on the new Styles ribbon tab:

image

This will allow you to change the colour for a slicer, table or card, whereas for a chart it will set the palette that gets used for the chart series. Surprisingly it seems to set the theme globally, rather than allowing you to set the colour/palette for a single object:

image

The other change that I’ve noticed is that the Card visualisation now has a Style option on the ribbon:

image

This allows the cards to be displayed without normal coloured background, with a much larger font. So potentially useful for highlighting key numeric facts:

image

I do really like Power View, in particular the clean look & feel, fast performance, export to PowerPoint and animations to name a few. However, even though RC0 added a lot of new features, I think it’s still missing some of the functionality offered by other reporting tools. I know that it’s marketed as a data visualisation tool rather than a reporting tool, but those lines will become a bit blurred for users. This is where Connect comes in…

One connect suggestion that I’ve created is to allow drill down in Power View charts. If you’ve used Power View to perhaps spot a trend or spike in the data, then it makes sense for you to manipulate the view to find the root cause, which I think drill down would help with. Other suggestions that I think would be useful are heat maps and allowing Power View to access multidimensional models. Please vote on Connect if you agree!

DQS Matching Vs SSIS Fuzzy Grouping and Lookup

When the term fuzzy matching comes up in the Microsoft world, it’s natural for anyone who’s used the BI stack to think of SQL Server Integration Services (SSIS), due to the fact that it has both Fuzzy Grouping and Fuzzy Matching components. Therefore, when using the matching in Data Quality Services (DQS) the other day, I thought it might be worth contrasting the matching capabilities in the two products.

Overview

The SSIS Fuzzy Grouping transformation is typically used on a set of data containing duplicates, as it will operate across all of the rows, grouping similar rows together based on fuzzy comparisons, with a view to eliminating duplicates. In contrast, the Fuzzy Lookup transformation takes a value in the SSIS pipeline and uses fuzzy matching to match the input value against a set of clean reference data in a database.

The matching in DQS takes a set of data and groups the data into clusters, based on a number of rules and other factors, again with a view to removing duplicates. It’s therefore only directly comparable against the SSIS Fuzzy Grouping, rather than the Fuzzy Lookup.

Test Scenario

For the purposes of this blog post, I’ve got a sample query from AdventureWorksDW, taking data from the DimCustomer table. I’ve unioned all that data with a few sample records that contain typos. E.g. I’ve got 'Pamela Fernndez' misspelt for one customer and '4610 Pinto Rd' instead of ‘Road’ for another. I’m going to attempt to de-duplicate data in SSIS first and then do the same thing in DQS to see if there is any difference.

SSIS Fuzzy Grouping

As shown below, I’ve just got a very simple OLE DB source passing some data into a Fuzzy Grouping, then I’m going to examine the data via a data viewer:

Control flow

The columns that I’m passing into the Fuzzy Grouping for the Customers are Full Name, Address Line 1, Gender and City just for the purposes of this blog post. I’ve configured the Fuzzy Grouping transform to carry out Fuzzy Matching on Address Line 1 and Full Name, as these are the most inconsistent (intentionally) across my made up data. I’ve also dropped the threshold right down to 0.40 to see a full range of matches that we may or may not get:

image

I’ve filtered the data for customers with a surname of Fernandez or a first name of Pamela, which includes two made up records. When I run the package, the data viewer produces the following results:

Data Viewer

The _score column has a high score of 0.8245935, which is based on the similarity scores of FullName and AddressLine1. There are 3 records included in the cluster, including someone called ‘P Fernandez’ who is male. Although that’s been matched, it’s only because I decreased the threshold, plus it can easily be excluded by adding an exact match for Gender, as shown below:

Gender

DQS Matching

So overall SSIS copes well with the test scenario, how does DQS compare?

After creating a knowledge base, creating domains and carrying out knowledge discovery, it’s possible to use DQS to create a Matching Policy. The Matching Policy involves building a number of rules that determine how the data matches. The starting point in creating a matching rule is determining which domains you want to match on and whether they should be matched using the fuzzy algorithms (similar) or matched exactly:

Rule editor

So not too dissimilar to SSIS. You then choose to give each domain a percentage weight, which must add up to 100%. For example, if you give Full Name a weight 30% and, when run, its match score is 61% accuracy, then you have Address Line 1 at a 70% weight and a 45% accuracy, then an overall match score for a row will be (0.3 * 0.61) + (0.7 * 0.45) = 49.8% match score.

Starting off with a 50% weight on the two domains, I get the following results when operating on the same data:

dqs clusters

As with SSIS, it hasn’t got it quite right, but using the Gender domain will help. What I quite like in DQS is that I can flag a domain as being a pre-requisite:

image

Flagging a domain as a pre-requisite means that the domain will be used to eliminate matches, but will not contribute to the matching score. If I wanted it to contribute to the matching score I would just have to uncheck pre-requisite and give it a weight. Running the matching again gives the following results:

dqs results 2

So the duplicate is correctly mapped with a score of 75%. Drilling down on the match was achieved is quite nice in DQS also:

image

This brings me to an interesting observation about the matching. I originally had my dummy record with a value of “3 Pierr Demoulainn” for the address. So in addition to having a typo in the second name and missing the word “Rue”, my other test also missed the second e out of the word “Pierre”. If I run my test data with these values, I get no matches at all. The lack of the second e in “Pierr” seems to have made all the difference. I can get it to match if I give a lot of weighting to Full Name. If I do this, we see the following results:

drill down 2

Address Line 1 has been given a score of 0%. Making the same change to my test data and then running in SSIS gives different results. The similarity for the address drops from 0.81 to 0.75, but it certainly doesn’t drop to zero. Although it will depend massively on your own data, the matching in DQS seems a bit more stringent. This can be mitigated by using more domains for the matching, rather than the simple 2 or 3 domains that I’ve used. To back that statement up, using a composite domain on both Address Line 1 and City does yield a strong match (71%) for the same data.

Conclusion

The rule based matching in DQS gives a lot of flexibility on how to weight matching across different domains. Coupled with the knowledge base components that interact with the matching, such as cleaning and Term-Based Relations, DQS has more features specifically aimed at matching than SSIS. It all depends on what sort of data you’re working with, but in theory, as you maintain the knowledge base over time, DQS should give you strong matches based on the knowledge you’ve given it.

However, there are some drawbacks – it’s a shame that the matching can’t be automated (please vote for my connect suggestion if you agree). SSIS by its very nature can be automated, meaning the matching in SSIS will be more suitable for a lot of implementations. Secondly, I’d like to be able to match external records against correct values already in the knowledge base, rather than just doing a fuzzy grouping on the external records. I’m a bit surprised DQS can’t do a fuzzy lookup, maybe this will change in V2. Finally, as I’ve shown above, some of the matching results are a little unexpected, whereas in SSIS they were good pretty much straight away. To be fair to DQS this could probably be mitigated with a better choice of matching fields and also running cleaning against a fully populated knowledge base beforehand.

More Posts Next page »