Jeremy Kashel's Blog

DQS Reference Data Services with SSIS

Rather than just using the the local domain values in its internal knowledge base, Data Quality Services (DQS) can instead clean data by using external reference data. This is a feature called Reference Data Services and in this blog post I’m going to show an example of cleaning SSIS data by connecting to the Azure Data Market.

Azure Data Market

After you’ve signed up for Azure Data Market, you’ll need to subscribe to datasets that are compatible with DQS. Unfortunately they generally cost money, but I have found one (namely DMTI Spatial) that allows you to sign up for a free trial. I do think this is a good idea and I’d like to see free trials offered on Data Market by other providers. I don’t happen to have a huge data quality problem with Canadian addresses, but it does allow me to show how Reference Data Services works with SSIS.

Domain Management

Once we’ve signed up for some data, the next step is to build the knowledge base in the Data Quality Client. In my case, I’m going to setup a simple set of address domains based on addresses:

image

For a detailed guide on how to setup domains, see the following article.

Reference Data Services

In order to create the link to the external data, settings need to be changed within Domain Management. Having selected the composite address domain, the first step is to click on the Reference Data tab, then click on the browse button, which will cause the following window to appear:

image

As there is only one dataset available, I’m going to pick it and then map each of the domains to the schema given by the reference dataset:

image

After clicking ok, there are some optional settings to adjust in order to influence auto correction, the number of candidates and the minimum confidence:

image

SSIS and DQS

Now its time to look at how the combination of DQS and the external data can be used within SSIS. I’ve put together just a simple CSV containing a few commercial Canadian addresses to clean, with a few spelling mistakes etc:

image

Within SSIS, the first step is to create a data flow that pulls data from the CSV. Then, within the data flow, we need to connect the the CSV source to the DQS Cleansing Transform:

image

The key part within SSIS is to edit the DQS Cleansing transform, as we need to first select a knowledge base, then match the incoming columns to the DQS domains:

image

Interestingly on the advanced tab of the transform there are checkboxes to return additional columns from the reference data provider, which is a provider-dependant feature and is not actually available within the DQ client:

image

Now we’re ready to go. When running the SSIS package with a data viewer, the following results are output:

image

Most of our addresses have been looked up and corrected, e.g. “4370 Lorimer” to “4370 Lorimer Road” and “Gerard St” to “Gerrard St”. The confidence of the cleaning is included in the output, as well as the aforementioned ‘extra’ appended data. In this case, as we’re using the DMTI provider, the type of property and latitude/longitude are amongst the columns returned.

As a final point, I noticed that the initial results from the external provider were changing my address values of “Road” and “Street” to “Rd” and “St”. As you can see from my Excel screenshot earlier, my input addresses were a mix of “Rd” and “Road”, but I wanted to standardise on “Road” and “Street”. This is exactly what Term Based Relations are for in DQS, but i didn’t expect them to work with external reference data as the data returned by the external data is in theory correct. Surprisingly, they do work with Reference Data Services, meaning its possible to gain a bit of extra control over the output.

Master Data Services Regular Expression Business Rules

As those of you who use MDS will know, the Code attribute within a Master Data Services entity is mandatory, as its used to uniquely identify the members within the entity. Not only is it mandatory, but unfortunately you cannot change its underlying SQL Server data type – it always gets created as an nvarchar. This is useful in some cases, but what do you do if you want to ensure that the Code can only be numeric? This post outlines two options that may help:

Master Data Services Security

Although it will vary across different MDS implementations, the Code is quite often used as as an auto incrementing integer, which can be achieved easily by changing the entity properties. If all you want is to auto generate the code and prevent users from changing it, then you can just set the Code to be read only, as shown below:

image

Business Rules

The security approach will work fine if users aren’t allowed to update the Code. On the other hand, if the Code can be updated, then a different solution is needed.

Here an MDS business Validation Business Rule type can help, as one of the Validation rules is called must contain the pattern:

image

By dragging this business rule onto the Action node, its now possible to write the rule action. If we chose the Code as the attribute to validate and then enter ^[0-9]+$ as the regex pattern, then the rule looks as follows:

image

Finally, after publishing the new Business Rule, members with an non-numeric code will now be invalid, as shown with the sample MDS Customer model:

image

Analytics Platform System Polybase Example

To me one of the most exciting parts of the Microsoft Analytics Platform System (APS) is Polybase, which in a nutshell allows you to access data residing in Hadoop or Windows Azure Blob Storage, all via T-SQL in PDW. What this means is that data can be transparently queried by a user or developer, in real time, regardless of whether the data lives in PDW or Hadoop/Azure. James Roland-Jones gives a thorough overview of all things Polybase here.

What I’m going to do in this post is to show an example of how existing data within Hadoop can be combined with data that resides in an APS PDW region.

Polybase Example - Setup

There are two key tasks to complete before we’re able to start querying data. We need to setup a data source, so that Polybase knows where to get the data from, plus we need to describe the file format of the external file that we’re going to read. The data source here specifies that we’re going to use the Hadoop nodes that are co-located with the PDW nodes, but this could be a non-appliance Hadoop cluster:

CREATE EXTERNAL DATA SOURCE HadoopRegion_DataSource
WITH (
        TYPE = HADOOP,
        LOCATION = 'hdfs://hdfs://H12345-C-HHN01',
        --Optional specify a tracker location to enable predicate
        --push down to Hadoop
        JOB_TRACKER_LOCATION = 'H12345-C-HHN01:50300'
    );

The next task is to setup a file format. In this case we’re defining that the file is pipe delimited, although we can use Polybase with other formats (e.g. RCFile):

CREATE EXTERNAL FILE FORMAT HadoopRegion_DelimitedPipe_NoCompression
WITH (
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS (
    Field_terminator = '|')
  
);

Accessing External Data

Before we access data in Hadoop, I want to show the file for this example. In this case I’ve got a customer dimension within PDW, but I have customer survey data within Hadoop that i wish to combine with my data in the warehouse. The survey data has been uploaded in this case to the Hadoop region within APS, via the portal:

image

The following shows a sample of the Hadoop file, note that its pipe delimited, with columns for Date, Survey Id, Product, Customer Id and Survey Score:

image

Now I want to access the Hadoop data and combine it with other data in the warehouse. To do this we need to use the CREATE EXTERNAL TABLE command.

CREATE EXTERNAL TABLE [dbo].[HDFS_CustomerSurvey]
(

    DateId int NOT NULL,
    SurveyResponseId int NOT NULL,
    ProductCategoryName varchar(100) NULL,
    CustomerId    varchar(50) NULL,
    SurveyResult INT NULL
)
    WITH ( 
        LOCATION = '/user/data/survey/',
        DATA_SOURCE = HadoopRegion_DataSource,
        FILE_FORMAT = HadoopRegion_DelimitedPipe_NoCompression
    )
;

This results in an external table being available within the PDW region, as shown below:

image

Now if we want to query the Hadoop data it’s just a simple case of T-SQL. Here I’m joining Hadoop survey data to the customer dimension:

SELECT      CS.CustomerId,
            C.FirstName,
            C.LastName,
            C.EmailAddress, 
            C.Gender,
            CS.DateId AS SurveyDate,
            CS.ProductCategoryName AS SurveyProduct,
            CS.SurveyResult
FROM        dbo.HDFS_CustomerSurvey CS
INNER JOIN  dbo.DimCustomer C ON C.CustomerAlternateKey = CS.CustomerId

Which gives the following results:

image

So that’s it, just a simple T-SQL query in the end. Polybase has taken away the complexity and allowed us to integrate different data sources using a widely used standard query language.

Prototyping Analysis Services Cubes

After running a modelling workshop with end users, my next activity on a project is generally to produce a prototype model with some test data, in order to check with users that the planned star schema design will work. My tool of choice for this job tends to be Power Pivot, but I was presented with a slightly different solution when working with a client recently. This got me thinking it would be worth contrasting the two approaches….

Power Pivot

First of all, lets start off with Power Pivot. The goal here is to quickly produce a prototype to verify the star schema design. I tend to just request that a business user brings along an Excel workbook with samples of each dimension and facts pre-extracted in Excel, which can then be loaded into Power Pivot. I find that this saves time, although as an alternative of course Power Pivot could extract directly from SQL Server, Oracle etc if needed.

As an example, we can quickly walk through creating a very simple model to create a star schema:

  • I’ve copied the data that represents my single fact table and related dimensions into Excel. In this case it’s to model a simple Sales Order Transactions fact, therefore my Excel workbook has dimensions such a Product, Customer, Sales Territory etc, as we well as the data that represents the fact table:

image

  • After completing and closing the workbook, we now create a new workbook and open the Power Pivot window.
  • Within the Power Pivot window, the next step is to import the data from the Excel workbook created in step 1. This can be done by choosing From Other Sources->Excel workbook and then browsing to the file.
  • Power Pivot will now present us with all of the sheets that it finds in the workbook, so we select the sheets that represent the dimensions and facts:

image

  • The next step is to setup relationships between the various tables. Once this is done, we have a simple prototype data model, as shown below:

image

After a bit of tidying up (e.g. creating some hierarchies and removing unwanted columns), we can now connect to a Pivot Table in Excel that will help verify the star schema:

image

They key point here is that we have rapidly built a prototype, without the need for any ETL. This means we can quickly cover the design issues with the users, but this time with their actual data, rather than just on a whiteboard.

Analysis Services

The alternative approach that was presented to me recently is to develop the Analysis Services cube/Tabular model before carrying out the ETL. Essentially this means that the users connect to cube, as a production system, unaware that under the hood the ETL is not yet complete.

How is this achieved? Essentially by putting logic in the SQL view layer. Its a well known Analysis Services best practice to bind your SSAS objects (e.g. dimensions, measure groups) to SQL Server views, rather than using objects within the DSV. Rather than each view pulling its data from a complete dimension or fact table, instead each view would pull its data from a staging area or a copy of the source database. The idea being that, over time, the contents of each view would be updated to point to the actual dimensions and facts, once they are built.

Therefore a normal view (in this case for a product dimension) that feeds Analysis Services might look like:

SELECT          ProductKey,
                ProductName,
                ProductCategory,
                ProductSubCategory
FROM            Dim.Product

With the ‘No ETL’ approach, the view looks very different, as any transformation and cleaning will have to be carried out from the view:

SELECT          ROW_NUMBER() OVER(ORDER BY PROD.ProductID) AS ProductKey,
                PROD.Name AS ProductName,
                CAST(ISNULL(SUB.Name, 'Unknown') AS VARCHAR(100)) AS ProductSubCategory,
                CAST(ISNULL(CAT.Name, 'Unknown') AS VARCHAR(100)) AS ProductCategory
FROM            [$(AdventureWorks)].Production.Product PROD
LEFT JOIN       [$(AdventureWorks)].Production.ProductSubcategory SUB ON SUB.ProductSubcategoryID = PROD.ProductSubcategoryID
LEFT JOIN       [$(AdventureWorks)].Production.ProductCategory CAT ON CAT.ProductCategoryID = SUB.ProductCategoryID

This therefore incurs some technical debt, as the cube gets built before the ETL or even the physical dimension table. But the idea is that you can get the users using the cube, on production even, then gradually replace the views with proper ETL and dimension tables once you have some more feedback.

Conclusion

The Power Pivot method is tried and tested for me many times – I find that its very quick to get up and running. The Analysis Services approach is not quite as quick to get up and running, but offers the benefit that the cube can continually evolve in a fairly agile manner. Its early days for me using this approach, but the a disadvantage is that the views can get quite complex, impacting performance depending on the data volumes.

Both methods probably have their place. If I wanted to validate my design, I think I would continue to produce a quick Power Pivot model. On the other hand, if quick delivery to production is a must, and the complexity/data volumes are not an issue, then I would use the view approach.

Analysis Services Dimension Processing Errors

I’ve seen a few data quality issues recently in a some client data warehouses/marts where I’ve been asked to carry out some Analysis Services work. These data quality issues have often led to the classic SSAS dimension processing error, namely:

“Errors in the OLAP storage engine: A duplicate attribute key has been found when processing: Table: 'dbo_DimProduct', Column: 'SubCategory', Value: 'Other'. The attribute is 'Sub Category'”

Normally this error is easy to track down, but there are a few scenarios that I’ve encountered where SSAS attribute properties actually make it harder for you to troubleshoot.

Scenario 1 – Trimming

When hitting the above error, the first thing to do would be to run a query to see where the duplicate attribute is. In my example, I’ve just produced a very simple dimension that’s loosely based on Adventure Works – here are the attribute relationships:

image

Therefore the first place to start would be to write a query that checks if we have any Sub Categories that are duplicated across Categories, such as:

SELECT        SubCategory
FROM          DimProduct
GROUP BY      SubCategory
HAVING        COUNT(DISTINCT Category) > 1

In my case, this actually yields no results. So what’s happening here? We can’t find any duplicates in SQL, but SSAS thinks there are duplicates. Well….Analysis Services dimension attributes contain a property called Trimming, which, by default, will remove spaces at the end of an attribute key or name. Needless to say, a dimension shouldn’t actually contain leading or trailing spaces, but in this specific scenario, the ETL wasn’t as robust as it could be. In the actual example I encountered, there was a tab at the end of one of the attribute names. To visualise that in the DimProduct example that I’ve created, the highlighted attribute has a tab at the end:

image

Scenario 2 – NullProcessing

The next scenario I want to highlight is the same error, but a different dimension. Again I’m going to create a simple example to illustrate what actually happened. This time I’ve got a customer dimension table, as follows:

image

The error is: A duplicate attribute key has been found when processing: Table: 'dbo_DimCustomer', Column: 'CustomerCountryKey', Value: '0'. The attribute is 'Customer Country'

A quick query on the above data will reveal that a CustomerCountryKey of 0 doesn’t actually exist twice for the same country, so what’s happened here? Again its due to a dimension property, this time on the attribute key, called NullProcessing. As its default is Automatic, it means that Nulls will get converted to zero during dimension processing. As we already have a member with CustomerCountryKey of 0, then we get an error.

ETL and Data Quality

These sort of issues highlight just why handling data quality during the ETL is so important! For the first scenario, some basic trimming/cleaning can fix the issue before it hits Analysis Services. As for the last issue, this is partly due to data warehouse/mart design. Personally I would never allow a dimension attribute to be null, partly because you shouldn’t present null attributes to a user, but also due to the lack of control that can result as shown above. Therefore a bit of design work up front to ensure that you have a robust data model is essential. Interestingly the properties that I’ve mentioned don’t exist for tables in Analysis Services Tabular. Whereas I think there are a few SSAS MD features that need to make it over into the next version of Tabular, I can personally live without the two properties outlined above.

Tabular Analysis Services Perspectives and Measure Groups

I’ve always liked how Excel and other client tools deal with measure groups in a multi dimensional SSAS cube. Once you connect to the cube, you get a drop down in the Pivot Table fields pane that lets you choose which measure group you want. From there, you get a filtered list of measures and dimensions. For example, if I connect to the Adventure Works 2012 sample cube, I get the following:

image

By picking Internet Orders, I will see only the measures dimensions that relate to Internet Orders, which provides me with a good way of navigating a large cube.

Tabular

Unfortunately though, if I connect to a Tabular model (e.g. the sample Adventure Works 2012 model that I got from codeplex), then if I click the same drop down, then I will see every table in the entire model, whether the table  is a dimension, a fact or something else. E.g. here for a tabular model than contains 3 fact tables, but 15 tables in total, then I see all 15 tables:

image

This isn’t ideal in my opinion, as I’d much rather that users had the ability to quickly jump to an area of interest, which measure groups achieve fairly well. As I have 3 fact tables, I would expect to see those 3 fact tables/measure groups in the drop down.

I did accept this as a quirk of tabular until I stumbled across something recently. If you connect to a tabular perspective, then Excel behaves exactly as Multi Dimensional does, i.e. it shows you only measure groups in the drop down. To illustrate I’ve added a new perspective to the model that contains all tables and I’ve called this perspective ‘Adventure Works’:

image

Now I can connect to this new perspective via Excel. Remember the perspective contains all my tables, so it shouldn’t be any different than connecting the model itself:

image

The perspective does give a different result though – now only the actual measure groups are displayed to me in the drop down, which is much more user friendly:

image

Summary

This is completely different to the way that perspectives work in multi dimensional. When making MD cubes, I wouldn’t always need perspectives, just because the measure groups provided a ‘natural’ way of the users picking the subset of the cube that was of interest to them. Now with tabular, it seems that using a perspective will actually improve the user experience.

MDS Derived Hierarchy Permissions

The MDS Hierarchy Member permissions have been given a complete revamp in SQL Server 2012. As explained here, the hierarchy member permissions are purely optional and allow you to limit the access to a specific set of members.

Even though the actual member permissions are optional, you may have a need to purely allow a user to browse and/or update the Derived Hierarchy. This is also something that’s been affected by the aforementioned change, so I want to explore here how we would grant users access to a Derived Hierarchy.

In 2008 R2, on the model tab, you got a Derived Hierarchies node, as shown in the image below:

image

If we contrast that to SQL Server 2012 (albeit with a different model), we don’t have a Derived Hierarchies node:

image

Therefore, to access the actual hierarchy, according to this article, you need to give the user Update permission on the model. That’s not the only way to do it, but certainly working only at the entity level, or the member level, is not good enough. To explore, lets look at an example. I have the sample Customer model installed and I want to give a test user access to the following CustomerType hierarchy, which is just made up of the CustomerType and Customer entities:

image

Going to the User and Group Permissions, just quick confirmation that assigning Update permission to the Customer and CustomerType entities is not good enough. As shown below, the user does not have access to any of the Derived Hierarchies that I wanted:

image

Upping the permissions level a bit by granting read only permission to the model does the trick – the test user can now see the Derived Hierarchy and can also add members. This is thanks to 1) a Read Only permission at the Model level and 2) Update permission on the two entities that make up the Derived Hierarchy:

image

However, the simplified security comes with a slight drawback, which is to get access to the Derived Hierarchy we’ve now given the user Read Only access to all entities in the model:

image

You may not want to grant a user access to all entities, but thankfully it can be avoided by granting an explicit Deny at the entity level. For example, as shown below I’ve set an explicit Deny for the AddressType entity:

image

This will ensure that the user now doesn’t see the AddressType entity on the Entities menu. Now that the user can see the hierarchy, what about restricting access to the members? That’s a bigger topic, so I don’t want to cover it here, but I will just provide a useful tip. When you’re testing your security, don’t forget that member permissions are not applied immediately! The following article explains how to immediately apply the member permissions, which is something that I’d recommend carrying out whilst testing.

MDX With Microsoft SQL Server 2012 Analysis Services Cookbook Book Review

Packt have just released a second version of their ‘MDX Cookbook’, titled ‘MDX With Microsoft SQL Server 2012 Analysis Services’, authored by Sherry Li and Tomislav Piasevoli. As I’m one of the technical reviewers of this book, it’s put me in a good position to be able to put together a review of the book in this blog post.

The new SQL Server 2012 version of the book is similar in terms of content and structure to the original, presenting a number of solutions to common MDX problems in a ‘recipe’ format.

In the first two chapters, the book covers basic MDX techniques, such as the layout of a basic query and a variety of techniques for working with sets. The pace of the book warms up nicely here, starting with a few basic concepts, before gradually getting more complex.

After a useful chapter on working with time in MDX, there are some excellent chapters that give what should be useful recipes on common reporting techniques, such as top/bottom members, ranking and averages.

As the book progresses, there are some more complex chapters that cover areas such as linear regression, complex sorts and recursion – to name a few. Whether the recipes are simple or complex, they works well, as each recipe is packed with step-by-step instructions and screenshots, all based on the SQL Server 2012 Adventure Works cube.

image

Although it’s also in the first edition, I particularly like the chapter ‘When MDX is Not Enough’, which talks about how modifying the cube design can assist you when writing some MDX queries. Essentially the chapter promotes evaluating if you can modify the cube design in some way, rather than writing very complex MDX, which is something that I completely agree with.

The book differs slightly to the 2008 R2 version in that a few more of the basics (e.g. basic MDX query structure) are covered in the introductory chapters. This seems to work well, meaning someone relatively new to MDX can pick up some simple concepts, whereas someone who is more experienced in MDX will also get a massive amount out of the book.

So overall this comes highly recommend for your bookshelf or Kindle…

SSIS Fuzzy Lookup Transformation Tokens

After doing my recent blog post on the Excel Fuzzy Lookup Add-In Vs SSIS, it got me thinking about the capabilities of the SSIS Fuzzy Lookup Transformation.

Although I’m aware that the Fuzzy Lookup is a token-based solution, I’ve only ever tuned it really by altering the similarity thresholds on both the columns and also on the task itself. What you can also do is alter how the transform deals with tokens. Therefore, I thought it would be worth a quick post to show how the tokens help produce matches.

Tokens

The Fuzzy Lookup works by splitting up strings into several different components, known as tokens. For example, the Adventure Works reseller “Cross-Country Riding Supplies” might be split up into the tokens “Cross”, “Country”, “Riding” and “Supplies”. This is key to the matching process, as the fuzzy algorithms will attempt a match based on the commonality between tokens in the input and reference datasets.

SSIS Fuzzy Lookup

The advanced tab of the fuzzy lookup will show the delimiters used to create the tokens. As you can see, a hyphen is included in the list, meaning words separated by a hyphen get interpreted as separate tokens:

image

On the “Reference Table” tab of the transformation, there is the option to store a new index, which means that SQL Server will create a table in the data source to hold the tokens found in the reference table:

image

In my made up dbo.Retailers table, I’ve got two retailers – “Nearest Bike Store” and “Gears+Chain Supply”. After running the SSIS package once, the dbo.IndexTest table will be created, which is where the reference tokens get stored in order to help with matching. As it’s just stored in SQL Server we can select from it.Before we do that, I should just say, I’m only passing this fuzzy lookup 1 input string, which is a misspelt company called “Bikes Chaiin Supply”. If I run the package, I get the following matches and scores:

image

This has a fairly low similarity and confidence, so we can probably do better. Going back to the index, if we select from dbo.IndexTest, we get the following:

image

The index table also contains substrings found in the reference table, known as q-grams, which are designed to assist with matches that contain errors. What’s interesting about these results is that “gears+chain” has been entered on its own, meaning that the word “chain” has not been interpreted as a token in this case. The solution, in this case, is to alter the token delimiters to include “+”:

image

If we re-run the package and then look again at the index table, we now see that the word “chain” has been successfully interpreted as a token. In addition we also have a few variants on the word chain, such as “chai” and “hain”:

image

After re-running the matching process with out new delimiters, we can now look at the results of the matching process. If we take a look at the data viewer output, we can see that this has now resulted in a better similarity and confidence for the first row:

image

The delimiters work across all columns for the whole fuzzy lookup,  so that’s something to test when changing the delimiters – it may not be beneficial in all situations. However the new delimiter has clearly worked very well in this example, so it’s very much worth considering if you’re not quite getting the level of matches that you had hoped for.

Posted: Jul 05 2013, 10:00 AM by Jeremy Kashel | with no comments
Filed under:
Debug MDX with Generate

On occasion you may find that your MDX queries don’t quite give you the result that you expect, which sometimes may be due to the way in which a certain bit of MDX has changed the context of part of your query.

This can sometimes be tough to figure out, especially with functions like Sum() or Filter(), whereby we are looping over a set and then only returning the end product. For example, take the following query, where I’m trying to find out the count of UK male customers who have bought Road Bikes in both the current and the previous year:

WITH MEMBER [Measures].[Repeat Customer Count] AS
SUM(
    [Customer].[Customer].[Customer].Members,
    IIF([Measures].[Internet Sales Amount] <> 0 AND
    ([Measures].[Internet Sales Amount], [Date].[Calendar Year].PrevMember) <> 0, 1, NULL)
    )

SELECT      {[Measures].[Repeat Customer Count]} ON 0,
            [Date].[Calendar Year].[Calendar Year] ON 1
FROM        [Adventure Works]
            --Road Bikes, UK, Male
WHERE       ([Product].[Subcategory].&[2], [Sales Territory].[Sales Territory Country].&[United Kingdom],
            [Customer].[Gender].&[M])

It returns the following result set, which I know to be incorrect:

image

Unlike debugging either of the axes, where of course we can see what gets returned, debugging examples such as the above might be difficult. The MDX Generate function can help here, as it works in a similar way to Sum(), by iterating over a set and then evaluating an argument once per iteration. The idea is that Generate will build us a new set, by using the looping context of the first set that you pass it. In this case, however, we can use it to return us a concatenated string based on our input set.

What I’ve done here is to use the CurrentMember of attributes that I think might be causing problems (Customer, Gender) and then just print them out as a string in order to see if the attributes have the context that I expect. Therefore I’ve replaced the ‘1’ in my original query with the string of:

" | " + [Customer].[Customer].CurrentMember.MEMBER_CAPTION + ", " 
    + [Customer].[Gender].CurrentMember.MEMBER_CAPTION

The full query and results then become:

GENERATE(
        [Customer].[Customer].[Customer].Members,
        IIF([Measures].[Internet Sales Amount] <> 0 AND
        ([Measures].[Internet Sales Amount], [Date].[Calendar Year].PrevMember) <> 0, 
        " | " + [Customer].[Customer].CurrentMember.MEMBER_CAPTION + ", "
        + [Customer].[Gender].CurrentMember.MEMBER_CAPTION, NULL)
        )

SELECT        {[Measures].[Repeat Customer Count]} ON 0,
            [Date].[Calendar Year].[Calendar Year] ON 1
FROM        [Adventure Works]
            --Road Bikes, UK, Male
WHERE        ([Product].[Subcategory].&[2], [Sales Territory].[Sales Territory Country].&[United Kingdom],
            [Customer].[Gender].&[M])
image

I can spot straight away that female customers have been returned in 2007, which is due to the fact that sets specified inside the WITH MEMBER section are not evaluated in the context of the slicer. In this case the problem can be solved by using the EXISTING keyword inside the Sum, so that the customers get sliced by the gender attribute:

SUM(
    EXISTING [Customer].[Customer].[Customer].Members,
    IIF([Measures].[Internet Sales Amount] <> 0 AND
    ([Measures].[Internet Sales Amount], [Date].[Calendar Year].PrevMember) <> 0, 1, NULL)
    )

This won’t be applicable in all MDX debugging scenarios, but hopefully it will help someone out.

Excel Fuzzy Lookup Add-In and SSIS

Although it’s been out for a while, I thought I’d mention, for those who don’t know, that you can download a Fuzzy Lookup Add-In for Microsoft Excel.

The Add-In allows you to perform a fuzzy lookup for Excel data against another Excel dataset. I won’t go into too much detail, as this has already been covered here, but essentially the Add-In requires you to setup two datasets, namely a “left” and “right” table. Once you carry out some configuration/mappings, then the fuzzy match will return all of the rows in your left table, with the appropriate values looked-up from your right table. As an example I’ve chosen to do my lookup based on reseller names and an address:

image

Part of the reason for this post is that I think this add-in could be useful from a self service perspective for power users to assist in MDM/Matching/DW projects. There’s often a lot of trial and error in trying to get the right balance for matching rules, and, given that the Fuzzy Add-in gives similar results to SSIS, it may well be a good place to start with matching an dataset as part of an MDM project, for example. Alternatively, it could be used to carry out pre-matching/clean-up on data before an MDM project starts.

SSIS

I also wanted to contrast this Add-in to the Fuzzy Lookup transformation that you get in SSIS. Both the Fuzzy Lookup Add-In and the SSIS variant do the same thing – they both match one dataset against another in a fuzzy manner, returning you 1-n matches per row. In fact, according to this paper, they use the same Microsoft Fuzzy algorithm.

With the default settings, at least with my small test datasets, I get similar results in Excel and SSIS, albeit with different similarity numbers. These are the key differences that I’ve noticed:

  • Excel returns only overall row similarity, whereas SSIS returns similarity per column also.
  • Excel exposes an additional set of tuning options. As an example, you can set the Containment Bias, whereby you decide on the penalty for tokens in the right record that are not in the left record. You can also edit a number of other settings in an XML file.
  • SSIS outputs both the confidence and similarity, whereas Excel outputs only the similarity.
  • Excel will output the similarity XML per row, whereby you can see how its split up a string into differently weighted tokens.
  • As you might expect, SSIS out performs Excel.

As I quick test, I started working with Adventure Works resellers and seeing if I could get the same match results from both products. To highlight this, I’ve filtered this down to just 1 reseller (“Bikes World Comp. Store”) that I’ve made up and seeing what are the top 5 resellers both products return. SSIS, with its default settings, gives the following when matching just on the reseller name:

image

Excel, using the Fuzzy Add-In, gives the following with its default settings:

image

So overall some very good matches out of the box from the Excel Add-In, certainly comparable to SSIS. What’s interesting about this result set is that “Bike World” is ranked higher in Excel than “World Bike Discount Store”. We can align the behaviour of the two products, in this case anyway, by changing the Containment Bias from a default of 0.8 to 0. This will assign a full penalty to tokens in the right record that are not in the left record. So, as the left record is “Bikes World Comp. Store”, and the right record is “Bike World”, we are by default being lenient on the fact that its missing tokens. Increasing the penalty for missing tokens gives the following, with the top record now matching SSIS:

image

Of course, you would normally match on more than just the name, plus what’s worked above here for me might not work in other scenarios. But the bottom line is that the Excel Fuzzy Lookup Add-In gives very good matches out-of-the-box, is easy to use for the Excel user and therefore could well be of some use as part of data matching projects.

Posted: Jun 11 2013, 02:32 PM by Jeremy Kashel | with no comments
Filed under: ,
Debug Custom MDS Workflows

The custom MDS workflows, AKA External Action Business Rules or custom MDS assemblies, are one of the more advanced features of MDS. They allow you to extend the standard MDS business rules by getting MDS to fire Entity Member member data at an external assembly of your choice.

There are a lot of moving parts to this feature, MDS business rules, the custom .Net assembly, service broker and a windows service! When a business rule meets your condition, then the rule is written to a service broker queue. The windows service interrogates the queue and then passes the information to the custom assembly. I’m not going to show how to do this, this is already done by Johan Machielse and also on MSDN. What I would like to show is how to debug the use of the custom business rules, just due to the number of moving parts and also a couple of quirks that I’ve encountered when using them.

Custom Assembly Deployment

One of the first things that may catch you out is the deployment of the assembly. To troubleshoot this, I recommend the           “-console” argument of the workflow service. If you run it in command line mode you may encounter a few errors:

“Could not load file or assembly 'YourAssembly' or one of its dependencies. This assembly is built by a runtime newer than
the currently loaded runtime and cannot be loaded.”

This means that you need to go the properties of your assembly in Visual Studio and change the Target Framework from .NET Framework 4 to 3.5.

Another error that you may get is "ERROR: Could not create type. Ignored.”  Assuming that you have actually put the assembly in the bin directory, this is most likely to be that the either the namespace or the class name in your DLL doesn’t match what you have in your config file. Check the contents of Microsoft.MasterDataServices.Workflow.exe.config match your DLL.

You may also get, “ERROR: Type does not implement IWorkflowTypeExtender.” This means that your assembly has been picked up correctly, but it does not implement IWorkflowTypeExtender.

If the assembly does start up ok without errors, you may still get errors when the exe tries to take a message off the service broker queue. An error such as “Could not dispatch due to missing extender <tag>” means that the tag that you’ve given to the assembly in the config file does not match the “Workflow Type” value of the MDS business rule:

<setting name="WorkflowTypeExtenders" serializeAs="String">
<value>MDSTEST=MDS.CustomWorkflow, CustomWorkflow</value>
</setting>
image
Debugging the Custom Assembly

Once you do get the assembly working, you may find that the code isn’t quite doing what you want. A good way to troubleshoot this is to choose “Attach To Process” from the Debug menu in Visual Studio, as shown below:

image

If you do this, then you can add breakpoints to your .Net code as normal and step through in order to troubleshoot issues.

To conclude, this probably isn’t an exhaustive list of all the sort of errors that could occur with custom workflows, but with any luck it will save someone a few hours of troubleshooting.

Data Quality Services - Kimball ETL Screening

Although it’s arguably not the most popular in the Kimball series, The Data Warehouse ETL Toolkit contains a huge amount of vital information that’s focussed on the ETL side of data warehousing. In particular I’ve found the data quality screens design to be useful, so I thought it might be worth exploring if this concept can be applied to Data Quality Services.

What is a Kimball Data Quality Screen?

A screen is simply a check that is made against data that comes into the ETL pipeline and is used to measure data quality. Physically, the Kimball screen design is typically implemented as a table called Screen, which contains, as rows, the actual checks to perform against staged data. In addition, the screen table will contain metadata to aid data quality measurement, such as screen severity, category, source system and possibly exception actions.

Each time that a screen yields some results (i.e. data quality issues) then another table, Error Event Fact, is populated with the results. Therefore, by using the data generated by the screening process, The Error Event Fact becomes the place to look for an overall measurement of data quality. The schema for this design from the Kimball book (with a few modifications) is as follows:

image

The check performed by each screen is typically a snippet of SQL that checks for a particular condition, e.g. Is the customer’s postcode missing, or is the tax amount within permitted ranges?

Can Data Quality Services assist with ETL screening?

DQS is capable of auditing and checking data, as each domain within a knowledge base can have its own domain rules. In order to see how this might work as Kimball screens, I’ve set up a simple knowledge base around purchase orders. Here’s the domain management window from the DQS client:

image

Within the Delivery Charge domain, I’ve setup a domain rule called ‘Delivery Charge Negative’ in order to ensure that the delivery charge cannot be negative:

image

I’ve done something similar with the PO Date domain, setting up a rule to state that the PO date must be greater than 01/01/2012 in this case. Data run against the PO knowledge base will now be enforced against these two rules, so these are effectively my two Kimball style screens. To recap, DQS cleaning can be run either from the Data Quality Client or from SSIS.

Populating the Error Event Fact with SSIS

I’m going to be running my DQS domain rules via SSIS, as my end goal is to populate the Error Event Fact. Remember, if any DQ issues occur, I want the Error Event Fact to know about them.

In this scenario, the data from the source system has already been staged to a SQL table, which gives the following results in the SSIS source adaptor preview:

image

This data is passed to the DQS Cleansing SSIS task, with the data mapped to the DQS Knowledge Base in the following way:

image

Once the package is run, the data will be passed to DQS, which will run the data against its knowledge base, including running the domain rules. The data that is returned from DQS is as shown below. Note, PO55 row has failed both rules, whereas the other two rows have failed one row each.

image

Now we know exactly which rows have failed each data quality screen. The only trouble is that the data is in the wrong format for our error event fact. We want a total of 4 rows in the error event fact has PO Id 5 has failed 2 screens, whereas the other two rows have failed 1 screen. Therefore, we carry out a unpivot before inserting into the Error Event Fact. For this blog post my ScreenId is hard coded, but in reality you would have to lookup an appropriate ID for the DQS screen. The full SSIS package is as follows:

image

Data Quality Reporting

With the Error Event Fact populated, a variety of reports can be produced. In this case I’ve made a very simple Analysis Services cube and made a quick report in Excel 2013, using slicers, sparklines and also the new Timeline feature:

image

From this report I can see the screen that’s failed the most rows, see a trend of data quality and filter on specific issues.

Summary

With a bit of work, DQS is capable of producing the rows needed for the Kimball Error Event Fact, but how good is it? I see the advantages of using DQS as follows:

  • Front end user interface in order to allow data stewards to set up their own rules to screen the data – much better than holding rules in SSIS or manually in a table;
  • DQS becomes a one stop shop for all data quality, as, in addition to holding the screens as domain rules, DQS will clean your data for you, according to the valid/invalid domain values that you set up in the knowledge base.

There are a few disadvantages to it also:

  • The domain rules are not as flexible as pure SQL or SSIS, meaning that you may not be able to express all the screening rules that you had in mind;
  • DQS can't maintain all the metadata that you need to implement Kimball screening, so you’ll need to supplement it with some SQL tables to hold things like screen severity, screen type etc;
  • Even though the performance has been improved in CUI, it’s still not suitable for large amounts of data. Here’s an overview of how long you can expect the SSIS cleansing to take.

So my view is that if you’re already using DQS and data stewards like it using it to manage data quality, and your data volumes are not too big, then implement the classic Kimball screening design using DQS. But there are more cons than pros unfortunately, so if you’ve not already made an investment in DQS, and you have the time to invest in your own Kimball style screening solution, then I think you can come up with something that’s a bit more flexible and scalable.

Unique Values in Master Data Services

I’ve been doing quite a few demos and Proof of Concept (POC) projects recently for MDS. Understandably there are always a raft of questions that come up during demos and POCs, one of which recently was “how do I ensure that this attribute’s value is unique?”.

MDS can easily deal with simple uniqueness enforcement, which is something that anyone who has done any MDS development will no doubt know straight away. But if the requirement is more complex, such as ensuring that the attribute is unique in conjunction with other attributes, then I don’t think it’s as well known. Therefore I thought it would be worth a quick post on how to do this.

Code Attribute

To get the simple case out of the way, each MDS entity is always created with an attribute called Code, which must be unique. Therefore if you try and create/update a member with a duplicate Code, then you get the following error message:

image

Business Rules

To enforce uniqueness on any other attribute, you have to use business rules. Therefore, if I want to ensure that the attribute called Name must be unique, then I must carry out the following:

  • Go to the business rules editor and create a new rule
  • Drag the action of “must be unique” over to the actions node
  • Drag the Name attribute over to the “Select attribute” node, meaning that the action should look as follows:
  • image
  • Once that’s done, save the rule step and then re-publish the rule.

Trying to enter a duplicate value for the Name attribute will now give the following validation error:

image

Business Rules – Multiple Attributes

Now for the harder scenario. In the sample MDS product model, there is a Country attribute and also a Product Sub Category. I want to ensure that Name is unique, but only within a Country and Sub Category combination. To do this we go to the same business rules interface, and add the same action as before. Once that is done we carry out the following additional steps:

  • Change the radio button to “In combination with the following”
  • This is the part that I don’t think is that obvious. Drag over the Country attribute and then the Product Sub Category attribute onto the “attributes” node. This should give the following:
  • image
  • As before, save the rule step, then publish the rule.

Now a duplicate name will be permitted, but only within a country and sub category combination. If we do enter a duplicate value within a given Country/Product Sub Category group, then the following validation message is shown:

image

Master Data Services SQL Server 2012 Vs 2008 R2

I’ve given quite a few MDS demos recently and one question I seem to get asked often is “What features does MDS 2012 have compared to MDS 2008R2?”. As its happened quite a few times, I thought it would be a good idea to put a quick post together to explain the differences. The table below details all the features MDS has and compares the two versions:

Feature

MDS 2012

MDS 2008 R2

Front End

·         New Silverlight UI added for the Explorer area. Better performance, with less clicks to get tasks done.

·         Excel Add-in introduced

·         HTML based ASP.NET implementation.

·         No Excel Add-in

Modelling

·         Carry out in the web UI, plus some tasks can be carried out in Excel

·         Allows the auto generation of a code without using business rules

·         Carry out modelling in the web UI only

Hierarchies

·         Three types of hierarchies

·         Manage hierarchies in the new Silverlight UI

·         Three types of hierarchies

·         Manage hierarchies in the older UI style

Collections

·         Add members from Explicit Hierarchies or other Collections

·         Maintain a weighting on a member by member basis

·         Add members from Explicit Hierarchies or other Collections

Staging Process

·         A staging table is created per entity that you create in MDS. The table mirrors the exact structure that you see when you view the entity in the front end, making loading easier.

·         A generic staging table is used for all entities. This means, for example, you must load attribute changes in attribute name/value pairs

Deployment

·         Deploy the model structure from the front end.

·         Deploy the data from the new command line deployment tool

·         Deploy model structure and data from the front end

Security

·         Grant permissions to models, entities, attributes and attribute groups

·         Security cleaned up and simplified (no longer necessary to grant permissions to hierarchies)

·         Grant permissions to models, entities, attributes, attribute groups and hierarchies

Business Rules

·         Alter attribute values, validate data integrity, initiate workflow.

·         Some minor validation changes around using business rules to auto-generate Code values

·         Alter attribute values, validate data integrity, initiate workflow.

 

SharePoint Integration

·         Initiate a SharePoint workflow from MDS business rules

·         New functionality to embed the web UI inside a SharePoint page

·         Initiate a SharePoint workflow from MDS business rules

 

Data Quality Services (DQS) Integration

·         Connect to a DQS matching policy from Excel, against MDS master data

·         No DQS capability, as DQS is new in SQL Server 2012

Version Management

·         Create multiple copies of your master data and work with them independently as needed

·         Some minor UI improvements have been added

·         As SQL 2012, create multiple copies of your master data and work with them independently as needed

Transactions (auditing)

·         Transactions removed from Explorer and now are only available in Version Management

·         Therefore users cannot reverse their own transactions, only administrators can do this

·         Annotations are now permanent and cannot be deleted

·         Transactions viewable in two areas, Explorer and Version Management

·         Users can reverse their own transactions

·         Annotations can be deleted

Metadata

·         No support to view Metadata – this is now a deprecated feature

·         Customise the metadata for most objects in MDS

·         View metadata in the metadata explorer

Installation & Configuration

·         Installation from the main SQL Server setup program

·         Minor changes to the configuration settings

·         Installation from the SQL Server media, but not included in the main setup program

 
For me the improvements to the performance and the addition of the Excel Add-in are the most compelling reasons to upgrade. The great thing in particular about the Excel Add-in is that data stewards can manage data with the familiarity of Excel, but the data itself will be stored in SQL Server 2012.
More Posts Next page »