Jeremy Kashel's Blog

March 2012 - Posts

Power View in SQL Server 2012 RTM

Power View has been given a couple of small changes from RC0 to the RTM release:

Firstly, you can now choose the colour of the various visualisations via the Themes section, which you’ll find on the new Styles ribbon tab:

image

This will allow you to change the colour for a slicer, table or card, whereas for a chart it will set the palette that gets used for the chart series. Surprisingly it seems to set the theme globally, rather than allowing you to set the colour/palette for a single object:

image

The other change that I’ve noticed is that the Card visualisation now has a Style option on the ribbon:

image

This allows the cards to be displayed without normal coloured background, with a much larger font. So potentially useful for highlighting key numeric facts:

image

I do really like Power View, in particular the clean look & feel, fast performance, export to PowerPoint and animations to name a few. However, even though RC0 added a lot of new features, I think it’s still missing some of the functionality offered by other reporting tools. I know that it’s marketed as a data visualisation tool rather than a reporting tool, but those lines will become a bit blurred for users. This is where Connect comes in…

One connect suggestion that I’ve created is to allow drill down in Power View charts. If you’ve used Power View to perhaps spot a trend or spike in the data, then it makes sense for you to manipulate the view to find the root cause, which I think drill down would help with. Other suggestions that I think would be useful are heat maps and allowing Power View to access multidimensional models. Please vote on Connect if you agree!

DQS Matching Vs SSIS Fuzzy Grouping and Lookup

When the term fuzzy matching comes up in the Microsoft world, it’s natural for anyone who’s used the BI stack to think of SQL Server Integration Services (SSIS), due to the fact that it has both Fuzzy Grouping and Fuzzy Matching components. Therefore, when using the matching in Data Quality Services (DQS) the other day, I thought it might be worth contrasting the matching capabilities in the two products.

Overview

The SSIS Fuzzy Grouping transformation is typically used on a set of data containing duplicates, as it will operate across all of the rows, grouping similar rows together based on fuzzy comparisons, with a view to eliminating duplicates. In contrast, the Fuzzy Lookup transformation takes a value in the SSIS pipeline and uses fuzzy matching to match the input value against a set of clean reference data in a database.

The matching in DQS takes a set of data and groups the data into clusters, based on a number of rules and other factors, again with a view to removing duplicates. It’s therefore only directly comparable against the SSIS Fuzzy Grouping, rather than the Fuzzy Lookup.

Test Scenario

For the purposes of this blog post, I’ve got a sample query from AdventureWorksDW, taking data from the DimCustomer table. I’ve unioned all that data with a few sample records that contain typos. E.g. I’ve got 'Pamela Fernndez' misspelt for one customer and '4610 Pinto Rd' instead of ‘Road’ for another. I’m going to attempt to de-duplicate data in SSIS first and then do the same thing in DQS to see if there is any difference.

SSIS Fuzzy Grouping

As shown below, I’ve just got a very simple OLE DB source passing some data into a Fuzzy Grouping, then I’m going to examine the data via a data viewer:

Control flow

The columns that I’m passing into the Fuzzy Grouping for the Customers are Full Name, Address Line 1, Gender and City just for the purposes of this blog post. I’ve configured the Fuzzy Grouping transform to carry out Fuzzy Matching on Address Line 1 and Full Name, as these are the most inconsistent (intentionally) across my made up data. I’ve also dropped the threshold right down to 0.40 to see a full range of matches that we may or may not get:

image

I’ve filtered the data for customers with a surname of Fernandez or a first name of Pamela, which includes two made up records. When I run the package, the data viewer produces the following results:

Data Viewer

The _score column has a high score of 0.8245935, which is based on the similarity scores of FullName and AddressLine1. There are 3 records included in the cluster, including someone called ‘P Fernandez’ who is male. Although that’s been matched, it’s only because I decreased the threshold, plus it can easily be excluded by adding an exact match for Gender, as shown below:

Gender

DQS Matching

So overall SSIS copes well with the test scenario, how does DQS compare?

After creating a knowledge base, creating domains and carrying out knowledge discovery, it’s possible to use DQS to create a Matching Policy. The Matching Policy involves building a number of rules that determine how the data matches. The starting point in creating a matching rule is determining which domains you want to match on and whether they should be matched using the fuzzy algorithms (similar) or matched exactly:

Rule editor

So not too dissimilar to SSIS. You then choose to give each domain a percentage weight, which must add up to 100%. For example, if you give Full Name a weight 30% and, when run, its match score is 61% accuracy, then you have Address Line 1 at a 70% weight and a 45% accuracy, then an overall match score for a row will be (0.3 * 0.61) + (0.7 * 0.45) = 49.8% match score.

Starting off with a 50% weight on the two domains, I get the following results when operating on the same data:

dqs clusters

As with SSIS, it hasn’t got it quite right, but using the Gender domain will help. What I quite like in DQS is that I can flag a domain as being a pre-requisite:

image

Flagging a domain as a pre-requisite means that the domain will be used to eliminate matches, but will not contribute to the matching score. If I wanted it to contribute to the matching score I would just have to uncheck pre-requisite and give it a weight. Running the matching again gives the following results:

dqs results 2

So the duplicate is correctly mapped with a score of 75%. Drilling down on the match was achieved is quite nice in DQS also:

image

This brings me to an interesting observation about the matching. I originally had my dummy record with a value of “3 Pierr Demoulainn” for the address. So in addition to having a typo in the second name and missing the word “Rue”, my other test also missed the second e out of the word “Pierre”. If I run my test data with these values, I get no matches at all. The lack of the second e in “Pierr” seems to have made all the difference. I can get it to match if I give a lot of weighting to Full Name. If I do this, we see the following results:

drill down 2

Address Line 1 has been given a score of 0%. Making the same change to my test data and then running in SSIS gives different results. The similarity for the address drops from 0.81 to 0.75, but it certainly doesn’t drop to zero. Although it will depend massively on your own data, the matching in DQS seems a bit more stringent. This can be mitigated by using more domains for the matching, rather than the simple 2 or 3 domains that I’ve used. To back that statement up, using a composite domain on both Address Line 1 and City does yield a strong match (71%) for the same data.

Conclusion

The rule based matching in DQS gives a lot of flexibility on how to weight matching across different domains. Coupled with the knowledge base components that interact with the matching, such as cleaning and Term-Based Relations, DQS has more features specifically aimed at matching than SSIS. It all depends on what sort of data you’re working with, but in theory, as you maintain the knowledge base over time, DQS should give you strong matches based on the knowledge you’ve given it.

However, there are some drawbacks – it’s a shame that the matching can’t be automated (please vote for my connect suggestion if you agree). SSIS by its very nature can be automated, meaning the matching in SSIS will be more suitable for a lot of implementations. Secondly, I’d like to be able to match external records against correct values already in the knowledge base, rather than just doing a fuzzy grouping on the external records. I’m a bit surprised DQS can’t do a fuzzy lookup, maybe this will change in V2. Finally, as I’ve shown above, some of the matching results are a little unexpected, whereas in SSIS they were good pretty much straight away. To be fair to DQS this could probably be mitigated with a better choice of matching fields and also running cleaning against a fully populated knowledge base beforehand.

Mean, Median and Mode in MDX

I recently delivered an Analysis Services training course to a client who were a bit surprised to learn that the standard list of Aggregation Functions for an Analysis Services measure doesn’t include Mean, Median and Mode. My answer was of course that all three can delivered using MDX, so I thought it might be a good blog post to show how it’s done.

Before I begin, just a quick refresher for anyone who needs it, the following article seems to be a good reference on the difference between Mean, Median and Mode. Essentially Mean is the average, Median is the middle value, whereas Mode is the value that occurs the most.

Mean and Median

The Mean and the Median are both relatively easy to do in MDX as there are built-in functions for both AVG() and Median(). As they’re well documented here and here, I won’t spend too long covering them. In the following MDX query I’m producing the Mean and the Median of the product sub categories in Adventure Works:

WITH 
MEMBER  [Measures].[Mean] AS AVG([Product].[Subcategory].[Subcategory].Members,
        [Measures].[Internet Sales Amount])

MEMBER  [Measures].[Median] AS Median([Product].[Subcategory].[Subcategory].Members,
        [Measures].[Internet Sales Amount])

SELECT  {[Measures].[Internet Sales Amount], [Measures].[Mean], [Measures].[Median]} ON 0,
        NonEmpty([Product].[Subcategory].[Subcategory].Members, [Measures].[Internet Sales Amount]) ON 1
FROM    [Adventure Works]
WHERE   ([Date].[Date].&[20070727])
image

If you run the query and copy the data out to Excel you’ll see that it matches the result of Excel’s Median and Mode functions.

Mode

Mode is harder to achieve as there’s no built in MDX function. Fortunately, Excel is on hand to help out, as Analysis Services allows you to use some of the Excel functions in MDX. There are warnings that come with this approach, as you need to have Excel installed on the server and there can also be performance problems. But, if you’re comfortable with these caveats, then the Excel mode function may be an option for you.

If you do use the Excel Mode function then there are a few things to watch out for. Firstly, you’ll have to use the MDX SetToArray() function to pass the set in the format that Excel expects. Secondly, you may get an error returned with the description: “#Error The following system error occurred:  Invalid flags.” This is due to the data type of the Measure – it seems that Currency measures are not supported by this function. Therefore, I’m using Cdbl() on Internet Sales Amount to get this to work:

WITH 
MEMBER  [Measures].[Demo] AS Cdbl([Measures].[Internet Sales Amount])
    
MEMBER  [Measures].[ExcelMode] AS
        Excel!MODE(SetToArray(NONEMPTY([Product].[Subcategory].[Subcategory].Members, [Measures].[Demo])
        * [Measures].[Demo]))

Applying this to the example query I’ve got above, the mode returned is 8.99, as its the only number that appears twice.

Bi Modal Result Set

There is a bit more complexity to Mode in some situations, essentially as you can get a bi-modal result set - where there is more than one value occurring the most. E.g. the mode of 1,2,2,3 is just 2, but the mode of 1,2,2,3,3,4 is both 2 and 3 – so it’s known as bi-modal.

The Excel Mode function won’t help in this scenario, as it will just return one of the mode values. Interestingly Excel 2010 does support this, with the Mode.Mult() function, but I’ve been unable to get this to work, presumably as the Excel function returns an array.

All is not lost though, it’s possible to produce mode using MDX. First of all, the numbers that I’m operating on are shown below. As you can see, there are two sets of two numbers that are the same:

image

As the following forum thread shows, mode can be achieved using MDX. I’m adapting that approach to give the following MDX:

WITH 
--Produce a result set that will guarantee bi modal results
MEMBER    [Measures].[Demo] AS
        CASE WHEN [Product].[Subcategory].CurrentMember IS [Product].[Subcategory].&[31]  THEN 65.91
        ELSE Cdbl([Measures].[Internet Sales Amount]) END

--Count how often each value appears
MEMBER [Measures].[ValueCount] AS 
SUM( 
    Union([Product].[Subcategory].CurrentMember.Level.Members,
        {[Product].[Subcategory].CurrentMember} AS Currentsub)
    , IIF(([Product].[Subcategory].CurrentMember, [Measures].[Demo]) = 
        (Currentsub.Item(0).Item(0), [Measures].[Demo]), 1, null)
)
--Only get the items that appear the most
SET [MaxModes] AS 
    ORDER(FILTER(NONEMPTY([Product].[Subcategory].[Subcategory].Members, {[Measures].[Demo]}),
    [Measures].[ValueCount] = MAX(NONEMPTY([Product].[Subcategory].[Subcategory].Members, [Measures].[Demo]), 
        [Measures].[ValueCount])), [Measures].[Demo], ASC)

SELECT   {[Measures].[Demo], [Measures].[ValueCount]} on 0,
        [MaxModes]
        --Filter out the duplicates 
        HAVING [MaxModes].CurrentOrdinal = 0 OR [Measures].[Demo] <> 
            ([Measures].[Demo], [MaxModes].Item([MaxModes].CurrentOrdinal - 2)) ON 1
FROM    [Adventure Works]
WHERE    ([Date].[Date].&[20070727])

This gives the following correct result set, assuming we just want the two bi modal values:

image

An alternative is to use an MDX Filter() or HAVING clause to just display all the sub categories that have the mode values, which would just require a small modification to the above code.

I’ve not used these approaches with big data volumes etc, but they should at least give you a few options if you’ve got to do these sort of calculations in your own environment.