Jeremy Kashel

Jeremy Kashel's Blog

Core and Custom Dimensions and Facts

An interesting challenge in data warehousing is how to cope with a dimension that is heterogeneous in nature, e.g. a product dimension that only has a handful of attributes for one particular product type, but then may have many more attributes for a different product type. One way to deal with this problem is to create ‘Core and Custom’ dimensions or fact tables, which is a concept I will introduce in this first post, before outlining some of the technical challenges in a later posts.

The ‘Core and Custom’ concept (AKA Supertype and Subtype) is covered by Ralph Kimball in the Data Warehouse Toolkit, where he uses the example of a retail bank selling different types of products, e.g. a mortgage ‘product’ and a current account ‘product’. They both have a balance and an interest amount, which are the facts, but they have a varying number of different dimension attributes. E.g. a Current Account product may have attributes such as ‘Monthly Fee Indicator’ and ‘Debit Card Eligible Indicator’, whereas a Mortgage product may have attributes such as ‘Term’ and ‘Fixed/Adjustable’ to represent Fixed/Variable etc.

I’ve seen this example myself recently where a company sells Commercial Vehicles, Cars, Spare Parts and Bikes. This is the example that I’ll use in order to illustrate the concept a bit further.

Core and Custom Dimensions

If a company wants a global view of all of its product sales, whether they be commercial vehicles, parts, cars or bikes, then an ideal scenario is to have one sales fact table and one product dimension table. Unfortunately due to the diverse nature of the products offered, it could become unmanageable and messy to have a single dimension table that simple has all attributes from all different product lines. For example, a car might have attributes such as Engine Size, Transmission, Airbag Indicator and Fuel Type, whereas a bike might have attributes such as Size, Gender, Speed and Rear Travel – they are clearly very different.

The Core and Custom dimension approach is essentially to have a ‘Core’ dimension table that holds all products, with a column for the surrogate key and then all of the common attributes. The Core dimension table would look like the following:

image

This is then complemented by the ‘Custom’ dimension tables which are highly specialised dimension tables that are specific to a product line. A given product will be present in both the Core dimension table and one of the Custom dimension tables, with the same surrogate key. This is an important point – a product in the Core table will have the same surrogate key as a product in one of the Custom dimension tables. Therefore in the example below, the highlighted rows in the Core dimension table also appear in the Custom bike product dimension table:

image

image

Fact Tables

The idea is that the fact table can be joined interchangeably with either the Core dimension table to get an overview of all products, or to the Custom dimension table to get highly specialised analysis. The Core dimension table would join to a fact table just like any other dimension table:

image

In contrast, either Custom dimension table could be joined to the fact table according to the type of analysis required. The example below shows how either one of three of the Custom dimension tables could be joined to the fact table:

image

Note – The above diagram shows the optional join paths for the Custom dimensions to the fact table. From the user’s perspective, a single dimension would always be presented – either all products, in which case the Core dimension would be used along with the whole fact table, or one of the Custom dimensions, in which case the fact table should only contain the products that relate to the Custom dimension, e.g. with a view.

The alternative approach is to have separate Custom fact tables for each business line – this could be necessary if there are specific facts or dimensions that only apply to a particular product line. Therefore an approach with both Custom fact and dimension tables might look as follows:

image

Summary

My parting thoughts are that whilst I like the Core and Custom approach, you should only use it where appropriate. For example if you have a few product lines and they vary only slightly, then its fine to create a few attributes within a single dimension table that do not apply to all products. On the other hand, if you want to do detailed analysis across highly varied product lines, then Core and Custom dimension and fact tables provide a very flexible method of presenting different information to different groups of users.

Data Quality Services - Kimball ETL Screening

Although it’s arguably not the most popular in the Kimball series, The Data Warehouse ETL Toolkit contains a huge amount of vital information that’s focussed on the ETL side of data warehousing. In particular I’ve found the data quality screens design to be useful, so I thought it might be worth exploring if this concept can be applied to Data Quality Services.

What is a Kimball Data Quality Screen?

A screen is simply a check that is made against data that comes into the ETL pipeline and is used to measure data quality. Physically, the Kimball screen design is typically implemented as a table called Screen, which contains, as rows, the actual checks to perform against staged data. In addition, the screen table will contain metadata to aid data quality measurement, such as screen severity, category, source system and possibly exception actions.

Each time that a screen yields some results (i.e. data quality issues) then another table, Error Event Fact, is populated with the results. Therefore, by using the data generated by the screening process, The Error Event Fact becomes the place to look for an overall measurement of data quality. The schema for this design from the Kimball book (with a few modifications) is as follows:

image

The check performed by each screen is typically a snippet of SQL that checks for a particular condition, e.g. Is the customer’s postcode missing, or is the tax amount within permitted ranges?

Can Data Quality Services assist with ETL screening?

DQS is capable of auditing and checking data, as each domain within a knowledge base can have its own domain rules. In order to see how this might work as Kimball screens, I’ve set up a simple knowledge base around purchase orders. Here’s the domain management window from the DQS client:

image

Within the Delivery Charge domain, I’ve setup a domain rule called ‘Delivery Charge Negative’ in order to ensure that the delivery charge cannot be negative:

image

I’ve done something similar with the PO Date domain, setting up a rule to state that the PO date must be greater than 01/01/2012 in this case. Data run against the PO knowledge base will now be enforced against these two rules, so these are effectively my two Kimball style screens. To recap, DQS cleaning can be run either from the Data Quality Client or from SSIS.

Populating the Error Event Fact with SSIS

I’m going to be running my DQS domain rules via SSIS, as my end goal is to populate the Error Event Fact. Remember, if any DQ issues occur, I want the Error Event Fact to know about them.

In this scenario, the data from the source system has already been staged to a SQL table, which gives the following results in the SSIS source adaptor preview:

image

This data is passed to the DQS Cleansing SSIS task, with the data mapped to the DQS Knowledge Base in the following way:

image

Once the package is run, the data will be passed to DQS, which will run the data against its knowledge base, including running the domain rules. The data that is returned from DQS is as shown below. Note, PO55 row has failed both rules, whereas the other two rows have failed one row each.

image

Now we know exactly which rows have failed each data quality screen. The only trouble is that the data is in the wrong format for our error event fact. We want a total of 4 rows in the error event fact has PO Id 5 has failed 2 screens, whereas the other two rows have failed 1 screen. Therefore, we carry out a unpivot before inserting into the Error Event Fact. For this blog post my ScreenId is hard coded, but in reality you would have to lookup an appropriate ID for the DQS screen. The full SSIS package is as follows:

image

Data Quality Reporting

With the Error Event Fact populated, a variety of reports can be produced. In this case I’ve made a very simple Analysis Services cube and made a quick report in Excel 2013, using slicers, sparklines and also the new Timeline feature:

image

From this report I can see the screen that’s failed the most rows, see a trend of data quality and filter on specific issues.

Summary

With a bit of work, DQS is capable of producing the rows needed for the Kimball Error Event Fact, but how good is it? I see the advantages of using DQS as follows:

  • Front end user interface in order to allow data stewards to set up their own rules to screen the data – much better than holding rules in SSIS or manually in a table;
  • DQS becomes a one stop shop for all data quality, as, in addition to holding the screens as domain rules, DQS will clean your data for you, according to the valid/invalid domain values that you set up in the knowledge base.

There are a few disadvantages to it also:

  • The domain rules are not as flexible as pure SQL or SSIS, meaning that you may not be able to express all the screening rules that you had in mind;
  • DQS can't maintain all the metadata that you need to implement Kimball screening, so you’ll need to supplement it with some SQL tables to hold things like screen severity, screen type etc;
  • Even though the performance has been improved in CUI, it’s still not suitable for large amounts of data. Here’s an overview of how long you can expect the SSIS cleansing to take.

So my view is that if you’re already using DQS and data stewards like it using it to manage data quality, and your data volumes are not too big, then implement the classic Kimball screening design using DQS. But there are more cons than pros unfortunately, so if you’ve not already made an investment in DQS, and you have the time to invest in your own Kimball style screening solution, then I think you can come up with something that’s a bit more flexible and scalable.

Microsoft Data Warehouse Toolkit Second Edition - Book Review

There is now a second edition of the Kimball Microsoft Data Warehouse Toolkit, which contains updated material for SQL Server 2008 R2. The book is out now in the US, and will be released in the UK in towards the end of March. I’ve got hold of a copy, so I thought it would be worth giving it a review.

As with the previous edition, the new version of the book starts off by covering the Business Requirements, before explaining the core concepts associated with the Kimball approach, such as surrogate keys and slowly changing dimensions, for example.

An overview of the Microsoft BI stack is given, before explaining how to build each part of the data warehouse in turn, covering the database engine, SSIS, SSAS, SSRS and SharePoint. There are also new sections on both Master Data Services and PowerPivot – personally I found the PowerPivot chapter to be really interesting – there’s a good explanation of how PowerPivot can compliment the data warehouse, as well as explaining how to use it of course.

The latter part of the book explains management of the data warehouse, covering security, backups, metadata, maintenance. The final chapter, “Present Imperatives and Future Outlook” includes a narrative on common problems in DW projects, as well as an honest appraisal of the MS BI stack which you can’t help but agree with.

KimballBook

So the burning question is should you buy this book? That needs to be answered from two perspectives – 1)If you’re an owner of the original and 2)If you don’t own the original. If you do own the original, then you’ll find that there’s a huge amount of content that’s the same across the two editions. There is new content – particularly around MDS, PowerPivot, Change Data Capture and Compression, for example, as well as updates to some of the old content. £30 is quite a lot to pay for a book that’s similar to a book that you already own, but overall I think its worth going for.

If you don’t own the original then you will find this book to be an extremely useful asset in implementing MS data warehousing solutions, and I wouldn’t hesitate to buy it. One point I will make is that this doesn’t represent an exhaustive how to for each of the components (E.g. SSAS), which I personally think is fine. I know a handful of people made this gripe at the first edition, but realistically there’s only so much you can cram into 648 pages on the huge subject of data warehousing AND then cover each element of the MS BI stack.

PowerPivot and Data Warehousing

A consultant that I know has recently been through the process of reviewing his company's management information requirements, and also the process of evaluating what technology platform can deliver those requirements.

Naturally, a data warehouse was one of the options. A data warehouse, in my opinion, will provide the enterprise-level platform for any organisation to deliver their management information.

In this case though, due diligence flagged up some alternative approaches to a data warehouse, namely the raft of self service BI tools that are out there, including PowerPivot. Other tools that can take data in from a variety of data sources are Tableau (which is also a good front end for Analysis Services) and Spotfire for example.

I think that all of these tools have their merits. PowerPivot is incredibly fast and easy to use, whereas Tableau and Spotfire for example have some handy data visualisation capabilities. Some vendors actually claim that self service tools can replace a data warehouse. Can this really be the case? Whereas these tools do a good job of presenting data in a dimensional model, my personal view is that a data warehouse of course also carries this out, but in addition delivers the following benefits to name a few:

  • Data Cleansing - The power of SQL in the backend and also features like Fuzzy Matching in SSIS mean that you can address the huge problem of data quality. The bonus here is that all data quality issues could be logged, meaning that you can go back to the owners of source systems and let them know that their data is in a mess.
  • Control - If you do encounter data quality issues, you may decide that they are actually too bad to load into the data warehouse and present to your users.
  • Transparency - if you want, you can choose to expose the data quality level of your facts to users, via the Kimball-style audit dimension. E.g. we may have noticed that within our retail data, one store has abnormally low sales. Therefore, we can assign this fact a lower data quality score than the other sales by store for the day, which lets the users know that the data quality steward is aware of the issue, and is looking into it.
  • Slowly Changing Dimensions - A lot of source systems don't behave. E.g. if we're tracking a customer's Marital Status of 'single' or 'married', you often find that source systems only record the latest change to a given field. Without addressing slowly changing dimension issues in a data warehouse, we may never know that Customer X is now single, but once was married.
  • Conformed Dimensions - When we have multiple data sources for a single dimension, the goal is to ensure that each dimension available to end users represents a single, consistent view of that dimension. Due to the often complex matching and de-duplication that's required, it's difficult to see how this would be possible without a data warehouse.
  • Late Arriving Dimension Rows - Going back to the 'Marital Status' attribute, we can get the situation where a source system only updates a particular field a long time after the event actually occurred. E.g, if we think that a customer called John Doe is single, but then today we're told he's married, but has actually been married for the last two years. If Marital Status is an important piece of management information to the business, then we need to update all data associated with John Doe over the last two years, in order to reflect his correct Marital Status over the last two years.
  • Complex Transformations - I recently encountered a source system where one of the columns in a SQL table held data in a varying length format of ‘#Product Type#Currency#Value, #Product Type#Currency#Value’. So what you basically have here is potentially say 25 records held in one row in a SQL table, whereas another table might have just 7 records within one row. Rows could also be badly formed! We ended up using a bit of C# within SSIS to split this data out so that the sub-elements could be analysed correctly.

I could go on, but I think as a few simple examples the above will do.

Whilst I'm an advocate of data warehousing, I also think that the self service tools, such as PowerPivot, can be complementary to a data warehouse. You'll often find, for example, that there's some data that's not in the data warehouse yet - after all some organisations have a lot of data, and it will take time before all the business analysis etc can be carried out to get this data into the warehouse. Another example is proto-typing, analysts can very quickly build a proto-type model, which can then be production-ised at a later stage. I think this video from Chris Webb is really worth watching - it covers these examples and a few more.

In summary there's always going to be a need for users to go digging for data and producing their own reports - the data warehouse can't always deliver everything immediately. PowerPivot is going to be great in this situation, but it will work best when it's used alongside cleaned and conformed data, which is exactly what a data warehouse provides.

Master Data Services - Reversing Transactions

MDM tools give the control of the enterprise master data over to the data stewards and power users, rather than relying on automated data integration alone.

Master Data Services is no exception to the above. One of the ways that this is true for MDS is that it allows users to inspect the transactions that have occurred (either internal to MDS or from a source system) and choose if they want to reverse them.

In order to achieve this MDS has a useful log of all transactions that's viewable by users. Here's an example of some transactions that have occurred in my test system - some are from data that I've loaded up via the staging tables, some are from manual member additions that I've carried out in the front end, and some are from business rules that have automatically run:

image

In the model that this example is taken from, I've got some business rules that look to address data quality issues. Taking the Kimball view on data quality issues in a data warehousing context - many can, and should, be addressed in the source system, then re-loaded. That isn't always possible, which is one of the reasons why we have business rules in MDS. However, as good any sort of automated rule is - there are always exceptions.

In the transactions shown above, an automatic business rule has run that checks a Customer's overdraft limit, then sets it to 10,000 if its over 10,000. Therefore, when a value of 50,000 was encountered for Member Code 10311, the MDS business rules kicked in and quite correctly did their job. This was not what I wanted in this particular case.

Thankfully we can click on the undo button that's shown above the grid, and reverse a chosen transaction, whether its come from a source system, a business rule or a manual edit. It doesn't seem possible to reverse many transactions at once, but that may be just due to the CTP. In my example, by selecting the first transaction in the list, then clicking the undo button, I've reversed my automatic business rule. Therefore, the user Kaylee Adams (10311) shown below now has her original overdraft limit:

image

In conclusion, when some sort of manual intervention is needed to successfully manage master data, MDM tools allow that intervention to come from the power users, rather than having to wait for someone technical to address the issue.