|There is now a second edition of the Kimball Microsoft Data Warehouse Toolkit, which contains updated material for SQL Server 2008 R2. The book is out now in the US, and will be released in the UK in towards the end of March. I’ve got hold of a copy, so I thought it would be worth giving it a review. |
As with the previous edition, the new version of the book starts off by covering the Business Requirements, before explaining the core concepts associated with the Kimball approach, such as surrogate keys and slowly changing dimensions, for example.
An overview of the Microsoft BI stack is given, before explaining how to build each part of the data warehouse in turn, covering the database engine, SSIS, SSAS, SSRS and SharePoint. There are also new sections on both Master Data Services and PowerPivot – personally I found the PowerPivot chapter to be really interesting – there’s a good explanation of how PowerPivot can compliment the data warehouse, as well as explaining how to use it of course.
The latter part of the book explains management of the data warehouse, covering security, backups, metadata, maintenance. The final chapter, “Present Imperatives and Future Outlook” includes a narrative on common problems in DW projects, as well as an honest appraisal of the MS BI stack which you can’t help but agree with.
So the burning question is should you buy this book? That needs to be answered from two perspectives – 1)If you’re an owner of the original and 2)If you don’t own the original. If you do own the original, then you’ll find that there’s a huge amount of content that’s the same across the two editions. There is new content – particularly around MDS, PowerPivot, Change Data Capture and Compression, for example, as well as updates to some of the old content. £30 is quite a lot to pay for a book that’s similar to a book that you already own, but overall I think its worth going for.
If you don’t own the original then you will find this book to be an extremely useful asset in implementing MS data warehousing solutions, and I wouldn’t hesitate to buy it. One point I will make is that this doesn’t represent an exhaustive how to for each of the components (E.g. SSAS), which I personally think is fine. I know a handful of people made this gripe at the first edition, but realistically there’s only so much you can cram into 648 pages on the huge subject of data warehousing AND then cover each element of the MS BI stack.
I’m pleased to say that I’m delivering a couple of Master Data Services sessions at Microsoft events over the next few months.
Firstly, as I’m sure you’ve heard, SQL Bits 8 is running in Brighton from 7th – 9th April. I’m delivering a session on the Saturday that will show how Master Data Services business rules work, which will include looking at how to create and use a SharePoint workflow.
Although last I heard it was getting pretty full, you can hopefully still register for SQL Bits here.
Tech Days 2011
Tech Days is running again this year, from the 23rd – 25th May. There are a number of sessions running for Tech Days in London over the 3 days, covering areas such as Windows Azure, Windows Phone and of course SQL Server. The session that I’m doing is to show how we’ve implemented Master Data Services at one of clients, EMI, and how they benefit from using MDS.
You can register for Tech Days 2011 here.
Derived Hierarchies in Master Data Services are hierarchies created from the domain-based attribute relationships between entities, as explained in the following documentation, resulting in a fixed-depth hierarchy. A less obvious feature is that they can also be recursive, which by its very nature implies a ragged, variable-depth hierarchy. This is a characteristic that’s shared by Explicit Hierarchies, so it got me thinking that it would be worth a blog post to contrast Explicit Hierarchies and Recursive Hierarchies.
Explicit Hierarchies use Consolidated Members to group other Consolidated and Leaf Members in order to make a hierarchy. The structure can be completely ragged, with each Consolidated Member having an unlimited number of Leaf or Consolidated Members beneath it.
An example of using Explicit Hierarchies is in the sample Product model that comes with MDS, which has an Explicit Hierarchy called Product Management. This groups members as shown in the (slightly edited) image below:
Note that the hierarchy is ragged, and that the Consolidated Members (PL, SL, CL, AC, ACC and CPT) can all have different attributes to the leaf members.
Recursive Hierarchies allow an Entity Member to contain an attribute that points to another Member within the same Entity. Essentially this is a classical parent-child structure, which can of course also be ragged.
An example of using recursive hierarchies is in the Employee_Hierarchies sample model which can be downloaded from here. The image below shows one of the recursive hierarchies in the model, for employee managers:
So, as we can see, both can be ragged and have a variable depth, so which one should we use? The following key points may help decide:
- Explicit Hierarchies must use Consolidated Members for the grouping. So a Leaf Member can’t suddenly become a parent. E.g. in an employee manager type hierarchy, where it would be common for an employee to become a manager once they have the relevant experience, with Explicit Hierarchies it would involve converting the Leaf Member to a Consolidated Member, which would be a pain.
- Its slightly easier to work with Recursive Hierarchies as you don’t need to worry about Consolidated Members. It can be a bit awkward to continually switch between Leaf and Consolidated Members in the explorer grid, so good to avoid it if possible.
- It’s not possible to set individual member permissions on a Recursive Hierarchy, whereas it is with an Explicit Hierarchy.
- Explicit Hierarchies can be added to Collections for more flexibility, whereas Recursive Hierarchies cannot.
- You won’t be able to have multiple hierarchies by using a Recursive Hierarchy (unless you go down the root of having multiple self-pointing attributes), whereas you can create multiple Explicit Hierarchies quite easily.
Considering all the above points, in my opinion anyway, a Recursive Hierarchy is more suitable when you have a genuine parent child requirement, e.g. for an employee hierarchy, or a chart of accounts. This is just down to the ease of use. An Explicit Hierarchy is more suitable when you just want to create groupings for your leaf members, but have the requirement to potentially have multiple ragged reporting hierarchies.
For any given entity in Master Data Services, its only the Code attribute that cannot be left blank when entering data in the front end or via the MDS staging tables. For any of the other attributes, you can then use the MDS business rules to enforce that the attribute must have a value, using the Is Required action, for example.
All this works well in my opinion, but I have found a difference in how blanks and NULLs are handled in the front end Vs in the MDS staging process, which has a knock on impact on the business rules.
To explain I’ve created an example model called Supplier, with an entity called Supplier and a test attribute called Attribute1. There is also a simple business rule that will set the value of Attribute1 to the value ‘Name blank’ when the built in Name attribute is left blank:
Adding Data via Master Data Manager
My first step is to add a new member to the Supplier entity, which I will just give a code of 1 and intentionally leave the name blank:
Clicking the save button runs the single business rule that exists and correctly sets Attribute1 to the value of ‘Name blank’
If we determine the entity table for the Supplier entity (which can be done by looking at the [mdm].[viw_SYSTEM_SCHEMA_ENTITY] view) then in my case the entity table is called tbl_24_97_EN. Running a select statement on this table shows the following result:
The name has been entered as NULL, which is fine, as we’ve got the behaviour that we want.
Adding Data via the MDS Staging Tables
Now lets look what happens when we do the same operation via the MDS staging tables. First of all, we need to load a supplier member into the mdm.tblStgMember table, which we can carry out with the following SQL:
INSERT INTO mdm.tblStgMember
--Insert member with no name and a Code of 2
The null has been left in the statement as we’re simulating, for example, not being given a name for the member from our data source. You could then have an MDS business rule set up to handle missing names, perhaps setting them to ‘Unknown Supplier’.
To load from the staging tables into the entity table, we run the staging sweep stored procedure:
DECLARE @ModelName nVarchar(50) = 'Supplier'
DECLARE @Version_ID int
SET @Version_ID = (SELECT MAX(ID)
WHERE Model_Name = @ModelName)
--This will cause the members in the staging tables to be loaded into MDS
EXECUTE mdm.udpStagingSweep 1, @Version_ID, 1
To finish up we will also validate the model, which will cause the business rule to fire:
DECLARE @ModelName nVarchar(50) = 'Supplier'
DECLARE @Model_id int
DECLARE @Version_ID int
SET @Version_ID = (SELECT MAX(ID)
WHERE Model_Name = @ModelName)
SET @Model_ID = (SELECT Model_ID
WHERE Model_Name = @ModelName)
--Cause validation to occur
EXECUTE mdm.udpValidateModel 1, @Model_ID, @Version_ID, 1
If we now do a select from the entity table, we will see that the member with the code of 2 doesn’t have a NULL for the name column, but instead has a blank string:
If we now go and take a look at the Master Data Manager front end, we will see that the business rule condition hasn’t been met for the the member that we added, as its ‘Attribute1’ attribute is still blank:
This has happened because the stored procedure that implements the business rules carries out the ‘Name is equal to Blank’ check that we’re relying on by checking if the name column is NULL.
How this affects you will obviously depend on how you’re using the MDS business rules. It doesn’t affect the other member attributes, only the Name attribute. If you want to handle null Names on entities then a good work around is to use the business rules in the front end, and then to use SSIS to replace null names with ‘Unknown’ or something similar.
For your Master Data services business rules that have a validation action, a notification, if you’ve set one up, will be sent if the rule condition gets satisfied.
I’ve noticed a quirk to the way that the notifications are sent – I was getting unexpected email notifications when creating members – which as it happens is not a bug, but is worth exploring to understand how the MDS rules work.
The above was essentially due to the two step save process that occurs in the MDS front end. When adding a member, you first of all enter the name and the code, then you click the save button. At this point, the business rules are run, and validation issues are created in a table called [mdm].[tblValidationLog], and then displayed to the user at the bottom of the screen, before the user has entered the attributes. This is shown below on the sample Product model, which I have changed to enable notifications on a few of the rules:
Then click the save button, which will display the attributes pane below. As the sample Product model contains a rule called “Required Fields”, with a condition of “None”, then the validation issues will also be displayed:
At this point we’ve got no way to avoid getting the validation issues, as the rule has no condition, and we’ve not had any opportunity yet to alter the attributes that the rule checks. The email notifications behaviour to be aware of at this point is:
- The notifications are sent on an interval, defined by the system setting called “Notification e-mail interval”. Its default is 120 seconds.
- If you rectify the issues that you’ve been warned about in the gap between the last email notification cycle, and the next one, then [mdm].[tblValidationLog] will be updated to say that the validation issue doesn’t need a notification, and you will get no emails.
- On the other hand, if you’re in the middle of a few things and essentially don’t act quick enough, then you will get a notification with the above issues displayed, as the email interval has kicked in during the middle of your save.
Hence I was sometimes getting email notifications as shown below:
But sometimes I was getting no notification for carrying out what I perceived to be exactly the same behaviour. So by no means a bug, but worth explaining to your users how the notifications work.
I’ve just found out that the first CTP for SQL Server codename Denali is now available for download. The link to get both the 32 and 64 bit versions is:
On a related note, Simon Sabin has recently posted here about the Tech-Ed Europe keynote that shows a glimpse of a new Silverlight reporting tool for SQL that’s part of Denali. Well worth watching the keynote video…
Subscription Views are one of the ways to extract data from Master Data Services. Created via the ‘Data Integration’ section of the MDS Master Data Manager front end, they result in the creation of a physical SQL Server view in the underlying database.
All this you will probably know from the MSDN article. The purpose of this post, however, is to explain a few of the more subtle differences between the different types of subscription views that get created, and looking how they can be used to feed a data warehouse.
You can create a subscription view from a derived hierarchy, which will return all the levels from your hierarchy in a de-normalised form. If you choose the sample Product model, then you will see that one of the Derived Hierarchies that you can pick from is called Category. As your Product dimension in a data warehouse would include attributes such as Product Category and Sub Category, then on the surface this view looks to be ideal.
Unfortunately there is a slight issue with subscription views based on derived hierarchies. Users could have altered one of the members that exist within the hierarchy before the data warehouse nightly load has taken place. If the MDS validation of that member hasn’t occurred yet (which it doesn’t in some situations), then the member could potentially contain some incorrect attribute values that we don’t want to expose to the data warehouse.
For example, a user has edited a product, as shown below, but chosen a different Sub Category to that of its adjacent members:
As indicated by the yellow question mark above, the member is awaiting validation. However, when I run a select statement against my view that’s based on Derived Hierarchy Subscription view, then the member that is marked above will still appear in the view results. We would not want to load this change into the data warehouse, as until the business rules have been run, we dont know if the new Sub Category is valid in the context of the other attributes. For reference, the same behaviour exists for members that have explicitly failed validation.
You can also create a subscription view by picking an entity, and then choosing the format of Leaf Attributes, as shown below:
In my case, I’ve set up a subscription view called ProductLeafAtributes, and when I run a select statement against this view in SQL Server, then I get the following results:
As with the subscription view that’s based on a Derived Hierarchy format, I still get all the members returned, but there’s also a column called Validation Status, that shows that one of the members is awaiting revalidation. Therefore, this column can be filtered on in the data warehouse load, just to take through the changes that have definitely passed validation.
Creating the subscription views based on entities, rather than Derived Hierarchies, exposes the validation status of a member, and is therefore the preferred subscription view format to use in this case. The one catch is that the Leaf Attributes format will not return attributes of any related entities, e.g. Product Category. It will return the Sub Category code, so therefore its necessary to join to separate Subscription Views for both Sub Category and Category in order to pick up the missing attributes.
The SQL Bits 7 agenda was published recently, and I’m pleased to say that I will be delivering a session titled ‘End to End Master Data Management with SQL Server Master Data Services’.
The heart of this session will be an end-to-end technical demo that will show the role of MDS in extracting master data from source systems, before demonstrating how data is fed via MDS to subscribing downstream systems. I’ll cover areas such as the MDS staging process, business rules and notifications amongst others.
I’ve also seen from Simon Sabin that the SQLCAT team will be showing some of the features of SQL 11, code name Denali. Well worth going for that alone surely?!
You can register for SQL Bits here.
Back to a bit of Master Data Services this week – there’s been a few learning resources that have appeared here and there for MDS over the past couple of months. I thought that it would be worth rounding up the ones that I’ve noticed:
There was a comment on my last blog post by Peter Eb asking about whether Reporting Services 2008 R2 knows how to render SSAS KPIs automatically. The short answer is that it doesn't unfortunately, but there are only a few steps needed to get it working.
For this blog post I've built a simple report that uses the Adventure Works DW 2008 R2 Analysis Services database as a data source. The first step is to build a data set that contains a query that references the Month attribute and the Channel Revenue KPI:
I'm interested in reporting on the status and trend part of the KPI, so I've added the month and value parts of the KPI, plus blank columns for the status and the trend:
For the status and the trend columns, the next step is then to drag on (or right click and insert) an indicator in each of the cells. This is when the manual steps come in:
- Right click on the indicator to open the properties window;
- Set the Value of the indicator to be the Status part of the Analysis Services KPI;
- Change the States Measurement Unit to Numeric;
- Configure the indicator Start and End properties to be -1, 0 and 1. SSAS KPIs can have an unlimited number of status values, so it may be necessary to add more states here if SSAS returns more than 3 values for the Status.
This changes that are needed to the indicator properties window is shown below:
Applying the same method to the Trend part of the KPI produces the following simple report:
This post wasn't really supposed to compare SSRS Vs PerformancePoint Services, but I've found out that PerformancePoint Services does pick up the SSAS KPI meta data (slight mismatch on the colours only), saving you having to set up the indicator yourself:
It's a shame that the KPIs aren't picked up automatically by SSRS, but then again the above process isn't too difficult to master.
Although you've always been able to deliver KPIs in Reporting Services by using images and expressions, Reporting Services in SQL Server 2008 R2 now includes a feature called Indicators, which assists with producing KPIs out of the box.
This got me thinking how producing KPIs and/or scorecards in Reporting Services compares to doing the same in PerformancePoint Services. This blog post gives a short overview of creating KPIs in each technology, before comparing the pros and cons of each.
Reporting Services 2008 R2 Indicators
The indicators in Reporting Services come as an extra item on your toolbox, and can be dragged onto a Tablix for example. Once you do add the indicator to your report, you'll be prompted to select the shape that you want for your indicator:
So no more messing about with images! Once you click OK your indicator will be added to your report. If you then right click on the properties of the indicator, you can choose what to base the indicator's value on on, and also set its state, both of which essentially control the points at which your indicator changes its colour and shape. What I quite like here is that it's easy to change the colour for each state, and it's also easy to add new states, if you need four colours rather than three for example:
You seem to be able to add a huge number of indicator states here, certainly more than enough to deal with KPIs. Taking some data from AdventureWorks, based around employee sales targets, I've created the following very basic report using the Indicators:
PerformancePoint Services KPIs
The KPI and Scorecard have had a few decent changes in SharePoint 2010 (e.g. drill up/down), but Dashboard Designer is still the tool that's used to create KPIs in PerformancePoint Services, and its a different experience to developing in Visual Studio.
Assuming that you get to the point where you have a data source set up, the first thing to do in order to create a KPI is to pick an indicator. The choice here is very similar to the choice in SSRS:
The alternative here is to pick a blank indicator, where you'll have to decide yourself how many levels you want for your indicator, and what the colours and shapes should be. You can pick an image for an indicator state (as you can with SSRS), although you are limited to a maximum of 10 states. Again, that should be more than enough for KPIs.
The objects within PerformancePoint Services are very modularised, meaning that you create indicators, KPIs and scorecards separately. If you need to re-use an indicator in a new KPI, then you can just point to an existing indicator, or alternatively you can create a new one.
Again, taking some sample data from AdventureWorks, I've produced the following basic scorecard, based on similar data:
The following table lists some KPI/Scorecard features and compares how SSRS and PPS are able to deal with them:
Reporting Services 2008 R2
Out of the box Parent KPIs
No built in parent KPIs
- Two built in scoring methods;
- Weighting concept to assist with scoring.
No built in scoring capability
- Analysis Services;
- SQL Server;
- SharePoint List;
- SQL Server;
- Analysis Services;
- OLE DB, ODBC;
- SharePoint List;
- SAP Netweaver BI;
- KPIs can only exist within Scorecards.
- A scorecard can only contain KPIs and their associated Indicator and data items.
- Indicators can be embedded in other objects or exist by themselves.
- Other objects (e.g. spark lines) can co-exist alongside Indicators in the same Tablix.
- Colours & number formats of the Scorecard grid can be changed;
- Colours/fonts in data cells cannot be changed.
If the Indicator is embedded in a Tablix, then each cell can be altered in terms of Colours and number formats.
Ease of Use
As long as your data is in good shape, its possible for power users to build KPIs and Scorecards with Dashboard Designer relatively easily.
The indicator windows are very intuitive in SSRS, but I'd say that building an SSRS report is harder for a power user than building a PPS dashboard.
Scorecards allow users enter comments against KPI values, which is a great feature.
The standard SSRS expressions allow you to fine tune indicators, states, values and formatting of your KPIs.
In summary, PerformancePoint Services in SharePoint 2010 will allow you to get some great looking KPIs and Scorecards up and running much quicker, as that's what it's designed to do. It is, however, more restrictive than Reporting Services, so SSRS is the better choice if your KPI requirements are complex in terms of layout or formatting.
With the release of SQL Server 2008 R2 nearly upon us, it's a safe bet that the number of technical articles for MDS will start to increase a bit. In the meantime, I felt it was worth mentioning a few older MDM White Papers that may come in useful:
Although I didn't mention it, the inspiration for my last post was the excellent MDX Solutions 2nd Edition. Whilst flicking through the book a while back, I though it would be interesting to see how DAX and PowerPivot could handle the so called 'Common Calculations and Selections in MDX'.
This post continues that theme, focusing on one calculation in particular that's in the aforementioned MDX book, namely 'Unweighted Allocations down the Hierarchy'. Essentially this business problem is dealing with allocating data that is entered at a higher level (e.g. Quarter) down to a lower level (e.g. Day). It varies from business to business what the users actually want to see at the day level in this situation. Some expect to see the Quarter amount, some want to see a blank value and then some users want the amount in the quarters allocated down to the day level.
If the expectation is that data should be allocated, then one way of doing the allocation is to use a ratio. In the MDX book, this is achieved by the following MDX:
If we put this into a query and run it against the Calendar hierarchy in AdventureWorksDW we get the following results:
Therefore our quarter-entered value can simply be multiplied by this ratio to get the correct value at each level in the Calendar hierarchy.
So the challenge is how to be able to do this in PowerPivot. I started off by looking for some data in AdventureWorksDW that would fit the bill. A good candidate seems to be the FactSalesQuota table, which I imported into PowerPivot along with its related dimensions:
Once in the pivot table, I wanted to be able to calculate the correct ratio at the daily and monthly levels. For the daily level this is 1 / 90, as there are 90 days in the Q1 that year, and for the month level it is 1/3 as there are 3 months in the quarter.
Given that there's no MDX style Ancestor() function in DAX, I ended up having to have a different calculation at different levels. Something that I learnt from this forum post is that you need to use IF() in order to isolate a calculation at a different level. Therefore the DAX that I came up with was:
=IF( COUNTROWS(VALUES('DimDate'[FullDateAlternateKey])) = 1,
1 / CALCULATE( COUNTROWS('DimDate'), ALL('DimDate'[FullDateAlternateKey], 'DimDate'[EnglishMonthName]) ),
1 / CALCULATE( COUNTROWS( DISTINCT( 'DimDate'[EnglishMonthName]) ), ALL('DimDate'[EnglishMonthName]) )
The COUNTROWS(VALUES('DimDate'[FullDateAlternateKey])) = 1 allows me to check that we're at the day level, and if we are at the day level, then count the number of days in the quarter. Otherwise, if we're at the month level we calculate the number of months in the current quarter. When this DAX is put into a new measure we get the following results:
This produces the same results as the MDX statement. The next step was then to apply this ratio to the Sales Amount Quota for the current quarter. This was achieved using the following DAX:
COUNTROWS( VALUES('DimDate'[FullDateAlternateKey]) ) < 90,
All('DimDate'[FullDateAlternateKey], 'DimDate'[EnglishMonthName]) ) * 'FactSalesQuota'[Ratio],
The < 90 at the start of the statement allows me to check if we're at the day level or the month level. If we're at either of these levels, then we want to pick up the Quarter-level SalesAmountQuota and multiply by the ratio. Otherwise, at the Quarter and Year levels, we just apply the original SalesAmountQuota value. This produces the following results:
The use of the MDX Scope statement would have been nice to pinpoint exactly where I wanted the calculation to occur, but then again, the use of IF() keeps it quite simple, which is better for an Excel power user. Whilst MDX is very very powerful, can you imagine a business user getting to grips with MDX statements such as Scope or Freeze?
The bottom line is that the DAX calculation produces exactly what I want - the new AllocatedSalesQuota measure is correct at all levels in the Calendar hierarchy. Even with an IF() statement, like everything else in PowerPivot, the calculation performs very quickly.
One of the often required calculations for users in reports is to be able to calculate a point-in-time closing balance, which is especially relevant when dealing with stock levels or financial data.
In the Analysis Services world, at least in the Enterprise Edition, we're lucky that we have a variety of Aggregation Functions that can deal with semi-additive measures such as a closing balance. Or we can always turn to MDX if we need to which will achieve the same thing.
In PowerPivot, we don't have Aggregation Functions, but there is DAX, which has a variety of time intelligence functions up its sleeve. The idea behind this post is to see how PowerPivot would deal with the requirement of reporting a closing balance measure.
DAX Time Intelligence Functions
One of the DAX functions that we can use for our closing balance is called ClosingBalanceMonth(), which will simply evaluate an expression at the end of a given month. There's also the similar ClosingBalanceQuarter() and ClosingBalanceYear().
Having noticed these functions within PowerPivot, my first port of call was to set up a simple PowerPivot model by taking data from the FactInternetSales table in AdventureWorksDW, plus all it's related dimensions of course. What I ended up with was the following simple pivot table in Excel 2010:
Sales Amount obviously isn't a semi-additive measure, but I'm treating it as one in this example. My first port of call was to use ClosingBalanceMonth() to try and get the value for the month to be equal to the value for last day in the month. I managed to come up with the following formula:
=CLOSINGBALANCEMONTH(Sum([SalesAmount]), 'DimDate'[FullDateAlternateKey], ALL('DimDate'[FullDateAlternateKey]))
This produces the results that are shown below, in the measure called ClosingBalanceFormula:
It does a good job of calculating a closing balance for each month, but it presents that closing monthly balance on each day. Therefore, I started looking for an alternative approach and found the LastDate() function. It returns the last date that's in context for the passed date column. The last date that's in context at the month level will be the last date in the month, but thankfully the last date that's in context at the day level will be the current day. Therefore I came up with the following formula:
='FactInternetSales'[Sum of SalesAmount](LASTDATE('DimDate'[FullDateAlternateKey]))
This produced the results that I wanted - the daily closing balance at the day level and the monthly closing balance at the month level:
A bit more digging lead me to find out that a combination of the Calculate() and LastDate() functions gets to the same result, as an alternative. That formula is:
It's a shame that the ClosingBalanceMonth() formula didn't quite work, but perhaps getting that approach to work will come out in the wash with more DAX experience. However, as with MDX or any kind of expression language, there's always going to be a load of different ways to get to the same result. It's much more efficient to work knowing that you've got several ways around a particular problem, rather than being forced down a particular path, and in fact, I think the range of calculation functions available shows the richness of the DAX language.
Technorati Tags: PowerPivot
A consultant that I know has recently been through the process of reviewing his company's management information requirements, and also the process of evaluating what technology platform can deliver those requirements.
Naturally, a data warehouse was one of the options. A data warehouse, in my opinion, will provide the enterprise-level platform for any organisation to deliver their management information.
In this case though, due diligence flagged up some alternative approaches to a data warehouse, namely the raft of self service BI tools that are out there, including PowerPivot. Other tools that can take data in from a variety of data sources are Tableau (which is also a good front end for Analysis Services) and Spotfire for example.
I think that all of these tools have their merits. PowerPivot is incredibly fast and easy to use, whereas Tableau and Spotfire for example have some handy data visualisation capabilities. Some vendors actually claim that self service tools can replace a data warehouse. Can this really be the case? Whereas these tools do a good job of presenting data in a dimensional model, my personal view is that a data warehouse of course also carries this out, but in addition delivers the following benefits to name a few:
- Data Cleansing - The power of SQL in the backend and also features like Fuzzy Matching in SSIS mean that you can address the huge problem of data quality. The bonus here is that all data quality issues could be logged, meaning that you can go back to the owners of source systems and let them know that their data is in a mess.
- Control - If you do encounter data quality issues, you may decide that they are actually too bad to load into the data warehouse and present to your users.
- Transparency - if you want, you can choose to expose the data quality level of your facts to users, via the Kimball-style audit dimension. E.g. we may have noticed that within our retail data, one store has abnormally low sales. Therefore, we can assign this fact a lower data quality score than the other sales by store for the day, which lets the users know that the data quality steward is aware of the issue, and is looking into it.
- Slowly Changing Dimensions - A lot of source systems don't behave. E.g. if we're tracking a customer's Marital Status of 'single' or 'married', you often find that source systems only record the latest change to a given field. Without addressing slowly changing dimension issues in a data warehouse, we may never know that Customer X is now single, but once was married.
- Conformed Dimensions - When we have multiple data sources for a single dimension, the goal is to ensure that each dimension available to end users represents a single, consistent view of that dimension. Due to the often complex matching and de-duplication that's required, it's difficult to see how this would be possible without a data warehouse.
- Late Arriving Dimension Rows - Going back to the 'Marital Status' attribute, we can get the situation where a source system only updates a particular field a long time after the event actually occurred. E.g, if we think that a customer called John Doe is single, but then today we're told he's married, but has actually been married for the last two years. If Marital Status is an important piece of management information to the business, then we need to update all data associated with John Doe over the last two years, in order to reflect his correct Marital Status over the last two years.
- Complex Transformations - I recently encountered a source system where one of the columns in a SQL table held data in a varying length format of ‘#Product Type#Currency#Value, #Product Type#Currency#Value’. So what you basically have here is potentially say 25 records held in one row in a SQL table, whereas another table might have just 7 records within one row. Rows could also be badly formed! We ended up using a bit of C# within SSIS to split this data out so that the sub-elements could be analysed correctly.
I could go on, but I think as a few simple examples the above will do.
Whilst I'm an advocate of data warehousing, I also think that the self service tools, such as PowerPivot, can be complementary to a data warehouse. You'll often find, for example, that there's some data that's not in the data warehouse yet - after all some organisations have a lot of data, and it will take time before all the business analysis etc can be carried out to get this data into the warehouse. Another example is proto-typing, analysts can very quickly build a proto-type model, which can then be production-ised at a later stage. I think this video from Chris Webb is really worth watching - it covers these examples and a few more.
In summary there's always going to be a need for users to go digging for data and producing their own reports - the data warehouse can't always deliver everything immediately. PowerPivot is going to be great in this situation, but it will work best when it's used alongside cleaned and conformed data, which is exactly what a data warehouse provides.
More Posts « Previous page
- Next page »