Adatis

Adatis BI Blogs

Slowly-Changing Parent-Child Dimensions – Part 2: The Implementation

Distributing the slides from my talk last night, it was well received and I think everyone was able to take something away. Credit goes to Marco Russo and Alberto Ferrari for their data modelling ideas, visit www.sqlbi.com  http://dl.dropbox.com/u/25925341/Learning%27s%20from%20large%20scale%20dw%20project.pptx Hopefully not my last talk as i enjoyed doing it.  Look out for further Bath based events at: http://avonim.wordpress.com/ All comments welcome CalvinFirst of all, an apology. It’s always hard to recognise whether your posts are useful for people out there and this one slipped through the cracks. After several requests, I thought I’d finally get around to posting the related solution to a post I wrote 2 years ago! Apologies to anyone this could have helped quite some time ago. To set the scene, I had to build a slowly-changing parent child dimension and manage the relationships accordingly. I described the theory behind the solution in this post, probably best give that a skim if you’re not familiar with the approach. To summarise – if a parent node changes, you create new SCD records for all descendants, essentially you create historic “branches” as each change happens. Initial Data Setup For the example, we’ll use the AdventureWorksDW2012 sample DB as it has an org chart we can very easily play with. You can find the download here. We’re going to create a new parent-child SCD dimension using dbo.DimOrganization by running the script below:CREATE TABLE [dbo].[DimOrganizationSCD]( [OrgKey] [int] IDENTITY(1,1), [ParentOrgKey] [int] NULL, [OrganizationID] [int] NOT NULL, [ParentOrganizationID] [int] NULL, [PercentageOfOwnership] [nvarchar](16) NULL, [OrganizationName] [nvarchar](50) NULL, [CurrencyKey] [int] NULL, [StartDate] datetime NOT NULL, [EndDate] datetime NULL, [CurrentFlag] [int] NOT NULL ) ON [PRIMARY] GO INSERT INTO [dbo].[DimOrganizationSCD] (ParentOrgKey, OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey, StartDate, EndDate, CurrentFlag) SELECT -1 [ParentOrgKey] ,[OrganizationKey] ,[ParentOrganizationKey] ,[PercentageOfOwnership] ,[OrganizationName] ,[CurrencyKey] ,'2014-01-01 00:00:000' [StartDate] ,NULL [EndDate] ,1 [CurrentFlag] FROM [dbo].[DimOrganization] UPDATE C SET ParentOrgKey = P.OrgKey FROM [dbo].[DimOrganizationSCD] C INNER JOIN [dbo].[DimOrganizationSCD] P ON C.ParentOrganizationID = P.OrganizationID Take a look in the table and you’ll see something like this: We now have a separation between the unique business identifier for each organisational entity (OrganizationID), and its historical version (OrganizationKey). The Org Change Occurs Next let’s make a change to track – maybe the North American operation has been split off as a subsidiary and we need to insert a hierarchy level for the new brand “SuperCycles US”. We’ll create an ‘updates’ table to represent the delta coming from our ETL and put in two records – the new organisation record for SuperCycles, and the update to North American Operation to represent the new Parent ID.CREATE TABLE [dbo].[DimOrganization_Updates]( [OrganizationID] [int] NOT NULL, [ParentOrganizationID] [int] NULL, [PercentageOfOwnership] [nvarchar](16) NULL, [OrganizationName] [nvarchar](50) NULL, [CurrencyKey] [int] NULL ) ON [PRIMARY] GO INSERT INTO [dbo].[DimOrganization_Updates] VALUES (15, 1, 1, 'SuperCycles US', 100), (2, 15, 1, 'North America Operations', 100) The ETL Logic For this example, I’ve written it as one big stored procedure, will all the logic occurring in one script. I’d normally break this into separate procedures with SSIS handling data flows, lookups etc and logging results of individual scripts, but a single SQL flow is easier to demonstrate. Firstly, we create a couple of variables used within the proc:--Createa a table used to hold all records impacted by the hierarchy changes DECLARE @Updates TABLE ( [OrganizationID] [int] NOT NULL, [ParentOrganizationID] [int] NULL, [PercentageOfOwnership] [nvarchar](16) NULL, [OrganizationName] [nvarchar](50) NULL, [CurrencyKey] [int] NULL ); --Use a common timestamp across all SCD updates DECLARE @SCDDate datetime = getdate(); Now, as far as any children of the North American Operation go, they technically haven’t changed and so we won’t see an update for them. So our ETL logic needs to take that into account. We take our update rows and build a recursive CTE that finds all active child rows in our dimension:--Use a recursive CTE to find all updates AND all descendants of updated records WITH OrgUpdates AS ( --Our Changed Records SELECT U.OrganizationID, U.ParentOrganizationID, U.PercentageOfOwnership, U.OrganizationName, U.CurrencyKey FROM [dbo].[DimOrganization_Updates] U LEFT JOIN [dbo].[DimOrganizationSCD] D on U.OrganizationID = D.OrganizationID and D.CurrentFlag = 1 UNION ALL --Our Descendants (this will recurse until no more descendants are found) SELECT D.OrganizationID, D.ParentOrganizationID, D.PercentageOfOwnership, D.OrganizationName, D.CurrencyKey FROM [dbo].[DimOrganizationSCD] D INNER JOIN OrgUpdates U on D.ParentOrganizationID = U.OrganizationID and D.CurrentFlag = 1 ) --Store the results of the CTE in a table variable INSERT INTO @Updates SELECT OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey FROM OrgUpdates This will loop a maximum of 100 times by default, you’ll need to override this if your parent-child structure has more than 100 levels. Hopefully this isn’t an issue for anyone! We now have a table variable populated with each of the business IDs that relates to a changed record, or a descendant of a changed record. From here on, we can follow a standard SCD Type 2 method – we stamp the existing records with an end date & amend the current flag, then insert the new records. The end-dating of the old records needs to be done first, as it is easier to identify the new records by their current flag once they have been inserted. So the historical update can be something like:--Mark the existing records as old by finding all related records that are currently active. --It is very important that we do this before inserting the new records as otherwise we would not be able to distinguish between old and new! UPDATE D SET EndDate = @SCDDate, CurrentFlag = 0 FROM @Updates U INNER JOIN [dbo].[DimOrganizationSCD] D on U.OrganizationID = D.OrganizationID and D.CurrentFlag = 1 And then we insert new records:--Insert the new rows into the dimension, these will each have new surrogate IDs associated --These will all have NULL ParentOrganizationKeys as we need to wait for all surrogates to be generated INSERT INTO [dbo].[DimOrganizationSCD] (OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey, StartDate, EndDate, CurrentFlag) SELECT OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey, @SCDDate StartDate, NULL EndDate, 1 CurrentFlag FROM @Updates Note that at this point, we’re taking the full cut of records from our “Updates” table, so this will include any new records. Our “SuperCycles US” record will have been inserted here, along with the updated records. The newly inserted records have deliberately not had their ParentOrganizationKey populated as for many of the record, the parent’s surrogate key has not yet been generated. We need to run an update on the dimension table after the insert to go through and fill this in.--Finally, update the ParentOrganizationKey for all Current records UPDATE C SET ParentOrgKey = P.OrgKey FROM [dbo].[DimOrganizationSCD] C INNER JOIN [dbo].[DimOrganizationSCD] P ON C.ParentOrganizationID = P.OrganizationID WHERE C.CurrentFlag = 1 AND P.CurrentFlag = 1 And that’s it, we now have a new branch of the organisation chart to represent today’s changes. Taking another look at our dimension table, we now have a new set of active records representing the current Org hierarchy. Improvements This particular logic has been written to demonstrate an approach, there are many improvements you may want to make if including in a large, resilient ETL process. Firstly, as mentioned above, I would separate the logic into individual stored procedures that each returned rowcounts for better logging. You can also amend the insertion procedure to take original values if you need to mix and match SCD types and include a Type 3 “original value” column. Finally, you may want to include a HierarchyId type column and populate using the full grandparent>parent>child surrogate key path, this allows you to query for such things as “all descendants” without having to build in recursion. This can vastly improve performance if you’re going to be running aggregations over huge volumes directly on the SQL layer. Further Reading Recursive Query using Common Table Expressions Hierarchal Data Code Extract Putting all the code together, our ETL query is as follows:--Create a table used to hold all records impacted by the hierarchy changes DECLARE @Updates TABLE ( [OrganizationID] [int] NOT NULL, [ParentOrganizationID] [int] NULL, [PercentageOfOwnership] [nvarchar](16) NULL, [OrganizationName] [nvarchar](50) NULL, [CurrencyKey] [int] NULL ); --Use a common timestamp across all SCD updates DECLARE @SCDDate datetime = getdate(); --Use a recursive CTE to find all updates AND all descendants of updated records WITH OrgUpdates AS ( --Our Changed Records SELECT U.OrganizationID, U.ParentOrganizationID, U.PercentageOfOwnership, U.OrganizationName, U.CurrencyKey FROM [dbo].[DimOrganization_Updates] U LEFT JOIN [dbo].[DimOrganizationSCD] D on U.OrganizationID = D.OrganizationID and D.CurrentFlag = 1 UNION ALL --Our Descendants (this will recurse until no more descendants are found) SELECT D.OrganizationID, D.ParentOrganizationID, D.PercentageOfOwnership, D.OrganizationName, D.CurrencyKey FROM [dbo].[DimOrganizationSCD] D INNER JOIN OrgUpdates U on D.ParentOrganizationID = U.OrganizationID and D.CurrentFlag = 1 ) --Store the results of the CTE in a table variable INSERT INTO @Updates SELECT OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey FROM OrgUpdates --Mark the existing records as old by finding all related records that are currently active. --It is very important that we do this before inserting the new records as otherwise we would not be able to distinguish between old and new! UPDATE D SET EndDate = @SCDDate, CurrentFlag = 0 FROM @Updates U INNER JOIN [dbo].[DimOrganizationSCD] D on U.OrganizationID = D.OrganizationID and D.CurrentFlag = 1 --Insert the new rows into the dimension, these will each have new surrogate IDs associated --These will all have NULL ParentOrganizationKeys as we need to wait for all surrogates to be generated INSERT INTO [dbo].[DimOrganizationSCD] (OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey, StartDate, EndDate, CurrentFlag) SELECT OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey, @SCDDate StartDate, NULL EndDate, 1 CurrentFlag FROM @Updates --Finally, update the ParentOrganizationKey for all Current records UPDATE C SET ParentOrgKey = P.OrgKey FROM [dbo].[DimOrganizationSCD] C INNER JOIN [dbo].[DimOrganizationSCD] P ON C.ParentOrganizationID = P.OrganizationID WHERE C.CurrentFlag = 1 AND P.CurrentFlag = 1

Introduction to Slowly Changing Dimensions (SCD) Types

When setting about writing a recent blog post, I wanted to link to a clear, concise blog post on the different SCD types for anyone not familiar with the topic. Whilst there are a variety of thorough introductions out there, I didn’t find one as clear and concise as I’d like. I therefore give you my own offering, a quick introduction to Slowly Changing Dimensions, or SCD, in a datawarehousing scenario. For a more detailed discussion of slowly changing dimensions, I’d suggest looking at Kimball Group’s own posts on type 1 and types 2 and 3. What are slowly changing dimensions? When organising a datawarehouse into Kimball-style star schemas, you relate fact records to a specific dimension record with its related attributes. But what if the information in the dimension changes? Do you now associate all fact records with the new value? Do you ignore the change to keep historical accuracy? Or do you treat facts before the dimension change differently to those after? It is this decision that determines whether to make your dimension a slowly changing one. There are several different types of SCD depending on how you treat incoming change. What are the types of SCD? Very simply, there are 6 types of Slowly Changing Dimension that are commonly used, they are as follows: Type 0 – Fixed Dimension No changes allowed, dimension never changes Type 1 – No History Update record directly, there is no record of historical values, only current state Type 2 – Row Versioning Track changes as version records with current flag & active dates and other metadata Type 3 – Previous Value column Track change to a specific attribute, add a column to show the previous value, which is updated as further changes occur Type 4 – History Table Show current value in dimension table but track all changes in separate table Type 6 – Hybrid SCD Utilise techniques from SCD Types 1, 2 and 3 to track change In reality, only types 0, 1 and 2 are widely used, with the others reserved for very specific requirements. Confusingly, there is no SCD type 5 in commonly agreed definitions. After you have implemented your chosen dimension type, you can then point your fact records at the relevant business or surrogate key. Surrogate keys in these examples relate to a specific historical version of the record, removing join complexity from later data structures. Practical Examples We have a very simple ‘customer’ dimension, with just 2 attributes – Customer Name and Country:   However, Bob has just informed us that he has now moved to the US and we want to update our dimension record to reflect this. We can see how the different SCD types will handle this change and the pro/cons of each method. Type 0 Our table remains the same. This means our existing reports will continue to show the same figures, maybe it is a business requirement that each customer is always allocated to the country they signed up from. All future transactions associated to Bob will also be allocated to the ‘United Kingdom’ country. Type 1 The table is updated to reflect Bob’s new country: All fact records associated with Bob will now be associated with the ‘United States’ country, regardless of when they occurred. We often just want to see the current value of a dimension attribute – it could be that the only dimension changes that occur are corrections to mistakes, maybe there is no requirement for historical reporting. Type 2 In order to support type 2 changes, we need to add four columns to our table: · Surrogate Key – the original ID will no longer be sufficient to identify the specific record we require, we therefore need to create a new ID that the fact records can join to specifically. · Current Flag – A quick method of returning only the current version of each record · Start Date – The date from which the specific historical version is active · End Date – The date to which the specific historical version record is active With these elements in place, our table will now look like: This method is very powerful – you maintain the history for the entire record and can easily perform change-over-time analysis. However, it also comes with more maintenance overhead, increased storage requirement and potential performance impacts if used on very large dimensions. Type 2 is the most common method of tracking change in data warehouses. Type 3 Here, we add a new column called “Previous Country” to track what the last value for our attribute was. Note how this will only provide a single historical value for Country. If the customer changes his name, we will not be able to track it without adding a new column. Likewise, if Bob moved country again, we would either need to add further “Previous Previous Country” columns or lose the fact that he once lived in the United Kingdom. Type 4 There is no change to our existing table here, we simply update the record as if a Type 1 change had occurred. However, we simultaneously maintain a history table to keep track of these changes: Our Dimension table reads: Whilst our Type 4 historical table is created as: Depending on your requirements, you may place both ID and Surrogate Key onto the fact record so that you can optimise performance whilst maintaining functionality. Separating the historical data makes your dimensions smaller and therefore reduces complexity and improves performance if the majority of uses only need the current value. However, if you do require historical values, this structure adds complexity and data redundancy overheads. It is generally assumed that the system will use Type 1 or Type 2 rather than Type 4. Type 6 The ‘Hybrid’ method simply takes SCD types 1, 2 and 3 and applies all techniques. We would maintain a history of all changes whilst simultaneously updating a “current value” column on all records.   This gives you the ability to provide an element of change comparison without additional calculation, whilst still maintaining a full, detailed history of all changes in the system. Personally, if this requirement came up, I would avoid the data redundancy of this extra column and simply calculate the current value using the “LAST_VALUE()” window function at run-time. Although this depends on your priorities between data storage and direct querying performance.

Data Quality Services - Kimball ETL Screening

Although it’s arguably not the most popular in the Kimball series, The Data Warehouse ETL Toolkit contains a huge amount of vital information that’s focussed on the ETL side of data warehousing. In particular I’ve found the data quality screens design to be useful, so I thought it might be worth exploring if this concept can be applied to Data Quality Services. What is a Kimball Data Quality Screen? A screen is simply a check that is made against data that comes into the ETL pipeline and is used to measure data quality. Physically, the Kimball screen design is typically implemented as a table called Screen, which contains, as rows, the actual checks to perform against staged data. In addition, the screen table will contain metadata to aid data quality measurement, such as screen severity, category, source system and possibly exception actions. Each time that a screen yields some results (i.e. data quality issues) then another table, Error Event Fact, is populated with the results. Therefore, by using the data generated by the screening process, The Error Event Fact becomes the place to look for an overall measurement of data quality. The schema for this design from the Kimball book (with a few modifications) is as follows: The check performed by each screen is typically a snippet of SQL that checks for a particular condition, e.g. Is the customer’s postcode missing, or is the tax amount within permitted ranges? Can Data Quality Services assist with ETL screening? DQS is capable of auditing and checking data, as each domain within a knowledge base can have its own domain rules. In order to see how this might work as Kimball screens, I’ve set up a simple knowledge base around purchase orders. Here’s the domain management window from the DQS client: Within the Delivery Charge domain, I’ve setup a domain rule called ‘Delivery Charge Negative’ in order to ensure that the delivery charge cannot be negative: I’ve done something similar with the PO Date domain, setting up a rule to state that the PO date must be greater than 01/01/2012 in this case. Data run against the PO knowledge base will now be enforced against these two rules, so these are effectively my two Kimball style screens. To recap, DQS cleaning can be run either from the Data Quality Client or from SSIS. Populating the Error Event Fact with SSIS I’m going to be running my DQS domain rules via SSIS, as my end goal is to populate the Error Event Fact. Remember, if any DQ issues occur, I want the Error Event Fact to know about them. In this scenario, the data from the source system has already been staged to a SQL table, which gives the following results in the SSIS source adaptor preview: This data is passed to the DQS Cleansing SSIS task, with the data mapped to the DQS Knowledge Base in the following way: Once the package is run, the data will be passed to DQS, which will run the data against its knowledge base, including running the domain rules. The data that is returned from DQS is as shown below. Note, PO55 row has failed both rules, whereas the other two rows have failed one row each. Now we know exactly which rows have failed each data quality screen. The only trouble is that the data is in the wrong format for our error event fact. We want a total of 4 rows in the error event fact has PO Id 5 has failed 2 screens, whereas the other two rows have failed 1 screen. Therefore, we carry out a unpivot before inserting into the Error Event Fact. For this blog post my ScreenId is hard coded, but in reality you would have to lookup an appropriate ID for the DQS screen. The full SSIS package is as follows: Data Quality Reporting With the Error Event Fact populated, a variety of reports can be produced. In this case I’ve made a very simple Analysis Services cube and made a quick report in Excel 2013, using slicers, sparklines and also the new Timeline feature: From this report I can see the screen that’s failed the most rows, see a trend of data quality and filter on specific issues. Summary With a bit of work, DQS is capable of producing the rows needed for the Kimball Error Event Fact, but how good is it? I see the advantages of using DQS as follows: Front end user interface in order to allow data stewards to set up their own rules to screen the data – much better than holding rules in SSIS or manually in a table; DQS becomes a one stop shop for all data quality, as, in addition to holding the screens as domain rules, DQS will clean your data for you, according to the valid/invalid domain values that you set up in the knowledge base. There are a few disadvantages to it also: The domain rules are not as flexible as pure SQL or SSIS, meaning that you may not be able to express all the screening rules that you had in mind; DQS can't maintain all the metadata that you need to implement Kimball screening, so you’ll need to supplement it with some SQL tables to hold things like screen severity, screen type etc; Even though the performance has been improved in CUI, it’s still not suitable for large amounts of data. Here’s an overview of how long you can expect the SSIS cleansing to take. So my view is that if you’re already using DQS and data stewards like it using it to manage data quality, and your data volumes are not too big, then implement the classic Kimball screening design using DQS. But there are more cons than pros unfortunately, so if you’ve not already made an investment in DQS, and you have the time to invest in your own Kimball style screening solution, then I think you can come up with something that’s a bit more flexible and scalable.

Microsoft Data Warehouse Toolkit Second Edition - Book Review

There is now a second edition of the Kimball Microsoft Data Warehouse Toolkit, which contains updated material for SQL Server 2008 R2. The book is out now in the US, and will be released in the UK in towards the end of March. I’ve got hold of a copy, so I thought it would be worth giving it a review. As with the previous edition, the new version of the book starts off by covering the Business Requirements, before explaining the core concepts associated with the Kimball approach, such as surrogate keys and slowly changing dimensions, for example. An overview of the Microsoft BI stack is given, before explaining how to build each part of the data warehouse in turn, covering the database engine, SSIS, SSAS, SSRS and SharePoint. There are also new sections on both Master Data Services and PowerPivot – personally I found the PowerPivot chapter to be really interesting – there’s a good explanation of how PowerPivot can compliment the data warehouse, as well as explaining how to use it of course. The latter part of the book explains management of the data warehouse, covering security, backups, metadata, maintenance. The final chapter, “Present Imperatives and Future Outlook” includes a narrative on common problems in DW projects, as well as an honest appraisal of the MS BI stack which you can’t help but agree with. So the burning question is should you buy this book? That needs to be answered from two perspectives – 1)If you’re an owner of the original and 2)If you don’t own the original. If you do own the original, then you’ll find that there’s a huge amount of content that’s the same across the two editions. There is new content – particularly around MDS, PowerPivot, Change Data Capture and Compression, for example, as well as updates to some of the old content. £30 is quite a lot to pay for a book that’s similar to a book that you already own, but overall I think its worth going for. If you don’t own the original then you will find this book to be an extremely useful asset in implementing MS data warehousing solutions, and I wouldn’t hesitate to buy it. One point I will make is that this doesn’t represent an exhaustive how to for each of the components (E.g. SSAS), which I personally think is fine. I know a handful of people made this gripe at the first edition, but realistically there’s only so much you can cram into 648 pages on the huge subject of data warehousing AND then cover each element of the MS BI stack.

PowerPivot and Data Warehousing

A consultant that I know has recently been through the process of reviewing his company's management information requirements, and also the process of evaluating what technology platform can deliver those requirements. Naturally, a data warehouse was one of the options. A data warehouse, in my opinion, will provide the enterprise-level platform for any organisation to deliver their management information. In this case though, due diligence flagged up some alternative approaches to a data warehouse, namely the raft of self service BI tools that are out there, including PowerPivot. Other tools that can take data in from a variety of data sources are Tableau (which is also a good front end for Analysis Services) and Spotfire for example. I think that all of these tools have their merits. PowerPivot is incredibly fast and easy to use, whereas Tableau and Spotfire for example have some handy data visualisation capabilities. Some vendors actually claim that self service tools can replace a data warehouse. Can this really be the case? Whereas these tools do a good job of presenting data in a dimensional model, my personal view is that a data warehouse of course also carries this out, but in addition delivers the following benefits to name a few: Data Cleansing - The power of SQL in the backend and also features like Fuzzy Matching in SSIS mean that you can address the huge problem of data quality. The bonus here is that all data quality issues could be logged, meaning that you can go back to the owners of source systems and let them know that their data is in a mess. Control - If you do encounter data quality issues, you may decide that they are actually too bad to load into the data warehouse and present to your users. Transparency - if you want, you can choose to expose the data quality level of your facts to users, via the Kimball-style audit dimension. E.g. we may have noticed that within our retail data, one store has abnormally low sales. Therefore, we can assign this fact a lower data quality score than the other sales by store for the day, which lets the users know that the data quality steward is aware of the issue, and is looking into it. Slowly Changing Dimensions - A lot of source systems don't behave. E.g. if we're tracking a customer's Marital Status of 'single' or 'married', you often find that source systems only record the latest change to a given field. Without addressing slowly changing dimension issues in a data warehouse, we may never know that Customer X is now single, but once was married. Conformed Dimensions - When we have multiple data sources for a single dimension, the goal is to ensure that each dimension available to end users represents a single, consistent view of that dimension. Due to the often complex matching and de-duplication that's required, it's difficult to see how this would be possible without a data warehouse. Late Arriving Dimension Rows - Going back to the 'Marital Status' attribute, we can get the situation where a source system only updates a particular field a long time after the event actually occurred. E.g, if we think that a customer called John Doe is single, but then today we're told he's married, but has actually been married for the last two years. If Marital Status is an important piece of management information to the business, then we need to update all data associated with John Doe over the last two years, in order to reflect his correct Marital Status over the last two years. Complex Transformations - I recently encountered a source system where one of the columns in a SQL table held data in a varying length format of ‘#Product Type#Currency#Value, #Product Type#Currency#Value’. So what you basically have here is potentially say 25 records held in one row in a SQL table, whereas another table might have just 7 records within one row. Rows could also be badly formed! We ended up using a bit of C# within SSIS to split this data out so that the sub-elements could be analysed correctly. I could go on, but I think as a few simple examples the above will do. Whilst I'm an advocate of data warehousing, I also think that the self service tools, such as PowerPivot, can be complementary to a data warehouse. You'll often find, for example, that there's some data that's not in the data warehouse yet - after all some organisations have a lot of data, and it will take time before all the business analysis etc can be carried out to get this data into the warehouse. Another example is proto-typing, analysts can very quickly build a proto-type model, which can then be production-ised at a later stage. I think this video from Chris Webb is really worth watching - it covers these examples and a few more. In summary there's always going to be a need for users to go digging for data and producing their own reports - the data warehouse can't always deliver everything immediately. PowerPivot is going to be great in this situation, but it will work best when it's used alongside cleaned and conformed data, which is exactly what a data warehouse provides.

Master Data Services - Reversing Transactions

MDM tools give the control of the enterprise master data over to the data stewards and power users, rather than relying on automated data integration alone. Master Data Services is no exception to the above. One of the ways that this is true for MDS is that it allows users to inspect the transactions that have occurred (either internal to MDS or from a source system) and choose if they want to reverse them. In order to achieve this MDS has a useful log of all transactions that's viewable by users. Here's an example of some transactions that have occurred in my test system - some are from data that I've loaded up via the staging tables, some are from manual member additions that I've carried out in the front end, and some are from business rules that have automatically run: In the model that this example is taken from, I've got some business rules that look to address data quality issues. Taking the Kimball view on data quality issues in a data warehousing context - many can, and should, be addressed in the source system, then re-loaded. That isn't always possible, which is one of the reasons why we have business rules in MDS. However, as good any sort of automated rule is - there are always exceptions. In the transactions shown above, an automatic business rule has run that checks a Customer's overdraft limit, then sets it to 10,000 if its over 10,000. Therefore, when a value of 50,000 was encountered for Member Code 10311, the MDS business rules kicked in and quite correctly did their job. This was not what I wanted in this particular case. Thankfully we can click on the undo button that's shown above the grid, and reverse a chosen transaction, whether its come from a source system, a business rule or a manual edit. It doesn't seem possible to reverse many transactions at once, but that may be just due to the CTP. In my example, by selecting the first transaction in the list, then clicking the undo button, I've reversed my automatic business rule. Therefore, the user Kaylee Adams (10311) shown below now has her original overdraft limit: In conclusion, when some sort of manual intervention is needed to successfully manage master data, MDM tools allow that intervention to come from the power users, rather than having to wait for someone technical to address the issue.