Adatis

Adatis BI Blogs

Slowly-Changing Parent-Child Dimensions – Part 2: The Implementation

Distributing the slides from my talk last night, it was well received and I think everyone was able to take something away. Credit goes to Marco Russo and Alberto Ferrari for their data modelling ideas, visit www.sqlbi.com  http://dl.dropbox.com/u/25925341/Learning%27s%20from%20large%20scale%20dw%20project.pptx Hopefully not my last talk as i enjoyed doing it.  Look out for further Bath based events at: http://avonim.wordpress.com/ All comments welcome CalvinFirst of all, an apology. It’s always hard to recognise whether your posts are useful for people out there and this one slipped through the cracks. After several requests, I thought I’d finally get around to posting the related solution to a post I wrote 2 years ago! Apologies to anyone this could have helped quite some time ago. To set the scene, I had to build a slowly-changing parent child dimension and manage the relationships accordingly. I described the theory behind the solution in this post, probably best give that a skim if you’re not familiar with the approach. To summarise – if a parent node changes, you create new SCD records for all descendants, essentially you create historic “branches” as each change happens. Initial Data Setup For the example, we’ll use the AdventureWorksDW2012 sample DB as it has an org chart we can very easily play with. You can find the download here. We’re going to create a new parent-child SCD dimension using dbo.DimOrganization by running the script below:CREATE TABLE [dbo].[DimOrganizationSCD]( [OrgKey] [int] IDENTITY(1,1), [ParentOrgKey] [int] NULL, [OrganizationID] [int] NOT NULL, [ParentOrganizationID] [int] NULL, [PercentageOfOwnership] [nvarchar](16) NULL, [OrganizationName] [nvarchar](50) NULL, [CurrencyKey] [int] NULL, [StartDate] datetime NOT NULL, [EndDate] datetime NULL, [CurrentFlag] [int] NOT NULL ) ON [PRIMARY] GO INSERT INTO [dbo].[DimOrganizationSCD] (ParentOrgKey, OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey, StartDate, EndDate, CurrentFlag) SELECT -1 [ParentOrgKey] ,[OrganizationKey] ,[ParentOrganizationKey] ,[PercentageOfOwnership] ,[OrganizationName] ,[CurrencyKey] ,'2014-01-01 00:00:000' [StartDate] ,NULL [EndDate] ,1 [CurrentFlag] FROM [dbo].[DimOrganization] UPDATE C SET ParentOrgKey = P.OrgKey FROM [dbo].[DimOrganizationSCD] C INNER JOIN [dbo].[DimOrganizationSCD] P ON C.ParentOrganizationID = P.OrganizationID Take a look in the table and you’ll see something like this: We now have a separation between the unique business identifier for each organisational entity (OrganizationID), and its historical version (OrganizationKey). The Org Change Occurs Next let’s make a change to track – maybe the North American operation has been split off as a subsidiary and we need to insert a hierarchy level for the new brand “SuperCycles US”. We’ll create an ‘updates’ table to represent the delta coming from our ETL and put in two records – the new organisation record for SuperCycles, and the update to North American Operation to represent the new Parent ID.CREATE TABLE [dbo].[DimOrganization_Updates]( [OrganizationID] [int] NOT NULL, [ParentOrganizationID] [int] NULL, [PercentageOfOwnership] [nvarchar](16) NULL, [OrganizationName] [nvarchar](50) NULL, [CurrencyKey] [int] NULL ) ON [PRIMARY] GO INSERT INTO [dbo].[DimOrganization_Updates] VALUES (15, 1, 1, 'SuperCycles US', 100), (2, 15, 1, 'North America Operations', 100) The ETL Logic For this example, I’ve written it as one big stored procedure, will all the logic occurring in one script. I’d normally break this into separate procedures with SSIS handling data flows, lookups etc and logging results of individual scripts, but a single SQL flow is easier to demonstrate. Firstly, we create a couple of variables used within the proc:--Createa a table used to hold all records impacted by the hierarchy changes DECLARE @Updates TABLE ( [OrganizationID] [int] NOT NULL, [ParentOrganizationID] [int] NULL, [PercentageOfOwnership] [nvarchar](16) NULL, [OrganizationName] [nvarchar](50) NULL, [CurrencyKey] [int] NULL ); --Use a common timestamp across all SCD updates DECLARE @SCDDate datetime = getdate(); Now, as far as any children of the North American Operation go, they technically haven’t changed and so we won’t see an update for them. So our ETL logic needs to take that into account. We take our update rows and build a recursive CTE that finds all active child rows in our dimension:--Use a recursive CTE to find all updates AND all descendants of updated records WITH OrgUpdates AS ( --Our Changed Records SELECT U.OrganizationID, U.ParentOrganizationID, U.PercentageOfOwnership, U.OrganizationName, U.CurrencyKey FROM [dbo].[DimOrganization_Updates] U LEFT JOIN [dbo].[DimOrganizationSCD] D on U.OrganizationID = D.OrganizationID and D.CurrentFlag = 1 UNION ALL --Our Descendants (this will recurse until no more descendants are found) SELECT D.OrganizationID, D.ParentOrganizationID, D.PercentageOfOwnership, D.OrganizationName, D.CurrencyKey FROM [dbo].[DimOrganizationSCD] D INNER JOIN OrgUpdates U on D.ParentOrganizationID = U.OrganizationID and D.CurrentFlag = 1 ) --Store the results of the CTE in a table variable INSERT INTO @Updates SELECT OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey FROM OrgUpdates This will loop a maximum of 100 times by default, you’ll need to override this if your parent-child structure has more than 100 levels. Hopefully this isn’t an issue for anyone! We now have a table variable populated with each of the business IDs that relates to a changed record, or a descendant of a changed record. From here on, we can follow a standard SCD Type 2 method – we stamp the existing records with an end date & amend the current flag, then insert the new records. The end-dating of the old records needs to be done first, as it is easier to identify the new records by their current flag once they have been inserted. So the historical update can be something like:--Mark the existing records as old by finding all related records that are currently active. --It is very important that we do this before inserting the new records as otherwise we would not be able to distinguish between old and new! UPDATE D SET EndDate = @SCDDate, CurrentFlag = 0 FROM @Updates U INNER JOIN [dbo].[DimOrganizationSCD] D on U.OrganizationID = D.OrganizationID and D.CurrentFlag = 1 And then we insert new records:--Insert the new rows into the dimension, these will each have new surrogate IDs associated --These will all have NULL ParentOrganizationKeys as we need to wait for all surrogates to be generated INSERT INTO [dbo].[DimOrganizationSCD] (OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey, StartDate, EndDate, CurrentFlag) SELECT OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey, @SCDDate StartDate, NULL EndDate, 1 CurrentFlag FROM @Updates Note that at this point, we’re taking the full cut of records from our “Updates” table, so this will include any new records. Our “SuperCycles US” record will have been inserted here, along with the updated records. The newly inserted records have deliberately not had their ParentOrganizationKey populated as for many of the record, the parent’s surrogate key has not yet been generated. We need to run an update on the dimension table after the insert to go through and fill this in.--Finally, update the ParentOrganizationKey for all Current records UPDATE C SET ParentOrgKey = P.OrgKey FROM [dbo].[DimOrganizationSCD] C INNER JOIN [dbo].[DimOrganizationSCD] P ON C.ParentOrganizationID = P.OrganizationID WHERE C.CurrentFlag = 1 AND P.CurrentFlag = 1 And that’s it, we now have a new branch of the organisation chart to represent today’s changes. Taking another look at our dimension table, we now have a new set of active records representing the current Org hierarchy. Improvements This particular logic has been written to demonstrate an approach, there are many improvements you may want to make if including in a large, resilient ETL process. Firstly, as mentioned above, I would separate the logic into individual stored procedures that each returned rowcounts for better logging. You can also amend the insertion procedure to take original values if you need to mix and match SCD types and include a Type 3 “original value” column. Finally, you may want to include a HierarchyId type column and populate using the full grandparent>parent>child surrogate key path, this allows you to query for such things as “all descendants” without having to build in recursion. This can vastly improve performance if you’re going to be running aggregations over huge volumes directly on the SQL layer. Further Reading Recursive Query using Common Table Expressions Hierarchal Data Code Extract Putting all the code together, our ETL query is as follows:--Create a table used to hold all records impacted by the hierarchy changes DECLARE @Updates TABLE ( [OrganizationID] [int] NOT NULL, [ParentOrganizationID] [int] NULL, [PercentageOfOwnership] [nvarchar](16) NULL, [OrganizationName] [nvarchar](50) NULL, [CurrencyKey] [int] NULL ); --Use a common timestamp across all SCD updates DECLARE @SCDDate datetime = getdate(); --Use a recursive CTE to find all updates AND all descendants of updated records WITH OrgUpdates AS ( --Our Changed Records SELECT U.OrganizationID, U.ParentOrganizationID, U.PercentageOfOwnership, U.OrganizationName, U.CurrencyKey FROM [dbo].[DimOrganization_Updates] U LEFT JOIN [dbo].[DimOrganizationSCD] D on U.OrganizationID = D.OrganizationID and D.CurrentFlag = 1 UNION ALL --Our Descendants (this will recurse until no more descendants are found) SELECT D.OrganizationID, D.ParentOrganizationID, D.PercentageOfOwnership, D.OrganizationName, D.CurrencyKey FROM [dbo].[DimOrganizationSCD] D INNER JOIN OrgUpdates U on D.ParentOrganizationID = U.OrganizationID and D.CurrentFlag = 1 ) --Store the results of the CTE in a table variable INSERT INTO @Updates SELECT OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey FROM OrgUpdates --Mark the existing records as old by finding all related records that are currently active. --It is very important that we do this before inserting the new records as otherwise we would not be able to distinguish between old and new! UPDATE D SET EndDate = @SCDDate, CurrentFlag = 0 FROM @Updates U INNER JOIN [dbo].[DimOrganizationSCD] D on U.OrganizationID = D.OrganizationID and D.CurrentFlag = 1 --Insert the new rows into the dimension, these will each have new surrogate IDs associated --These will all have NULL ParentOrganizationKeys as we need to wait for all surrogates to be generated INSERT INTO [dbo].[DimOrganizationSCD] (OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey, StartDate, EndDate, CurrentFlag) SELECT OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey, @SCDDate StartDate, NULL EndDate, 1 CurrentFlag FROM @Updates --Finally, update the ParentOrganizationKey for all Current records UPDATE C SET ParentOrgKey = P.OrgKey FROM [dbo].[DimOrganizationSCD] C INNER JOIN [dbo].[DimOrganizationSCD] P ON C.ParentOrganizationID = P.OrganizationID WHERE C.CurrentFlag = 1 AND P.CurrentFlag = 1

SQL - Using The MERGE Statement To Apply Type 2 SCD Logic

Introduced in SQL 2008 the merge function is a useful way of inserting, updating and deleting data inside one SQL statement. In the example below I have 2 tables one containing historical data using type 2 SCD (Slowly changing dimensions) called DimBrand and another containing just the latest dimension data called LatestDimBrand. Using the merge function I will insert new records from LatestDimBrand into DimBrand, I will archive (apply an end date) to any DimBrand records which do not appear in the latest data, and finally enter a new record and archive the old record for any Brands which may have changed. DimBrand (the target of our Inserts, Updates, Deletes) and DimLatestBrand (the source for Inserts,Updates,Deletes):      The merge code in it’s entirety:INSERT #DimBrand ([BrandCode],[BrandName],[StartDate]) SELECT [BrandCode],[BrandName],getdate() FROM ( MERGE #DimBrand AS Target USING ( SELECT [BrandCode],[BrandName],[StartDate],[EndDate] FROM #LatestDimBrand ) AS Source ON (Target.[BrandCode] = Source.[BrandCode]) ------------------------------- WHEN MATCHED AND Target.[BrandName] <> Source.[BrandName] THEN UPDATE SET Target.[EndDate] = getdate() ------------------------------- WHEN NOT MATCHED BY TARGET THEN INSERT ( [BrandCode] ,[BrandName] ,[StartDate] ,[EndDate] ) VALUES ( Source.[BrandCode], Source.[BrandName], Source.[StartDate], Source.[EndDate] ) ------------------------------- WHEN NOT MATCHED BY SOURCE THEN UPDATE SET Target.[EndDate] = getdate() ------------------------------- OUTPUT $Action, Source.* ) As i([Action],[BrandCode],[BrandName],[StartDate],[EndDate]) ------------------------------- WHERE [Action] = 'UPDATE' AND BrandCode IS NOT NULL   The insert statement, although appearing at the top of the SQL statement is the last thing to be executed and uses the results set of the merge function specified in the OUTPUT clause further down. Next is the start of our merge statement here we specify a Target and a Source table. We also specify which columns the 2 tables should match on, in this case only the BrandCode. The next step is to specify and handle any matches, here we are looking for Brand Code’s which appear in both tables, but with differing Brand Names. If a match is found the row in the target table is given an end date.  As with the following “NOT MATCHED” section inserts,updates or deletes can be applied to either table here. This handles the first part of our Slowly Changing Dimension requirement. Where there is no match between the Target table and the Source table, the relevant records from the Source table are inserted into the Target table. The penultimate part of the merge statement takes any Target records which do not appear in our “Latest” Source table and sets an EndDate for them as they are no longer a valid record. The OUTPUT clause populates the outer INSERT statement, here the Source rows which are involved in the UPDATE statements above are pulled out to the outer FROM clause. We have also selected $Action so that we can filter out the INSERTED rows under the NOT MATCHED by Target statement. The result of all the above is the brand “Googles” is archived along with the “MikeRowSoft” which as you can see from the BrandCode has been rebranded as MickRowSoft.

Slowly-Changing Parent-Child Dimensions – Part 1: The Theory

A common structure in OLTP systems is a parent-child relationship with Object and ObjectParent tables creating a recursive structure. This is easily represented as a Warehouse dimension table, usually flattened out but occasionally left as native Parent-Child if required. I recently encountered an issue where the client not only required a flexible, dynamic parent-child dimension but also required it to be slowly changing. Each fact record joining to the dimension, at any granularity, had to be aware of its hierarchal context at that point in time, despite there only being one source record for dimension object. We end up with something like this: Throughout the course of the engagement I implemented two different models for this as requirements changed, I’ll detail the solutions in part 2 of this post. If you’re not familiar with Slowly Changing Dimensions, namely Type 2 SCD, I put together a quick introduction in a blog post here to bring you up to speed. So how do you slowly change a Parent-Child relationship? Since we have source records for each node in our hierarchy, it makes sense to keep this structure. However, over time that same node can fit into the hierarchy in different places, changing parent records, gaining/losing children or even moving levels. We therefore need to differentiate between the different hierarchal contexts of node. As with other SCD implementations, we give each historical version a surrogate key, so we can accurately identify the node in the relevant context. Parent-Child makes this tricky however – if we just created historical versions when the individual nodes changed, records joining in at lower granularities would not know which historical version to use. We therefore need to amend child nodes to point to the new surrogate keys. This, in turn, means we have to create new references to those nodes, and so on down the hierarchy. Essentially, anytime a node in the hierarchy changes, we need to create new historical versions for all descendants of that node. Our original structure, using this method, would now look like this: You can see that when Parent N’s own parent changed, we had to propagate that change to Object A to ensure our lowest granularity object has a key for each temporal version of the structure. We can now use this structure with a fact table – we know that a fact record occurring on 10/02/14, for example, would aggregate up the orange-marked path through A2 > N1 > Y. That’s the key point to implementing SCD for Parent-Child structures. If any changes occur, anywhere in the hierarchy, all descendants will need a new type 2 record created. By using Type 2 SCD, each object is referencing the surrogate key of its parent, not the business key, this way every join in the structure is based upon a specific historical version of that record and thus historical context is implied by the foreign key relationships. Whilst complex in theory, once implemented your fact > dimension relationship is very simple. Your fact record has a single foreign key which holds the full historical context of that record. In the next post, I’ll discuss a couple of techniques for implementing the above transformation inside a standard ETL structure.

Introduction to Slowly Changing Dimensions (SCD) Types

When setting about writing a recent blog post, I wanted to link to a clear, concise blog post on the different SCD types for anyone not familiar with the topic. Whilst there are a variety of thorough introductions out there, I didn’t find one as clear and concise as I’d like. I therefore give you my own offering, a quick introduction to Slowly Changing Dimensions, or SCD, in a datawarehousing scenario. For a more detailed discussion of slowly changing dimensions, I’d suggest looking at Kimball Group’s own posts on type 1 and types 2 and 3. What are slowly changing dimensions? When organising a datawarehouse into Kimball-style star schemas, you relate fact records to a specific dimension record with its related attributes. But what if the information in the dimension changes? Do you now associate all fact records with the new value? Do you ignore the change to keep historical accuracy? Or do you treat facts before the dimension change differently to those after? It is this decision that determines whether to make your dimension a slowly changing one. There are several different types of SCD depending on how you treat incoming change. What are the types of SCD? Very simply, there are 6 types of Slowly Changing Dimension that are commonly used, they are as follows: Type 0 – Fixed Dimension No changes allowed, dimension never changes Type 1 – No History Update record directly, there is no record of historical values, only current state Type 2 – Row Versioning Track changes as version records with current flag & active dates and other metadata Type 3 – Previous Value column Track change to a specific attribute, add a column to show the previous value, which is updated as further changes occur Type 4 – History Table Show current value in dimension table but track all changes in separate table Type 6 – Hybrid SCD Utilise techniques from SCD Types 1, 2 and 3 to track change In reality, only types 0, 1 and 2 are widely used, with the others reserved for very specific requirements. Confusingly, there is no SCD type 5 in commonly agreed definitions. After you have implemented your chosen dimension type, you can then point your fact records at the relevant business or surrogate key. Surrogate keys in these examples relate to a specific historical version of the record, removing join complexity from later data structures. Practical Examples We have a very simple ‘customer’ dimension, with just 2 attributes – Customer Name and Country:   However, Bob has just informed us that he has now moved to the US and we want to update our dimension record to reflect this. We can see how the different SCD types will handle this change and the pro/cons of each method. Type 0 Our table remains the same. This means our existing reports will continue to show the same figures, maybe it is a business requirement that each customer is always allocated to the country they signed up from. All future transactions associated to Bob will also be allocated to the ‘United Kingdom’ country. Type 1 The table is updated to reflect Bob’s new country: All fact records associated with Bob will now be associated with the ‘United States’ country, regardless of when they occurred. We often just want to see the current value of a dimension attribute – it could be that the only dimension changes that occur are corrections to mistakes, maybe there is no requirement for historical reporting. Type 2 In order to support type 2 changes, we need to add four columns to our table: · Surrogate Key – the original ID will no longer be sufficient to identify the specific record we require, we therefore need to create a new ID that the fact records can join to specifically. · Current Flag – A quick method of returning only the current version of each record · Start Date – The date from which the specific historical version is active · End Date – The date to which the specific historical version record is active With these elements in place, our table will now look like: This method is very powerful – you maintain the history for the entire record and can easily perform change-over-time analysis. However, it also comes with more maintenance overhead, increased storage requirement and potential performance impacts if used on very large dimensions. Type 2 is the most common method of tracking change in data warehouses. Type 3 Here, we add a new column called “Previous Country” to track what the last value for our attribute was. Note how this will only provide a single historical value for Country. If the customer changes his name, we will not be able to track it without adding a new column. Likewise, if Bob moved country again, we would either need to add further “Previous Previous Country” columns or lose the fact that he once lived in the United Kingdom. Type 4 There is no change to our existing table here, we simply update the record as if a Type 1 change had occurred. However, we simultaneously maintain a history table to keep track of these changes: Our Dimension table reads: Whilst our Type 4 historical table is created as: Depending on your requirements, you may place both ID and Surrogate Key onto the fact record so that you can optimise performance whilst maintaining functionality. Separating the historical data makes your dimensions smaller and therefore reduces complexity and improves performance if the majority of uses only need the current value. However, if you do require historical values, this structure adds complexity and data redundancy overheads. It is generally assumed that the system will use Type 1 or Type 2 rather than Type 4. Type 6 The ‘Hybrid’ method simply takes SCD types 1, 2 and 3 and applies all techniques. We would maintain a history of all changes whilst simultaneously updating a “current value” column on all records.   This gives you the ability to provide an element of change comparison without additional calculation, whilst still maintaining a full, detailed history of all changes in the system. Personally, if this requirement came up, I would avoid the data redundancy of this extra column and simply calculate the current value using the “LAST_VALUE()” window function at run-time. Although this depends on your priorities between data storage and direct querying performance.

Type 2 SCDs - Calculating End Date when you only have Effective Date

Here's a quick one for creating an end date column when you only have an effective date column in your data source to work with.  The secret is to join the table to itself with a greater than join on the effective date.  You then use the effective date from your joined table to give you your end date. SELECT DT1.AccountID ,DT1.EffectiveDate --Add a default end date for current record and subtract a day so the end date is one day before the next start date ,ISNULL(DATEADD(d,-1, DT2.EffectiveDate), '31/12/2099') AS EndDate ,DT1.Price FROM dbo.PriceTable AS DT2 RIGHT OUTER JOIN dbo.PriceTable AS DT1 DT2.AccountID = DT1.AccountID AND DT2.EffectiveDate > DT1.EffectiveDate