Shane

Shane Grimes' Blog

SQL - Using The MERGE Statement To Apply Type 2 SCD Logic

Introduced in SQL 2008 the merge function is a useful way of inserting, updating and deleting data inside one SQL statement.

In the example below I have 2 tables one containing historical data using type 2 SCD (Slowly changing dimensions) called DimBrand and another containing just the latest dimension data called LatestDimBrand. Using the merge function I will insert new records from LatestDimBrand into DimBrand, I will archive (apply an end date) to any DimBrand records which do not appear in the latest data, and finally enter a new record and archive the old record for any Brands which may have changed.

DimBrand (the target of our Inserts, Updates, Deletes) and DimLatestBrand (the source for Inserts,Updates,Deletes):

image     image

The merge code in it’s entirety:

INSERT #DimBrand    ([BrandCode],[BrandName],[StartDate])
SELECT                 [BrandCode],[BrandName],getdate()                                    
FROM                (
                    MERGE #DimBrand AS Target
                    USING    (
                            SELECT    [BrandCode],[BrandName],[StartDate],[EndDate]                       
                            FROM    #LatestDimBrand
                            ) AS Source
                    ON     (Target.[BrandCode] = Source.[BrandCode])    
                    -------------------------------                       
                    WHEN MATCHED AND Target.[BrandName] <> Source.[BrandName]
                        THEN
                        UPDATE SET Target.[EndDate] = getdate()
                    -------------------------------
                    WHEN NOT MATCHED BY TARGET 
                        THEN
                        INSERT ( 
                                [BrandCode]                        
                                ,[BrandName]                    
                                ,[StartDate]              
                                ,[EndDate]                                            
                        )
                        VALUES (      
                                Source.[BrandCode], 
                                Source.[BrandName], 
                                Source.[StartDate],
                                Source.[EndDate]   
                        )
                    -------------------------------
                    WHEN NOT MATCHED BY SOURCE 
                        THEN 
                        UPDATE SET Target.[EndDate] = getdate()
                    -------------------------------
                    OUTPUT $Action, Source.*
                    ) As i([Action],[BrandCode],[BrandName],[StartDate],[EndDate])
                    -------------------------------
WHERE                [Action] = 'UPDATE'
AND                    BrandCode IS NOT NULL

 

The insert statement, although appearing at the top of the SQL statement is the last thing to be executed and uses the results set of the merge function specified in the OUTPUT clause further down.

image

Next is the start of our merge statement here we specify a Target and a Source table. We also specify which columns the 2 tables should match on, in this case only the BrandCode.

image

The next step is to specify and handle any matches, here we are looking for Brand Code’s which appear in both tables, but with differing Brand Names. If a match is found the row in the target table is given an end date.  As with the following “NOT MATCHED” section inserts,updates or deletes can be applied to either table here. This handles the first part of our Slowly Changing Dimension requirement.

image

Where there is no match between the Target table and the Source table, the relevant records from the Source table are inserted into the Target table.

image

The penultimate part of the merge statement takes any Target records which do not appear in our “Latest” Source table and sets an EndDate for them as they are no longer a valid record.

image

The OUTPUT clause populates the outer INSERT statement, here the Source rows which are involved in the UPDATE statements above are pulled out to the outer FROM clause. We have also selected $Action so that we can filter out the INSERTED rows under the NOT MATCHED by Target statement.

image

The result of all the above is the brand “Googles” is archived along with the “MikeRowSoft” which as you can see from the BrandCode has been rebranded as MickRowSoft.

image

SQL Order by with aggregation

In this post we will look into applying an order by clause to a query containing an aggregate function. The purpose of this is to allow us to order a result set by a column which does not appear in the select or group by clause.  Without using an aggregate function, or including the column to order by in your select/group by, you will be unable to execute the query successfully.

This is quite a straight forward problem, but one that can easily trip people up.  I started by creating a simple table containing Colours which appear multiple times over multiple dates and against multiple IDs.

image

image

Using the count() function to count how many times a colour appears on a certain date doesn’t cause us any problems.

The result set shows us 7 rows of data, we note that the colour Aqua Marine Green appears first in the list and while black appears second in the list as we will use this information later.

image

image

However if we wanted to see the result set in the order of their respective ID’s this is where we run into issues,

The error message is quite clear if not a little confusing on what needs to happen here, either we need to add the ID column to our group by clause, or wrap it in an aggregate function. 

image

image

Adding the ID to the group by clause will change your result set, if we take the query above and simply add the ID column to our select and group by we no longer get an error message, however the data has lost all meaning as now we are back to our 10 original rows of data each being counted once.

image

image

Using an aggregate function in the order by clause “"eg: Min(ID) fixes the problem and provides us with the result set we are after. If we look above we can see that “Black” appears 3 times against Date “2015-09-03” and the ID’s for those 3 records are 1,3,8.  It is only after specifying which ID to order by is SQL able to correctly execute the query.

Without adding an aggregate function be that MIN() SUM() AVG() SQL is unable to determine which ID it should use for each group to order the data by.

image

image

Be careful when picking the aggregate function to use as your choice will affect the result set.  The screenshots below show the result set ordered by MIN(ID), AVG(ID), SUM(ID). Each returning the same set of data but not in the same order. (the aggregated ID column has been added to the screenshots for reference only.)

The first example orders by the MIN() function and results in Black appearing first in the result set.

image

The next example is ordered by the AVG() function and results in Black appearing fourth in the result set.

image

Finally ordering by the SUM() function results in Black appearing last in the result set.

image