Adatis

Adatis BI Blogs

SQL - Using The MERGE Statement To Apply Type 2 SCD Logic

Introduced in SQL 2008 the merge function is a useful way of inserting, updating and deleting data inside one SQL statement. In the example below I have 2 tables one containing historical data using type 2 SCD (Slowly changing dimensions) called DimBrand and another containing just the latest dimension data called LatestDimBrand. Using the merge function I will insert new records from LatestDimBrand into DimBrand, I will archive (apply an end date) to any DimBrand records which do not appear in the latest data, and finally enter a new record and archive the old record for any Brands which may have changed. DimBrand (the target of our Inserts, Updates, Deletes) and DimLatestBrand (the source for Inserts,Updates,Deletes):      The merge code in it’s entirety:INSERT #DimBrand ([BrandCode],[BrandName],[StartDate]) SELECT [BrandCode],[BrandName],getdate() FROM ( MERGE #DimBrand AS Target USING ( SELECT [BrandCode],[BrandName],[StartDate],[EndDate] FROM #LatestDimBrand ) AS Source ON (Target.[BrandCode] = Source.[BrandCode]) ------------------------------- WHEN MATCHED AND Target.[BrandName] <> Source.[BrandName] THEN UPDATE SET Target.[EndDate] = getdate() ------------------------------- WHEN NOT MATCHED BY TARGET THEN INSERT ( [BrandCode] ,[BrandName] ,[StartDate] ,[EndDate] ) VALUES ( Source.[BrandCode], Source.[BrandName], Source.[StartDate], Source.[EndDate] ) ------------------------------- WHEN NOT MATCHED BY SOURCE THEN UPDATE SET Target.[EndDate] = getdate() ------------------------------- OUTPUT $Action, Source.* ) As i([Action],[BrandCode],[BrandName],[StartDate],[EndDate]) ------------------------------- WHERE [Action] = 'UPDATE' AND BrandCode IS NOT NULL   The insert statement, although appearing at the top of the SQL statement is the last thing to be executed and uses the results set of the merge function specified in the OUTPUT clause further down. Next is the start of our merge statement here we specify a Target and a Source table. We also specify which columns the 2 tables should match on, in this case only the BrandCode. The next step is to specify and handle any matches, here we are looking for Brand Code’s which appear in both tables, but with differing Brand Names. If a match is found the row in the target table is given an end date.  As with the following “NOT MATCHED” section inserts,updates or deletes can be applied to either table here. This handles the first part of our Slowly Changing Dimension requirement. Where there is no match between the Target table and the Source table, the relevant records from the Source table are inserted into the Target table. The penultimate part of the merge statement takes any Target records which do not appear in our “Latest” Source table and sets an EndDate for them as they are no longer a valid record. The OUTPUT clause populates the outer INSERT statement, here the Source rows which are involved in the UPDATE statements above are pulled out to the outer FROM clause. We have also selected $Action so that we can filter out the INSERTED rows under the NOT MATCHED by Target statement. The result of all the above is the brand “Googles” is archived along with the “MikeRowSoft” which as you can see from the BrandCode has been rebranded as MickRowSoft.

SQL Order by with aggregation

In this post we will look into applying an order by clause to a query containing an aggregate function. The purpose of this is to allow us to order a result set by a column which does not appear in the select or group by clause.  Without using an aggregate function, or including the column to order by in your select/group by, you will be unable to execute the query successfully. This is quite a straight forward problem, but one that can easily trip people up.  I started by creating a simple table containing Colours which appear multiple times over multiple dates and against multiple IDs. Using the count() function to count how many times a colour appears on a certain date doesn’t cause us any problems. The result set shows us 7 rows of data, we note that the colour Aqua Marine Green appears first in the list and while black appears second in the list as we will use this information later. However if we wanted to see the result set in the order of their respective ID’s this is where we run into issues, The error message is quite clear if not a little confusing on what needs to happen here, either we need to add the ID column to our group by clause, or wrap it in an aggregate function.  Adding the ID to the group by clause will change your result set, if we take the query above and simply add the ID column to our select and group by we no longer get an error message, however the data has lost all meaning as now we are back to our 10 original rows of data each being counted once. Using an aggregate function in the order by clause “"eg: Min(ID) fixes the problem and provides us with the result set we are after. If we look above we can see that “Black” appears 3 times against Date “2015-09-03” and the ID’s for those 3 records are 1,3,8.  It is only after specifying which ID to order by is SQL able to correctly execute the query. Without adding an aggregate function be that MIN() SUM() AVG() SQL is unable to determine which ID it should use for each group to order the data by. Be careful when picking the aggregate function to use as your choice will affect the result set.  The screenshots below show the result set ordered by MIN(ID), AVG(ID), SUM(ID). Each returning the same set of data but not in the same order. (the aggregated ID column has been added to the screenshots for reference only.) The first example orders by the MIN() function and results in Black appearing first in the result set. The next example is ordered by the AVG() function and results in Black appearing fourth in the result set. Finally ordering by the SUM() function results in Black appearing last in the result set.

SQL PASS Summit - Day 3

Today was the final day of the pass summit. The schedule was a little different in the morning with no keynote speech. Instead I opted to see Jerimiah Peschka’s session ‘Dynamic SQL: Build Fast, Flexible Queries’. This was an interesting talk, which covered the security aspects of dynamic SQL and how to avoid things such as SQL injection and also how to ensure that queries generated from dynamic SQL ran quickly and avoided issues such as ensuring plan reuse on each execution. Whilst I very rarely find a myself situation in which Dynamic SQL is essential I found this an interesting talk and I am sure that when the need does arise it will have been very handy. Following this I attended a Microsoft lab on Azure Machine Learning (AML) ‘Authoring a hand-written digit recognizer in Microsoft Azure Machine Learning Studio’. This was a hands on session where computers and a set of exercises and sample data are supplied. Essentially the lesson consisted of providing a model with vector images of numbers, training the model to recognise the numbers and then passing a new set of number vector images to the model to see it identify them (> 85%) correctly. The session was interesting and gave me an overview of using AML Studio which was great. My main issue was that the data was already provided in vector format (csv files which were imported), whilst I can see time constraints mean that the conversion from image to vector during the class would have been difficult it would have been very interesting to have seen a few of the examples of the images so that we could see exactly what it was that was being identified and how different the numbers were as this would give an illustration of how clever the algorithms are. I finished the conference with Brain Larson’s talk entitled ‘Effective Reporting through SSRS Advanced Authoring Features’. As you may know SSRS has not been updated for some time now and with the current focus on Power View I wasn’t sure how much I would benefit from this and what the plans are for this tool in the future. I was quite surprised therefore when the room filled up and overflowed so that there were quite a crowd of people in the room who had to stand in order to watch it. The talk showed some interesting uses of SSRS - images embedded in the background of reports, adding strip lines and markers and an example of creating an interactive report which allowed users to set their preferences for running other reports. The examples given were very good as was the delivery, my only issue here is that without new features added to SSRS (for example ability to do an asynchronous refresh of some report elements) I am not sure what the uptake of SSRS is going to be. All in all I have had a fantastic and informative time here. I leave you with some pictures, taken earlier in the week :)  

Optimising SSIS Lookups with HASHBYTES()

I was recently working on a load for a Dimension table which worked like the below. The package follows a fairly standard routine. Data for the Dimension is transformed into the required format using SSIS, following this it is inserted into the target table. On the way an initial check is made to determine if the record is new ‘LKU – Check Exists’ and a secondary check to determine if the record has changed, which is subsequently used in an update routine. Although the package was incrementally loading, it was having to sometimes check several million rows to determine if these had changed, as the table was quite wide I decided to implement the HASHBYTES() function in order to take a hash of the columns for comparison rather than compare each of the columns. The initial results were quite promising with the package running in around half the time, however on a particularly big load I noticed the package failed as it had run out of memory for the second Lookup (full cache was used). I found this odd because the HASHBYTES function only returned a single value and I anticipated this would be smaller than the sum of the columns, however on a little deeper investigation I found that by default any row returned using the HASHBYTES function are all of a default size 8000 bytes. A quick check of the SQL Documentation of the HASHBYTES function at http://msdn.microsoft.com/en-GB/library/ms174415.aspx states that the size of the data returned for the HASHBYTES function when used with the SHA2-256 algorithm is 32 bytes meaning that most of this space was being wasted. Therefore changing the formula from 1 to 2 below significantly increases the efficiency of the lookup and also make the cache size required smaller by a factor of around 250! 1. HASHBYTES('SHA2_256',[ProductName] + '|' + [CategoryName]) AS HashValue 2. CONVERT(VARBINARY(32),HASHBYTES('SHA2_256',[ProductName] + '|' + [CategoryName])) AS NewHash Upon doing this the size of each row was significantly reduced, the package is running faster than before and most importantly there are no memory issues :)

Different Methods Of Type 2 Dimension Population

This post is to share some of my findings about the different ways of dealing with Type 2 changes in SSIS environment. I wanted to provide comparison of the methods in terms of performance, reusability and ease of maintenance. For the purposes of the testing let’s take the scenario where a client would like to implement historical tracking in their DW environment for one of their dimensions, lets say that they have given the following requirements:-  The ordered online flag should be replaced with a more user friendly text field containing ‘Yes’ or ‘No (done in a derived column transformation/SQL case statement). The client would like to track any changes to the Purchase Order Number, Customer Account Number or Carrier Tracking Number so that they can preserve the history of these changes. These will be setup as Type 2 changing attributes and we will need to record the time that each change was induced. Any changes to the OrderedOnline field will be as the result of corrections made to the source data and historic tracking of changes will not be required for this in the DW and all rows should be updated with the new value (a Type 1 change). I created the following table to hold the data: CREATE TABLE DimOrder     (Order_Key INT IDENTITY(1,1) PRIMARY KEY NOT NULL --Surrogate Key     ,SalesOrderID INT NOT NULL --Source Key       ,LineNumber TINYINT NOT NULL --Source Key     ,OrderNumber NVARCHAR(25) NOT NULL      ,PurchaseOrderNumber NVARCHAR(25) NOT NULL     ,AccountNumber NVARCHAR(15) NOT NULL     ,CarrierTrackingNumber NVARCHAR (25)     ,OrderedOnline NVARCHAR (3) NOT NULL     ,RecordValidFromDate DATETIME NOT NULL     ,RecordExpiredDate DATETIME ) The three methods I have selected are:- The SSIS inbuilt Slowly Changing Dimension (SCD) Task. A custom built SSIS package to perform the Type 2 changes.  Using a T-SQL Procedure to perform the dimension population. In each of the comparisons I have performed the following tests:- An initial load of data (all of the records are new are there are no changes. A load of the same data from Step 1 (no changes all records exist) A load of data which contains both Type 1 and Type 2 changes I used Adventure Works as a source and Step 1 & 2 both use the same source statement which looks like this: SELECT      OrdHed.SalesOrderID     ,SalesOrderDetailID     ,OnlineOrderFlag     ,SalesOrderNumber     ,PurchaseOrderNumber     ,AccountNumber     ,CarrierTrackingNumber FROM Sales.SalesOrderHeader OrdHed INNER JOIN Sales.SalesOrderDetail OrdDet     ON Ordhed.SalesOrderID = Orddet.SalesOrderID For the third test I created some copies of the source tables and then ran some basic statements on them to change the format of the data as below: SELECT * INTO [Sales].SalesOrderDetailTest FROM [Sales].SalesOrderDetail SELECT * INTO [Sales].SalesOrderHeaderTest FROM [Sales].SalesOrderHeader --this will only exhibit Type 1 changes UPDATE Sales.SalesOrderHeaderTest SET OnlineOrderFlag = Case when  OnlineOrderFlag = 1 THEN 0 ELSE 1 END WHERE salesOrderID BETWEEN 59392 AND 67257 --this will exhibit only a Type 2 change UPDATE Sales.SalesOrderHeaderTest SET PurchaseOrderNumber = 'X' + PurchaseOrderNumber WHERE salesOrderID < 51525 --this does a mix of both. UPDATE Sales.SalesOrderHeaderTest SET     PurchaseOrderNumber = 'X' + PurchaseOrderNumber     ,OnlineOrderFlag = CASE WHEN  OnlineOrderFlag = 1 THEN 0 ELSE 1 END WHERE salesOrderID BETWEEN 51524 AND 59391 The data is then sourced from the new tables which provide the changes for testing. 1. SSIS Slowly Changing Dimension Task Configuration of the package is very straightforward; you add a data source, drag on a slowly changing Dimension task to the data flow and double click it to configure how you want it to run. The whole process takes less than 10 minutes to complete and there are many options that should suit a variety of needs.   Once the configuration is completed we end up with a dataflow which looks like this: 2. Custom Built SSIS Package Next I designed a package in SSIS but without using the SCD wizard. I used the following principles during design to try and improve performance: 1. Changes can be detected in the data flow by using simple lookup transformations (we could consider hash checking to improve performance if there were very many columns) to see if there is any change to the data, by using several lookups with we can determine if the change is Type 1, 2 or both. 2. Updates in the data flow are performed in SSIS using the ‘OLE DB Command’ task. As this works on a row-by-row basis this has a large impact on performance. To avoid this I created tables to hold the Type 1 and Type 2 changes and then used an ‘Execute SQL task’ in the control flow to make the updates in one go. The completed dataflow looks like this:  New records go straight into the target tables and the ones which have changes are split using a multicast, where a change has occurred the records are inserted into holding tables and then processed in the control flow. The control flow looks like this:                           The control flow executes the statements below: --Make Type 1 changes UPDATE TGT SET TGT.OrderedOnline = SRC.OrderedOnline FROM DimOrder TGT INNER JOIN DimOrderType1 SRC ON TGT.SalesOrderID = SRC.SalesOrderID AND TGT.LineNumber = SRC.LineNumber; --Type 2 expire old rows UPDATE TGT SET TGT.RecordExpiredDate = SRC.TransactionDate FROM DimOrder TGT INNER JOIN DimOrderType2 SRC ON TGT.SalesOrderID = SRC.SalesOrderID AND TGT.LineNumber = SRC.LineNumber; --Type 2 insert new records INSERT INTO DimOrder(SalesOrderID,LineNumber,OrderNumber,PurchaseOrderNumber,AccountNumber                               ,CarrierTrackingNumber,OrderedOnline,RecordValidFromDate) SELECT     SalesOrderID     ,LineNumber     ,OrderNumber     ,PurchaseOrderNumber     ,AccountNumber     ,CarrierTrackingNumber     ,OrderedOnline     ,TransactionDate AS RecordValidFromDate FROM DimOrderType2; 3. T-SQL Script I also thought I would try a basic SQL script to see how this performed. CREATE PROC popDimOrder AS BEGIN             CREATE TABLE #OrderData                 (                         SalesOrderID INT NOT NULL                         ,LineNumber INT NOT NULL                         ,OrderNumber NVARCHAR(25) NOT NULL                         ,PurchaseOrderNumber NVARCHAR(25) NOT NULL                         ,AccountNumber NVARCHAR(15) NOT NULL                         ,CarrierTrackingNumber NVARCHAR(25) NOT NULL                         ,OrderedOnline NVARCHAR(3) NOT NULL                         ,RecordExists smallINT NOT NULL default(0)                         )             DECLARE @TransactionDate Datetime             SELECT @TransactionDate = GETDATE() INSERT INTO #OrderData(SalesOrderID, LineNumber, OrderNumber, PurchaseOrderNumber, AccountNumber,  [CarrierTrackingNumber], [OrderedOnline], RecordExists )             SELECT                         OrdHed.SalesOrderID                         ,SalesOrderDetailID As LineNumber                         ,SalesOrderNumber As OrderNumber                         ,ISNULL(PurchaseOrderNumber,'') As PurchaseOrderNumber                         ,AccountNumber                         ,ISNULL(CarrierTrackingNumber,'') As CarrierTrackingNumber                         ,CASE OnlineOrderFlag                                      WHEN 0 THEN 'No'                                     WHEN 1 THEN 'Yes'                         END As OrderedOnline                         ,0 As RecordExists             FROM [AdventureWorks2012].Sales.SalesOrderHeaderTest OrdHed             INNER JOIN [AdventureWorks2012].Sales.SalesOrderDetailTest OrdDet                         ON Ordhed.SalesOrderID = Orddet.SalesOrderID             --Check if records exist             UPDATE SRC            SET RecordExists = 1             FROM #OrderData SRC             INNER JOIN POC.dbo.DimOrder TGT                         ON TGT.SalesOrderID = SRC.SalesOrderID                         AND TGT.LineNumber = SRC.LineNumber             --Insert new records INSERT INTO DimOrder(SalesOrderID,LineNumber,OrderNumber,PurchaseOrderNumber,AccountNumber,CarrierTrackingNumber,OrderedOnline,RecordValidFromDate,RecordExpiredDate )             SELECT SalesOrderID                         ,LineNumber                         ,OrderNumber                         Once in awhile you happen upon a really useful feature in SSRS that you were unaware of. For me strip lines are definitely one of these features that you don’t see that much mention of but which can add valuable insight to a report. Below is a simple example of a scatter chart with a horizontal and vertical strip line each marking the average values of their axis.  In order to add strip lines to the report you need to do the following: 1. Select the chart axis to which you wish to add the strip line and go to the properties window. In this there is an option marked ‘StripLines’. When you click in the cell you can open up the Strip Line Collection editor.   2. Click the ‘Add’ button in the strip line editor (note: you can add multiple strip lines to each axis if required). 3. In the option marked ‘IntervalOffset’ specify the required value for the Strip Line, you can either specify a fixed value or use an expression from your Dataset. If you require the line to repeat over fixed values you can fill in the properties for ‘Interval’. 4. You then need to set the display properties for the strip lines under appearance in order for them to be visible. To produce the example strip lines in this post I have set the ‘BorderColour’ to black and the ‘BorderStyle’ to dashed.     5. If you wish to add text after the strip line on the graph – for example to illustrate that values after the strip line are above average as shown below then the settings are under the title section. That’s it for now, hope that you’ve found this helpful:)>,

Type 2 SCDs - Calculating End Date when you only have Effective Date

Here's a quick one for creating an end date column when you only have an effective date column in your data source to work with.  The secret is to join the table to itself with a greater than join on the effective date.  You then use the effective date from your joined table to give you your end date. SELECT DT1.AccountID ,DT1.EffectiveDate --Add a default end date for current record and subtract a day so the end date is one day before the next start date ,ISNULL(DATEADD(d,-1, DT2.EffectiveDate), '31/12/2099') AS EndDate ,DT1.Price FROM dbo.PriceTable AS DT2 RIGHT OUTER JOIN dbo.PriceTable AS DT1 DT2.AccountID = DT1.AccountID AND DT2.EffectiveDate > DT1.EffectiveDate