Shane

Shane Grimes' Blog

SQL Server 2016 New String Split Function

With the release of SQL Server 2016 RC0, comes a function I have waited a long time for the string_split function.  Everyone has there own different way of splitting strings, and I shall walk you through how I used to this, and how I will now be splitting strings using SQL Server 2016 RC0 onwards.

Take the following example where I have one record which contains 2 columns, my 2nd column containing multiple values which I want to split out.

image

Previously I would have used some string manipulation and the xml data type to split the values into column 2 out as separate values using code like that shown below.

;WITH CTE AS (SELECT        UPPER('Adatis') as Brand_Code
                           ,CAST ('<M>' + REPLACE('1,2,3,4,5,6', ',', '</M><M>') + '</M>' AS XML) AS Ord  
            )

SELECT            Cte.Brand_Code, Ord.a.value('.', 'VARCHAR(100)') as OrderID
FROM            CTE
OUTER APPLY     Ord.nodes    ('/M') AS Ord(a) --Use the node method to join back to the original data 

Now all I have to do is call the SPLIT_STRING function in SQL Server 2016 simply specifying the string to be split and the value to split by.

SELECT UPPER('Adatis') as Brand_Code, Value
FROM string_split('1,2,3,4,5,6',',')

This works fine if you just want to split a single string, but if you have multiple values you want to split then you will need to use CROSS APPLY.

Given the table below with two columns which replicate the query above.

image

I would now need to only write the following query which is a lot neater and easier for anyone reviewing my work to understand. Here I am simply asking it to split out values found in the OrderID column using a comma as the value to split by.

SELECT BrandCode, Value
FROM StringSplitExample a
CROSS APPLY STRING_SPLIT(a.OrderID,',')

 

The Results:

image

One thing to note, is that this function will not strip out any leading/trailing spaces so this will have to be handled either by using RTRIM and LTRIM or a REPLACE function. The following screenshots show the issue and my resolution

Before:

image

 

SELECT BrandCode, LTRIM(RTRIM(Value))
FROM    StringSplitExample a
CROSS APPLY STRING_SPLIT(a.OrderID,',')
WHERE    BrandCode = 'Other2'

After:

image

Deleting Date Stamped Files Over X Days Old Using SSIS

One challenge I have faced recently is to automatically delete files which are over X days old.  Below I have documented how I overcame this using a simple For Each Loop container in SSIS to loop through files which contained a date stamp in their file names.

What we need in SSIS

  • A variable – used to handle the names of the files we are looping through to delete
  • A file connection manager – used to connect to the file(s) we wish to delete
  • A For each loop container – this will be used to loop through all the files in a directory
  • A script task – this is a blank script task, that serves no purpose other than to allow us to set a precedence constraint on the file system task
  • A file system - task to actually delete the relevant files

image

 

Firstly create a simple variable with an example file name to get you started

image

Once you have created a variable which contains an example file name, right click on your file connection, select properties and then click the ellipses next to the Expressions property. Next select “ConnectionString” from the property drop down menu and then in the expression builder add the folder path, plus your variable name, as shown in the screenshot below. If you wish, you can also add the folder path into your variable or, you could have a separate variable for your folder path.

image

After creating your variable and setting up a connection to an example file you will need to go into your for each loop container, and specify the folder containing the files we wish to delete. I have also specified under files that I am looking for files containing a “_” and a “.” in the file name/extension. Under retrieve file name I have selected Name and extensions as my directory won’t change, so I do not need to bring back the fully qualified path.

image

Still in the for each loop editor, click on Variable Mapping pane and select your variable created earlier.

image

The next step is to double click the precedence constraint (arrow) between the blank script task, and the file system task. Next change the evaluation operation to expression, and insert the code below - tweaked to use your variable name, and adjusted for any differences to your date format. This script is currently working for files with a date suffix of _YYYYMMDD

image

The full code used:

DATEDIFF("dd",(DT_Date) (SUBSTRING(@[User::strSourceArchiveFileName],FINDSTRING(@[User::strSourceArchiveFileName] ,"_" ,1)+1,4) + "-" +SUBSTRING(@[User::strSourceArchiveFileName],FINDSTRING(@[User::strSourceArchiveFileName] ,"_" ,1)+5,2) + "-" +SUBSTRING(@[User::strSourceArchiveFileName],FINDSTRING(@[User::strSourceArchiveFileName] ,"_" ,1)+7,2)),GETDATE()) > 30

 

What the code is doing:

The code compares todays date against the date found in the file names eg: File1_20161030.txt and if the difference between todays date and the date in the file is greater than 30 then the expression returns true and the file is deleted in the next step.  Otherwise the file is left alone.

The way it obtains the date is by using the findstring function to identify the underscore in my filenames.

A substring function is then used to identify the various elements of the date in the file name. Eg: Year can be found in the 1st character after the underscore and has a length of 4, month the 5th character after the underscore with a length of 2 and day the 7th character after the underscore with a length of 2.  

This information is then concatenated together separating the various elements using a “-“ and compared against todays date.

After setting up the precedence constraint, all that is left is to set up the File System task, setting the Source Connection to our file connection created earlier, and changing the operation to “Delete File”.

image

As mentioned previously, the solution does require you have a character such as an underscore to split the file name and the date stamp within the filename, and for your date stamps to be delivered in a consistent format, be that YYYYMMDD or any other format.

These final two screenshots are to show the before and after results of running the above tasks by executing the SSIS package.

Before:

image

After:

image

SQL - Using The MERGE Statement To Apply Type 2 SCD Logic

Introduced in SQL 2008 the merge function is a useful way of inserting, updating and deleting data inside one SQL statement.

In the example below I have 2 tables one containing historical data using type 2 SCD (Slowly changing dimensions) called DimBrand and another containing just the latest dimension data called LatestDimBrand. Using the merge function I will insert new records from LatestDimBrand into DimBrand, I will archive (apply an end date) to any DimBrand records which do not appear in the latest data, and finally enter a new record and archive the old record for any Brands which may have changed.

DimBrand (the target of our Inserts, Updates, Deletes) and DimLatestBrand (the source for Inserts,Updates,Deletes):

image     image

The merge code in it’s entirety:

INSERT #DimBrand    ([BrandCode],[BrandName],[StartDate])
SELECT                 [BrandCode],[BrandName],getdate()                                    
FROM                (
                    MERGE #DimBrand AS Target
                    USING    (
                            SELECT    [BrandCode],[BrandName],[StartDate],[EndDate]                       
                            FROM    #LatestDimBrand
                            ) AS Source
                    ON     (Target.[BrandCode] = Source.[BrandCode])    
                    -------------------------------                       
                    WHEN MATCHED AND Target.[BrandName] <> Source.[BrandName]
                        THEN
                        UPDATE SET Target.[EndDate] = getdate()
                    -------------------------------
                    WHEN NOT MATCHED BY TARGET 
                        THEN
                        INSERT ( 
                                [BrandCode]                        
                                ,[BrandName]                    
                                ,[StartDate]              
                                ,[EndDate]                                            
                        )
                        VALUES (      
                                Source.[BrandCode], 
                                Source.[BrandName], 
                                Source.[StartDate],
                                Source.[EndDate]   
                        )
                    -------------------------------
                    WHEN NOT MATCHED BY SOURCE 
                        THEN 
                        UPDATE SET Target.[EndDate] = getdate()
                    -------------------------------
                    OUTPUT $Action, Source.*
                    ) As i([Action],[BrandCode],[BrandName],[StartDate],[EndDate])
                    -------------------------------
WHERE                [Action] = 'UPDATE'
AND                    BrandCode IS NOT NULL

 

The insert statement, although appearing at the top of the SQL statement is the last thing to be executed and uses the results set of the merge function specified in the OUTPUT clause further down.

image

Next is the start of our merge statement here we specify a Target and a Source table. We also specify which columns the 2 tables should match on, in this case only the BrandCode.

image

The next step is to specify and handle any matches, here we are looking for Brand Code’s which appear in both tables, but with differing Brand Names. If a match is found the row in the target table is given an end date.  As with the following “NOT MATCHED” section inserts,updates or deletes can be applied to either table here. This handles the first part of our Slowly Changing Dimension requirement.

image

Where there is no match between the Target table and the Source table, the relevant records from the Source table are inserted into the Target table.

image

The penultimate part of the merge statement takes any Target records which do not appear in our “Latest” Source table and sets an EndDate for them as they are no longer a valid record.

image

The OUTPUT clause populates the outer INSERT statement, here the Source rows which are involved in the UPDATE statements above are pulled out to the outer FROM clause. We have also selected $Action so that we can filter out the INSERTED rows under the NOT MATCHED by Target statement.

image

The result of all the above is the brand “Googles” is archived along with the “MikeRowSoft” which as you can see from the BrandCode has been rebranded as MickRowSoft.

image

Power BI Dynamically Filtering Dates

One problem which forced me to pause for thought recently was how to dynamically filter data in Power BI. Eg: Filtering data by a date range to only show sales for the last 7 days.

I believe the two simplest options are using either the Q&A natural query language or using Power BI Desktop’s built in Date/Time filter options found in the query editor, both of which have been documented below.

Option 1 – Using the Power BI web app

Using the Q&A text box of your dashboard you can simply ask in “natural language” to see your data from the last 7 days, as shown in the screenshot below.  This can then be pinned to the dashboard if it is a question you are going to ask on regular basis.

This does take a little bit of tweaking to get the desired results as you are relying on the Power BI engine to understand your query.

image

Fortunately if no results are found Power BI does give you some suggestions should it not be able to find useful answers to your question, as shown in the screenshot below.

image

Option 2 – Using the Power BI Desktop app

When you import data using Power BI Desktop you are greeted with the Query Editor, from here you can click the drop down arrow for the date column in question. Next select Date/Time Filters and you will be exposed to a number of filtering options unavailable in the reports or dashboard tabs, From here I have selected “In the Previous”.

image

Once you have selected your filter (eg: In the Previous) all you need to do is set how many days you want to count back.  In this instance the result of this filter means that I am only going to see data in my reports where the Created Date is in the previous 7 days.

image

The Microsoft Power BI team have already included a few useful built in filters, such as the Year To Date filter shown in the screenshot below.

image

Power BI Maps Handling Duplicate City Names

The Bing map engine behind the map visualisation in Power BI is very intuitive allowing users to provide textual data such as City or Country or Postcode to map metrics, instead of just latitude and longitude as most other applications do. However one thing which is not immediately obvious is how to get around the issue of duplicate City/Town names.

In this blog I will explain how to map your metrics when your data source contains duplicate cities/towns.

To start with we have a simple data set with quarterly sales for 6 different cities based in 5 different states which is being loaded from a CSV into Power BI. Straight away you can see that we only have 2 distinct city names.

image  image

As soon as we try to map the sales data by city, we get an obvious problem all of the Bristol sales are being assigned to Bristol, England, while the Georgetown sales are appearing in Guyana.

image

Adding state to the Location field does nothing to help the problem as Power BI only reads a single input in the Location field.

image

So the solution is to create a new column containing both City and State data. To do this you need to complete the following steps:

1. Click “Edit Queries”
2. Select the data source in question.
3. Select the two or more columns which contain the data we want to merge eg: City and State 
    -If additional geographical data is available such as Country then this can be included in the merged column.
4. Navigate to the "Add Columns" menu and select "Merge Columns"
5. Choose the separator value and name the new column

image

For simplicity I have just called this “Merged” and separated the values using only a space.

image

Once the new column has been created it can be dropped into the Location field of the map visualization.

As you can see from the screenshot below I now have 6 data points, showing all three variations of Bristol, and all three variations of Georgetown.

image

One final tip, is to ensure you have set the Data Category value for the column in question.  In this case I have set the Data Category to City to help Bing identify the type of data I believe I am providing it.

image

The only problem with this, is if you set the Data Category value incorrectly no data will be displayed as shown in this final screenshot where I have changed the Data Category to “Continent”

image

MDS 2012 Business Rules 208: A database error has occurred

There is an issue with MDS 2012 and business rules with a combination of "must be unique" attributes. If you do not list the attributes in the same order that they were created you will get a generic error message when trying to revalidate your data.

The work around for this is to recreate the “must be unique” combination of attributes in the same order as the attributes shown in the leaf attributes list. The screenshots below provide an example of the issue and solution.

In the first screenshot below we have a unique constraint containing Customer Type and Order Type. In this scenario we wanted to add Brand to the uniqueness constraint.

imageclip_image002[17]

We can drag and drop “Brand” from our entity specific attribute list onto our combination, save, and publish the updated business rules without problem.

Once the new rules have been published a question mark will appears next to all rows of data for the entity that the business rule covers.

clip_image002[6]

When we click on apply rules to re-run the business rules and validate the data we get the following error message.

clip_image002[12]

clip_image002[8]

As mentioned above the work around is to recreate the unique constraint with the attributes in the same order as they appear in the Entity Specific attribute lists.

In our example that means recreating the unique constraint so that Brand appears above Customer Type and Order Type.

clip_image002[20]

Once the re-ordered rule has been published, applying it to the entity will work without error once more.

clip_image002[12]

SQL Order by with aggregation

In this post we will look into applying an order by clause to a query containing an aggregate function. The purpose of this is to allow us to order a result set by a column which does not appear in the select or group by clause.  Without using an aggregate function, or including the column to order by in your select/group by, you will be unable to execute the query successfully.

This is quite a straight forward problem, but one that can easily trip people up.  I started by creating a simple table containing Colours which appear multiple times over multiple dates and against multiple IDs.

image

image

Using the count() function to count how many times a colour appears on a certain date doesn’t cause us any problems.

The result set shows us 7 rows of data, we note that the colour Aqua Marine Green appears first in the list and while black appears second in the list as we will use this information later.

image

image

However if we wanted to see the result set in the order of their respective ID’s this is where we run into issues,

The error message is quite clear if not a little confusing on what needs to happen here, either we need to add the ID column to our group by clause, or wrap it in an aggregate function. 

image

image

Adding the ID to the group by clause will change your result set, if we take the query above and simply add the ID column to our select and group by we no longer get an error message, however the data has lost all meaning as now we are back to our 10 original rows of data each being counted once.

image

image

Using an aggregate function in the order by clause “"eg: Min(ID) fixes the problem and provides us with the result set we are after. If we look above we can see that “Black” appears 3 times against Date “2015-09-03” and the ID’s for those 3 records are 1,3,8.  It is only after specifying which ID to order by is SQL able to correctly execute the query.

Without adding an aggregate function be that MIN() SUM() AVG() SQL is unable to determine which ID it should use for each group to order the data by.

image

image

Be careful when picking the aggregate function to use as your choice will affect the result set.  The screenshots below show the result set ordered by MIN(ID), AVG(ID), SUM(ID). Each returning the same set of data but not in the same order. (the aggregated ID column has been added to the screenshots for reference only.)

The first example orders by the MIN() function and results in Black appearing first in the result set.

image

The next example is ordered by the AVG() function and results in Black appearing fourth in the result set.

image

Finally ordering by the SUM() function results in Black appearing last in the result set.

image