Adatis BI Blogs

Getting started with Azure Data Factory Mapping Data Flows

Azure Data Factory V2 is the go-to service for moving large amounts of data within the Azure platform and, up until relatively recently, was focussed predominantly on control flow rather than data flow. Those familiar with SQL Server Integration Services will be aware of the difference however, to clarify, Control flow is used to orchestrate activities within a data pipeline whilst Data flow is used to physically move and transform the data. In May 2019, the Azure Data Factory team released into public preview the Azure Data Factory Mapping Data Flows (MDF) functionality, which effectively moves ADF from being an orchestration tool into a fully fledged ETL tool. This new feature allows developers to configure data transformation logic using a no code, drag and drop approach and implements many of the transformation concepts that existed in SSIS. In fact, there exists an excellent comparison between SSIS, SQL and ADF MDF written by Kamil Nowinski available through this link which highlights how the three technologies matchup: Points to note are that ADF does not have event handlers in the way SSIS does and also does not have an equivalent to the script component within Data Flows. It does have a custom activity which allows you to write C# within the control flow but currently you cannot write any custom code within a Mapping Data Flow. Prior to the release of MDF, Data Factory could move data from one place to another with its copy activity however it could not modify the file in any meaningful way and would require external services such as SQL Data Warehouse or Data Lake Analytics (RIP) to be used to fill this gap. The copy activity performed the data movement using the Azure Integration Runtime which provided the compute power needed to complete the operation, so does that mean that Mapping Data Flows run on the same compute resource? NO is the answer. In fact, your graphical data flow is converted into Scala and then compiled into a JAR library to be executed on a Databricks (Spark) cluster which is deployed and managed by Microsoft solely for running your data flow activities. This does mean that any pipelines that utilise MDF have a slight overhead to allow for the cluster to start-up and configure, however Microsoft are looking at ways to reduce this and there is no need for an ETL developer looking to build MDF’s to know anything about Databricks, Scala or Spark Clusters – although it will certainly help!So, it’s in public preview so let’s get using it! This blog will walk through the process of creating a basic cleaning data flow that will populate a SQL Database table with the values from a delimited text file. To begin, we need some data factory objects, anyone familiar with data factory will understand we need Linked Services and Datasets to connect to our data and create schemas, and Pipelines to give structure to the activities and create our control flow. In this example I have a small comma separated file that has some data quality issues such as,Leading and trailing white spaceWeird column namesNon standardised NULL valuesWeak data typesThe data flow we will create will address all of these issues before depositing the file into a SQL Database. First, we should create a linked service that connects to a blob store or data lake account where our dirty file is stored. We then need a dataset that sits on top of the Linked Service which allows us to locate and read the file using the specified parameters such as file path, column delimiter, row delimiter etc. Additionally, in this dataset we can import the schema of the file so that we have some column names available. Next, we can create a linked service that connects to our output SQL Database and also a dataset that points to the correct table again, importing the schema. After creating these objects we also need a pipeline to orchestrate the process and this will ultimately call our fancy new data flow. You only need a single activity in this pipeline which will be found under the “Move & Transform” heading and is called “Data Flow (Preview)”At this point you should have something similar to the following with the exception of the data flow object under the Data Flows (Preview) tab:NOTE: I have not gone into much detail about creating these objects as they will be familiar to most ADF developers. For pointers on how to create the above follow this link: Now we can begin creating our Data Flow. Firstly, we will add the source and the sink to our data flow, this gives us our bookends as a clear starting point, and we can then play about with middle bit afterwards. Also, at this point, toggle the switch at the top of the Data Factory UI named “Data Flow Debug”, this will start a cluster up for you and avoid you having to wait later onHover over the Data Flows (Preview) header and click on the ellipsis, choose “Add Data Flow” and this will take you into the Mapping Data Flow UI where you can begin creating you Data Flows. Remember to set a sensible name for your data flow in the general tab of the data flow UI. Click “Add Source” and we can begin configuring the source activity of our data flow. Provide a clear name that identifies the source. This source is fed from a dataset so the names should closely match. Choose the dataset that is linked to your blob account and uncheck “Allow schema drift”. This is a useful option and allows for constantly changing sources files to be handled however we will cover that in a later blog. On the “Source Options” tab you can choose to add actions that occur after completion such as deleting or moving source files. On the “Projection” tab you can tailor your schema which is defined in the source dataset. My preference is to leave these all as strings to avoid any early typing errors as we will address these later in the flow. Finally, on the Optimize, Inspect and Data Preview tabs, all defaults can remain the same. Now click on the tiny + icon in the bottom right corner of the source activity and choose “Sink” from the drop-down list. Here we can configure the output of our flow which in this case will be a SQL Database. Specify the name, again relating it to your sink dataset, and choose the SQL Database dataset created earlier. On the “Settings” tab you can choose which methods can be used by ADF when working with your table. These can be any combination of Insertion, Deletion, Upserting or Updating. Also, you can define actions to occur in the database before loading the data such as recreating the entire table or truncating the existing table before loading into it. Finally, on the “Mapping” tab you can map columns from source to sink. Be aware that any columns that are not strings in your database will not be able to be mapped until the data typing has occurred. Now we have a basic copy structure that does nothing clever yet but does connect our dirty source file up to our sink dataset. We can begin doing the actual transform. The first transformation we will do will be to Trim white space from columns. Click on the + icon and choose “Derived Column”. Within the “Derived Column Settings” tab you should add each of the columns in your source dataset and then enter the following expression for each one in the expressions editor: trim({column name}). This expression will remove any whitespace from the columns value ensuring the database receives a clean value. Now we will standardise any NULL values and also transform any columns into their correct data types. To do this, click the + icon again and choose “Derived Column” again. Similar to the above step you can add an entry in the “Derived Column Settings” tab for each column, adding the below expression for each column: replace(replace({column name}, ' ',''),'Unknown',''). This expression will replace any empty values with NULL and also any values where we have ‘Unknown’ will also get replaced with NULL so that we have some standardisation before loading into the database. Any NULL values already present will be untouched. In my dataset I need to change one column from its originating string data type into an int so that it can be dropped in the DB. Rather than doing this change in place, it is best to create a new column to do this so that you have an original column and the new column with the correct type. Whilst still in the expression editor, hover over any column name in the “OUTPUT SCHEMA” window that is visible on the left hand side and choose the + icon. This will allow you add a new column to you data flow and you can use any of the conversion functions (toInteger, toLong, toString, toDate, toBoolean etc) to coerce the value into its correct type.At this point you should have four steps that resemble the below screenshot. Once your Data Flow Debug session is online you can debug the data flow and hopefully see the nice clean values pass through into the database. Throughout this process I recommend taking a peek at the Inspect and Data Preview tabs. The Inspect tabs give a bit more information about what steps are taking place on the data in that activity and the Data Preview will show you how the data will look, although the Debug session needs to be active for this to work. Finally, the optimize tab allows you to set the partitioning of the data using techniques such as Round Robin, HASH and range distribution although these are out of the scope of this blog. Hopefully this brief tutorial has been helpful and allowed you to gain some early knowledge on Data Factory Mapping Data Flows meaning that you can go on to create Flows that are tailored to your needs and cover off a wider variety of scenarios. Any questions, thoughts or feedback, please catch me on twitter @MattTheHow.

Naming downloaded files with Azure Data Lake Store File System Task

I was recently working on a hybrid project where we download files from a lake and transform the data with SSIS. I was stunned to find that there’s no native ability to name the file you download from the lake! Even more frustrating, the downloaded file was inconsistently named as Data.<GUID>, rendering the SSIS File System Task useless in this case also. PowerShell to the rescue…Using an execute process task to call the following PowerShell script, we were able to overcome this challenge. param([string] $NewFileName, [string] $LocalFolder, [string] $FileNameFilter) $file = Get-ChildItem -Path $LocalFolder -Filter $FileNameFilter | ? { $_.LastWriteTime -gt (Get-Date).AddSeconds(-15) } | select -Last 1 Move-Item -Path $file.FullName -Destination $LocalFolder"\"$NewFileName -Force The script accepts three parameters: $NewFileName – What you want to name the file to, including the file extension. $LocalFolder – The local folder in which the file resides. $FileNameFilter – A mask to apply for searching for the downloaded file. In this case, we used Data.* where * is a wildcard for the GUID Get-ChildItem is used to obtain the details of the latest file written to our $LocalFolder within the last 15 seconds. This just adds an element of security, minimizing risk of the script being used outside of the SSIS process and renaming files it shouldn’t.Move-Item is used instead of Rename-Item, as in our case we wanted to overwrite the file if it already existed.If you have multiple packages using this script, which are called in parallel by a master package, I would highly recommend adding a completion constraint between all of the Execute Package Tasks to ensure no file accidentally renamed inappropriately by another package running at the same time. If removing parallelism isn’t an option for performance reasons, you could set up a different local folder per package.

Using ADF V2 Activities in Logic Apps

Logic Apps recently introduced the ability to connect to an Azure Data Factory V2 instance and perform a number of actions including cancelling a pipeline, invoking a pipeline and also interrogating the service for information about a particular pipeline run. This blog will focus on how to utilise the full potential of ADF V2 via a Logic App connector when calling a pipeline. In its current form there is no way to provide input parameters to ADF from Logic Apps, thereby hamstringing one of the best features about ADF V2! Later in this blog I will show how to get around that. Despite this there are a number of reasons why a Logic App calling an ADF V2 pipeline with a simple activity is a great thing and I have gone in to some detail below. Event Driven ExecutionThe Azure Data Factory V2 team are now starting to bring in event driven triggers but currently this is limited only to the creation or modification blobs in a blob storage account. This is a good start but pales in comparison to the overwhelming number of events that can trigger a Logic App, notably things like Service Bus, Event Grid, HTTP calls etc etc… the list goes on! More info on Logic App connectors here. By utilising Logic Apps as a wrapper for your ADF V2 pipelines you can open up a huge amount of opportunities to diversify what triggers a pipeline run. Simplifying Loops, Conditionals and Failure PathsIn addition to event driven triggers, the ADF team have also brought in an IF activity and a number of looping activities which are really useful in a lot of scenarios. However, This kind of logic can be simplified when its built in Logic Apps. The visual editor in Logic Apps makes understanding the flow of a loop or an IF seem much simpler to those maybe not so familiar with the business logic that's been implemented. Additionally failure paths can be handled much more efficiently and clearly with many more options for logging or notifying of failure right out the box – e.g. the Send Email activity. Heavy lifting of DataLogic Apps, for all its benefits, is definitely not a heavy lifter of data. They really excel at lightweight messaging and orchestration whereas Data Factory is great at moving big chunks of data when using the Copy activity. When you pair these two together you get something that resembles SSIS Control Flow (Logic Apps) and SSIS Data Flow (Data Factory). Now that they can be closely and easily integrated it makes orchestrating the logical flow and movement of data in the cloud much simpler. So now we know that pairing Logic Apps and Data Factory V2 is a great idea, lets look at how to do it. Understandably there are no triggers from Data Factory at this point so you will need to trigger your Logic App in any one of the million ways that are available. Once you have your trigger sorted you can search for the Data Factory connecter and choose the action you need. You will then need to connect to the Data Factory service as below by logging into your Azure tenant. Once logged in you will need to supply a few details to locate the Data Factory instance and pipeline as below.Now you have completed this you can trigger your Logic App and you will see that your Data Factory pipeline will be invoked as normal. However, by using only this method we are depriving ourselves of the ability to derive parameter values outside of Data Factory and pass them in at execution time. Currently there is no where to supply the input data! This is where the blog gets a bit hacky and we will venture into the Code View of the Logic App. [Side note: you can usually get around a lot of tricky situations by manipulating the code in logic apps and not just relying on what the UI gives you] Click the “Code View” button in the designer and then locate your Data Factory V2 activity definition in the code page. It will look something like this (I have whited out my subscription id)In order to pass in data to this Data Factory pipeline, all we need to do is add a “body” attribute within the “inputs” object like the below image. We can then reference any of the variables called out in the Logic App or any other value that might be available to us. In case you’re wondering, these input values need to configured as input parameters to the Data Factory Pipeline you will be calling. More on how to do that hereOnce you have updated your code accordingly then save and run your logic app and you should be able to see the pipeline invoked with your inputs in the Data Factory service as below.In conclusion, the marriage of Logic Apps and Data Factory is a happy and harmonious one with many benefits to be exploited. Now armed with the key to utilise input parameters for pipelines called from Logic Apps you can cater for a vast amount of data integration scenarios that require complex but clear logic and heavy lifting of data. Any questions or comments please supply below or catch me on twitter @MattTheHow.

Bulk updating SSIS package protection level with DTUTIL

I was recently working with an SSIS project with well over 100 packages. It had been developed using the ‘Encrypt Sensitive with User Key’ package and project protection level, project deployment model, and utilizing components from the Azure Feature Pack.The project was to be deployed to a remote server within another domain using a different user to the one which developed the packages, which caused an error with the ‘Encrypt Sensitive with User Key’ setting upon deployment. As a side note, regarding package protection level and the Azure Feature Pack, ‘Don’t Save Sensitive’ cannot be used as you’ll receive an error stating: “Error when copying file from Azure Data Lake Store to local drive using SSIS. [Azure Data Lake Store File System Task] Error: There is an internal error to refresh token. Value cannot be null. Parameter name: input”. There seems to be an issue with the components not accepting values from an environment variable if ‘Don’t Save Sensitive’ is used.With project deployment model, the project and all packages within it need to have the same protection level, but there’s no easy way from within visual studio to apply the update to all packages, and I didn’t want to have to open 100 packages individually to change the setting manually! SQL and DTUTIL to the rescue…Step OneCheck the entire project and all packages out of source control, making sure you have the latest version. Close visual studio.Step TwoChange the protection level at the project level by right clicking the project and selecting properties. A dialogue box should open, on the project tab you should see a Security section, under which you can change the protection level.Step ThreeUsing SQL against the SSISDB (assuming the project is deployed locally), you can easily produce a list of all the packages you need to update within a DTUTIL string to change the encryption setting:USE SSISDB DECLARE @FolderPath VARCHAR(1000) = 'C:\SourceControl\FolderX' DECLARE @DtutilString VARCHAR(1000) = 'dtutil.exe /file "'+ @FolderPath +'\XXX" /encrypt file;"'+ @FolderPath +'\XXX";2;Password1! /q' SELECT DISTINCT REPLACE(@DtutilString, 'XXX', [name]) FROM internal.packages WHERE project_id = 2This query will produce you a string like the below for each package in your project:dtutil.exe /file "C:\SourceControl\FolderX\Package1.dtsx" /encrypt file;"C:\SourceControl\FolderX\Package1.dtsx";2;Password1! /qThis executes DTUTIL for the file specified, encrypting the package using ‘Encrypt with password’ (2) and the password Password1! in this case. The /q tells the command to execute “quietly” – meaning you won’t be prompted with an “Are you sure?” message each time. More information about DTUTIL can be found here.If the project isn’t deployed, the same can be achieved through PowerShell against the folder the packages live in.Step FourCopy the list of packages generated by Step Three, open notepad, and paste. Save the file as a batch file (.bat).Run the batch file as an administrator (right click the file, select run as admin).A command prompt window should open, and you should see it streaming through you packages with a success message.Step FiveThe encryption setting for each package is also stored in the project file, and needs to be changed here too.Find the project file for the project you’re working with (.dtproj), right click it and select Open with, then notepad or your preferred text editor. Within the SSIS:PackageMetaData node for each package, there’s a nested <SSIS:Property SSIS:Name=”ProtectionLevel”> element containing the integer value for its corresponding protection level.Run a find replace for this full string, swapping only the integer value to the desired protection level. In this example we’re going from ‘Don’t Save Sensitive’ (0) to ‘Encrypt with password’ (2):Save and close the project file.Step SixRe-open visual studio and your project, spot check a few packages to ensure the correct protection level is now selected. Build your project, and check back in to source control once satisfied.

SSIS Azure Data Lake Store Destination Mapping Problem

The ProblemMy current project involves Azure’s Data Lake Store and Analytics. We’re using the SSIS Azure Feature Pack’s Azure Data Lake Store Destination to move data from our clients on premise system into the Lake, then using U-SQL to generate a delta file which goes on to be loaded into the warehouse. U-SQL is a “schema-on-read” language, which means you need a consistent and predictable format to be able to define the schema as you pull data out.We ran in to an issue with this schema-on-read approach, but once you understand the issue, it’s simple to rectify. The Data Lake Store Destination task does not use the same column ordering which is shown in the destination mapping. Instead, it appears to rely on an underlying column identifier. This means that if you apply any conversions to a column in the data flow, this column will automatically be placed at the end of file– taking away the predictability of the file format, and potentially making your schema inconsistent if you have historic data in the Lake.An ExampleCreate a simple package which pulls data from a flat file and moves it into the Lake.Mappings of the Destination are as follows:Running the package, and viewing the file in the Lake gives us the following (as we’d expect, based on the mappings):Now add a conversion task – the one in my package just converts Col2 to a DT_I4, update the mappings in the destination, and run the package.Open the file up in the Lake again, and you’ll find that Col2 is now at the end and contains the name of the input column, not the destination column:The FixAs mention in my “The Problem” section, the fix is extremely simple – just handle it in your U-SQL by re-ordering the columns appropriately during extraction! This article is more about giving a heads up and highlighting the problem, than a mind-blowing solution.

Extracting Users from AD using SSIS

I've recently been working on a project which required KPI level security alongside the traditional row level security secured at a geography level. This would limit what financial data a user could see within a cube, without having to create multiple cubes or use perspectives (which would not actually secure the data). To achieve this, I needed to populate a set of 'KPI User/Role' tables stored in Master Data Services (MDS) with a list of users who were stored in a particular AD group. I would need these tables updated on a regularly basis to grant/revoke access. We could then use these names along with the USERNAME() function in DAX to filter.   The Solution One method to solve my problem would be by using SSIS. The package could  be setup to run as part of a SQL Agent Job, either by a schedule or on demand. My list of users were stored in an AD group called LH_FIN. To start with you will need to truncate and clear your MDS staging tables that you are about to populate.  You can then use the data flow to process the majority of the logic, by creating a script component task. The purpose of this is to loop through Active Directory and pick up the user details that belong to the specified AD Group or set of AD groups if dealing with multiple roles. A number of variables are defined which the task uses to complete the lookup. strLDAP – the LDAP directory on which to perform the lookup strDomain – the domain on which the AD group(s) belong strADPrefix – the AD group prefix from which to return user information about strADParent – the parent group which contains the AD groups which you are looking up (may not need to be used if only looking up a single AD group) To extract users from multiple groups, make sure the prefix stored in the variable strADPrefix covers both groups. Once the rows are extracted it would then be a case of using SSIS to split the data accordingly on the AD Group Name. The following code can be used in the script: The first section sets up the objects required to interrogate the directory, and the fields we expect to return from the accounts – the most important of which is memberof which is used to check versus our AD Prefix. It also filters out items such as service accounts and disabled accounts to speed up the interrogation process. Public Overrides Sub CreateNewOutputRows() Dim domain As String = Variables.strDomain Dim searchRoot As New DirectoryEntry(Variables.strLDAP, Nothing, Nothing, AuthenticationTypes.Secure) Dim dirSearch As New DirectorySearcher(searchRoot) dirSearch.SearchScope = SearchScope.Subtree 'LogonName, GroupsUserBelongsTo, Department, JobTitle, MailAddress, DisplayName dirSearch.PropertiesToLoad.Add("samaccountname") dirSearch.PropertiesToLoad.Add("memberof") dirSearch.PropertiesToLoad.Add("department") dirSearch.PropertiesToLoad.Add("title") dirSearch.PropertiesToLoad.Add("mail") dirSearch.PropertiesToLoad.Add("displayname") 'filter to user objects dirSearch.Filter = "(objectCategory=person)" 'filter to user objects dirSearch.Filter = "(objectClass=user)" 'filter out disabled accounts dirSearch.Filter = "(!userAccountControl:1.2.840.113556.1.4.803:=2)" 'filter out password never expires accounts, i.e. service accounts dirSearch.Filter = "(!userAccountControl:1.2.840.113556.1.4.803:=65536)" 'sets chunk size for retrieving items dirSearch.PageSize = 1000 The next section of code performs the search, and for any LDAP objects it finds within the filter set, returns the properties requested. These properties are then stored in key/value pairs. Dim props As ResultPropertyCollection Dim values As ResultPropertyValueCollection Dim key As String Dim userAccountName As String Dim departmentHome As String Dim jobtitle As String Dim GroupName As String Dim email As String Dim displayName As String Dim groups As New ArrayList Using searchRoot 'Return all LDAP objects, LDAP://acl/CN=Tristan Robinson,OU=Employees,DC=ACL,DC=local 'CN = Common Name, OU = Organisational Unit, DC = Domain Component Using results As SearchResultCollection = dirSearch.FindAll() For Each result As SearchResult In results 'For each object return properties, i.e. displayname, memberof, etc props = result.Properties For Each entry As DictionaryEntry In props key = CType(entry.Key, String) 'For each property, inspect the property and record its value 'Logon Name If key = "samaccountname" Then values = CType(entry.Value, ResultPropertyValueCollection) userAccountName = CType(values.Item(0), String) End If 'Department If key = "department" Then values = CType(entry.Value, ResultPropertyValueCollection) departmentHome = CType(values.Item(0), String) End If 'Job Title If key = "title" Then values = CType(entry.Value, ResultPropertyValueCollection) jobtitle = CType(values.Item(0), String) End If 'E-Mail If key = "mail" Then values = CType(entry.Value, ResultPropertyValueCollection) email = CType(values.Item(0), String) End If 'Display Name If key = "displayname" Then values = CType(entry.Value, ResultPropertyValueCollection) displayName = CType(values.Item(0), String) End If 'Groups User Belongs To (array/collection) If key = "memberof" Then values = CType(entry.Value, ResultPropertyValueCollection) groups = GetGroups(values) End If Next The final section filters the data into the output buffer if from the array list we’ve extracted above, we have matching strings from our original AD Prefix variable. It will then reset, and loop round for the next account. 'Export user details to buffer if it passes the logical test For Each item As String In groups 'Avoids computer accounts, i.e. ending with $ If userAccountName.EndsWith("$") = False And item.ToString.StartsWith(Variables.strADPrefix) Then 'And item.ToString <> (Variables.strADParent) Output0Buffer.AddRow() If String.IsNullOrEmpty(userAccountName) Then Output0Buffer.UserAccountName_IsNull = True Else Output0Buffer.UserAccountName = userAccountName End If If String.IsNullOrEmpty(domain) Then Output0Buffer.Domain_IsNull = True Else Output0Buffer.Domain = domain End If If String.IsNullOrEmpty(item.ToString) Then Output0Buffer.GroupName_IsNull = True Else Output0Buffer.GroupName = item.ToString End If If String.IsNullOrEmpty(jobtitle) Then Output0Buffer.JobTitle_IsNull = True Else Output0Buffer.JobTitle = jobtitle End If If String.IsNullOrEmpty(email) Then Output0Buffer.Email_IsNull = True Else Output0Buffer.Email = email End If If String.IsNullOrEmpty(displayName) Then Output0Buffer.DisplayName_IsNull = True Else Output0Buffer.DisplayName = displayName End If End If Next groups.Clear() userAccountName = "" departmentHome = "" jobtitle = "" GroupName = "" email = "" displayName = "" Next End Using End Using End Sub I also required a function to split the list of groups a user belonged to and store them in another array list. Private Function GetGroups(ByVal values As ResultPropertyValueCollection) As ArrayList Dim valueList As ArrayList = New ArrayList() For Each Item As Object In values Dim memberof As String = Item.ToString() Dim pairs As String() = memberof.Split(",".ToCharArray) Dim group As String() = pairs(0).Split("=".ToCharArray) valueList.Add(group(1)) Next Return valueList End Function End Class Once a list of users has been extracted, you will need to do a lookup against the existing list and only stage those that are new. This can be achieved through a simple lookup component . You can then move the rows into the MDS staging table ready for the load into MDS. After the data flow has processed successfully, the next stage is to sweep the records into MDS using one of the built in stored procedures in the product. The results of which can be seen in the MDS Import View on the MDS site. The following entities can then be used to power the DAX: User (Username, Team, AD Login – populated from the script above)   Role (Role – populated manually) User Role (User, Role – joins users to roles, populated manually) KPI (KPI – populated manually) KPI Role (Role, KPI – joins roles to KPIs, populated manually) These are then processed to the DW by ETL, and a UserKPI view is written across the tables to provide an AD Login to KPI pair. For filtering the measures in the cube, you can then apply the following piece of DAX to the original measure – Gross Sales in the example below: Gross Sales:= IF( CONTAINS( CALCULATETABLE( 'UserKPI', UserKPI[KPIName] = "Gross Sales"), UserKPI[ADLogin], USERNAME() ), [Gross Sales (ACTUAL)] , BLANK () ) This concludes this blog post – hopefully this will be useful for anyone that wants to extract users from AD to control access / security within an application. Data could also be loaded direct into DB tables rather than MDS if required.  Please feel free to comment!

Optimising Lookups with Caching

One of, if not the most, used transformations in SQL Server Integration Services (SSIS) must be the Lookup transform. Pretty much every ETL routine will make use of the component at some point, especially at the time of loading fact tables. Given the nature of fact and dimension tables, these lookups can be tasked with loading large amounts of data in memory to perform high speed lookups. There are occasions when a few optimisations on these lookups can greatly improve the efficiency of your ETL.   Choosing the Best Cache Option One of the most important options featured on the Lookup Transform is the Cache Mode. By understanding the differences between these choices, you can be sure to pick the option that best suits your ETL needs.   Using the Full Cache Option (Default) The Full Cache option tell SSIS to consume all your reference data into memory and then perform its lookups on that data. This has the benefit of only hitting the database once to fetch the data but can cause delay on package start-up as all the lookup data will have to be loaded prior to executing the package. You can optimise this by using a SQL Command to fetch only the columns you need but as you can imagine, a dimension with > 1million rows will take some time to cache and this is before you have even done any lookups! Be aware that if you cannot fit your entire reference set into memory your package will FAIL!! SSIS cannot spool data out to disk if Full Cache is chosen and there is no graceful way of handling this error either. Luckily there is a great blog on how to calculate the cache size here. Use this option when: You want to avoid excessive traffic to your database or it is remote You have a small reference table that can be cached quickly You can fit your whole lookup dataset into memory   Using the Partial Cache Option The Partial Cache is a good half way house. This option will tell SSIS to fetch the data at runtime, meaning a quicker start-up time but also a lot more queries to the database. Any value not currently in the cache will be queried for in the database and then stored in the cache, making it available for future lookups. The efficiency here comes from not having to charge the cache on package start-up and then also from having any previously fetched items stored in memory. One very specific benefit the partial cache has is that if your lookup table is being written to during the data flow, the full cache wouldn’t see those rows as it only queries the database once prior to execution. The partial cache however works perfectly here as it will see all the rows in the table at runtime. Additional performance can be gained by making use of the Miss Cache. This option (found on the advanced page) allows a portion of the cache to be reserved for rows that don’t have a match in the database. SSIS will then know not to look for these rows again, thereby saving any unnecessary queries. Use this option when: You have a small reference set not worth charging a full cache for You want to optimise your cache for more frequently accessed rows. Your reference table is being updated within your dataflow.   Using the No Cache Option Finally, No Cache literally means that no data will be cached. This means that every single row will be looked up against the database, even if that row has already been looked up before. Use this option when: You only need to process a small number of rows You have severe memory limitations For some reason, the partial cache wasn’t a better option   The Cascading Lookup Pattern A nice way of optimising your lookups is to combine a full and partial cache creating what is known as a Cascading Lookup. The idea here is to use the full cache to store a subset of your reference data that is frequently looked up and then a partial cache for the rest. This works because the full cache shouldn’t take too long to charge, as it’s only a subset of the data, but will still have the benefits of being in memory. The partial cache will be used for the larger, non-frequently looked up data as it will have no charging time but can also make use of the cache to avoid querying the database twice for the same lookup value. An added benefit is that if the reference table is written to during the data flow, these rows will be caught by the partial cache lookup. In this example, we have a large customer dimension, but we know that only 500 of those customers make a lot of purchases and so we can use a cascading lookup to optimise this dataflow.   Firstly, choose the Full Cache option. Then use a select statement to get only the data that is most looked up against. In this case, the 500 most loyal customers. Remember to redirect non-matching rows to the non-matching output! Now choose the Partial Cache option and then use a select statement to get the rest of the dataset – excluding the loyal customers. Finally, union the rows back together so that you have a single result set.   The Cache Connection Manager Another option that is available is to use the Cache Connection Manager. This is akin to having a persisted full cache that can be referenced several times throughout your package. Better yet, you can convert a package cache connection manager to a project cache connection manager so that it can be referenced by other packages in the project! This offers multiple benefits including: Charging your lookup caches in parallel to speed up package start-up Caches can be loaded at any given time (although must be before the lookups are needed) so if you need to load the dimension during your ETL, you can. Caches will only need to be charged once for your whole ETL project, meaning you can avoid similar lookups querying the same data twice. The way to implement a Cache Connection Manager is seriously easy! To start you need to create a data flow task and then drag a data source onto the design surface. Below that, add a Cache Transform and connect the two up. Once you have configured your data source (remembering to select only the columns you absolutely need!), open the editor for the cache transform. Where prompted for a cache connection manager, select New. On the General page, you can add a name and description. On the Columns page, you will need to do some configuration. For the column/columns you want look up on you should add an Index Position. The lowest index number that is > 0 will be looked up first followed by any other subsequent columns with an Index Position > 0. Any columns with a 0 index are columns that you want to pull into the data flow. In this case that is the CustomerKey. You can also set data types here. Once that’s complete you can map the columns in the cache transform. Finally, you will now see a new connection manager in your Connection Managers window. Right click on the new cache connection manager and click Convert to Project. An implementation of a cache connection manager might have a parent package like the below. Within each “Populate XXXXXX Cache” is a dataflow that extracts source data and puts it into a cache transform. The fact table load packages below that can now reference all the dimensional data using the cache connection manager. In conclusion, there is a lot of optimisation that can be done with regards to lookups and their caching options. Not all of these need to be implemented all of the time. It is more on a case by case basis but knowing that there are options is half the battle.

Extracting Data from Salesforce using SSIS

This blog looks at a couple of methods of extracting data from Salesforce and gives a demo of how to output tables and queries into flat files.   Salesforce is HUGE! How do I find what I’m looking for? Thankfully SFDC make it easy for us to find the tables that custom objects as they are suffixed with “__c” (e.g. “Assessment_Questions__c”). This convention carries through to Column level as well, meaning that when a Salesforce standard table is used, such as SFDCs Event table, any custom columns are also suffixed with “__c”.   Can I use SQL? Salesforce utilises a modified version of SQL called SOQL (Salesforce Object Query Language). There are some Keyword similarities such as starting a query with “SELECT” and calling a table with “FROM” but if you want to limit a query result to a fixed number of rows you need to add “LIMIT n” to the end of the query string, not the start! Additionally, SFDC provide some handy functions to make SOQL slightly easier on us developers. I have listed a few that I used below but a full documentation site is available here - Function Usage Example LAST_YEAR Provides a data range from Jan 1st of the previous year to Dec 31st of the previous year. SELECT AccountId FROM Customers WHERE CreatedDate > LAST_YEAR LAST_N_MONTHS: n Provides a data range from the last day of the previous month and continues for the past n months. SELECT Id FROM Account WHERE CreatedDate > LAST_N_MONTHS:12 BLANKVALUE Similar to ISNULL. Will provide a substitute value if the field is null SELECT BLANKVALUE(AddressLine2,”N/A”) FROM…   How do I connect to Salesforce? In order to access the Salesforce system, you will need a username, password and security token. From my research there are two viable methods for connecting to Salesforce using SSIS which are using an API component or using a linked server. Both have their pros and cons but generally I have found the API route to be more efficient. There are a number of 3rd Party providers around and I have listed a few of these below. Provider Comments Link C – Data (API) Task Factory (API) Forces you to down load all 50 components. Can use SSIS variables and SOQL Kingsway Soft (API) Simple interface and can use SSIS variables. Auto generates SOQL. Cant edit error output from main component DB – Amp (Linked Server) Allows you to use OLE DB Connection manager   You can use these steps below to access SFDC and start retrieving data. 1.      Download an SFDC SSIS component. For this demo I have used the Kingsway Soft component available from the link above.   2.      Run through the install wizard and once complete, open Visual Studio. 3.      Create a new SSIS project and add a Data Flow Task. Rename this to Kingsway Soft Demo.     4.      Go into the data flow and right click in the toolbox, hit “Refresh Toolbox”. Now you should see 2 new components called “Salesforce Destination” and “Salesforce Source”.     5.      Drag a Salesforce Source onto the design surface and double click to edit the properties. Drop down in “Connection Manager” and select “New”. Fill out the details required using the below connection info.   Username: Password: S4lesforce Security Token: oKar72dhKcBAWlq0i4M0RF7ua      6.      Test the connection, and click “OK” once it has connected successfully. From the component properties screen use the Source Type to select “Object Query”. Add this SOQL statement into the text editor below. Click “OK”.   TIP!! – If you select “Object” from the Source Type drop down and choose your table you can then switch the source type back to “ObjectQuery” and the full SOQL statement to retrieve all columns will be generated automatically.   7.      Add a flat file destination and configure the connection manager. Enable the data viewer between your SFDC component and the flat file destination and preview the data that is being extracted from SFDC.     8.      From here can then repeat the process to build up a package that will bring down all the files that you might need for your solution.   There are additional tools such as Salesforce Workbench that will allow developers greater access to the underlying solution but these will be covered in a different blog. If this solution does or doesn’t work for you then let me know!

Microsoft BI – Coming Soon to SQL Server 2016

After attending the Pass Summit 2016 a couple of weeks ago, I attended a number of sessions that provided an insight into the direction Microsoft are heading with BI.  I thought I’d share this with the community. Looking back to October 2015, the official Reporting Roadmap blog from Microsoft stated their intent in dramatically improving the visualisations, integration and harmonisation of both on-prem (SSRS) and cloud based services (Power BI).  Whilst reporting appeared to be a renewed focus, they are constantly driving other areas of BI - such as analysis services and database/data warehousing development in the cloud. Now, for the interesting bit.  Here is what we can expect in the SQL Server 20916 BI Stack in the near future: -          SQL Server 2018 CTP (to be released within the next month). o   This seems very early, considering 2016 has only just been released!  Let’s wait for an official Microsoft announcement. -          Azure DW o   Auto Create Statistics §  Currently statistics have to be generated and updated on the fly. o   Replicated Distributed Table §  One that will excite any cloud based SQL Server developer. §  This will reduce data transfer between distributions/nodes and consequently improve performance. -          Azure Data Lake o   General Availability (GA) is imminent. o   Future Features: §  Polybase, so that you can connect the two big MMP platforms (ADL and DW). §  SSIS connectors (released with GA) for Store. §  Python and R in U-SQL. -          SSIS o   Lots of new connectors, including: §  ADLS. §  CRM. -          SSAS o   Migrating Power BI Models into Tabular. §  This is coming very soon apparently, but will be developed in Azure SSAS first. o   Object Level Security in Tabular §  We currently have row level, but there are talk to secure a physical object, not just a row. §  Even better news - Microsoft want to integrate the two together, which will make security awesome in Tabular. -          SSRS o   Supporting (not pinning) Excel reports in RS report. §  This will come, but Power BI is the focus right now and we may have to wait a while. -          Power BI o   Additional and better Pivot Table functionality. o   Integrating Active Directory dynamically. o   Potential to use Direct Query and Imported modes together – as a hybrid. §  Functionality is possible, but performance needs to be weighed up by Microsoft before anything will emerge. o   Quick Calcs. §  Only ‘Percent of Current Total’ currently available. §  Potential is to offer lots more – such as YTD, MAT, Current Year vs. Previous Year, etc. §  This is for the users who aren’t familiar with DAX. o   Template organisational Content Packs. §  The ability to give the user the ability to personalise colours, fonts, etc. within a structured (organisational) content pack. -          Power BI Embed o   Application developer will be able to limit user sessions and therefore, reduce the charge per 1 hour costs that come with it.   There are some features/issues Microsoft do not plan to change.  Although, the good thing about Microsoft is that they are community driven, so if you feel strongly about anything (and get support from your peers), they may change their minds. -          SSRS o   Q&A, Query Editor (Power Query), R integration, etc. not being developed. §  Pretty obvious really.  Whilst they are going to introduce almost everything from Power BI, some elements of functionality are just not needed for on pre purposes. §  R Scripts may come one day, but not a focus right now. -          Power BI o   Source Control §  No immediate plans to integrate with TFS or modularise the pbix files (for a more developer based solution) §  Not surprising as this is a self-service tool, not a development team. §  Work around is to upload pbix files into OneDrive and use the versioning as an element of Source Control or add a file into Visual Studio. §  Keep Voting on if you want this! (Currently 278 votes). · -          Power BI Embed o   Licence model §  ‘Speak to Marketing’ is what we were told.   Is everyone excited?  I certainly am.  Microsoft are openly admitting that any new BI related feature will be developed in the cloud first, but this was pretty obvious.  For all you on-prem fanatics (me included), at least the bugs/undocumented features will be ironed out before we get to use them!  My personal favourite has to be the direction SSRS is going.  It was recently labelled a ‘mature’ product, but now RS has suddenly become a cool and current tool to use.  The fact Power BI and SSRS will become almost the same product, tells us all that on-prem is still an important part of the Microsoft licensing strategy.   I am sure there are lots of other features coming into the BI stack over the next 6 months to a year and these were just the ones I had heard about.  Feel free to comment if you have additional ones to share.

Setup Of Oracle Attunity Connectors

Fairly often I come across a project where there is a requirement to extract data from an Oracle source. Whilst this can be done in SSIS using a standard ODBC connection there are some drawbacks to this – the main one for me is that when working with larger data volumes the speed of data transfer can be quite slow. As such I have found the best solution is generally to setup the Attunity Connectors in order to facilitate quick extraction from the Oracle system. For a great comparison on the performance between straight ODBC and Attunity – please see this post from Len Wyatt. Unfortunately the setup can be a little bit fiddly so I thought I would document the process here (as much for me as anyone else). 1. The first step is to determine the version of the Attunity connectors required – this depends upon your SQL server version: SQL Server Version Attunity Connector Version Supported Oracle Versions SQL Server 2016 v4.0 Oracle 10.x; Oracle 11.x; Oracle 12c SQL Server 2014 v3.0 Oracle 10.x, Oracle 11.x, Oracle 12c SQL Server 2012 v2.0 Oracle and higher, Oracle 10.x, Oracle 11.x SQL Server 2008 / 2008 R2 v1.2 Oracle and higher, Oracle 10.x, Oracle 11.x 2. Following this you need to install the Oracle client. Generally you will be working in a 64-bit environment which will require both the 64-bit and 32-bit versions of the Oracle client to be installed (Data tools uses a 32-bit connection but deployed packages should run in 64-bit mode for optimum performance). The Oracle client can be downloaded from you will need to sign up on the site and create an account if you don’t have one already. NOTE: It is not enough to only download the connectors for SSIS and run the installation packages in order to retrieve data you must install the Oracle client. Once you have determined the correct version of the Attunity connectors this you will need to select the relevant version of the Client which you wish to use. Because when Visual Studio is used to run SSIS package this runs in 32-bit mode, you will need the 32-bit version of the client installed on development machines. However, when you deploy the packages and run them from the SSIS Catalog this runs in 64-bit mode so you need to install the 64-bit version to get the best performance in production environments. There are several version of Oracle clients available for download on the site, most recently I used the following version of the Oracle components. I find that these install relatively simply and provide the required features without requiring to much disk space or consuming to much in the way of resources. Depending on the version of the client you have selected the installation instruction for Oracle will vary. However I would suggest you follow the following points: 1. Run the installation packages as an administrator by right clicking and selecting ‘Run as Administrator’ – previously when I have not done this it has led to unusual results. 2. You will need to install the clients on all machines which will be running the SSIS packages – this may be several servers including developer machines used for creating SSIS packages and servers used for schedules ETL runs. 3. The installation should be done in the following order: - Install Oracle 32-bit Client (if required) - Install Oracle 64-bit Client (if required) - Restart machine Once the client is installed a simple way to test that the client is working is to use SQL Developer to open the connection to the Oracle database. Once this is working correctly you then need to install the Attunity connectors. Again depending on the system you are working on you will need to install either the 64-bit of 32-bit Attunity components for SSIS. However here the rules are slightly different. - If you need to run in 64-bit or 64-bit and 32-bit mode, you should install only the 64-bit version of the components. - If you need to run in 32-bit mode only then you should install only the 32-bit version of the compoents. The latest version of the connectors (for SQL 2016) are available from: The installation of the components is pretty straightforward – once you downloaded them you run the executable and then restart the machine. Once completed the Attunity components will be available for selection within the SSIS toolbox in the Data Flow as per the below. The connectors are a very versatile component and support many of the features you would expect from an SSIS component. Be aware that if you are looking for expressions, then these are to be found on the Data Flow Properties rather than the properties of the actual component.

Deleting Date Stamped Files Over X Days Old Using SSIS

One challenge I have faced recently is to automatically delete files which are over X days old.  Below I have documented how I overcame this using a simple For Each Loop container in SSIS to loop through files which contained a date stamp in their file names. What we need in SSIS A variable – used to handle the names of the files we are looping through to delete A file connection manager – used to connect to the file(s) we wish to delete A For each loop container – this will be used to loop through all the files in a directory A script task – this is a blank script task, that serves no purpose other than to allow us to set a precedence constraint on the file system task A file system - task to actually delete the relevant files   Firstly create a simple variable with an example file name to get you started Once you have created a variable which contains an example file name, right click on your file connection, select properties and then click the ellipses next to the Expressions property. Next select “ConnectionString” from the property drop down menu and then in the expression builder add the folder path, plus your variable name, as shown in the screenshot below. If you wish, you can also add the folder path into your variable or, you could have a separate variable for your folder path. After creating your variable and setting up a connection to an example file you will need to go into your for each loop container, and specify the folder containing the files we wish to delete. I have also specified under files that I am looking for files containing a “_” and a “.” in the file name/extension. Under retrieve file name I have selected Name and extensions as my directory won’t change, so I do not need to bring back the fully qualified path. Still in the for each loop editor, click on Variable Mapping pane and select your variable created earlier. The next step is to double click the precedence constraint (arrow) between the blank script task, and the file system task. Next change the evaluation operation to expression, and insert the code below - tweaked to use your variable name, and adjusted for any differences to your date format. This script is currently working for files with a date suffix of _YYYYMMDD The full code used:DATEDIFF("dd",(DT_Date) (SUBSTRING(@[User::strSourceArchiveFileName],FINDSTRING(@[User::strSourceArchiveFileName] ,"_" ,1)+1,4) + "-" +SUBSTRING(@[User::strSourceArchiveFileName],FINDSTRING(@[User::strSourceArchiveFileName] ,"_" ,1)+5,2) + "-" +SUBSTRING(@[User::strSourceArchiveFileName],FINDSTRING(@[User::strSourceArchiveFileName] ,"_" ,1)+7,2)),GETDATE()) > 30   What the code is doing: The code compares todays date against the date found in the file names eg: File1_20161030.txt and if the difference between todays date and the date in the file is greater than 30 then the expression returns true and the file is deleted in the next step.  Otherwise the file is left alone. The way it obtains the date is by using the findstring function to identify the underscore in my filenames. A substring function is then used to identify the various elements of the date in the file name. Eg: Year can be found in the 1st character after the underscore and has a length of 4, month the 5th character after the underscore with a length of 2 and day the 7th character after the underscore with a length of 2.   This information is then concatenated together separating the various elements using a “-“ and compared against todays date. After setting up the precedence constraint, all that is left is to set up the File System task, setting the Source Connection to our file connection created earlier, and changing the operation to “Delete File”. As mentioned previously, the solution does require you have a character such as an underscore to split the file name and the date stamp within the filename, and for your date stamps to be delivered in a consistent format, be that YYYYMMDD or any other format. These final two screenshots are to show the before and after results of running the above tasks by executing the SSIS package. Before: After:

Optimise SSIS Fuzzy Lookup Matches

This is a quick post to highlight how making some small changes to the SSIS Fuzzy Lookup settings can make a significant difference to the matches that are returned. As I’ve mentioned before, the SSIS Fuzzy matching relies on matching substrings within a string (known as q-grams) in order to produce a result. A given input string (and reference strings) is split up into several different substrings. Simply put, the more substrings, or q-grams, that the input string has in common with the reference string, then the better the match. For example, a word such as ‘sandwich’ would be split up into ‘sand’, ‘andw’ and ‘dwic’, in order to aid the matching process. I’ve hit an issue recently whereby the sub strings being produced where causing false matches – lets start by going through a quick example, loosely based on Adventure Works data. Here is a package that takes data from a data source, before using a Fuzzy lookup: I’ve narrowed down the data source to have just two rows for this example, which are: There are a few more so called Master Products, these are what we want to match against: By default, if we run the SSIS package then the following results will appear in a data viewer after the fuzzy lookup. Note – the lookup is intentionally returning the top 2 matches to illustrate this point. The Source Product column reflects the actual situation I experienced recently, whereby the Brand name (Apollo and RS in this case) appears in the column to match. Although this is made up, it closely follows the situation I experienced whereby the wrong match is produced in both cases – “Apollo X.C. Large” looks very similar to “Apollo XC”, yet its not the top match. The reason this occurs is actually the Token Delimiters setting on the Fuzzy Lookup. By default they include both ampersands and full stops, meaning in the case of ‘Apollo X.C. Large’ the following tokens/substrings are created by default: Note there is no token created for XC, meaning this will not match well against a substring of ‘XC’. One option is to remove the full stops and ampersands as delimiters as shown below in the advanced tab: This produces the following results in the data viewer: In this case, the correct master product per row (namely ‘Apollo XC’ and ‘RS Shorts’) is returned correctly, so the settings in this case have made a difference. Conclusion This shows the effect that the token delimiters can have on the matching process. Unfortunately, what works in one situation may not work in another, so one option may be to try an initial match, followed by a second match for any low matching rows. Data quality will always have a big impact on matching, so cleaning and enriching data before matching is always a good idea.

New Features in SQL Server 2016 – Part 4: Integration Services

In the current SQL Server 2016 CTP2, there are not many exciting improvements made to SSIS.  The hope is that the rumours are true and that the more exciting and beneficial features will come later, probably closer to the actual release. The below have been announced by Microsoft and are available in CTP2: ·        AlwaysOn Support. ·        Incremental Package Deployment. ·        Project Upgrade The speculation and noises coming out of the SQL Community imply two features will be available soon.  As nothing has been officially released, I cannot offer much more insight right now.  Look out for an updated blog as and when something is announced. ·        Azure Data Factory integration with SSIS. ·        Enhanced SSIS - Designer support for previous SSIS versions and support for Power Query. AlwaysOn Support This high-availability, disaster recovery solution is used for lots of large scale databases as an alternative to mirroring.  Integrating this into SSIS will provide some great benefits. 1.      Easily deploy to a centralised SSIS Catalog (e.g SSISDB database). 2.      SSISSDB can now be added to an AlwaysOn Availability group. 3.     When a failover occurs, a secondary node will automatically become the primary.   Incremental Package Deployment You can now deploy one or more packages to an existing or new project, without the need to deploy the whole project.  The incremental process can be actioned from: 1.      Deployment Wizard. 2.      SQL Server Management Studio (SSMS). 3.      Stored Procedures. 4.      Management Object Model (MOM) API. I have tested the package deployment in CTP2 and it works in exactly the same way as project deployment.  The wizard, especially has the exact same look and feel.  In order to run this in CTP2, you will need to navigate to the standard SQL Server 2016 directory, which is usually in: 64 Bit:   C:\Program Files\Microsoft SQL Server\130\DTS\Binn 32 Bit:   C:\Program Files (x86)\Microsoft SQL Server\130\DTS\Binn To run the wizard, simply double click the ‘ISDeploymentWizard.exe’ executable in the folder or run it from command line.  You can now select ‘Package’ or ‘Project’ deployment in the options – see below: There is an option to set password security on individual packages and individually which ones you want to deploy at one time. When SQL Server 2016 is officially released, I would imagine the package deployment wizard will be integrated into Visual Studio and SSDT and implemented without the need of command line or an exe. Project Upgrade Not much has changed here, apart from: 1.      Project-level connection managers work as normal. 2.      Package layout/annotations are always retained. References For more information on all of the new SSIS SQL Server 2016 features, the below resources/blogs are highly recommended. ·        Official Microsoft Page - ·        Gilbert Quevauvilliers – BI blog - ·        Manoj Pandey Blog -

Optimising SSIS Lookups with HASHBYTES()

I was recently working on a load for a Dimension table which worked like the below. The package follows a fairly standard routine. Data for the Dimension is transformed into the required format using SSIS, following this it is inserted into the target table. On the way an initial check is made to determine if the record is new ‘LKU – Check Exists’ and a secondary check to determine if the record has changed, which is subsequently used in an update routine. Although the package was incrementally loading, it was having to sometimes check several million rows to determine if these had changed, as the table was quite wide I decided to implement the HASHBYTES() function in order to take a hash of the columns for comparison rather than compare each of the columns. The initial results were quite promising with the package running in around half the time, however on a particularly big load I noticed the package failed as it had run out of memory for the second Lookup (full cache was used). I found this odd because the HASHBYTES function only returned a single value and I anticipated this would be smaller than the sum of the columns, however on a little deeper investigation I found that by default any row returned using the HASHBYTES function are all of a default size 8000 bytes. A quick check of the SQL Documentation of the HASHBYTES function at states that the size of the data returned for the HASHBYTES function when used with the SHA2-256 algorithm is 32 bytes meaning that most of this space was being wasted. Therefore changing the formula from 1 to 2 below significantly increases the efficiency of the lookup and also make the cache size required smaller by a factor of around 250! 1. HASHBYTES('SHA2_256',[ProductName] + '|' + [CategoryName]) AS HashValue 2. CONVERT(VARBINARY(32),HASHBYTES('SHA2_256',[ProductName] + '|' + [CategoryName])) AS NewHash Upon doing this the size of each row was significantly reduced, the package is running faster than before and most importantly there are no memory issues :)

SSIS Fuzzy Lookup Transformation Tokens

After doing my recent blog post on the Excel Fuzzy Lookup Add-In Vs SSIS, it got me thinking about the capabilities of the SSIS Fuzzy Lookup Transformation. Although I’m aware that the Fuzzy Lookup is a token-based solution, I’ve only ever tuned it really by altering the similarity thresholds on both the columns and also on the task itself. What you can also do is alter how the transform deals with tokens. Therefore, I thought it would be worth a quick post to show how the tokens help produce matches. Tokens The Fuzzy Lookup works by splitting up strings into several different components, known as tokens. For example, the Adventure Works reseller “Cross-Country Riding Supplies” might be split up into the tokens “Cross”, “Country”, “Riding” and “Supplies”. This is key to the matching process, as the fuzzy algorithms will attempt a match based on the commonality between tokens in the input and reference datasets. SSIS Fuzzy Lookup The advanced tab of the fuzzy lookup will show the delimiters used to create the tokens. As you can see, a hyphen is included in the list, meaning words separated by a hyphen get interpreted as separate tokens: On the “Reference Table” tab of the transformation, there is the option to store a new index, which means that SQL Server will create a table in the data source to hold the tokens found in the reference table: In my made up dbo.Retailers table, I’ve got two retailers – “Nearest Bike Store” and “Gears+Chain Supply”. After running the SSIS package once, the dbo.IndexTest table will be created, which is where the reference tokens get stored in order to help with matching. As it’s just stored in SQL Server we can select from it.Before we do that, I should just say, I’m only passing this fuzzy lookup 1 input string, which is a misspelt company called “Bikes Chaiin Supply”. If I run the package, I get the following matches and scores: This has a fairly low similarity and confidence, so we can probably do better. Going back to the index, if we select from dbo.IndexTest, we get the following: The index table also contains substrings found in the reference table, known as q-grams, which are designed to assist with matches that contain errors. What’s interesting about these results is that “gears+chain” has been entered on its own, meaning that the word “chain” has not been interpreted as a token in this case. The solution, in this case, is to alter the token delimiters to include “+”: If we re-run the package and then look again at the index table, we now see that the word “chain” has been successfully interpreted as a token. In addition we also have a few variants on the word chain, such as “chai” and “hain”: After re-running the matching process with out new delimiters, we can now look at the results of the matching process. If we take a look at the data viewer output, we can see that this has now resulted in a better similarity and confidence for the first row: The delimiters work across all columns for the whole fuzzy lookup,  so that’s something to test when changing the delimiters – it may not be beneficial in all situations. However the new delimiter has clearly worked very well in this example, so it’s very much worth considering if you’re not quite getting the level of matches that you had hoped for.

Excel Fuzzy Lookup Add-In and SSIS

Although it’s been out for a while, I thought I’d mention, for those who don’t know, that you can download a Fuzzy Lookup Add-In for Microsoft Excel. The Add-In allows you to perform a fuzzy lookup for Excel data against another Excel dataset. I won’t go into too much detail, as this has already been covered here, but essentially the Add-In requires you to setup two datasets, namely a “left” and “right” table. Once you carry out some configuration/mappings, then the fuzzy match will return all of the rows in your left table, with the appropriate values looked-up from your right table. As an example I’ve chosen to do my lookup based on reseller names and an address: Part of the reason for this post is that I think this add-in could be useful from a self service perspective for power users to assist in MDM/Matching/DW projects. There’s often a lot of trial and error in trying to get the right balance for matching rules, and, given that the Fuzzy Add-in gives similar results to SSIS, it may well be a good place to start with matching an dataset as part of an MDM project, for example. Alternatively, it could be used to carry out pre-matching/clean-up on data before an MDM project starts. SSIS I also wanted to contrast this Add-in to the Fuzzy Lookup transformation that you get in SSIS. Both the Fuzzy Lookup Add-In and the SSIS variant do the same thing – they both match one dataset against another in a fuzzy manner, returning you 1-n matches per row. In fact, according to this paper, they use the same Microsoft Fuzzy algorithm. With the default settings, at least with my small test datasets, I get similar results in Excel and SSIS, albeit with different similarity numbers. These are the key differences that I’ve noticed: Excel returns only overall row similarity, whereas SSIS returns similarity per column also. Excel exposes an additional set of tuning options. As an example, you can set the Containment Bias, whereby you decide on the penalty for tokens in the right record that are not in the left record. You can also edit a number of other settings in an XML file. SSIS outputs both the confidence and similarity, whereas Excel outputs only the similarity. Excel will output the similarity XML per row, whereby you can see how its split up a string into differently weighted tokens. As you might expect, SSIS out performs Excel. As I quick test, I started working with Adventure Works resellers and seeing if I could get the same match results from both products. To highlight this, I’ve filtered this down to just 1 reseller (“Bikes World Comp. Store”) that I’ve made up and seeing what are the top 5 resellers both products return. SSIS, with its default settings, gives the following when matching just on the reseller name: Excel, using the Fuzzy Add-In, gives the following with its default settings: So overall some very good matches out of the box from the Excel Add-In, certainly comparable to SSIS. What’s interesting about this result set is that “Bike World” is ranked higher in Excel than “World Bike Discount Store”. We can align the behaviour of the two products, in this case anyway, by changing the Containment Bias from a default of 0.8 to 0. This will assign a full penalty to tokens in the right record that are not in the left record. So, as the left record is “Bikes World Comp. Store”, and the right record is “Bike World”, we are by default being lenient on the fact that its missing tokens. Increasing the penalty for missing tokens gives the following, with the top record now matching SSIS: Of course, you would normally match on more than just the name, plus what’s worked above here for me might not work in other scenarios. But the bottom line is that the Excel Fuzzy Lookup Add-In gives very good matches out-of-the-box, is easy to use for the Excel user and therefore could well be of some use as part of data matching projects.

Different Methods Of Type 2 Dimension Population

This post is to share some of my findings about the different ways of dealing with Type 2 changes in SSIS environment. I wanted to provide comparison of the methods in terms of performance, reusability and ease of maintenance. For the purposes of the testing let’s take the scenario where a client would like to implement historical tracking in their DW environment for one of their dimensions, lets say that they have given the following requirements:-  The ordered online flag should be replaced with a more user friendly text field containing ‘Yes’ or ‘No (done in a derived column transformation/SQL case statement). The client would like to track any changes to the Purchase Order Number, Customer Account Number or Carrier Tracking Number so that they can preserve the history of these changes. These will be setup as Type 2 changing attributes and we will need to record the time that each change was induced. Any changes to the OrderedOnline field will be as the result of corrections made to the source data and historic tracking of changes will not be required for this in the DW and all rows should be updated with the new value (a Type 1 change). I created the following table to hold the data: CREATE TABLE DimOrder     (Order_Key INT IDENTITY(1,1) PRIMARY KEY NOT NULL --Surrogate Key     ,SalesOrderID INT NOT NULL --Source Key       ,LineNumber TINYINT NOT NULL --Source Key     ,OrderNumber NVARCHAR(25) NOT NULL      ,PurchaseOrderNumber NVARCHAR(25) NOT NULL     ,AccountNumber NVARCHAR(15) NOT NULL     ,CarrierTrackingNumber NVARCHAR (25)     ,OrderedOnline NVARCHAR (3) NOT NULL     ,RecordValidFromDate DATETIME NOT NULL     ,RecordExpiredDate DATETIME ) The three methods I have selected are:- The SSIS inbuilt Slowly Changing Dimension (SCD) Task. A custom built SSIS package to perform the Type 2 changes.  Using a T-SQL Procedure to perform the dimension population. In each of the comparisons I have performed the following tests:- An initial load of data (all of the records are new are there are no changes. A load of the same data from Step 1 (no changes all records exist) A load of data which contains both Type 1 and Type 2 changes I used Adventure Works as a source and Step 1 & 2 both use the same source statement which looks like this: SELECT      OrdHed.SalesOrderID     ,SalesOrderDetailID     ,OnlineOrderFlag     ,SalesOrderNumber     ,PurchaseOrderNumber     ,AccountNumber     ,CarrierTrackingNumber FROM Sales.SalesOrderHeader OrdHed INNER JOIN Sales.SalesOrderDetail OrdDet     ON Ordhed.SalesOrderID = Orddet.SalesOrderID For the third test I created some copies of the source tables and then ran some basic statements on them to change the format of the data as below: SELECT * INTO [Sales].SalesOrderDetailTest FROM [Sales].SalesOrderDetail SELECT * INTO [Sales].SalesOrderHeaderTest FROM [Sales].SalesOrderHeader --this will only exhibit Type 1 changes UPDATE Sales.SalesOrderHeaderTest SET OnlineOrderFlag = Case when  OnlineOrderFlag = 1 THEN 0 ELSE 1 END WHERE salesOrderID BETWEEN 59392 AND 67257 --this will exhibit only a Type 2 change UPDATE Sales.SalesOrderHeaderTest SET PurchaseOrderNumber = 'X' + PurchaseOrderNumber WHERE salesOrderID < 51525 --this does a mix of both. UPDATE Sales.SalesOrderHeaderTest SET     PurchaseOrderNumber = 'X' + PurchaseOrderNumber     ,OnlineOrderFlag = CASE WHEN  OnlineOrderFlag = 1 THEN 0 ELSE 1 END WHERE salesOrderID BETWEEN 51524 AND 59391 The data is then sourced from the new tables which provide the changes for testing. 1. SSIS Slowly Changing Dimension Task Configuration of the package is very straightforward; you add a data source, drag on a slowly changing Dimension task to the data flow and double click it to configure how you want it to run. The whole process takes less than 10 minutes to complete and there are many options that should suit a variety of needs.   Once the configuration is completed we end up with a dataflow which looks like this: 2. Custom Built SSIS Package Next I designed a package in SSIS but without using the SCD wizard. I used the following principles during design to try and improve performance: 1. Changes can be detected in the data flow by using simple lookup transformations (we could consider hash checking to improve performance if there were very many columns) to see if there is any change to the data, by using several lookups with we can determine if the change is Type 1, 2 or both. 2. Updates in the data flow are performed in SSIS using the ‘OLE DB Command’ task. As this works on a row-by-row basis this has a large impact on performance. To avoid this I created tables to hold the Type 1 and Type 2 changes and then used an ‘Execute SQL task’ in the control flow to make the updates in one go. The completed dataflow looks like this:  New records go straight into the target tables and the ones which have changes are split using a multicast, where a change has occurred the records are inserted into holding tables and then processed in the control flow. The control flow looks like this:                           The control flow executes the statements below: --Make Type 1 changes UPDATE TGT SET TGT.OrderedOnline = SRC.OrderedOnline FROM DimOrder TGT INNER JOIN DimOrderType1 SRC ON TGT.SalesOrderID = SRC.SalesOrderID AND TGT.LineNumber = SRC.LineNumber; --Type 2 expire old rows UPDATE TGT SET TGT.RecordExpiredDate = SRC.TransactionDate FROM DimOrder TGT INNER JOIN DimOrderType2 SRC ON TGT.SalesOrderID = SRC.SalesOrderID AND TGT.LineNumber = SRC.LineNumber; --Type 2 insert new records INSERT INTO DimOrder(SalesOrderID,LineNumber,OrderNumber,PurchaseOrderNumber,AccountNumber                               ,CarrierTrackingNumber,OrderedOnline,RecordValidFromDate) SELECT     SalesOrderID     ,LineNumber     ,OrderNumber     ,PurchaseOrderNumber     ,AccountNumber     ,CarrierTrackingNumber     ,OrderedOnline     ,TransactionDate AS RecordValidFromDate FROM DimOrderType2; 3. T-SQL Script I also thought I would try a basic SQL script to see how this performed. CREATE PROC popDimOrder AS BEGIN             CREATE TABLE #OrderData                 (                         SalesOrderID INT NOT NULL                         ,LineNumber INT NOT NULL                         ,OrderNumber NVARCHAR(25) NOT NULL                         ,PurchaseOrderNumber NVARCHAR(25) NOT NULL                         ,AccountNumber NVARCHAR(15) NOT NULL                         ,CarrierTrackingNumber NVARCHAR(25) NOT NULL                         ,OrderedOnline NVARCHAR(3) NOT NULL                         ,RecordExists smallINT NOT NULL default(0)                         )             DECLARE @TransactionDate Datetime             SELECT @TransactionDate = GETDATE() INSERT INTO #OrderData(SalesOrderID, LineNumber, OrderNumber, PurchaseOrderNumber, AccountNumber,  [CarrierTrackingNumber], [OrderedOnline], RecordExists )             SELECT                         OrdHed.SalesOrderID                         ,SalesOrderDetailID As LineNumber                         ,SalesOrderNumber As OrderNumber                         ,ISNULL(PurchaseOrderNumber,'') As PurchaseOrderNumber                         ,AccountNumber                         ,ISNULL(CarrierTrackingNumber,'') As CarrierTrackingNumber                         ,CASE OnlineOrderFlag                                      WHEN 0 THEN 'No'                                     WHEN 1 THEN 'Yes'                         END As OrderedOnline                         ,0 As RecordExists             FROM [AdventureWorks2012].Sales.SalesOrderHeaderTest OrdHed             INNER JOIN [AdventureWorks2012].Sales.SalesOrderDetailTest OrdDet                         ON Ordhed.SalesOrderID = Orddet.SalesOrderID             --Check if records exist             UPDATE SRC            SET RecordExists = 1             FROM #OrderData SRC             INNER JOIN POC.dbo.DimOrder TGT                         ON TGT.SalesOrderID = SRC.SalesOrderID                         AND TGT.LineNumber = SRC.LineNumber             --Insert new records INSERT INTO DimOrder(SalesOrderID,LineNumber,OrderNumber,PurchaseOrderNumber,AccountNumber,CarrierTrackingNumber,OrderedOnline,RecordValidFromDate,RecordExpiredDate )             SELECT SalesOrderID                         ,LineNumber                         ,OrderNumber                         Once in awhile you happen upon a really useful feature in SSRS that you were unaware of. For me strip lines are definitely one of these features that you don’t see that much mention of but which can add valuable insight to a report. Below is a simple example of a scatter chart with a horizontal and vertical strip line each marking the average values of their axis.  In order to add strip lines to the report you need to do the following: 1. Select the chart axis to which you wish to add the strip line and go to the properties window. In this there is an option marked ‘StripLines’. When you click in the cell you can open up the Strip Line Collection editor.   2. Click the ‘Add’ button in the strip line editor (note: you can add multiple strip lines to each axis if required). 3. In the option marked ‘IntervalOffset’ specify the required value for the Strip Line, you can either specify a fixed value or use an expression from your Dataset. If you require the line to repeat over fixed values you can fill in the properties for ‘Interval’. 4. You then need to set the display properties for the strip lines under appearance in order for them to be visible. To produce the example strip lines in this post I have set the ‘BorderColour’ to black and the ‘BorderStyle’ to dashed.     5. If you wish to add text after the strip line on the graph – for example to illustrate that values after the strip line are above average as shown below then the settings are under the title section. That’s it for now, hope that you’ve found this helpful:)>,

Matching with Master Data Services, DQS and SSIS

If you have an MDM requirement to consolidate multiple sources of Master Data together into a single golden record, then you have a few different ways to achieve this on the Microsoft platform. This blog post gives an overview of the different ways that various matching methods can be used in conjunction with Master Data Services for consolidation in Master Data Management, outlining the pros and cons of each option. In summary, the options are: Data Quality Services (DQS) SQL Server Integration Services (SSIS) Plus Master Data Services itself has a few (not that well known) matching capabilities Data Quality Services A key requirement in all but the simplest MDM solutions is that the matching/consolidation must be carried out in an automated manner, with a Data Steward alerted to take action if needed (e.g. the matching engine incorrectly matches two customers, the Data Steward opens MDS and corrects this). This scenario would be hard to achieve with DQS, as it’s possible to automate the DQS cleaning, but it’s not possible to automate the DQS matching. This is something that I’ve raised connect issues about, here and here. If your data to be matched into MDS is not coming in on a daily basis, and you therefore need to do more ad-hoc matching to produce your golden Master Data records, then DQS could be for you. The MDS Excel Add-in will give you the capability of matching data that you import into Excel with MDS members, harnessing the power of DQS. An overview of how this works is available here. Integration Services SSIS has been around for a long time now and, as many of you will know, contains fuzzy matching components. With the right MDS model design, its possible to carry out a batch based fuzzy match between your master records and end up with a mapping between your the records that exist in your source systems and your MDS golden records. The rough conceptual steps to do this are: Load the new and changed records from the source systems into a staging area. Clean and standardise your data. This is actually something that DQS cleaning can help with. Query your staging area to get the new records that you want to insert/update into Master Data Services. Now the question arises, do we have an exact or close match for these records already in MDS? While the exact matches are easy to deal with, use the SSIS Fuzzy Lookup component to establish whether there are any approximate matches. Link the source records to master records (if the match is high enough) using MDS Domain Attributes. Carry out appropriate inserts and updates into MDS using the MDS staging tables. Ensure that a Data Steward is alerted in some way if necessary (e.g. if the match threshold is below x% confidence). This can be done with Email or MDS Notifications, for example. This process can run in batch overnight, with the Data Steward approving or rejecting the matches that SSIS has carried out the following morning. Whilst the above over-simplifies the process and technical work required, hopefully the process makes sense at a high level. Master Data Services Although you cannot feed MDS your source data and get it to automatically carry out matching for you, it does actually contain the raw components in order to do this. By this I mean the MDS database contains an assembly called Microsoft.MasterDataServices.DataQuality, which gives you a number of fuzzy matching T-SQL functions. These are called from the MDS front end when you carry out some filtering when viewing entities. Using them just for filtering in the front end really isn’t using the functions to their true capability, but thankfully you can use these functions in your own code. You can use the MDS T-SQL functions in a similar way to the conceptual SSIS method outlined above, in order to match and eliminate duplicates. In addition, the MDS web API can also be used to carry out a fuzzy match, as mentioned in this forum post. Retrieving match candidates using a web service may be an attractive option if you’re trying to do real time MDM. Conclusion Essentially until it’s possible to automate DQS matching, we have a choice between SSIS and the MDS matching functions. The following e-book gives a very detailed overview of the matches that both are capable of doing. The MDS T-SQL functions are more flexible than the SSIS fuzzy components as you can choose what fuzzy algorithm you want to use, but the SSIS components let you choose between Fuzzy Grouping and Fuzzy Lookup out of the box, without having to write SQL. Although I tend find that both give very good matching results, the MDS T-SQL functions produce slightly better matches in my experience, plus give you the option of trying different algorithms to suit different scenarios. It’s also worth mentioning that Profisee Maestro (full disclosure, we are a partner) integrates with MDS, offering its own matching algorithms. Maestro also has a front end with the ability to assist with survivorship and approval, which I think is a useful addition to MDS. Speaking of survivorship and approval, there are two options in MDS out-of-the box. The new Master Data Manager web front-end is much improved, but potentially the MDS Excel Add-In allows a bit more flexibility for survivorship carried out by a Data Steward, just due to its natural ability for filling/copying/pasting/filtering. So overall, due to the various improvements, Master Data Services is now capable of tackling more complex MDM scenarios than in the 2008 R2 version.

DQS Matching Vs SSIS Fuzzy Grouping and Lookup

When the term fuzzy matching comes up in the Microsoft world, it’s natural for anyone who’s used the BI stack to think of SQL Server Integration Services (SSIS), due to the fact that it has both Fuzzy Grouping and Fuzzy Matching components. Therefore, when using the matching in Data Quality Services (DQS) the other day, I thought it might be worth contrasting the matching capabilities in the two products. Overview The SSIS Fuzzy Grouping transformation is typically used on a set of data containing duplicates, as it will operate across all of the rows, grouping similar rows together based on fuzzy comparisons, with a view to eliminating duplicates. In contrast, the Fuzzy Lookup transformation takes a value in the SSIS pipeline and uses fuzzy matching to match the input value against a set of clean reference data in a database. The matching in DQS takes a set of data and groups the data into clusters, based on a number of rules and other factors, again with a view to removing duplicates. It’s therefore only directly comparable against the SSIS Fuzzy Grouping, rather than the Fuzzy Lookup. Test Scenario For the purposes of this blog post, I’ve got a sample query from AdventureWorksDW, taking data from the DimCustomer table. I’ve unioned all that data with a few sample records that contain typos. E.g. I’ve got 'Pamela Fernndez' misspelt for one customer and '4610 Pinto Rd' instead of ‘Road’ for another. I’m going to attempt to de-duplicate data in SSIS first and then do the same thing in DQS to see if there is any difference. SSIS Fuzzy Grouping As shown below, I’ve just got a very simple OLE DB source passing some data into a Fuzzy Grouping, then I’m going to examine the data via a data viewer: The columns that I’m passing into the Fuzzy Grouping for the Customers are Full Name, Address Line 1, Gender and City just for the purposes of this blog post. I’ve configured the Fuzzy Grouping transform to carry out Fuzzy Matching on Address Line 1 and Full Name, as these are the most inconsistent (intentionally) across my made up data. I’ve also dropped the threshold right down to 0.40 to see a full range of matches that we may or may not get: I’ve filtered the data for customers with a surname of Fernandez or a first name of Pamela, which includes two made up records. When I run the package, the data viewer produces the following results: The _score column has a high score of 0.8245935, which is based on the similarity scores of FullName and AddressLine1. There are 3 records included in the cluster, including someone called ‘P Fernandez’ who is male. Although that’s been matched, it’s only because I decreased the threshold, plus it can easily be excluded by adding an exact match for Gender, as shown below: DQS Matching So overall SSIS copes well with the test scenario, how does DQS compare? After creating a knowledge base, creating domains and carrying out knowledge discovery, it’s possible to use DQS to create a Matching Policy. The Matching Policy involves building a number of rules that determine how the data matches. The starting point in creating a matching rule is determining which domains you want to match on and whether they should be matched using the fuzzy algorithms (similar) or matched exactly: So not too dissimilar to SSIS. You then choose to give each domain a percentage weight, which must add up to 100%. For example, if you give Full Name a weight 30% and, when run, its match score is 61% accuracy, then you have Address Line 1 at a 70% weight and a 45% accuracy, then an overall match score for a row will be (0.3 * 0.61) + (0.7 * 0.45) = 49.8% match score. Starting off with a 50% weight on the two domains, I get the following results when operating on the same data: As with SSIS, it hasn’t got it quite right, but using the Gender domain will help. What I quite like in DQS is that I can flag a domain as being a pre-requisite: Flagging a domain as a pre-requisite means that the domain will be used to eliminate matches, but will not contribute to the matching score. If I wanted it to contribute to the matching score I would just have to uncheck pre-requisite and give it a weight. Running the matching again gives the following results: So the duplicate is correctly mapped with a score of 75%. Drilling down on the match was achieved is quite nice in DQS also: This brings me to an interesting observation about the matching. I originally had my dummy record with a value of “3 Pierr Demoulainn” for the address. So in addition to having a typo in the second name and missing the word “Rue”, my other test also missed the second e out of the word “Pierre”. If I run my test data with these values, I get no matches at all. The lack of the second e in “Pierr” seems to have made all the difference. I can get it to match if I give a lot of weighting to Full Name. If I do this, we see the following results: Address Line 1 has been given a score of 0%. Making the same change to my test data and then running in SSIS gives different results. The similarity for the address drops from 0.81 to 0.75, but it certainly doesn’t drop to zero. Although it will depend massively on your own data, the matching in DQS seems a bit more stringent. This can be mitigated by using more domains for the matching, rather than the simple 2 or 3 domains that I’ve used. To back that statement up, using a composite domain on both Address Line 1 and City does yield a strong match (71%) for the same data. Conclusion The rule based matching in DQS gives a lot of flexibility on how to weight matching across different domains. Coupled with the knowledge base components that interact with the matching, such as cleaning and Term-Based Relations, DQS has more features specifically aimed at matching than SSIS. It all depends on what sort of data you’re working with, but in theory, as you maintain the knowledge base over time, DQS should give you strong matches based on the knowledge you’ve given it. However, there are some drawbacks – it’s a shame that the matching can’t be automated (please vote for my connect suggestion if you agree). SSIS by its very nature can be automated, meaning the matching in SSIS will be more suitable for a lot of implementations. Secondly, I’d like to be able to match external records against correct values already in the knowledge base, rather than just doing a fuzzy grouping on the external records. I’m a bit surprised DQS can’t do a fuzzy lookup, maybe this will change in V2. Finally, as I’ve shown above, some of the matching results are a little unexpected, whereas in SSIS they were good pretty much straight away. To be fair to DQS this could probably be mitigated with a better choice of matching fields and also running cleaning against a fully populated knowledge base beforehand.

A Pattern To Load Data to Master Data Services via SSIS–Part 2

Introduction This is the second part of a series of blog posts intended to show a way to load data in to MDS via SSIS. In part 1 we have loaded the staging tables with new members and attributes for several entities. In this blog post we are going to extend the SSIS package with tasks to move the data from the staging tables into the MDS Product model and validate these newly inserted members. Completing The Solution We need to move the data from the staging tables into the model. This is carried out by executing the MDS staging sweep process. To achieve this we need to add an Execute SQL Task to the control flow of our package. Rename the task – I’ve called mine ‘SQL – Sweep Stage’ and connect it up to the ‘DFL – Load Staging Tables’ task with a success constraint. On the General tab set the connection to MasterDataServices and the SQL Statement as follows: DECLARE @ModelName NVARCHAR(50) = ? DECLARE @UserName NVARCHAR(50) = ? DECLARE @User_ID INT DECLARE @Version_ID INT SET @User_ID = (SELECT ID FROM mdm.tblUser u WHERE u.UserName = @UserName ) SET @Version_ID = (SELECT MAX(ID) FROM mdm.viw_SYSTEM_SCHEMA_VERSION WHERE Model_Name = @ModelName) EXECUTE mdm.udpStagingSweep @User_ID, @Version_ID, 1 Then add the Parameter mapping as shown below: That’s all there is to do to get our data into the model. However this process is asynchronous and before we can validate the model we need to know when the staging sweep has finished. Add a For Loop Container task to the control flow of the package and connect it up to the ‘SQL – Sweep Stage’ task with a success constraint. Rename the task – I’ve called mine ‘FLC – Wait Until Batch Completes’. Add an Execute SQL Task inside the loop container task and rename it. Mine is called ‘SQL – Get Staging Batch Status’. Change the connection to MasterDataServices, change the ResultSet property to ‘Single row’ and then add the following SQL script to the SQLStatement property: DECLARE @Version_ID INT SET @Version_ID = (SELECT MAX(ID) FROM mdm.viw_SYSTEM_SCHEMA_VERSION WHERE Model_Name = ?) SELECT TOP 1 Status_ID FROM mdm.tblStgBatch WHERE Version_ID = @Version_ID ORDER BY ID DESC Add the parameter mapping as shown below: And the Result Set as shown below: Add three more parameters to the package as shown in the table below: Next configure the For Loop Properties as shown in the table below: The InitExpression value sets the @dtMDSLoopTimer to the current time plus the interval set in our @intMDSTimeout variable. The EvalExpression checks if the @strMDSBatchStatus is either not equal to 2 (Success) or the timeout has expired. The For Loop Container can only succeed if the staging batch is successfully loaded. Now we can validate the model so add an Execute SQL Task to the control flow and connect it to the ‘FLC – Wait Until Batch Completes’ task with a success constraint. Rename the task - mine is called ‘SQL – Validate Model’. Change the connection to MasterDataServices and the SQLStatement as follows: DECLARE @ModelName NVARCHAR(50) = ? DECLARE @UserName NVARCHAR(50) = ? DECLARE @User_ID INT DECLARE @Version_ID INT DECLARE @Model_id INT SET @User_ID = (SELECT ID FROM mdm.tblUser u WHERE u.UserName = @UserName ) SET @Version_ID = (SELECT MAX(ID) FROM mdm.viw_SYSTEM_SCHEMA_VERSION WHERE Model_Name = @ModelName) SET @Model_ID = (SELECT Model_ID FROM mdm.viw_SYSTEM_SCHEMA_VERSION WHERE Model_Name = @ModelName) EXECUTE mdm.udpValidateModel @User_ID, @Model_ID, @Version_ID, 1 Set the parameter mapping as follows: Almost done. Just to finish it off lets add a script task to the control flow and connect it to the ‘FLC – Wait Until Batch Completes’ task with a completion constraint. Change the Evaluation Operation of the constraint to ‘Expression and Constraint’ and set the Expression to ‘@strMDSBatchStatus != 2’. Edit the script and add the following line of code under   // TODO: Add your code here: Dts.Events.FireError(0, "SCR - Fire Error", "MDS Timeout Occurred", string.Empty, 0); This task will fire an error event if the MDS staging batch does not complete successfully. The finished package control flow should look similar to the following image: Execute this package and then check the Product entity in MDS. It should look something like the following: Looking at the other entities you will see that we have added members to three entities and validated all these new members. Summary Over the last two blog posts I have shown a way of automating the loading of data to Master Data Services via SSIS. This pattern can be used to cater for most of your loading requirements. That’s it, the completed Integration Services project source code and MDS Model can be downloaded from here – (You will need to create a login first). Your comments are very welcome.