Adatis

Adatis BI Blogs

Optimising Lookups with Caching

One of, if not the most, used transformations in SQL Server Integration Services (SSIS) must be the Lookup transform. Pretty much every ETL routine will make use of the component at some point, especially at the time of loading fact tables. Given the nature of fact and dimension tables, these lookups can be tasked with loading large amounts of data in memory to perform high speed lookups. There are occasions when a few optimisations on these lookups can greatly improve the efficiency of your ETL.   Choosing the Best Cache Option One of the most important options featured on the Lookup Transform is the Cache Mode. By understanding the differences between these choices, you can be sure to pick the option that best suits your ETL needs.   Using the Full Cache Option (Default) The Full Cache option tell SSIS to consume all your reference data into memory and then perform its lookups on that data. This has the benefit of only hitting the database once to fetch the data but can cause delay on package start-up as all the lookup data will have to be loaded prior to executing the package. You can optimise this by using a SQL Command to fetch only the columns you need but as you can imagine, a dimension with > 1million rows will take some time to cache and this is before you have even done any lookups! Be aware that if you cannot fit your entire reference set into memory your package will FAIL!! SSIS cannot spool data out to disk if Full Cache is chosen and there is no graceful way of handling this error either. Luckily there is a great blog on how to calculate the cache size here. Use this option when: You want to avoid excessive traffic to your database or it is remote You have a small reference table that can be cached quickly You can fit your whole lookup dataset into memory   Using the Partial Cache Option The Partial Cache is a good half way house. This option will tell SSIS to fetch the data at runtime, meaning a quicker start-up time but also a lot more queries to the database. Any value not currently in the cache will be queried for in the database and then stored in the cache, making it available for future lookups. The efficiency here comes from not having to charge the cache on package start-up and then also from having any previously fetched items stored in memory. One very specific benefit the partial cache has is that if your lookup table is being written to during the data flow, the full cache wouldn’t see those rows as it only queries the database once prior to execution. The partial cache however works perfectly here as it will see all the rows in the table at runtime. Additional performance can be gained by making use of the Miss Cache. This option (found on the advanced page) allows a portion of the cache to be reserved for rows that don’t have a match in the database. SSIS will then know not to look for these rows again, thereby saving any unnecessary queries. Use this option when: You have a small reference set not worth charging a full cache for You want to optimise your cache for more frequently accessed rows. Your reference table is being updated within your dataflow.   Using the No Cache Option Finally, No Cache literally means that no data will be cached. This means that every single row will be looked up against the database, even if that row has already been looked up before. Use this option when: You only need to process a small number of rows You have severe memory limitations For some reason, the partial cache wasn’t a better option   The Cascading Lookup Pattern A nice way of optimising your lookups is to combine a full and partial cache creating what is known as a Cascading Lookup. The idea here is to use the full cache to store a subset of your reference data that is frequently looked up and then a partial cache for the rest. This works because the full cache shouldn’t take too long to charge, as it’s only a subset of the data, but will still have the benefits of being in memory. The partial cache will be used for the larger, non-frequently looked up data as it will have no charging time but can also make use of the cache to avoid querying the database twice for the same lookup value. An added benefit is that if the reference table is written to during the data flow, these rows will be caught by the partial cache lookup. In this example, we have a large customer dimension, but we know that only 500 of those customers make a lot of purchases and so we can use a cascading lookup to optimise this dataflow.   Firstly, choose the Full Cache option. Then use a select statement to get only the data that is most looked up against. In this case, the 500 most loyal customers. Remember to redirect non-matching rows to the non-matching output! Now choose the Partial Cache option and then use a select statement to get the rest of the dataset – excluding the loyal customers. Finally, union the rows back together so that you have a single result set.   The Cache Connection Manager Another option that is available is to use the Cache Connection Manager. This is akin to having a persisted full cache that can be referenced several times throughout your package. Better yet, you can convert a package cache connection manager to a project cache connection manager so that it can be referenced by other packages in the project! This offers multiple benefits including: Charging your lookup caches in parallel to speed up package start-up Caches can be loaded at any given time (although must be before the lookups are needed) so if you need to load the dimension during your ETL, you can. Caches will only need to be charged once for your whole ETL project, meaning you can avoid similar lookups querying the same data twice. The way to implement a Cache Connection Manager is seriously easy! To start you need to create a data flow task and then drag a data source onto the design surface. Below that, add a Cache Transform and connect the two up. Once you have configured your data source (remembering to select only the columns you absolutely need!), open the editor for the cache transform. Where prompted for a cache connection manager, select New. On the General page, you can add a name and description. On the Columns page, you will need to do some configuration. For the column/columns you want look up on you should add an Index Position. The lowest index number that is > 0 will be looked up first followed by any other subsequent columns with an Index Position > 0. Any columns with a 0 index are columns that you want to pull into the data flow. In this case that is the CustomerKey. You can also set data types here. Once that’s complete you can map the columns in the cache transform. Finally, you will now see a new connection manager in your Connection Managers window. Right click on the new cache connection manager and click Convert to Project. An implementation of a cache connection manager might have a parent package like the below. Within each “Populate XXXXXX Cache” is a dataflow that extracts source data and puts it into a cache transform. The fact table load packages below that can now reference all the dimensional data using the cache connection manager. In conclusion, there is a lot of optimisation that can be done with regards to lookups and their caching options. Not all of these need to be implemented all of the time. It is more on a case by case basis but knowing that there are options is half the battle.

Extracting Data from Salesforce using SSIS

This blog looks at a couple of methods of extracting data from Salesforce and gives a demo of how to output tables and queries into flat files.   Salesforce is HUGE! How do I find what I’m looking for? Thankfully SFDC make it easy for us to find the tables that custom objects as they are suffixed with “__c” (e.g. “Assessment_Questions__c”). This convention carries through to Column level as well, meaning that when a Salesforce standard table is used, such as SFDCs Event table, any custom columns are also suffixed with “__c”.   Can I use SQL? Salesforce utilises a modified version of SQL called SOQL (Salesforce Object Query Language). There are some Keyword similarities such as starting a query with “SELECT” and calling a table with “FROM” but if you want to limit a query result to a fixed number of rows you need to add “LIMIT n” to the end of the query string, not the start! Additionally, SFDC provide some handy functions to make SOQL slightly easier on us developers. I have listed a few that I used below but a full documentation site is available here - https://help.salesforce.com/articleView?id=customize_functions.htm&type=0 Function Usage Example LAST_YEAR Provides a data range from Jan 1st of the previous year to Dec 31st of the previous year. SELECT AccountId FROM Customers WHERE CreatedDate > LAST_YEAR LAST_N_MONTHS: n Provides a data range from the last day of the previous month and continues for the past n months. SELECT Id FROM Account WHERE CreatedDate > LAST_N_MONTHS:12 BLANKVALUE Similar to ISNULL. Will provide a substitute value if the field is null SELECT BLANKVALUE(AddressLine2,”N/A”) FROM…   How do I connect to Salesforce? In order to access the Salesforce system, you will need a username, password and security token. From my research there are two viable methods for connecting to Salesforce using SSIS which are using an API component or using a linked server. Both have their pros and cons but generally I have found the API route to be more efficient. There are a number of 3rd Party providers around and I have listed a few of these below. Provider Comments Link C – Data (API)   http://www.cdata.com/drivers/salesforce/ssis/ Task Factory (API) Forces you to down load all 50 components. Can use SSIS variables and SOQL http://pragmaticworks.com/Products/Task-Factory/Free-Trial Kingsway Soft (API) Simple interface and can use SSIS variables. Auto generates SOQL. Cant edit error output from main component http://www.kingswaysoft.com/products/ssis-integration-toolkit-for-salesforce/download DB – Amp (Linked Server) Allows you to use OLE DB Connection manager http://forceamp.com/trial.htm   You can use these steps below to access SFDC and start retrieving data. 1.      Download an SFDC SSIS component. For this demo I have used the Kingsway Soft component available from the link above.   2.      Run through the install wizard and once complete, open Visual Studio. 3.      Create a new SSIS project and add a Data Flow Task. Rename this to Kingsway Soft Demo.     4.      Go into the data flow and right click in the toolbox, hit “Refresh Toolbox”. Now you should see 2 new components called “Salesforce Destination” and “Salesforce Source”.     5.      Drag a Salesforce Source onto the design surface and double click to edit the properties. Drop down in “Connection Manager” and select “New”. Fill out the details required using the below connection info.   Username: mih@adatis.co.uk Password: S4lesforce Security Token: oKar72dhKcBAWlq0i4M0RF7ua      6.      Test the connection, and click “OK” once it has connected successfully. From the component properties screen use the Source Type to select “Object Query”. Add this SOQL statement into the text editor below. Click “OK”.   TIP!! – If you select “Object” from the Source Type drop down and choose your table you can then switch the source type back to “ObjectQuery” and the full SOQL statement to retrieve all columns will be generated automatically.   7.      Add a flat file destination and configure the connection manager. Enable the data viewer between your SFDC component and the flat file destination and preview the data that is being extracted from SFDC.     8.      From here can then repeat the process to build up a package that will bring down all the files that you might need for your solution.   There are additional tools such as Salesforce Workbench that will allow developers greater access to the underlying solution but these will be covered in a different blog. If this solution does or doesn’t work for you then let me know!

Microsoft BI – Coming Soon to SQL Server 2016

After attending the Pass Summit 2016 a couple of weeks ago, I attended a number of sessions that provided an insight into the direction Microsoft are heading with BI.  I thought I’d share this with the community. Looking back to October 2015, the official Reporting Roadmap blog from Microsoft stated their intent in dramatically improving the visualisations, integration and harmonisation of both on-prem (SSRS) and cloud based services (Power BI).  Whilst reporting appeared to be a renewed focus, they are constantly driving other areas of BI - such as analysis services and database/data warehousing development in the cloud. Now, for the interesting bit.  Here is what we can expect in the SQL Server 20916 BI Stack in the near future: -          SQL Server 2018 CTP (to be released within the next month). o   This seems very early, considering 2016 has only just been released!  Let’s wait for an official Microsoft announcement. -          Azure DW o   Auto Create Statistics §  Currently statistics have to be generated and updated on the fly. o   Replicated Distributed Table §  One that will excite any cloud based SQL Server developer. §  This will reduce data transfer between distributions/nodes and consequently improve performance. -          Azure Data Lake o   General Availability (GA) is imminent. o   Future Features: §  Polybase, so that you can connect the two big MMP platforms (ADL and DW). §  SSIS connectors (released with GA) for Store. §  Python and R in U-SQL. -          SSIS o   Lots of new connectors, including: §  ADLS. §  CRM. -          SSAS o   Migrating Power BI Models into Tabular. §  This is coming very soon apparently, but will be developed in Azure SSAS first. o   Object Level Security in Tabular §  We currently have row level, but there are talk to secure a physical object, not just a row. §  Even better news - Microsoft want to integrate the two together, which will make security awesome in Tabular. -          SSRS o   Supporting (not pinning) Excel reports in RS report. §  This will come, but Power BI is the focus right now and we may have to wait a while. -          Power BI o   Additional and better Pivot Table functionality. o   Integrating Active Directory dynamically. o   Potential to use Direct Query and Imported modes together – as a hybrid. §  Functionality is possible, but performance needs to be weighed up by Microsoft before anything will emerge. o   Quick Calcs. §  Only ‘Percent of Current Total’ currently available. §  Potential is to offer lots more – such as YTD, MAT, Current Year vs. Previous Year, etc. §  This is for the users who aren’t familiar with DAX. o   Template organisational Content Packs. §  The ability to give the user the ability to personalise colours, fonts, etc. within a structured (organisational) content pack. -          Power BI Embed o   Application developer will be able to limit user sessions and therefore, reduce the charge per 1 hour costs that come with it.   There are some features/issues Microsoft do not plan to change.  Although, the good thing about Microsoft is that they are community driven, so if you feel strongly about anything (and get support from your peers), they may change their minds. -          SSRS o   Q&A, Query Editor (Power Query), R integration, etc. not being developed. §  Pretty obvious really.  Whilst they are going to introduce almost everything from Power BI, some elements of functionality are just not needed for on pre purposes. §  R Scripts may come one day, but not a focus right now. -          Power BI o   Source Control §  No immediate plans to integrate with TFS or modularise the pbix files (for a more developer based solution) §  Not surprising as this is a self-service tool, not a development team. §  Work around is to upload pbix files into OneDrive and use the versioning as an element of Source Control or add a file into Visual Studio. §  Keep Voting on PowerBI.com if you want this! (Currently 278 votes). ·         https://ideas.powerbi.com/forums/265200-power-bi-ideas/suggestions/9677517-source-control -          Power BI Embed o   Licence model §  ‘Speak to Marketing’ is what we were told.   Is everyone excited?  I certainly am.  Microsoft are openly admitting that any new BI related feature will be developed in the cloud first, but this was pretty obvious.  For all you on-prem fanatics (me included), at least the bugs/undocumented features will be ironed out before we get to use them!  My personal favourite has to be the direction SSRS is going.  It was recently labelled a ‘mature’ product, but now RS has suddenly become a cool and current tool to use.  The fact Power BI and SSRS will become almost the same product, tells us all that on-prem is still an important part of the Microsoft licensing strategy.   I am sure there are lots of other features coming into the BI stack over the next 6 months to a year and these were just the ones I had heard about.  Feel free to comment if you have additional ones to share.

Setup Of Oracle Attunity Connectors

Fairly often I come across a project where there is a requirement to extract data from an Oracle source. Whilst this can be done in SSIS using a standard ODBC connection there are some drawbacks to this – the main one for me is that when working with larger data volumes the speed of data transfer can be quite slow. As such I have found the best solution is generally to setup the Attunity Connectors in order to facilitate quick extraction from the Oracle system. For a great comparison on the performance between straight ODBC and Attunity – please see this post from Len Wyatt. Unfortunately the setup can be a little bit fiddly so I thought I would document the process here (as much for me as anyone else). 1. The first step is to determine the version of the Attunity connectors required – this depends upon your SQL server version: SQL Server Version Attunity Connector Version Supported Oracle Versions SQL Server 2016 v4.0 Oracle 10.x; Oracle 11.x; Oracle 12c SQL Server 2014 v3.0 Oracle 10.x, Oracle 11.x, Oracle 12c SQL Server 2012 v2.0 Oracle 9.2.0.4 and higher, Oracle 10.x, Oracle 11.x SQL Server 2008 / 2008 R2 v1.2 Oracle 9.2.0.4 and higher, Oracle 10.x, Oracle 11.x 2. Following this you need to install the Oracle client. Generally you will be working in a 64-bit environment which will require both the 64-bit and 32-bit versions of the Oracle client to be installed (Data tools uses a 32-bit connection but deployed packages should run in 64-bit mode for optimum performance). The Oracle client can be downloaded from http://www.oracle.com/technetwork/database/features/instant-client/index-097480.html you will need to sign up on the site and create an account if you don’t have one already. NOTE: It is not enough to only download the connectors for SSIS and run the installation packages in order to retrieve data you must install the Oracle client. Once you have determined the correct version of the Attunity connectors this you will need to select the relevant version of the Client which you wish to use. Because when Visual Studio is used to run SSIS package this runs in 32-bit mode, you will need the 32-bit version of the client installed on development machines. However, when you deploy the packages and run them from the SSIS Catalog this runs in 64-bit mode so you need to install the 64-bit version to get the best performance in production environments. There are several version of Oracle clients available for download on the site, most recently I used the following version of the Oracle components. I find that these install relatively simply and provide the required features without requiring to much disk space or consuming to much in the way of resources. Depending on the version of the client you have selected the installation instruction for Oracle will vary. However I would suggest you follow the following points: 1. Run the installation packages as an administrator by right clicking and selecting ‘Run as Administrator’ – previously when I have not done this it has led to unusual results. 2. You will need to install the clients on all machines which will be running the SSIS packages – this may be several servers including developer machines used for creating SSIS packages and servers used for schedules ETL runs. 3. The installation should be done in the following order: - Install Oracle 32-bit Client (if required) - Install Oracle 64-bit Client (if required) - Restart machine Once the client is installed a simple way to test that the client is working is to use SQL Developer to open the connection to the Oracle database. Once this is working correctly you then need to install the Attunity connectors. Again depending on the system you are working on you will need to install either the 64-bit of 32-bit Attunity components for SSIS. However here the rules are slightly different. - If you need to run in 64-bit or 64-bit and 32-bit mode, you should install only the 64-bit version of the components. - If you need to run in 32-bit mode only then you should install only the 32-bit version of the compoents. The latest version of the connectors (for SQL 2016) are available from: https://www.microsoft.com/en-us/download/details.aspx?id=52950 The installation of the components is pretty straightforward – once you downloaded them you run the executable and then restart the machine. Once completed the Attunity components will be available for selection within the SSIS toolbox in the Data Flow as per the below. The connectors are a very versatile component and support many of the features you would expect from an SSIS component. Be aware that if you are looking for expressions, then these are to be found on the Data Flow Properties rather than the properties of the actual component.

Deleting Date Stamped Files Over X Days Old Using SSIS

One challenge I have faced recently is to automatically delete files which are over X days old.  Below I have documented how I overcame this using a simple For Each Loop container in SSIS to loop through files which contained a date stamp in their file names. What we need in SSIS A variable – used to handle the names of the files we are looping through to delete A file connection manager – used to connect to the file(s) we wish to delete A For each loop container – this will be used to loop through all the files in a directory A script task – this is a blank script task, that serves no purpose other than to allow us to set a precedence constraint on the file system task A file system - task to actually delete the relevant files   Firstly create a simple variable with an example file name to get you started Once you have created a variable which contains an example file name, right click on your file connection, select properties and then click the ellipses next to the Expressions property. Next select “ConnectionString” from the property drop down menu and then in the expression builder add the folder path, plus your variable name, as shown in the screenshot below. If you wish, you can also add the folder path into your variable or, you could have a separate variable for your folder path. After creating your variable and setting up a connection to an example file you will need to go into your for each loop container, and specify the folder containing the files we wish to delete. I have also specified under files that I am looking for files containing a “_” and a “.” in the file name/extension. Under retrieve file name I have selected Name and extensions as my directory won’t change, so I do not need to bring back the fully qualified path. Still in the for each loop editor, click on Variable Mapping pane and select your variable created earlier. The next step is to double click the precedence constraint (arrow) between the blank script task, and the file system task. Next change the evaluation operation to expression, and insert the code below - tweaked to use your variable name, and adjusted for any differences to your date format. This script is currently working for files with a date suffix of _YYYYMMDD The full code used:DATEDIFF("dd",(DT_Date) (SUBSTRING(@[User::strSourceArchiveFileName],FINDSTRING(@[User::strSourceArchiveFileName] ,"_" ,1)+1,4) + "-" +SUBSTRING(@[User::strSourceArchiveFileName],FINDSTRING(@[User::strSourceArchiveFileName] ,"_" ,1)+5,2) + "-" +SUBSTRING(@[User::strSourceArchiveFileName],FINDSTRING(@[User::strSourceArchiveFileName] ,"_" ,1)+7,2)),GETDATE()) > 30   What the code is doing: The code compares todays date against the date found in the file names eg: File1_20161030.txt and if the difference between todays date and the date in the file is greater than 30 then the expression returns true and the file is deleted in the next step.  Otherwise the file is left alone. The way it obtains the date is by using the findstring function to identify the underscore in my filenames. A substring function is then used to identify the various elements of the date in the file name. Eg: Year can be found in the 1st character after the underscore and has a length of 4, month the 5th character after the underscore with a length of 2 and day the 7th character after the underscore with a length of 2.   This information is then concatenated together separating the various elements using a “-“ and compared against todays date. After setting up the precedence constraint, all that is left is to set up the File System task, setting the Source Connection to our file connection created earlier, and changing the operation to “Delete File”. As mentioned previously, the solution does require you have a character such as an underscore to split the file name and the date stamp within the filename, and for your date stamps to be delivered in a consistent format, be that YYYYMMDD or any other format. These final two screenshots are to show the before and after results of running the above tasks by executing the SSIS package. Before: After:

Optimise SSIS Fuzzy Lookup Matches

This is a quick post to highlight how making some small changes to the SSIS Fuzzy Lookup settings can make a significant difference to the matches that are returned. As I’ve mentioned before, the SSIS Fuzzy matching relies on matching substrings within a string (known as q-grams) in order to produce a result. A given input string (and reference strings) is split up into several different substrings. Simply put, the more substrings, or q-grams, that the input string has in common with the reference string, then the better the match. For example, a word such as ‘sandwich’ would be split up into ‘sand’, ‘andw’ and ‘dwic’, in order to aid the matching process. I’ve hit an issue recently whereby the sub strings being produced where causing false matches – lets start by going through a quick example, loosely based on Adventure Works data. Here is a package that takes data from a data source, before using a Fuzzy lookup: I’ve narrowed down the data source to have just two rows for this example, which are: There are a few more so called Master Products, these are what we want to match against: By default, if we run the SSIS package then the following results will appear in a data viewer after the fuzzy lookup. Note – the lookup is intentionally returning the top 2 matches to illustrate this point. The Source Product column reflects the actual situation I experienced recently, whereby the Brand name (Apollo and RS in this case) appears in the column to match. Although this is made up, it closely follows the situation I experienced whereby the wrong match is produced in both cases – “Apollo X.C. Large” looks very similar to “Apollo XC”, yet its not the top match. The reason this occurs is actually the Token Delimiters setting on the Fuzzy Lookup. By default they include both ampersands and full stops, meaning in the case of ‘Apollo X.C. Large’ the following tokens/substrings are created by default: Note there is no token created for XC, meaning this will not match well against a substring of ‘XC’. One option is to remove the full stops and ampersands as delimiters as shown below in the advanced tab: This produces the following results in the data viewer: In this case, the correct master product per row (namely ‘Apollo XC’ and ‘RS Shorts’) is returned correctly, so the settings in this case have made a difference. Conclusion This shows the effect that the token delimiters can have on the matching process. Unfortunately, what works in one situation may not work in another, so one option may be to try an initial match, followed by a second match for any low matching rows. Data quality will always have a big impact on matching, so cleaning and enriching data before matching is always a good idea.

New Features in SQL Server 2016 – Part 4: Integration Services

In the current SQL Server 2016 CTP2, there are not many exciting improvements made to SSIS.  The hope is that the rumours are true and that the more exciting and beneficial features will come later, probably closer to the actual release. The below have been announced by Microsoft and are available in CTP2: ·        AlwaysOn Support. ·        Incremental Package Deployment. ·        Project Upgrade The speculation and noises coming out of the SQL Community imply two features will be available soon.  As nothing has been officially released, I cannot offer much more insight right now.  Look out for an updated blog as and when something is announced. ·        Azure Data Factory integration with SSIS. ·        Enhanced SSIS - Designer support for previous SSIS versions and support for Power Query. AlwaysOn Support This high-availability, disaster recovery solution is used for lots of large scale databases as an alternative to mirroring.  Integrating this into SSIS will provide some great benefits. 1.      Easily deploy to a centralised SSIS Catalog (e.g SSISDB database). 2.      SSISSDB can now be added to an AlwaysOn Availability group. 3.     When a failover occurs, a secondary node will automatically become the primary.   Incremental Package Deployment You can now deploy one or more packages to an existing or new project, without the need to deploy the whole project.  The incremental process can be actioned from: 1.      Deployment Wizard. 2.      SQL Server Management Studio (SSMS). 3.      Stored Procedures. 4.      Management Object Model (MOM) API. I have tested the package deployment in CTP2 and it works in exactly the same way as project deployment.  The wizard, especially has the exact same look and feel.  In order to run this in CTP2, you will need to navigate to the standard SQL Server 2016 directory, which is usually in: 64 Bit:   C:\Program Files\Microsoft SQL Server\130\DTS\Binn 32 Bit:   C:\Program Files (x86)\Microsoft SQL Server\130\DTS\Binn To run the wizard, simply double click the ‘ISDeploymentWizard.exe’ executable in the folder or run it from command line.  You can now select ‘Package’ or ‘Project’ deployment in the options – see below: There is an option to set password security on individual packages and individually which ones you want to deploy at one time. When SQL Server 2016 is officially released, I would imagine the package deployment wizard will be integrated into Visual Studio and SSDT and implemented without the need of command line or an exe. Project Upgrade Not much has changed here, apart from: 1.      Project-level connection managers work as normal. 2.      Package layout/annotations are always retained. References For more information on all of the new SSIS SQL Server 2016 features, the below resources/blogs are highly recommended. ·        Official Microsoft Page - https://msdn.microsoft.com/en-us/library/bb522534(v=sql.130).aspx ·        Gilbert Quevauvilliers – BI blog - https://gqbi.wordpress.com/2015/05/07/bi-nsight-sql-server-2016-power-bi-updates-microsoft-azure-stack/ ·        Manoj Pandey Blog - http://sqlwithmanoj.com/2015/05/09/microsoft-announced-sql-server-2016-new-features-and-enhancements/

Optimising SSIS Lookups with HASHBYTES()

I was recently working on a load for a Dimension table which worked like the below. The package follows a fairly standard routine. Data for the Dimension is transformed into the required format using SSIS, following this it is inserted into the target table. On the way an initial check is made to determine if the record is new ‘LKU – Check Exists’ and a secondary check to determine if the record has changed, which is subsequently used in an update routine. Although the package was incrementally loading, it was having to sometimes check several million rows to determine if these had changed, as the table was quite wide I decided to implement the HASHBYTES() function in order to take a hash of the columns for comparison rather than compare each of the columns. The initial results were quite promising with the package running in around half the time, however on a particularly big load I noticed the package failed as it had run out of memory for the second Lookup (full cache was used). I found this odd because the HASHBYTES function only returned a single value and I anticipated this would be smaller than the sum of the columns, however on a little deeper investigation I found that by default any row returned using the HASHBYTES function are all of a default size 8000 bytes. A quick check of the SQL Documentation of the HASHBYTES function at http://msdn.microsoft.com/en-GB/library/ms174415.aspx states that the size of the data returned for the HASHBYTES function when used with the SHA2-256 algorithm is 32 bytes meaning that most of this space was being wasted. Therefore changing the formula from 1 to 2 below significantly increases the efficiency of the lookup and also make the cache size required smaller by a factor of around 250! 1. HASHBYTES('SHA2_256',[ProductName] + '|' + [CategoryName]) AS HashValue 2. CONVERT(VARBINARY(32),HASHBYTES('SHA2_256',[ProductName] + '|' + [CategoryName])) AS NewHash Upon doing this the size of each row was significantly reduced, the package is running faster than before and most importantly there are no memory issues :)

SSIS Fuzzy Lookup Transformation Tokens

After doing my recent blog post on the Excel Fuzzy Lookup Add-In Vs SSIS, it got me thinking about the capabilities of the SSIS Fuzzy Lookup Transformation. Although I’m aware that the Fuzzy Lookup is a token-based solution, I’ve only ever tuned it really by altering the similarity thresholds on both the columns and also on the task itself. What you can also do is alter how the transform deals with tokens. Therefore, I thought it would be worth a quick post to show how the tokens help produce matches. Tokens The Fuzzy Lookup works by splitting up strings into several different components, known as tokens. For example, the Adventure Works reseller “Cross-Country Riding Supplies” might be split up into the tokens “Cross”, “Country”, “Riding” and “Supplies”. This is key to the matching process, as the fuzzy algorithms will attempt a match based on the commonality between tokens in the input and reference datasets. SSIS Fuzzy Lookup The advanced tab of the fuzzy lookup will show the delimiters used to create the tokens. As you can see, a hyphen is included in the list, meaning words separated by a hyphen get interpreted as separate tokens: On the “Reference Table” tab of the transformation, there is the option to store a new index, which means that SQL Server will create a table in the data source to hold the tokens found in the reference table: In my made up dbo.Retailers table, I’ve got two retailers – “Nearest Bike Store” and “Gears+Chain Supply”. After running the SSIS package once, the dbo.IndexTest table will be created, which is where the reference tokens get stored in order to help with matching. As it’s just stored in SQL Server we can select from it.Before we do that, I should just say, I’m only passing this fuzzy lookup 1 input string, which is a misspelt company called “Bikes Chaiin Supply”. If I run the package, I get the following matches and scores: This has a fairly low similarity and confidence, so we can probably do better. Going back to the index, if we select from dbo.IndexTest, we get the following: The index table also contains substrings found in the reference table, known as q-grams, which are designed to assist with matches that contain errors. What’s interesting about these results is that “gears+chain” has been entered on its own, meaning that the word “chain” has not been interpreted as a token in this case. The solution, in this case, is to alter the token delimiters to include “+”: If we re-run the package and then look again at the index table, we now see that the word “chain” has been successfully interpreted as a token. In addition we also have a few variants on the word chain, such as “chai” and “hain”: After re-running the matching process with out new delimiters, we can now look at the results of the matching process. If we take a look at the data viewer output, we can see that this has now resulted in a better similarity and confidence for the first row: The delimiters work across all columns for the whole fuzzy lookup,  so that’s something to test when changing the delimiters – it may not be beneficial in all situations. However the new delimiter has clearly worked very well in this example, so it’s very much worth considering if you’re not quite getting the level of matches that you had hoped for.

Excel Fuzzy Lookup Add-In and SSIS

Although it’s been out for a while, I thought I’d mention, for those who don’t know, that you can download a Fuzzy Lookup Add-In for Microsoft Excel. The Add-In allows you to perform a fuzzy lookup for Excel data against another Excel dataset. I won’t go into too much detail, as this has already been covered here, but essentially the Add-In requires you to setup two datasets, namely a “left” and “right” table. Once you carry out some configuration/mappings, then the fuzzy match will return all of the rows in your left table, with the appropriate values looked-up from your right table. As an example I’ve chosen to do my lookup based on reseller names and an address: Part of the reason for this post is that I think this add-in could be useful from a self service perspective for power users to assist in MDM/Matching/DW projects. There’s often a lot of trial and error in trying to get the right balance for matching rules, and, given that the Fuzzy Add-in gives similar results to SSIS, it may well be a good place to start with matching an dataset as part of an MDM project, for example. Alternatively, it could be used to carry out pre-matching/clean-up on data before an MDM project starts. SSIS I also wanted to contrast this Add-in to the Fuzzy Lookup transformation that you get in SSIS. Both the Fuzzy Lookup Add-In and the SSIS variant do the same thing – they both match one dataset against another in a fuzzy manner, returning you 1-n matches per row. In fact, according to this paper, they use the same Microsoft Fuzzy algorithm. With the default settings, at least with my small test datasets, I get similar results in Excel and SSIS, albeit with different similarity numbers. These are the key differences that I’ve noticed: Excel returns only overall row similarity, whereas SSIS returns similarity per column also. Excel exposes an additional set of tuning options. As an example, you can set the Containment Bias, whereby you decide on the penalty for tokens in the right record that are not in the left record. You can also edit a number of other settings in an XML file. SSIS outputs both the confidence and similarity, whereas Excel outputs only the similarity. Excel will output the similarity XML per row, whereby you can see how its split up a string into differently weighted tokens. As you might expect, SSIS out performs Excel. As I quick test, I started working with Adventure Works resellers and seeing if I could get the same match results from both products. To highlight this, I’ve filtered this down to just 1 reseller (“Bikes World Comp. Store”) that I’ve made up and seeing what are the top 5 resellers both products return. SSIS, with its default settings, gives the following when matching just on the reseller name: Excel, using the Fuzzy Add-In, gives the following with its default settings: So overall some very good matches out of the box from the Excel Add-In, certainly comparable to SSIS. What’s interesting about this result set is that “Bike World” is ranked higher in Excel than “World Bike Discount Store”. We can align the behaviour of the two products, in this case anyway, by changing the Containment Bias from a default of 0.8 to 0. This will assign a full penalty to tokens in the right record that are not in the left record. So, as the left record is “Bikes World Comp. Store”, and the right record is “Bike World”, we are by default being lenient on the fact that its missing tokens. Increasing the penalty for missing tokens gives the following, with the top record now matching SSIS: Of course, you would normally match on more than just the name, plus what’s worked above here for me might not work in other scenarios. But the bottom line is that the Excel Fuzzy Lookup Add-In gives very good matches out-of-the-box, is easy to use for the Excel user and therefore could well be of some use as part of data matching projects.

Different Methods Of Type 2 Dimension Population

This post is to share some of my findings about the different ways of dealing with Type 2 changes in SSIS environment. I wanted to provide comparison of the methods in terms of performance, reusability and ease of maintenance. For the purposes of the testing let’s take the scenario where a client would like to implement historical tracking in their DW environment for one of their dimensions, lets say that they have given the following requirements:-  The ordered online flag should be replaced with a more user friendly text field containing ‘Yes’ or ‘No (done in a derived column transformation/SQL case statement). The client would like to track any changes to the Purchase Order Number, Customer Account Number or Carrier Tracking Number so that they can preserve the history of these changes. These will be setup as Type 2 changing attributes and we will need to record the time that each change was induced. Any changes to the OrderedOnline field will be as the result of corrections made to the source data and historic tracking of changes will not be required for this in the DW and all rows should be updated with the new value (a Type 1 change). I created the following table to hold the data: CREATE TABLE DimOrder     (Order_Key INT IDENTITY(1,1) PRIMARY KEY NOT NULL --Surrogate Key     ,SalesOrderID INT NOT NULL --Source Key       ,LineNumber TINYINT NOT NULL --Source Key     ,OrderNumber NVARCHAR(25) NOT NULL      ,PurchaseOrderNumber NVARCHAR(25) NOT NULL     ,AccountNumber NVARCHAR(15) NOT NULL     ,CarrierTrackingNumber NVARCHAR (25)     ,OrderedOnline NVARCHAR (3) NOT NULL     ,RecordValidFromDate DATETIME NOT NULL     ,RecordExpiredDate DATETIME ) The three methods I have selected are:- The SSIS inbuilt Slowly Changing Dimension (SCD) Task. A custom built SSIS package to perform the Type 2 changes.  Using a T-SQL Procedure to perform the dimension population. In each of the comparisons I have performed the following tests:- An initial load of data (all of the records are new are there are no changes. A load of the same data from Step 1 (no changes all records exist) A load of data which contains both Type 1 and Type 2 changes I used Adventure Works as a source and Step 1 & 2 both use the same source statement which looks like this: SELECT      OrdHed.SalesOrderID     ,SalesOrderDetailID     ,OnlineOrderFlag     ,SalesOrderNumber     ,PurchaseOrderNumber     ,AccountNumber     ,CarrierTrackingNumber FROM Sales.SalesOrderHeader OrdHed INNER JOIN Sales.SalesOrderDetail OrdDet     ON Ordhed.SalesOrderID = Orddet.SalesOrderID For the third test I created some copies of the source tables and then ran some basic statements on them to change the format of the data as below: SELECT * INTO [Sales].SalesOrderDetailTest FROM [Sales].SalesOrderDetail SELECT * INTO [Sales].SalesOrderHeaderTest FROM [Sales].SalesOrderHeader --this will only exhibit Type 1 changes UPDATE Sales.SalesOrderHeaderTest SET OnlineOrderFlag = Case when  OnlineOrderFlag = 1 THEN 0 ELSE 1 END WHERE salesOrderID BETWEEN 59392 AND 67257 --this will exhibit only a Type 2 change UPDATE Sales.SalesOrderHeaderTest SET PurchaseOrderNumber = 'X' + PurchaseOrderNumber WHERE salesOrderID < 51525 --this does a mix of both. UPDATE Sales.SalesOrderHeaderTest SET     PurchaseOrderNumber = 'X' + PurchaseOrderNumber     ,OnlineOrderFlag = CASE WHEN  OnlineOrderFlag = 1 THEN 0 ELSE 1 END WHERE salesOrderID BETWEEN 51524 AND 59391 The data is then sourced from the new tables which provide the changes for testing. 1. SSIS Slowly Changing Dimension Task Configuration of the package is very straightforward; you add a data source, drag on a slowly changing Dimension task to the data flow and double click it to configure how you want it to run. The whole process takes less than 10 minutes to complete and there are many options that should suit a variety of needs.   Once the configuration is completed we end up with a dataflow which looks like this: 2. Custom Built SSIS Package Next I designed a package in SSIS but without using the SCD wizard. I used the following principles during design to try and improve performance: 1. Changes can be detected in the data flow by using simple lookup transformations (we could consider hash checking to improve performance if there were very many columns) to see if there is any change to the data, by using several lookups with we can determine if the change is Type 1, 2 or both. 2. Updates in the data flow are performed in SSIS using the ‘OLE DB Command’ task. As this works on a row-by-row basis this has a large impact on performance. To avoid this I created tables to hold the Type 1 and Type 2 changes and then used an ‘Execute SQL task’ in the control flow to make the updates in one go. The completed dataflow looks like this:  New records go straight into the target tables and the ones which have changes are split using a multicast, where a change has occurred the records are inserted into holding tables and then processed in the control flow. The control flow looks like this:                           The control flow executes the statements below: --Make Type 1 changes UPDATE TGT SET TGT.OrderedOnline = SRC.OrderedOnline FROM DimOrder TGT INNER JOIN DimOrderType1 SRC ON TGT.SalesOrderID = SRC.SalesOrderID AND TGT.LineNumber = SRC.LineNumber; --Type 2 expire old rows UPDATE TGT SET TGT.RecordExpiredDate = SRC.TransactionDate FROM DimOrder TGT INNER JOIN DimOrderType2 SRC ON TGT.SalesOrderID = SRC.SalesOrderID AND TGT.LineNumber = SRC.LineNumber; --Type 2 insert new records INSERT INTO DimOrder(SalesOrderID,LineNumber,OrderNumber,PurchaseOrderNumber,AccountNumber                               ,CarrierTrackingNumber,OrderedOnline,RecordValidFromDate) SELECT     SalesOrderID     ,LineNumber     ,OrderNumber     ,PurchaseOrderNumber     ,AccountNumber     ,CarrierTrackingNumber     ,OrderedOnline     ,TransactionDate AS RecordValidFromDate FROM DimOrderType2; 3. T-SQL Script I also thought I would try a basic SQL script to see how this performed. CREATE PROC popDimOrder AS BEGIN             CREATE TABLE #OrderData                 (                         SalesOrderID INT NOT NULL                         ,LineNumber INT NOT NULL                         ,OrderNumber NVARCHAR(25) NOT NULL                         ,PurchaseOrderNumber NVARCHAR(25) NOT NULL                         ,AccountNumber NVARCHAR(15) NOT NULL                         ,CarrierTrackingNumber NVARCHAR(25) NOT NULL                         ,OrderedOnline NVARCHAR(3) NOT NULL                         ,RecordExists smallINT NOT NULL default(0)                         )             DECLARE @TransactionDate Datetime             SELECT @TransactionDate = GETDATE() INSERT INTO #OrderData(SalesOrderID, LineNumber, OrderNumber, PurchaseOrderNumber, AccountNumber,  [CarrierTrackingNumber], [OrderedOnline], RecordExists )             SELECT                         OrdHed.SalesOrderID                         ,SalesOrderDetailID As LineNumber                         ,SalesOrderNumber As OrderNumber                         ,ISNULL(PurchaseOrderNumber,'') As PurchaseOrderNumber                         ,AccountNumber                         ,ISNULL(CarrierTrackingNumber,'') As CarrierTrackingNumber                         ,CASE OnlineOrderFlag                                      WHEN 0 THEN 'No'                                     WHEN 1 THEN 'Yes'                         END As OrderedOnline                         ,0 As RecordExists             FROM [AdventureWorks2012].Sales.SalesOrderHeaderTest OrdHed             INNER JOIN [AdventureWorks2012].Sales.SalesOrderDetailTest OrdDet                         ON Ordhed.SalesOrderID = Orddet.SalesOrderID             --Check if records exist             UPDATE SRC            SET RecordExists = 1             FROM #OrderData SRC             INNER JOIN POC.dbo.DimOrder TGT                         ON TGT.SalesOrderID = SRC.SalesOrderID                         AND TGT.LineNumber = SRC.LineNumber             --Insert new records INSERT INTO DimOrder(SalesOrderID,LineNumber,OrderNumber,PurchaseOrderNumber,AccountNumber,CarrierTrackingNumber,OrderedOnline,RecordValidFromDate,RecordExpiredDate )             SELECT SalesOrderID                         ,LineNumber                         ,OrderNumber                         Once in awhile you happen upon a really useful feature in SSRS that you were unaware of. For me strip lines are definitely one of these features that you don’t see that much mention of but which can add valuable insight to a report. Below is a simple example of a scatter chart with a horizontal and vertical strip line each marking the average values of their axis.  In order to add strip lines to the report you need to do the following: 1. Select the chart axis to which you wish to add the strip line and go to the properties window. In this there is an option marked ‘StripLines’. When you click in the cell you can open up the Strip Line Collection editor.   2. Click the ‘Add’ button in the strip line editor (note: you can add multiple strip lines to each axis if required). 3. In the option marked ‘IntervalOffset’ specify the required value for the Strip Line, you can either specify a fixed value or use an expression from your Dataset. If you require the line to repeat over fixed values you can fill in the properties for ‘Interval’. 4. You then need to set the display properties for the strip lines under appearance in order for them to be visible. To produce the example strip lines in this post I have set the ‘BorderColour’ to black and the ‘BorderStyle’ to dashed.     5. If you wish to add text after the strip line on the graph – for example to illustrate that values after the strip line are above average as shown below then the settings are under the title section. That’s it for now, hope that you’ve found this helpful:)>,

Matching with Master Data Services, DQS and SSIS

If you have an MDM requirement to consolidate multiple sources of Master Data together into a single golden record, then you have a few different ways to achieve this on the Microsoft platform. This blog post gives an overview of the different ways that various matching methods can be used in conjunction with Master Data Services for consolidation in Master Data Management, outlining the pros and cons of each option. In summary, the options are: Data Quality Services (DQS) SQL Server Integration Services (SSIS) Plus Master Data Services itself has a few (not that well known) matching capabilities Data Quality Services A key requirement in all but the simplest MDM solutions is that the matching/consolidation must be carried out in an automated manner, with a Data Steward alerted to take action if needed (e.g. the matching engine incorrectly matches two customers, the Data Steward opens MDS and corrects this). This scenario would be hard to achieve with DQS, as it’s possible to automate the DQS cleaning, but it’s not possible to automate the DQS matching. This is something that I’ve raised connect issues about, here and here. If your data to be matched into MDS is not coming in on a daily basis, and you therefore need to do more ad-hoc matching to produce your golden Master Data records, then DQS could be for you. The MDS Excel Add-in will give you the capability of matching data that you import into Excel with MDS members, harnessing the power of DQS. An overview of how this works is available here. Integration Services SSIS has been around for a long time now and, as many of you will know, contains fuzzy matching components. With the right MDS model design, its possible to carry out a batch based fuzzy match between your master records and end up with a mapping between your the records that exist in your source systems and your MDS golden records. The rough conceptual steps to do this are: Load the new and changed records from the source systems into a staging area. Clean and standardise your data. This is actually something that DQS cleaning can help with. Query your staging area to get the new records that you want to insert/update into Master Data Services. Now the question arises, do we have an exact or close match for these records already in MDS? While the exact matches are easy to deal with, use the SSIS Fuzzy Lookup component to establish whether there are any approximate matches. Link the source records to master records (if the match is high enough) using MDS Domain Attributes. Carry out appropriate inserts and updates into MDS using the MDS staging tables. Ensure that a Data Steward is alerted in some way if necessary (e.g. if the match threshold is below x% confidence). This can be done with Email or MDS Notifications, for example. This process can run in batch overnight, with the Data Steward approving or rejecting the matches that SSIS has carried out the following morning. Whilst the above over-simplifies the process and technical work required, hopefully the process makes sense at a high level. Master Data Services Although you cannot feed MDS your source data and get it to automatically carry out matching for you, it does actually contain the raw components in order to do this. By this I mean the MDS database contains an assembly called Microsoft.MasterDataServices.DataQuality, which gives you a number of fuzzy matching T-SQL functions. These are called from the MDS front end when you carry out some filtering when viewing entities. Using them just for filtering in the front end really isn’t using the functions to their true capability, but thankfully you can use these functions in your own code. You can use the MDS T-SQL functions in a similar way to the conceptual SSIS method outlined above, in order to match and eliminate duplicates. In addition, the MDS web API can also be used to carry out a fuzzy match, as mentioned in this forum post. Retrieving match candidates using a web service may be an attractive option if you’re trying to do real time MDM. Conclusion Essentially until it’s possible to automate DQS matching, we have a choice between SSIS and the MDS matching functions. The following e-book gives a very detailed overview of the matches that both are capable of doing. The MDS T-SQL functions are more flexible than the SSIS fuzzy components as you can choose what fuzzy algorithm you want to use, but the SSIS components let you choose between Fuzzy Grouping and Fuzzy Lookup out of the box, without having to write SQL. Although I tend find that both give very good matching results, the MDS T-SQL functions produce slightly better matches in my experience, plus give you the option of trying different algorithms to suit different scenarios. It’s also worth mentioning that Profisee Maestro (full disclosure, we are a partner) integrates with MDS, offering its own matching algorithms. Maestro also has a front end with the ability to assist with survivorship and approval, which I think is a useful addition to MDS. Speaking of survivorship and approval, there are two options in MDS out-of-the box. The new Master Data Manager web front-end is much improved, but potentially the MDS Excel Add-In allows a bit more flexibility for survivorship carried out by a Data Steward, just due to its natural ability for filling/copying/pasting/filtering. So overall, due to the various improvements, Master Data Services is now capable of tackling more complex MDM scenarios than in the 2008 R2 version.

DQS Matching Vs SSIS Fuzzy Grouping and Lookup

When the term fuzzy matching comes up in the Microsoft world, it’s natural for anyone who’s used the BI stack to think of SQL Server Integration Services (SSIS), due to the fact that it has both Fuzzy Grouping and Fuzzy Matching components. Therefore, when using the matching in Data Quality Services (DQS) the other day, I thought it might be worth contrasting the matching capabilities in the two products. Overview The SSIS Fuzzy Grouping transformation is typically used on a set of data containing duplicates, as it will operate across all of the rows, grouping similar rows together based on fuzzy comparisons, with a view to eliminating duplicates. In contrast, the Fuzzy Lookup transformation takes a value in the SSIS pipeline and uses fuzzy matching to match the input value against a set of clean reference data in a database. The matching in DQS takes a set of data and groups the data into clusters, based on a number of rules and other factors, again with a view to removing duplicates. It’s therefore only directly comparable against the SSIS Fuzzy Grouping, rather than the Fuzzy Lookup. Test Scenario For the purposes of this blog post, I’ve got a sample query from AdventureWorksDW, taking data from the DimCustomer table. I’ve unioned all that data with a few sample records that contain typos. E.g. I’ve got 'Pamela Fernndez' misspelt for one customer and '4610 Pinto Rd' instead of ‘Road’ for another. I’m going to attempt to de-duplicate data in SSIS first and then do the same thing in DQS to see if there is any difference. SSIS Fuzzy Grouping As shown below, I’ve just got a very simple OLE DB source passing some data into a Fuzzy Grouping, then I’m going to examine the data via a data viewer: The columns that I’m passing into the Fuzzy Grouping for the Customers are Full Name, Address Line 1, Gender and City just for the purposes of this blog post. I’ve configured the Fuzzy Grouping transform to carry out Fuzzy Matching on Address Line 1 and Full Name, as these are the most inconsistent (intentionally) across my made up data. I’ve also dropped the threshold right down to 0.40 to see a full range of matches that we may or may not get: I’ve filtered the data for customers with a surname of Fernandez or a first name of Pamela, which includes two made up records. When I run the package, the data viewer produces the following results: The _score column has a high score of 0.8245935, which is based on the similarity scores of FullName and AddressLine1. There are 3 records included in the cluster, including someone called ‘P Fernandez’ who is male. Although that’s been matched, it’s only because I decreased the threshold, plus it can easily be excluded by adding an exact match for Gender, as shown below: DQS Matching So overall SSIS copes well with the test scenario, how does DQS compare? After creating a knowledge base, creating domains and carrying out knowledge discovery, it’s possible to use DQS to create a Matching Policy. The Matching Policy involves building a number of rules that determine how the data matches. The starting point in creating a matching rule is determining which domains you want to match on and whether they should be matched using the fuzzy algorithms (similar) or matched exactly: So not too dissimilar to SSIS. You then choose to give each domain a percentage weight, which must add up to 100%. For example, if you give Full Name a weight 30% and, when run, its match score is 61% accuracy, then you have Address Line 1 at a 70% weight and a 45% accuracy, then an overall match score for a row will be (0.3 * 0.61) + (0.7 * 0.45) = 49.8% match score. Starting off with a 50% weight on the two domains, I get the following results when operating on the same data: As with SSIS, it hasn’t got it quite right, but using the Gender domain will help. What I quite like in DQS is that I can flag a domain as being a pre-requisite: Flagging a domain as a pre-requisite means that the domain will be used to eliminate matches, but will not contribute to the matching score. If I wanted it to contribute to the matching score I would just have to uncheck pre-requisite and give it a weight. Running the matching again gives the following results: So the duplicate is correctly mapped with a score of 75%. Drilling down on the match was achieved is quite nice in DQS also: This brings me to an interesting observation about the matching. I originally had my dummy record with a value of “3 Pierr Demoulainn” for the address. So in addition to having a typo in the second name and missing the word “Rue”, my other test also missed the second e out of the word “Pierre”. If I run my test data with these values, I get no matches at all. The lack of the second e in “Pierr” seems to have made all the difference. I can get it to match if I give a lot of weighting to Full Name. If I do this, we see the following results: Address Line 1 has been given a score of 0%. Making the same change to my test data and then running in SSIS gives different results. The similarity for the address drops from 0.81 to 0.75, but it certainly doesn’t drop to zero. Although it will depend massively on your own data, the matching in DQS seems a bit more stringent. This can be mitigated by using more domains for the matching, rather than the simple 2 or 3 domains that I’ve used. To back that statement up, using a composite domain on both Address Line 1 and City does yield a strong match (71%) for the same data. Conclusion The rule based matching in DQS gives a lot of flexibility on how to weight matching across different domains. Coupled with the knowledge base components that interact with the matching, such as cleaning and Term-Based Relations, DQS has more features specifically aimed at matching than SSIS. It all depends on what sort of data you’re working with, but in theory, as you maintain the knowledge base over time, DQS should give you strong matches based on the knowledge you’ve given it. However, there are some drawbacks – it’s a shame that the matching can’t be automated (please vote for my connect suggestion if you agree). SSIS by its very nature can be automated, meaning the matching in SSIS will be more suitable for a lot of implementations. Secondly, I’d like to be able to match external records against correct values already in the knowledge base, rather than just doing a fuzzy grouping on the external records. I’m a bit surprised DQS can’t do a fuzzy lookup, maybe this will change in V2. Finally, as I’ve shown above, some of the matching results are a little unexpected, whereas in SSIS they were good pretty much straight away. To be fair to DQS this could probably be mitigated with a better choice of matching fields and also running cleaning against a fully populated knowledge base beforehand.

A Pattern To Load Data to Master Data Services via SSIS–Part 2

Introduction This is the second part of a series of blog posts intended to show a way to load data in to MDS via SSIS. In part 1 we have loaded the staging tables with new members and attributes for several entities. In this blog post we are going to extend the SSIS package with tasks to move the data from the staging tables into the MDS Product model and validate these newly inserted members. Completing The Solution We need to move the data from the staging tables into the model. This is carried out by executing the MDS staging sweep process. To achieve this we need to add an Execute SQL Task to the control flow of our package. Rename the task – I’ve called mine ‘SQL – Sweep Stage’ and connect it up to the ‘DFL – Load Staging Tables’ task with a success constraint. On the General tab set the connection to MasterDataServices and the SQL Statement as follows: DECLARE @ModelName NVARCHAR(50) = ? DECLARE @UserName NVARCHAR(50) = ? DECLARE @User_ID INT DECLARE @Version_ID INT SET @User_ID = (SELECT ID FROM mdm.tblUser u WHERE u.UserName = @UserName ) SET @Version_ID = (SELECT MAX(ID) FROM mdm.viw_SYSTEM_SCHEMA_VERSION WHERE Model_Name = @ModelName) EXECUTE mdm.udpStagingSweep @User_ID, @Version_ID, 1 Then add the Parameter mapping as shown below: That’s all there is to do to get our data into the model. However this process is asynchronous and before we can validate the model we need to know when the staging sweep has finished. Add a For Loop Container task to the control flow of the package and connect it up to the ‘SQL – Sweep Stage’ task with a success constraint. Rename the task – I’ve called mine ‘FLC – Wait Until Batch Completes’. Add an Execute SQL Task inside the loop container task and rename it. Mine is called ‘SQL – Get Staging Batch Status’. Change the connection to MasterDataServices, change the ResultSet property to ‘Single row’ and then add the following SQL script to the SQLStatement property: DECLARE @Version_ID INT SET @Version_ID = (SELECT MAX(ID) FROM mdm.viw_SYSTEM_SCHEMA_VERSION WHERE Model_Name = ?) SELECT TOP 1 Status_ID FROM mdm.tblStgBatch WHERE Version_ID = @Version_ID ORDER BY ID DESC Add the parameter mapping as shown below: And the Result Set as shown below: Add three more parameters to the package as shown in the table below: Next configure the For Loop Properties as shown in the table below: The InitExpression value sets the @dtMDSLoopTimer to the current time plus the interval set in our @intMDSTimeout variable. The EvalExpression checks if the @strMDSBatchStatus is either not equal to 2 (Success) or the timeout has expired. The For Loop Container can only succeed if the staging batch is successfully loaded. Now we can validate the model so add an Execute SQL Task to the control flow and connect it to the ‘FLC – Wait Until Batch Completes’ task with a success constraint. Rename the task - mine is called ‘SQL – Validate Model’. Change the connection to MasterDataServices and the SQLStatement as follows: DECLARE @ModelName NVARCHAR(50) = ? DECLARE @UserName NVARCHAR(50) = ? DECLARE @User_ID INT DECLARE @Version_ID INT DECLARE @Model_id INT SET @User_ID = (SELECT ID FROM mdm.tblUser u WHERE u.UserName = @UserName ) SET @Version_ID = (SELECT MAX(ID) FROM mdm.viw_SYSTEM_SCHEMA_VERSION WHERE Model_Name = @ModelName) SET @Model_ID = (SELECT Model_ID FROM mdm.viw_SYSTEM_SCHEMA_VERSION WHERE Model_Name = @ModelName) EXECUTE mdm.udpValidateModel @User_ID, @Model_ID, @Version_ID, 1 Set the parameter mapping as follows: Almost done. Just to finish it off lets add a script task to the control flow and connect it to the ‘FLC – Wait Until Batch Completes’ task with a completion constraint. Change the Evaluation Operation of the constraint to ‘Expression and Constraint’ and set the Expression to ‘@strMDSBatchStatus != 2’. Edit the script and add the following line of code under   // TODO: Add your code here: Dts.Events.FireError(0, "SCR - Fire Error", "MDS Timeout Occurred", string.Empty, 0); This task will fire an error event if the MDS staging batch does not complete successfully. The finished package control flow should look similar to the following image: Execute this package and then check the Product entity in MDS. It should look something like the following: Looking at the other entities you will see that we have added members to three entities and validated all these new members. Summary Over the last two blog posts I have shown a way of automating the loading of data to Master Data Services via SSIS. This pattern can be used to cater for most of your loading requirements. That’s it, the completed Integration Services project source code and MDS Model can be downloaded from here – (You will need to create a login first). Your comments are very welcome.

A Pattern To Load Data to Master Data Services via SSIS

Part 1 Introduction Loading new members to a MDS entity will be a common requirement in all MDS implementations. In these blog posts I am going to walk you through building an SSIS package that performs the following processes: Load new members and attributes to several entities via staging tables Validate the MDS model that contains the entities In part one we will load the MDS staging tables ready to take our new members and attributes into our MDS model. For a thorough understanding of the staging process in MDS please see the Master Data Services Team blog post on Importing Data by Using the Staging Process. A pre-requisite is to have the AdventureWorks2008R2 database samples installed on the same instance of SQL Server as Master Data Services. In MDS I have created a model named ‘Product’ with an entity of the same name. The product entity has the following attributes which are set to the default type and length unless specified: Name Code Model (Domain Attribute) Culture (Domain Attribute) Description (Text, 500) We are going to load this entity with Product data from the AdventureWorks2008R2 database using a SSIS package. In addition to this there are two further entities in the Product model: Culture Model These entities have just the code and name attributes and are set to the default type and length. The MDS model and Integration Services project source code can be downloaded from here – (You will need to create a login first). Building The Solution OK enough of the intro let’s get on and build the package. Start a new Visual Studio Integration Services Project and save the default package to a more suitable name. I’ve called mine ‘LoadMDS.dtsx’. Create the following connection managers as shown below remembering to replace the Server and MDS database names. Rename the connection managers to ‘AdventureWorks’ and ‘MasterDataServices’ respectively. Now we need to create some variables so go ahead and create the variables shown in the table below: We are now ready to put our first task into our package. This task will optionally clear the staging tables of all successfully loaded members, attributes and relationships prior to loading, based on the value of the blnClearStage parameter. Add an Execute SQL Task to the control flow of your package and rename it - I’ve called mine ‘SQL – Clear Staging Tables’. On the General tab set the connection to MasterDataServices and the SQL Statement as follows: DECLARE @ModelName NVARCHAR(50) = ? DECLARE @UserName NVARCHAR(50) = ? DECLARE @User_ID INT SET @User_ID = (SELECT ID FROM mdm.tblUser u WHERE u.UserName = @UserName ) IF ? = 1 EXECUTE mdm.udpStagingClear @User_ID, 4, 1, @ModelName, DEFAULT ELSE SELECT 1 AS A On the Parameter Mapping tab add the variables exactly as shown below: Add a Data Flow task to the control flow of the package and connect it to the ‘SQL – Clear Staging Tables’ task with a success constraint. Rename the task to ‘DFL – Load Staging Tables’. Add three further variables to our package as follows: In the data flow of our package add an OLEDB data source task, set the connection to AdventureWorks and the Data Access Mode to SQL Command. Add the following SQL to the SQL command text window: SELECT CAST(p.ProductID AS VARCHAR(10)) + pmx.CultureID AS ProductCode ,p.Name ,p.ProductModelID ,pm.Name AS ProductModelName ,pmx.CultureID ,c.Name AS CultureName ,pd.Description FROM Production.Product p INNER JOIN Production.ProductModel pm ON p.ProductModelID = pm.ProductModelID INNER JOIN Production.ProductModelProductDescriptionCulture pmx ON pm.ProductModelID = pmx.ProductModelID INNER JOIN Production.ProductDescription pd ON pmx.ProductDescriptionID = pd.ProductDescriptionID INNER JOIN Production.Culture c ON pmx.CultureID = c.CultureID Don’t worry if the formatting turns ugly, that’s just what happens. Press the Preview button and you will see that this query will return us the following columns to our data flow: ProductCode Name ProductModelID ProductModelName CultureID CultureName Description We need two more columns in our data flow and to get them we will use a Derived Column transformation task so drag one on to the data flow from the toolbox and connect it up to the data source. Add the columns as shown in the image below: Next the data flow needs to be duplicated into multiple streams so that the different members and attributes can be loaded to the staging tables. This is achieved by adding a Multicast transformation task to our data flow. This task does not require any configuration. There will be six outputs from the Multicast task and these will be used to load the following: Product Members Model Members Culture Members Product Model Attributes Product Culture Attributes Product Description Attributes Each of these outputs needs to be tailored as to whether they will be loading a member or an attribute and also which member or attribute they are loading. Add six Derived Column transformation tasks to the data flow and connect them to the Multicast transformation. At this point our data flow should look similar to the following: For each of the Derived Column transformations add the additional columns as specified below: OK now we have got all the information we need in our data flows to start loading to the staging tables but before we do that there is one more thing to do. As we are loading new members to the Model and Culture entities as well as Product we need to ensure that we have only distinct values for our member codes to prevent staging errors. To achieve this we add and connect Aggregate transformation shapes to the data flows underneath the ‘Add Culture Member Information’ and ‘Add Model Member Information’ shapes. The images below show how to configure these aggregate transformation shapes: Group By Culture                                                             Group By Model We are now ready to load the data to the MDS staging tables. Add six OLE DB destination shapes to the dataflow. Three of the destinations will be to load new entity members and the other three will be to load attributes for these new members. Configure the Connection Manager properties of the destinations as follows: Members                                                                     Attributes Connect the first destination shape to the ‘Add Product Member Information’ shape and configure it as a member destination. Click the Mappings tab and set the Input and Destination column mappings as shown below: Connect the second destination shape to the ‘Group By Culture’ shape and configure it as a Member destination. The column mappings will be the same as above except for the MemberName and MemberCode columns and these will be set to CultureName and CultureID respectively. Connect the third destination shape to the ‘Group By Model’ shape and configure it as a Member destination. The column mappings will be the same as above except for the MemberName and MemberCode columns and these will be set to ProductModelName and ProductModelID respectively. Connect the fourth destination shape to the ‘Add Culture Attribute Information’ shape and configure it as an Attribute destination. The column mappings will be as follows: Configure the next two destinations as Attribute destinations and map the columns as the other Attribute destination replacing the AttributeValue mapping with ProductModelID and Description respectively. Now our completed dataflow should look similar to the following: If you execute the package you will see that we have inserted 1764 Product member rows, 6 Culture member rows and 118 Model member rows into the mdm.tblStgMember table and 1764 attribute rows for each of the Culture, Model and Description attributes into the mdm.tblStgMemberAttribute table in your MDS database. It is worth noting that the data has now been staged only and we will not see it in our MDS entities yet. OK that’s as far as we are going to go in part one. In part two we will extend the package to move the data from the staging tables into the MDS model and validate the newly inserted data.

MDM White Papers

With the release of SQL Server 2008 R2 nearly upon us, it's a safe bet that the number of technical articles for MDS will start to increase a bit. In the meantime, I felt it was worth mentioning a few older MDM White Papers that may come in useful: Master Data Management (MDM) Hub Architecture by Roger Walter - Whereas you can find plenty of info on the web about the practice of MDM, this white paper provides some simple but good technical examples of how to approach MDM. Data Integration Solutions for Master Data Management by Elizabeth Vitt - This one is from a while back, but it brings SSIS into the mix so that you can see how SSIS can help with an MDM initiative. So again it's got some good technical examples. I've not seen too many articles covering MDM with SSIS on the web, so I felt that it was especially worth a mention. The What, Why, and How of Master Data Management by Roger Wolter and Kirk Haselden - This white paper covers the reasons to carry out MDM, plus explains the process of developing an MDM solution.

Type 2 SCDs - Calculating End Date when you only have Effective Date

Here's a quick one for creating an end date column when you only have an effective date column in your data source to work with.  The secret is to join the table to itself with a greater than join on the effective date.  You then use the effective date from your joined table to give you your end date. SELECT DT1.AccountID ,DT1.EffectiveDate --Add a default end date for current record and subtract a day so the end date is one day before the next start date ,ISNULL(DATEADD(d,-1, DT2.EffectiveDate), '31/12/2099') AS EndDate ,DT1.Price FROM dbo.PriceTable AS DT2 RIGHT OUTER JOIN dbo.PriceTable AS DT1 DT2.AccountID = DT1.AccountID AND DT2.EffectiveDate > DT1.EffectiveDate