Adatis

Adatis BI Blogs

Getting started with Azure Data Factory Mapping Data Flows

Azure Data Factory V2 is the go-to service for moving large amounts of data within the Azure platform and, up until relatively recently, was focussed predominantly on control flow rather than data flow. Those familiar with SQL Server Integration Services will be aware of the difference however, to clarify, Control flow is used to orchestrate activities within a data pipeline whilst Data flow is used to physically move and transform the data. In May 2019, the Azure Data Factory team released into public preview the Azure Data Factory Mapping Data Flows (MDF) functionality, which effectively moves ADF from being an orchestration tool into a fully fledged ETL tool. This new feature allows developers to configure data transformation logic using a no code, drag and drop approach and implements many of the transformation concepts that existed in SSIS. In fact, there exists an excellent comparison between SSIS, SQL and ADF MDF written by Kamil Nowinski available through this link which highlights how the three technologies matchup: https://sqlplayer.net/2018/12/azure-data-factory-v2-and-its-available-components-in-data-flows/. Points to note are that ADF does not have event handlers in the way SSIS does and also does not have an equivalent to the script component within Data Flows. It does have a custom activity which allows you to write C# within the control flow but currently you cannot write any custom code within a Mapping Data Flow. Prior to the release of MDF, Data Factory could move data from one place to another with its copy activity however it could not modify the file in any meaningful way and would require external services such as SQL Data Warehouse or Data Lake Analytics (RIP) to be used to fill this gap. The copy activity performed the data movement using the Azure Integration Runtime which provided the compute power needed to complete the operation, so does that mean that Mapping Data Flows run on the same compute resource? NO is the answer. In fact, your graphical data flow is converted into Scala and then compiled into a JAR library to be executed on a Databricks (Spark) cluster which is deployed and managed by Microsoft solely for running your data flow activities. This does mean that any pipelines that utilise MDF have a slight overhead to allow for the cluster to start-up and configure, however Microsoft are looking at ways to reduce this and there is no need for an ETL developer looking to build MDF’s to know anything about Databricks, Scala or Spark Clusters – although it will certainly help!So, it’s in public preview so let’s get using it! This blog will walk through the process of creating a basic cleaning data flow that will populate a SQL Database table with the values from a delimited text file. To begin, we need some data factory objects, anyone familiar with data factory will understand we need Linked Services and Datasets to connect to our data and create schemas, and Pipelines to give structure to the activities and create our control flow. In this example I have a small comma separated file that has some data quality issues such as,Leading and trailing white spaceWeird column namesNon standardised NULL valuesWeak data typesThe data flow we will create will address all of these issues before depositing the file into a SQL Database. First, we should create a linked service that connects to a blob store or data lake account where our dirty file is stored. We then need a dataset that sits on top of the Linked Service which allows us to locate and read the file using the specified parameters such as file path, column delimiter, row delimiter etc. Additionally, in this dataset we can import the schema of the file so that we have some column names available. Next, we can create a linked service that connects to our output SQL Database and also a dataset that points to the correct table again, importing the schema. After creating these objects we also need a pipeline to orchestrate the process and this will ultimately call our fancy new data flow. You only need a single activity in this pipeline which will be found under the “Move & Transform” heading and is called “Data Flow (Preview)”At this point you should have something similar to the following with the exception of the data flow object under the Data Flows (Preview) tab:NOTE: I have not gone into much detail about creating these objects as they will be familiar to most ADF developers. For pointers on how to create the above follow this link: https://docs.microsoft.com/en-us/azure/data-factory/quickstart-create-data-factory-portal Now we can begin creating our Data Flow. Firstly, we will add the source and the sink to our data flow, this gives us our bookends as a clear starting point, and we can then play about with middle bit afterwards. Also, at this point, toggle the switch at the top of the Data Factory UI named “Data Flow Debug”, this will start a cluster up for you and avoid you having to wait later onHover over the Data Flows (Preview) header and click on the ellipsis, choose “Add Data Flow” and this will take you into the Mapping Data Flow UI where you can begin creating you Data Flows. Remember to set a sensible name for your data flow in the general tab of the data flow UI. Click “Add Source” and we can begin configuring the source activity of our data flow. Provide a clear name that identifies the source. This source is fed from a dataset so the names should closely match. Choose the dataset that is linked to your blob account and uncheck “Allow schema drift”. This is a useful option and allows for constantly changing sources files to be handled however we will cover that in a later blog. On the “Source Options” tab you can choose to add actions that occur after completion such as deleting or moving source files. On the “Projection” tab you can tailor your schema which is defined in the source dataset. My preference is to leave these all as strings to avoid any early typing errors as we will address these later in the flow. Finally, on the Optimize, Inspect and Data Preview tabs, all defaults can remain the same. Now click on the tiny + icon in the bottom right corner of the source activity and choose “Sink” from the drop-down list. Here we can configure the output of our flow which in this case will be a SQL Database. Specify the name, again relating it to your sink dataset, and choose the SQL Database dataset created earlier. On the “Settings” tab you can choose which methods can be used by ADF when working with your table. These can be any combination of Insertion, Deletion, Upserting or Updating. Also, you can define actions to occur in the database before loading the data such as recreating the entire table or truncating the existing table before loading into it. Finally, on the “Mapping” tab you can map columns from source to sink. Be aware that any columns that are not strings in your database will not be able to be mapped until the data typing has occurred. Now we have a basic copy structure that does nothing clever yet but does connect our dirty source file up to our sink dataset. We can begin doing the actual transform. The first transformation we will do will be to Trim white space from columns. Click on the + icon and choose “Derived Column”. Within the “Derived Column Settings” tab you should add each of the columns in your source dataset and then enter the following expression for each one in the expressions editor: trim({column name}). This expression will remove any whitespace from the columns value ensuring the database receives a clean value. Now we will standardise any NULL values and also transform any columns into their correct data types. To do this, click the + icon again and choose “Derived Column” again. Similar to the above step you can add an entry in the “Derived Column Settings” tab for each column, adding the below expression for each column: replace(replace({column name}, ' ',''),'Unknown',''). This expression will replace any empty values with NULL and also any values where we have ‘Unknown’ will also get replaced with NULL so that we have some standardisation before loading into the database. Any NULL values already present will be untouched. In my dataset I need to change one column from its originating string data type into an int so that it can be dropped in the DB. Rather than doing this change in place, it is best to create a new column to do this so that you have an original column and the new column with the correct type. Whilst still in the expression editor, hover over any column name in the “OUTPUT SCHEMA” window that is visible on the left hand side and choose the + icon. This will allow you add a new column to you data flow and you can use any of the conversion functions (toInteger, toLong, toString, toDate, toBoolean etc) to coerce the value into its correct type.At this point you should have four steps that resemble the below screenshot. Once your Data Flow Debug session is online you can debug the data flow and hopefully see the nice clean values pass through into the database. Throughout this process I recommend taking a peek at the Inspect and Data Preview tabs. The Inspect tabs give a bit more information about what steps are taking place on the data in that activity and the Data Preview will show you how the data will look, although the Debug session needs to be active for this to work. Finally, the optimize tab allows you to set the partitioning of the data using techniques such as Round Robin, HASH and range distribution although these are out of the scope of this blog. Hopefully this brief tutorial has been helpful and allowed you to gain some early knowledge on Data Factory Mapping Data Flows meaning that you can go on to create Flows that are tailored to your needs and cover off a wider variety of scenarios. Any questions, thoughts or feedback, please catch me on twitter @MattTheHow.

Setup Of Oracle Attunity Connectors

Fairly often I come across a project where there is a requirement to extract data from an Oracle source. Whilst this can be done in SSIS using a standard ODBC connection there are some drawbacks to this – the main one for me is that when working with larger data volumes the speed of data transfer can be quite slow. As such I have found the best solution is generally to setup the Attunity Connectors in order to facilitate quick extraction from the Oracle system. For a great comparison on the performance between straight ODBC and Attunity – please see this post from Len Wyatt. Unfortunately the setup can be a little bit fiddly so I thought I would document the process here (as much for me as anyone else). 1. The first step is to determine the version of the Attunity connectors required – this depends upon your SQL server version: SQL Server Version Attunity Connector Version Supported Oracle Versions SQL Server 2016 v4.0 Oracle 10.x; Oracle 11.x; Oracle 12c SQL Server 2014 v3.0 Oracle 10.x, Oracle 11.x, Oracle 12c SQL Server 2012 v2.0 Oracle 9.2.0.4 and higher, Oracle 10.x, Oracle 11.x SQL Server 2008 / 2008 R2 v1.2 Oracle 9.2.0.4 and higher, Oracle 10.x, Oracle 11.x 2. Following this you need to install the Oracle client. Generally you will be working in a 64-bit environment which will require both the 64-bit and 32-bit versions of the Oracle client to be installed (Data tools uses a 32-bit connection but deployed packages should run in 64-bit mode for optimum performance). The Oracle client can be downloaded from http://www.oracle.com/technetwork/database/features/instant-client/index-097480.html you will need to sign up on the site and create an account if you don’t have one already. NOTE: It is not enough to only download the connectors for SSIS and run the installation packages in order to retrieve data you must install the Oracle client. Once you have determined the correct version of the Attunity connectors this you will need to select the relevant version of the Client which you wish to use. Because when Visual Studio is used to run SSIS package this runs in 32-bit mode, you will need the 32-bit version of the client installed on development machines. However, when you deploy the packages and run them from the SSIS Catalog this runs in 64-bit mode so you need to install the 64-bit version to get the best performance in production environments. There are several version of Oracle clients available for download on the site, most recently I used the following version of the Oracle components. I find that these install relatively simply and provide the required features without requiring to much disk space or consuming to much in the way of resources. Depending on the version of the client you have selected the installation instruction for Oracle will vary. However I would suggest you follow the following points: 1. Run the installation packages as an administrator by right clicking and selecting ‘Run as Administrator’ – previously when I have not done this it has led to unusual results. 2. You will need to install the clients on all machines which will be running the SSIS packages – this may be several servers including developer machines used for creating SSIS packages and servers used for schedules ETL runs. 3. The installation should be done in the following order: - Install Oracle 32-bit Client (if required) - Install Oracle 64-bit Client (if required) - Restart machine Once the client is installed a simple way to test that the client is working is to use SQL Developer to open the connection to the Oracle database. Once this is working correctly you then need to install the Attunity connectors. Again depending on the system you are working on you will need to install either the 64-bit of 32-bit Attunity components for SSIS. However here the rules are slightly different. - If you need to run in 64-bit or 64-bit and 32-bit mode, you should install only the 64-bit version of the components. - If you need to run in 32-bit mode only then you should install only the 32-bit version of the compoents. The latest version of the connectors (for SQL 2016) are available from: https://www.microsoft.com/en-us/download/details.aspx?id=52950 The installation of the components is pretty straightforward – once you downloaded them you run the executable and then restart the machine. Once completed the Attunity components will be available for selection within the SSIS toolbox in the Data Flow as per the below. The connectors are a very versatile component and support many of the features you would expect from an SSIS component. Be aware that if you are looking for expressions, then these are to be found on the Data Flow Properties rather than the properties of the actual component.

Slowly-Changing Parent-Child Dimensions – Part 2: The Implementation

Distributing the slides from my talk last night, it was well received and I think everyone was able to take something away. Credit goes to Marco Russo and Alberto Ferrari for their data modelling ideas, visit www.sqlbi.com  http://dl.dropbox.com/u/25925341/Learning%27s%20from%20large%20scale%20dw%20project.pptx Hopefully not my last talk as i enjoyed doing it.  Look out for further Bath based events at: http://avonim.wordpress.com/ All comments welcome CalvinFirst of all, an apology. It’s always hard to recognise whether your posts are useful for people out there and this one slipped through the cracks. After several requests, I thought I’d finally get around to posting the related solution to a post I wrote 2 years ago! Apologies to anyone this could have helped quite some time ago. To set the scene, I had to build a slowly-changing parent child dimension and manage the relationships accordingly. I described the theory behind the solution in this post, probably best give that a skim if you’re not familiar with the approach. To summarise – if a parent node changes, you create new SCD records for all descendants, essentially you create historic “branches” as each change happens. Initial Data Setup For the example, we’ll use the AdventureWorksDW2012 sample DB as it has an org chart we can very easily play with. You can find the download here. We’re going to create a new parent-child SCD dimension using dbo.DimOrganization by running the script below:CREATE TABLE [dbo].[DimOrganizationSCD]( [OrgKey] [int] IDENTITY(1,1), [ParentOrgKey] [int] NULL, [OrganizationID] [int] NOT NULL, [ParentOrganizationID] [int] NULL, [PercentageOfOwnership] [nvarchar](16) NULL, [OrganizationName] [nvarchar](50) NULL, [CurrencyKey] [int] NULL, [StartDate] datetime NOT NULL, [EndDate] datetime NULL, [CurrentFlag] [int] NOT NULL ) ON [PRIMARY] GO INSERT INTO [dbo].[DimOrganizationSCD] (ParentOrgKey, OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey, StartDate, EndDate, CurrentFlag) SELECT -1 [ParentOrgKey] ,[OrganizationKey] ,[ParentOrganizationKey] ,[PercentageOfOwnership] ,[OrganizationName] ,[CurrencyKey] ,'2014-01-01 00:00:000' [StartDate] ,NULL [EndDate] ,1 [CurrentFlag] FROM [dbo].[DimOrganization] UPDATE C SET ParentOrgKey = P.OrgKey FROM [dbo].[DimOrganizationSCD] C INNER JOIN [dbo].[DimOrganizationSCD] P ON C.ParentOrganizationID = P.OrganizationID Take a look in the table and you’ll see something like this: We now have a separation between the unique business identifier for each organisational entity (OrganizationID), and its historical version (OrganizationKey). The Org Change Occurs Next let’s make a change to track – maybe the North American operation has been split off as a subsidiary and we need to insert a hierarchy level for the new brand “SuperCycles US”. We’ll create an ‘updates’ table to represent the delta coming from our ETL and put in two records – the new organisation record for SuperCycles, and the update to North American Operation to represent the new Parent ID.CREATE TABLE [dbo].[DimOrganization_Updates]( [OrganizationID] [int] NOT NULL, [ParentOrganizationID] [int] NULL, [PercentageOfOwnership] [nvarchar](16) NULL, [OrganizationName] [nvarchar](50) NULL, [CurrencyKey] [int] NULL ) ON [PRIMARY] GO INSERT INTO [dbo].[DimOrganization_Updates] VALUES (15, 1, 1, 'SuperCycles US', 100), (2, 15, 1, 'North America Operations', 100) The ETL Logic For this example, I’ve written it as one big stored procedure, will all the logic occurring in one script. I’d normally break this into separate procedures with SSIS handling data flows, lookups etc and logging results of individual scripts, but a single SQL flow is easier to demonstrate. Firstly, we create a couple of variables used within the proc:--Createa a table used to hold all records impacted by the hierarchy changes DECLARE @Updates TABLE ( [OrganizationID] [int] NOT NULL, [ParentOrganizationID] [int] NULL, [PercentageOfOwnership] [nvarchar](16) NULL, [OrganizationName] [nvarchar](50) NULL, [CurrencyKey] [int] NULL ); --Use a common timestamp across all SCD updates DECLARE @SCDDate datetime = getdate(); Now, as far as any children of the North American Operation go, they technically haven’t changed and so we won’t see an update for them. So our ETL logic needs to take that into account. We take our update rows and build a recursive CTE that finds all active child rows in our dimension:--Use a recursive CTE to find all updates AND all descendants of updated records WITH OrgUpdates AS ( --Our Changed Records SELECT U.OrganizationID, U.ParentOrganizationID, U.PercentageOfOwnership, U.OrganizationName, U.CurrencyKey FROM [dbo].[DimOrganization_Updates] U LEFT JOIN [dbo].[DimOrganizationSCD] D on U.OrganizationID = D.OrganizationID and D.CurrentFlag = 1 UNION ALL --Our Descendants (this will recurse until no more descendants are found) SELECT D.OrganizationID, D.ParentOrganizationID, D.PercentageOfOwnership, D.OrganizationName, D.CurrencyKey FROM [dbo].[DimOrganizationSCD] D INNER JOIN OrgUpdates U on D.ParentOrganizationID = U.OrganizationID and D.CurrentFlag = 1 ) --Store the results of the CTE in a table variable INSERT INTO @Updates SELECT OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey FROM OrgUpdates This will loop a maximum of 100 times by default, you’ll need to override this if your parent-child structure has more than 100 levels. Hopefully this isn’t an issue for anyone! We now have a table variable populated with each of the business IDs that relates to a changed record, or a descendant of a changed record. From here on, we can follow a standard SCD Type 2 method – we stamp the existing records with an end date & amend the current flag, then insert the new records. The end-dating of the old records needs to be done first, as it is easier to identify the new records by their current flag once they have been inserted. So the historical update can be something like:--Mark the existing records as old by finding all related records that are currently active. --It is very important that we do this before inserting the new records as otherwise we would not be able to distinguish between old and new! UPDATE D SET EndDate = @SCDDate, CurrentFlag = 0 FROM @Updates U INNER JOIN [dbo].[DimOrganizationSCD] D on U.OrganizationID = D.OrganizationID and D.CurrentFlag = 1 And then we insert new records:--Insert the new rows into the dimension, these will each have new surrogate IDs associated --These will all have NULL ParentOrganizationKeys as we need to wait for all surrogates to be generated INSERT INTO [dbo].[DimOrganizationSCD] (OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey, StartDate, EndDate, CurrentFlag) SELECT OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey, @SCDDate StartDate, NULL EndDate, 1 CurrentFlag FROM @Updates Note that at this point, we’re taking the full cut of records from our “Updates” table, so this will include any new records. Our “SuperCycles US” record will have been inserted here, along with the updated records. The newly inserted records have deliberately not had their ParentOrganizationKey populated as for many of the record, the parent’s surrogate key has not yet been generated. We need to run an update on the dimension table after the insert to go through and fill this in.--Finally, update the ParentOrganizationKey for all Current records UPDATE C SET ParentOrgKey = P.OrgKey FROM [dbo].[DimOrganizationSCD] C INNER JOIN [dbo].[DimOrganizationSCD] P ON C.ParentOrganizationID = P.OrganizationID WHERE C.CurrentFlag = 1 AND P.CurrentFlag = 1 And that’s it, we now have a new branch of the organisation chart to represent today’s changes. Taking another look at our dimension table, we now have a new set of active records representing the current Org hierarchy. Improvements This particular logic has been written to demonstrate an approach, there are many improvements you may want to make if including in a large, resilient ETL process. Firstly, as mentioned above, I would separate the logic into individual stored procedures that each returned rowcounts for better logging. You can also amend the insertion procedure to take original values if you need to mix and match SCD types and include a Type 3 “original value” column. Finally, you may want to include a HierarchyId type column and populate using the full grandparent>parent>child surrogate key path, this allows you to query for such things as “all descendants” without having to build in recursion. This can vastly improve performance if you’re going to be running aggregations over huge volumes directly on the SQL layer. Further Reading Recursive Query using Common Table Expressions Hierarchal Data Code Extract Putting all the code together, our ETL query is as follows:--Create a table used to hold all records impacted by the hierarchy changes DECLARE @Updates TABLE ( [OrganizationID] [int] NOT NULL, [ParentOrganizationID] [int] NULL, [PercentageOfOwnership] [nvarchar](16) NULL, [OrganizationName] [nvarchar](50) NULL, [CurrencyKey] [int] NULL ); --Use a common timestamp across all SCD updates DECLARE @SCDDate datetime = getdate(); --Use a recursive CTE to find all updates AND all descendants of updated records WITH OrgUpdates AS ( --Our Changed Records SELECT U.OrganizationID, U.ParentOrganizationID, U.PercentageOfOwnership, U.OrganizationName, U.CurrencyKey FROM [dbo].[DimOrganization_Updates] U LEFT JOIN [dbo].[DimOrganizationSCD] D on U.OrganizationID = D.OrganizationID and D.CurrentFlag = 1 UNION ALL --Our Descendants (this will recurse until no more descendants are found) SELECT D.OrganizationID, D.ParentOrganizationID, D.PercentageOfOwnership, D.OrganizationName, D.CurrencyKey FROM [dbo].[DimOrganizationSCD] D INNER JOIN OrgUpdates U on D.ParentOrganizationID = U.OrganizationID and D.CurrentFlag = 1 ) --Store the results of the CTE in a table variable INSERT INTO @Updates SELECT OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey FROM OrgUpdates --Mark the existing records as old by finding all related records that are currently active. --It is very important that we do this before inserting the new records as otherwise we would not be able to distinguish between old and new! UPDATE D SET EndDate = @SCDDate, CurrentFlag = 0 FROM @Updates U INNER JOIN [dbo].[DimOrganizationSCD] D on U.OrganizationID = D.OrganizationID and D.CurrentFlag = 1 --Insert the new rows into the dimension, these will each have new surrogate IDs associated --These will all have NULL ParentOrganizationKeys as we need to wait for all surrogates to be generated INSERT INTO [dbo].[DimOrganizationSCD] (OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey, StartDate, EndDate, CurrentFlag) SELECT OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey, @SCDDate StartDate, NULL EndDate, 1 CurrentFlag FROM @Updates --Finally, update the ParentOrganizationKey for all Current records UPDATE C SET ParentOrgKey = P.OrgKey FROM [dbo].[DimOrganizationSCD] C INNER JOIN [dbo].[DimOrganizationSCD] P ON C.ParentOrganizationID = P.OrganizationID WHERE C.CurrentFlag = 1 AND P.CurrentFlag = 1

Nonclustered Columnstore Indexes And Stored Procedures

I was recently working on a project with a developer who needed to execute a stored procedure which followed this pattern (amended for simplicity): CREATE PROC ExampleProc AS IF (SELECT COUNT(*) FROM Sys.indexes WHERE name = 'NCCIDX_DemoIndex') > 0      DROP INDEX [NCCIDX_DemoIndex] ON [dbo].[Demo] INSERT INTO [dbo].[Demo]([Field1], [Field2], [Field3]) VALUES('Some','Test','Data')      ,('More','Test','Data') CREATE NONCLUSTERED COLUMNSTORE INDEX [NCCIDX_DemoIndex] ON [dbo].[Demo] (    [Field1],     [Field2],     [Field3] )WITH (DROP_EXISTING = OFF) Unfortunately on execution the procedure would fail intermittently and the following error message would be returned: Msg 35330, Level 15, State 1, Procedure ExampleProc, Line 7 INSERT statement failed because data cannot be updated in a table with a columnstore index. Consider disabling the columnstore index before issuing the INSERT statement, then rebuilding the columnstore index after INSERT is complete. The reason for the error message is that, at the time of compilation if the COLUMNSTORE index exists the optimiser decides the INSERT statement will fail (although we drop the index within the procedure this will not be taken into account by the optimiser). If the procedure is executed and generates a plan at a time when the index does not exist it will run as expected. However we cannot resolve the issue reliably by first removing the index and then creating the procedure. The reason for this is that when the plan is subsequently flushed from the plan cache (something which can be forced by using DBCC FREEPROCCACHE) the optimiser will create a new plan and if the index exists when this happens the query will fail. The solution I found was to add OPTION(RECOMPILE) to the INSERT statement, this causes it to be assessed at runtime when the index has been dropped stopping the error. Please be aware as this will force recompilation each time the procedure is run adding a small overhead, whilst this was quite acceptable within the DW environment, you will need to make your own assessment. The amended procedure therefore follows the following format: CREATE PROC ExampleProc AS IF (SELECT COUNT(*) FROM Sys.indexes WHERE name = 'NCCIDX_DemoIndex') > 0      DROP INDEX [NCCIDX_DemoIndex] ON [dbo].[Demo] INSERT INTO [dbo].[Demo]([Field1], [Field2], [Field3]) VALUES('Some','Test','Data')      ,('More','Test','Data') OPTION(RECOMPILE) CREATE NONCLUSTERED COLUMNSTORE INDEX [NCCIDX_DemoIndex] ON [dbo].[Demo] (    [Field1],     [Field2],     [Field3] )WITH (DROP_EXISTING = OFF)

Optimising SSIS Lookups with HASHBYTES()

I was recently working on a load for a Dimension table which worked like the below. The package follows a fairly standard routine. Data for the Dimension is transformed into the required format using SSIS, following this it is inserted into the target table. On the way an initial check is made to determine if the record is new ‘LKU – Check Exists’ and a secondary check to determine if the record has changed, which is subsequently used in an update routine. Although the package was incrementally loading, it was having to sometimes check several million rows to determine if these had changed, as the table was quite wide I decided to implement the HASHBYTES() function in order to take a hash of the columns for comparison rather than compare each of the columns. The initial results were quite promising with the package running in around half the time, however on a particularly big load I noticed the package failed as it had run out of memory for the second Lookup (full cache was used). I found this odd because the HASHBYTES function only returned a single value and I anticipated this would be smaller than the sum of the columns, however on a little deeper investigation I found that by default any row returned using the HASHBYTES function are all of a default size 8000 bytes. A quick check of the SQL Documentation of the HASHBYTES function at http://msdn.microsoft.com/en-GB/library/ms174415.aspx states that the size of the data returned for the HASHBYTES function when used with the SHA2-256 algorithm is 32 bytes meaning that most of this space was being wasted. Therefore changing the formula from 1 to 2 below significantly increases the efficiency of the lookup and also make the cache size required smaller by a factor of around 250! 1. HASHBYTES('SHA2_256',[ProductName] + '|' + [CategoryName]) AS HashValue 2. CONVERT(VARBINARY(32),HASHBYTES('SHA2_256',[ProductName] + '|' + [CategoryName])) AS NewHash Upon doing this the size of each row was significantly reduced, the package is running faster than before and most importantly there are no memory issues :)

Prototyping Analysis Services Cubes

After running a modelling workshop with end users, my next activity on a project is generally to produce a prototype model with some test data, in order to check with users that the planned star schema design will work. My tool of choice for this job tends to be Power Pivot, but I was presented with a slightly different solution when working with a client recently. This got me thinking it would be worth contrasting the two approaches…. Power Pivot First of all, lets start off with Power Pivot. The goal here is to quickly produce a prototype to verify the star schema design. I tend to just request that a business user brings along an Excel workbook with samples of each dimension and facts pre-extracted in Excel, which can then be loaded into Power Pivot. I find that this saves time, although as an alternative of course Power Pivot could extract directly from SQL Server, Oracle etc if needed. As an example, we can quickly walk through creating a very simple model to create a star schema: I’ve copied the data that represents my single fact table and related dimensions into Excel. In this case it’s to model a simple Sales Order Transactions fact, therefore my Excel workbook has dimensions such a Product, Customer, Sales Territory etc, as we well as the data that represents the fact table: After completing and closing the workbook, we now create a new workbook and open the Power Pivot window. Within the Power Pivot window, the next step is to import the data from the Excel workbook created in step 1. This can be done by choosing From Other Sources->Excel workbook and then browsing to the file. Power Pivot will now present us with all of the sheets that it finds in the workbook, so we select the sheets that represent the dimensions and facts: The next step is to setup relationships between the various tables. Once this is done, we have a simple prototype data model, as shown below: After a bit of tidying up (e.g. creating some hierarchies and removing unwanted columns), we can now connect to a Pivot Table in Excel that will help verify the star schema: They key point here is that we have rapidly built a prototype, without the need for any ETL. This means we can quickly cover the design issues with the users, but this time with their actual data, rather than just on a whiteboard. Analysis Services The alternative approach that was presented to me recently is to develop the Analysis Services cube/Tabular model before carrying out the ETL. Essentially this means that the users connect to cube, as a production system, unaware that under the hood the ETL is not yet complete. How is this achieved? Essentially by putting logic in the SQL view layer. Its a well known Analysis Services best practice to bind your SSAS objects (e.g. dimensions, measure groups) to SQL Server views, rather than using objects within the DSV. Rather than each view pulling its data from a complete dimension or fact table, instead each view would pull its data from a staging area or a copy of the source database. The idea being that, over time, the contents of each view would be updated to point to the actual dimensions and facts, once they are built. Therefore a normal view (in this case for a product dimension) that feeds Analysis Services might look like: SELECT ProductKey, ProductName, ProductCategory, ProductSubCategory FROM Dim.Product With the ‘No ETL’ approach, the view looks very different, as any transformation and cleaning will have to be carried out from the view: SELECT ROW_NUMBER() OVER(ORDER BY PROD.ProductID) AS ProductKey, PROD.Name AS ProductName, CAST(ISNULL(SUB.Name, 'Unknown') AS VARCHAR(100)) AS ProductSubCategory, CAST(ISNULL(CAT.Name, 'Unknown') AS VARCHAR(100)) AS ProductCategory FROM [$(AdventureWorks)].Production.Product PROD LEFT JOIN [$(AdventureWorks)].Production.ProductSubcategory SUB ON SUB.ProductSubcategoryID = PROD.ProductSubcategoryID LEFT JOIN [$(AdventureWorks)].Production.ProductCategory CAT ON CAT.ProductCategoryID = SUB.ProductCategoryID This therefore incurs some technical debt, as the cube gets built before the ETL or even the physical dimension table. But the idea is that you can get the users using the cube, on production even, then gradually replace the views with proper ETL and dimension tables once you have some more feedback. Conclusion The Power Pivot method is tried and tested for me many times – I find that its very quick to get up and running. The Analysis Services approach is not quite as quick to get up and running, but offers the benefit that the cube can continually evolve in a fairly agile manner. Its early days for me using this approach, but the a disadvantage is that the views can get quite complex, impacting performance depending on the data volumes. Both methods probably have their place. If I wanted to validate my design, I think I would continue to produce a quick Power Pivot model. On the other hand, if quick delivery to production is a must, and the complexity/data volumes are not an issue, then I would use the view approach.

SQL 2014: Boosting ETL Performance using Memory-Optimised Tables

With SQL Server 2014 due for release soon, there’s a lot of talk about how little is changing for those of us more interested in Business Intelligence. We’ll be getting clustered, updatable columnstore, which is very convenient, but I’ve been looking into the potential benefits of other new features. The real gem for me is the potential of memory-optimised tables or in-memory OLTP (previously known as Project Hekaton). I’ve had a few questions around how columnstore and memory-optimised tables differ – they’re both in-memory solutions for data storage right? The big difference comes in disk usage – in order for a columnstore index to exist, there must also be a physical table written to the disk. Any updates to the table, even in SQL Server 2014, will require disk writes, log writes too if you’re not bulk loading. Memory-optimised tables don’t have this overhead – we can push the staging data straight into memory and skip the disk entirely. There are plenty of uses for this new type of table, but it’s the traditional ETL routine I’m interested in. We would normally pull data from our source systems, place it in a staging database while we clean, validate etc, then write the cleaned data to the warehouse: The data written to the staging database is transient, it’s only of use during the load process. It is therefore redundant to write this data to disk each time; I believe this is where memory-optimised tables will really have an impact. To prove this theory, I set up a quick demonstration. I started by creating two identical tables, one physical, one memory-optimised, both schema copies of [AdventureWorks2012].[Sales].[SalesOrderDetail]. The memory optimised table was created using the following script:     I then built the simplest of SSIS packages, a data flow to move data from AdventureWorks2012.Sales.SalesOrderDetail to a ‘staging’ table, then read the data from the staging table and write it back to a disk-based table, representing our final warehouse. I configured two versions of the package, one where the staging table was disk-based, one where it is our memory-optimised table. I deployed and scheduled these packages as alternating jobs and kicked them off for a few hours to see how performance varies over time. The results are as follows:   The initial results are very promising - we see a huge reduction in execution time when using in-memory tables for our transient ETL data. We’re still performing minimal logging with this setup – you can reduce this even further by switching to SCHEMA_ONLY durability, but by bypassing writing the data to disk we’ve cut our ETL window in half. There are obvious hardware limits to how this can be used, and in-memory tables have their own limitations, but implemented carefully it could change the face of warehouse loading drastically.