Adatis

Adatis BI Blogs

Introduction to Azure Key Vault for SQL Server

What is it Azure Key Vault is a feature available in Microsoft Azure which can be used to manage and store keys that are used to encrypt your data. A big benefit to using Azure Key Vault is that the process of managing and maintaining your keys is completely streamlined; Keys can be created very quickly for development and testing purposes and can then in turn be seamlessly migrated to production keys where permissions can be granted or revoked as necessary. A variety of keys and secrets can be held in Azure Key Vault including; Authentication keys, Storage account keys, Data encryption keys, .PFX files and passwords. Stored keys are protected by Hardware Security Modules (HSMs); keys can be imported or generated in theses HSMs which are processed in FIPS 140-2 Level 2 validated HSMs. Uses and limitations for SQL Server Key Vaults can be created and used by anyone with an Azure subscription. They can be useful to Azure developers and security admins, but also to administrators who manage other Azure services for an organisation who can then be responsible for the management and maintenance of keys or secrets and can provide users with URIs which can be applied directly to their applications. Azure Key Vault can be integrated with SQL Server as an Extensible Key Management (EKM) provider to protect SQL Server encryption keys. This is particularly useful when using Always On Encryption which is available with SQL Server 2016 as with Always On Encryption SQL Server does not hold the keys used to decrypt the data it stores in Always On encryption fields making Azure Key Vault a perfect utilisation as a centralised key store for this functionality. SQL Server has a variety of encryption methods including; Transparent Data Encryption (TDE), Column Level Encryption (CLE) and Backup Encryption; these encryption methods implement a traditional key hierarchy where by the data is encrypted using a symmetric data encryption key (DEK) which is further protected by encrypting it with a hierarchy of keys stored in SQL Server. By instead using Azure Key Vault as the EKM provider architecture SQL Server can protect the data encryption keys by using an asymmetric key stored externally to SQL Server which in turn adds an additional security layer and separation between the management of keys and data. This functionality can be adopted by both cloud based SQL Server instances on Azure virtual machines and on-premises SQL Server instances. In order to implement Azure Key Vault to protect your SQL Server encryption keys you will use the SQL Server Connector; this acts as a bridge between SQL Server and Azure Key Vault. The SQL Server Connector needs to be registered with the relevant SQL Server instance which allows Azure Key Vault to be used as a cryptographic provider, next the configuration and permissions are set up from within Azure and the appropriate credentials are created from within SQL Server. Finally an asymmetric key is opened in Azure Key Vault which can be used to protect database encryption keys on the SQL Server instance. The SQL Server Connector is available as a download from Microsoft here and requires Windows Server 2012 or Windows Server 2012 R2 as your operating system. It currently supports the Enterprise 64-bit versions of SQL Server 2016, 2014, 2012 SP2, 2012 SP1 CU6 and 2008 R2 CU8; for earlier versions of SQL Server 2008 and 2012 there is a patch available which is linked from the afore mentioned Microsoft download page. For more information on the SQL Server Connector and other software requirements please see the Details and System Requirements again on the afore mentioned Microsoft download page. Potential limitations and issues to keep in mind include the following; · Azure Key Vault, like most cloud applications is a paid for service and although there is no upfront cost or termination fees there is a price for both key storage and operations. There are two available service tiers but only Premium offers HSM protected keys which are used with SQL Server implementations. The prices as of November 2016 are;      - £0.6109 per version of a key per month and £0.0183/10,000 operations for HSM protected keys.      - £0.0183/10,000 operations for secrets and software protected keys.      - £1.8327 per renewal request and £0.0183/10,000 operations for Certificate operations. · Authentication to Azure Key Vault requires Azure Active Directory; this is however included with Azure subscriptions. · Key Vault permissions assign a principal access to all secrets and keys within a vault which is something to keep I mind when assigning management and maintenance of the key vault. The best practise would be to have a separate Azure Key Vault for each application instance storing secrets or keys. · Where you have applications using Azure Key Vaults split across separate regions it is best practise to create separate key vaults in each of the relevant regions that will be used by those applications for both performance and security considerations. · There are transaction limits applied which allows the following maximum amount of transactions in 10 seconds, per vault, per region;      - 5 for ‘HSM- CREATE KEY’ transaction types.      - 1000 for ‘HSM- other transaction’ types.      - 10 for ‘Soft-key CREATE KEY’ transaction types.      - 1500 for ‘Soft-key other transaction’ types.      - 2000 for ‘All secrets, vault related transaction’ types. Further Reading I will be posting a future blog showing an in-depth real-life application of Azure Key Vault with a SQL Server instance using the SQL Server Connector and the steps taken to create it including initial set-up, testing and deployment strategies. I will also discuss the permissions and configurations that can be used and the roles that can be assigned for the management and maintenance of the implementation. For more detailed information see the following links; · Microsoft Azure – Key Vault · Microsoft Documents – What is Azure Key Vault · Microsoft – SQL Server Connector for Microsoft Key Vault

Managing msdb in SQL Server

In the world of BI, the SQL Server system databases (with the exception of tembdb) are often forgotten about.  You won’t be surprised to learn that I am a BI Consultant and until recently, I was also in this boat.  System database are usually left to a DBA, whilst we develop ETL packages, model data or produce fancy reports. After attending Ryan Adam’s session at SQL Pass 2016, my mindset has changed.  The msdb database used to be the hub for SSIS logging but since 2012, this is all stored in the SSISDB database.  However, there are still a number of processes and functions that are logged in msdb and we need to manage this, to prevent the database from becoming too large and impacting storage costs.  This blog will provide a brief overview of what is stored and logged in msdb, as well as a little trick of how to manage SQL Agent Jobs.  What is Stored in msdb? Before I provide a brief list of what is in msdb, one obvious bit of advice is it should be backed up regularly.  We all have backup strategies for our OLTP or data warehouse database, but the system ones are often neglected.  All this logging and auditing history in msdb is useful! The msdb database consists of the following: o   SSIS - Only if packages are stored in SQL Server (not the SSIS Catalog). o   Backup and Restore o   Maintenance Plans o   Database Mirroring o   Policy Based Management o   Log Shipping o   Service Broker o   Database Engine Tuning Advisor SQL Agent Jobs Whilst all of the logging tables in msdb are important for a DBA, one real area of interest in BI is around SQL Agent jobs.  There can be many SSIS packages being called from the Catalog, spanning across multiple jobs.  It is important that logging is available, yet we don’t want job history to be stored forever. Out of the box, the logging for msdb looks like the below.  Right click on SQL Server Agent properties in SSMS and navigate to the ‘History’ tab. Let’s work through a use case, extracted from Ryan Adam’s Pass presentation: 1.       Imagine you have 10 database backup jobs running every hour in your production environment. a.       10 Databases * 24 rows per day = 240 Total rows per day. b.      Total rows for all jobs would be maxed out in 4 days and 4 hours. Ok, so now we have a problem.  All DBA’s and BI developers would want to see more than 4 days’ worth of history, which means the ‘’ setting is not a good default to retain.  How about changing the Agent job to the below? Sounds perfect, right? Wrong.  The GUI setting is not persisted, meaning the Agent Job will log history for 30 days on the next run only.  This seems crazy, but is what we are dealing with unfortunately.  There is a solution, which involves executing a stored procedure within the msdb database. DECLARE @DeletePrior DATETIME; SET @DeletePrior = CONVERT(VARCHAR(10), DATEADD(dd, -30, GETDATE()), 101); EXEC msdb.dbo.sp_purge_jobhistory @oldest_date = @DeletePrior;   NOTE:   Ensure the ‘row restriction’ property (shown above) is unchecked against the SQL Server Agent Job settings before executing. Although you cannot see the changes in the GUI settings, you will now retain 30 days of history in msdb.  The query can be tweaked to days, months or even years if you like.  Conclusion If you take anything away from this article, then it should be to not ignore system databases.  Microsoft create them automatically for us but unfortunately, the out of the box configurations are not always correct for your situation or business.  For a deeper dive into msdb and other system databases, try some of the recommended blogs below. If anyone would like to share their experiences with using/refining msdb, feel free to comment.  Recommended Reading  o   MSDN - https://msdn.microsoft.com/en-us/library/ms187112.aspx?f=255&MSPPError=-2147217396 o   Hemantgiri Goswami’s Blog - https://www.toadworld.com/platforms/sql-server/b/weblog/archive/2013/10/21/msdb-cleanup-is-necessary o   Ryan Adam’s Blog - http://www.ryanjadams.com/

Executing SQL Scripts From TeamCity

Recently I was working in an environment where we used TeamCity to automate unit testing (further blogs on this coming later). Essentially the process was that the latest source would be deployed, a set of ETL processes would be run and tests would validate the results to ensure changes to the project over the day had not affected the reliability of the results. In the interests of ensuring that the tests were the same each time my preferred method was to use a script to remove the databases from the SQL instance on the server (for other reasons I was keen not to change the deployment scripts in order to achieve this). Whilst in this example I delete databases using the SQL script it should be noted you can use this method to execute any SQL script on a target server. 1. Create a new Build Configuration (or add as a step to an existing one). 2. Create a SQL Script that does whatever it is you want it to do and save it in a text file eg ‘ClearDatabases.sql’. 3. Create a folder on the server running team city – this will need to be a folder which the service running the team city build agent has access to otherwise the process will fail. 4. Add a build step – giving a name that indicates what the script will be doing – in this case I have called it ‘Drop Existing Databases’. The script should be configured with a runner type of ‘Command Line’ and run a custom script. In the ‘Custom script’ section I have the following: sqlcmd -e -b -d master -i "C:\UnitTestingScripts\ClearDatabases.sql" -S "DEVTESTDW\SQL2014" 5. If the SQL server on which the script is to be executed is not the same as the Team City server (very likely) then you will need to ensure that the target server has the correct firewall and SQL configuration in order to accept the connection.  

Adatis Coding Dojo – Session No.2

The Challenge Write a program that generates all two-word anagrams of the string "documenting", in SQL Server. e.g. “Documenting” = “Document Gin”. Introduction This weeks’ challenge was….well, a challenge. In our first session everyone attempted it via paired programming, for this dojo we decided to attempt the problem with one larger group, consisting of 5 people. This technique encourages an open forum for developers to discuss ideas, with one person at the computer coding. The assigned coder must swap with another participant every 5 – 10 minutes. Preparation Before beginning the Dojo, I asked for one of the Senior Consultants to attempt to complete the problem and provide the code and concepts behind his thinking. The idea is that one challenge can be solved in a number of ways, with 5 people working collaboratively thinking in a completely different way to one individual coder. We provided a file containing 10,000 words (which would become the master list for the anagram solution). If you would like to try this yourself, the text file can be downloaded from here. Senior Developer’s Solution Most importantly, the senior developer DID NOT use the TDD development methodology. It took around 2 hours for the developer to implement a fully functioning stored procedure. Here is how it was achieved: 1. Stored procedure, encapsulating a long T-SQL script. The solution was focused around string manipulation and recursive querying. 2. While loop turns word into a pivoted array of individual letters, which can then be compared against the words in the word list table. 3. Recursive CTE returns words consisting only of letters in that array 4. XML PATH used to create CHECKSUM string of letters and counts. 5. All word combinations of correct length returned and checked against checksum to validate letter counts. The solution was built to be flexible from the start – it returned anagrams with ‘n’ number of words rather than the proposed 2. It would also work with any provided starting word and special characters. Code Dojo Solution The first task was to brainstorm ideas on how to conquer the challenge, starting with whether it could even be achieved in SQL Server! The coding team weighed up string manipulation, recursive CTE’s, a cursor or mathematical calculations, amongst other SQL Server functions. The general consensus was to avoid recursion (where possible) to ensure faster query results and mathematical algorithms to compare against the data in the imported wordlist table. The development team used TDD to implement the mathematical approach to solving the problem. Each code enhancement contained a new automated test script that would capture any failures and enable a simple rollback to the last working codebase. This proved to be a successful technique, as ideas were changing constantly throughout development. Actual Dojo Solution After around 90 minutes of SQL coding and TDD, the team managed to come up with a working solution. This is how they achieved it: 1. A SQL Function to work out a unique value for a character (letter). a. Each letter of the word ‘Documenting’ has a case sensitive ASCII value e.g. ‘D’ is 68 and ‘d’ is 100. b. All letters were converted to uppercase, to ensure a non-case sensitive approach. c. Apply the POWER SQL function within a given letter, which gives it a unique number and cannot be duplicated by another letter. 2. Simple stored procedure that looks up the existing words in the full word list table, which references the function whilst comparing letter values. a. Find all word combinations and their total POWER value. b. Compare the total number against the hardcoded word ‘Documenting’ c. Return and records that have two words adding up to the total POWER of ‘Documenting’. d. Nest the hard coded word into the procedure, which means any word can be compared at run time. 3. The TDD approach helped facilitate the iterative, code review process. Conclusion Although the Senior Developer and Dojo Team’s solutions met the minimum expectations for the challenge, there were a number of pro’s and con’s to both solutions. The Senior Developer thought ahead and made improvements to the requirements, such as dynamically handling more than 2 word anagrams. He also demonstrated some of the under used and more powerful functionality within SQL. However, it takes around 2 minutes to execute one anagram and the code itself is not the most efficient. By not using TDD in his approach, he over complicated the solution and did not encounter performance bottlenecks until the end of the build. On the other hand, the Dojo team fully practiced TDD. This was reflected in the much smaller codebase and, most importantly, the speed in which the anagram procedure executed was much quicker (23 Seconds). Their solution is limited in that it is very rigid and cannot handle more than two word anagrams. It also cannot handle special characters, whereas the Senior Developer solution can. However, these were not requirements of the solution – would a client accept a slower, more complicated product that handles requirements they do not need? Overall, both solutions work adequately but it is clear that when confronting a technical challenge/problem, running a Dojo and practising TDD can bring more efficient results. Had we added further requirements to the challenge, maybe the dojo team would have found another, even more efficient solution. References General Good Dojo Practices - http://codingdojo.org/ TDD - http://code.tutsplus.com/tutorials/the-newbies-guide-to-test-driven-development--net-13835 SQL Functions POWER – https://msdn.microsoft.com/en-us/library/ms174276.aspx ASCII – https://msdn.microsoft.com/en-us/library/ms177545.aspx XML PATH – http://blogs.msdn.com/b/mind_talks/archive/2012/01/18/xml-path-for-sql-server.aspx CHECKSUM - https://msdn.microsoft.com/en-us/library/ms189788.aspx Dojo Code For access to the Senior Developer’s and the Dojo team’s SQL solutions, please leave a comment and I will get back to you directly.  It would be interesting to see if anyone else has a better technique (both in terms of code length and performance) and welcome any suggestions.

SQL PASS Summit 2014 – Kick Off

Day 1 has kicked off in Seattle, a remarkable city. Having arrived a week early, I’ve had plenty of time to check out the sights and the food and have enjoyed it immensely - a brilliant venue for PASS! There were a few announcements at this morning’s Keynote, mostly regarding Azure. Azure SQL Databases are gaining larger index handling, parallel queries, extended events and in-memory columnstore for data marts. Joseph Sirosh gave a talk about Machine Learning & Information management showing a cool example of Azure Stream Analytics using Microsoft Kinect sensor information of customer interactions in a shop being uploaded straight into Power Map! I am looking forward to hearing more on Machine Learning. There are also a handful of great improvements for Power BI. I am most looking forward to the new Live Ops Dashboards and drill-through actions! Combo Reports also look promising… Moving onto the first session, I chose to attend ‘What’s new in Microsoft Power Query for Excel’. As it turns out there’s not a massive amount of new stuff – some new data sources and a tick box when you do a merge to remove name prefixes. However one of these new sources is the long-awaited Analysis Services data source. The new ODBC Data Source is a great addition also. There was a mention regarding the possibility of a decoupled M-query SSIS component! We probably won’t hear of anything until later in 2015, unfortunately. I would say this was not a level 300 session, more like 100/200. The second session was ‘SQL Server 2014 Always On (High Availability and Disaster Recovery)’: a nice overview of what was brought in in 2012 and the newer 2014 features – these including an increased maximum number of secondary replicas, increased availability of readable secondary’s and the Add Azure Replica Wizard. Despite not being a DBA and it being a level 300 session, I found it easy to follow and absorbable. I feel many of the DBAs in the room may not have taken away any information they would have not already have known, however. Niko Neugebauer gave a fantastic, in depth session on ‘ETL Patterns with Clustered Columnstore Indexes’. It was a fast moving talk, despite the time spent waiting for some executions. It demanded your full attention! Definitely worthy of its 400 level. It left me a little tired for Marco Russo’s ‘DAX Patterns’ session which showed some examples of workarounds for common tabular weaknesses like distinct counts for Type 2 slowly changing dimensions and cumulative quantities. Overall it was a strong day. I am looking forward to tomorrow. More to follow…

SQL Server NOT IN Clause - Avoid like the….

Background Up until recently, I was one of the SQL Server developers adopting the bad habit that is known as the NOT IN clause.  It is an easy way of finding data in one table, that does not exist in another.  For this purpose, I thought using the NOT IN would help me conceptualise a query result, as well as help make it easier for someone else looking at the code.  In fact, although the performance (within an execution plan) is OK, you can pull back incorrect results from the overall query. The Problem The NOT IN clause is problematic in only one, but VERY IMPORTANT way…….it DOES NOT include NULLS in the comparison table.  Please see the example below: Create two tables for NOT In Example: Query results for both tables: NOT In Query: As you can see, 0 records were returned.  We would expect the record (containing Striker, Andy Cole) in the NewFootyPlayers table to be returned.  The NOT IN Clause is ignoring any comparisons on NULLS. NOTE  Adding an additional ‘WHERE Position IS NOT NULL’ filter to the NOT IN clause would also give the same result but a lot of people will forget to add it and spend a substantial amount of time wondering why certain records are missing from their result set. The Solution(s) There are a number of clauses or SQL syntax that can be used instead of the NOT IN.  although most do not have any major performance benefits, they actually return what is expected.  The three examples below all return the one expected record: All three return the below result, which we expected in the first place: Recommended Solution Whilst none of the solutions above cause major performance problems, there is one method that is better than the others.  If we are working with hundreds of millions of records in both tables, using the NOT EXISTS is the most efficient query.  Its performance is similar to NOT IN and EXCEPT, and it produces an identical plan, but is not prone to the potential issues caused by NULLs or duplicates. I would be interested to see if anyone else has performance tested each query type and if there are better alternatives to NOT EXISTS.  One thing I am certain on, however, is that no one should have to use the NOT IN clause.

Master Data Services Training in the UK

This is just a quick post to announce a range of SQL Server training courses, organised (and in some cases delivered) by Chris Webb. To start off there’s a SQL Server course delivered by Christian Bolton in December, followed by an Analysis Services course delivered by Chris in February. I’ll be delivering a Master Data Services course in February, before Chris delivers an MDX course in March. The details for all the courses are: SQL Server Internals and Troubleshooting Workshop - Christian Bolton – 6th – 7th December 2011 The Advanced Troubleshooting Workshop for SQL Server 2005, 2008 and R2 provides attendees with SQL Server internals knowledge, practical troubleshooting skills and a proven methodical approach to problem solving. The workshop will enable attendees to tackle complex SQL Server problems with confidence. Full details and registration here. Real World Cube Design and Performance Tuning with Analysis Services – Chris Webb – February 2012 A two day course that takes real world experience in showing you how to build a best practice Analysis Services cube, covering design issues such as data warehouse design and complex cube modelling. Day two then covers performance optimisation for Analysis Services, including MDX optimisation and cube processing. Full details and registration here. Introduction to Master Data Services with Jeremy Kashel – February 2012 An end to end look inside Master Data Services, this full day course will begin with a synopsis of Master Data Management before moving on to an overview of Microsoft SQL Server 2008 R2 Master Data Services (MDS). The remainder of the course will cover the major MDS topics, such as modelling and business rules, which will include a number of practical exercises. More details and registration for here. Introduction to MDX with Chris Webb – March 2012 The Introduction to MDX course aims to take you from the point of being a complete beginner with no previous knowledge of MDX up to the point where you can write 90% of the MDX calculations and queries you’ll ever need to write. The three day course covers the basics, such as sets, tuples, members to more advanced concepts such as scoped assignments and performance tuning. Full details and registration here.

A Pattern To Load Data to Master Data Services via SSIS–Part 2

Introduction This is the second part of a series of blog posts intended to show a way to load data in to MDS via SSIS. In part 1 we have loaded the staging tables with new members and attributes for several entities. In this blog post we are going to extend the SSIS package with tasks to move the data from the staging tables into the MDS Product model and validate these newly inserted members. Completing The Solution We need to move the data from the staging tables into the model. This is carried out by executing the MDS staging sweep process. To achieve this we need to add an Execute SQL Task to the control flow of our package. Rename the task – I’ve called mine ‘SQL – Sweep Stage’ and connect it up to the ‘DFL – Load Staging Tables’ task with a success constraint. On the General tab set the connection to MasterDataServices and the SQL Statement as follows: DECLARE @ModelName NVARCHAR(50) = ? DECLARE @UserName NVARCHAR(50) = ? DECLARE @User_ID INT DECLARE @Version_ID INT SET @User_ID = (SELECT ID FROM mdm.tblUser u WHERE u.UserName = @UserName ) SET @Version_ID = (SELECT MAX(ID) FROM mdm.viw_SYSTEM_SCHEMA_VERSION WHERE Model_Name = @ModelName) EXECUTE mdm.udpStagingSweep @User_ID, @Version_ID, 1 Then add the Parameter mapping as shown below: That’s all there is to do to get our data into the model. However this process is asynchronous and before we can validate the model we need to know when the staging sweep has finished. Add a For Loop Container task to the control flow of the package and connect it up to the ‘SQL – Sweep Stage’ task with a success constraint. Rename the task – I’ve called mine ‘FLC – Wait Until Batch Completes’. Add an Execute SQL Task inside the loop container task and rename it. Mine is called ‘SQL – Get Staging Batch Status’. Change the connection to MasterDataServices, change the ResultSet property to ‘Single row’ and then add the following SQL script to the SQLStatement property: DECLARE @Version_ID INT SET @Version_ID = (SELECT MAX(ID) FROM mdm.viw_SYSTEM_SCHEMA_VERSION WHERE Model_Name = ?) SELECT TOP 1 Status_ID FROM mdm.tblStgBatch WHERE Version_ID = @Version_ID ORDER BY ID DESC Add the parameter mapping as shown below: And the Result Set as shown below: Add three more parameters to the package as shown in the table below: Next configure the For Loop Properties as shown in the table below: The InitExpression value sets the @dtMDSLoopTimer to the current time plus the interval set in our @intMDSTimeout variable. The EvalExpression checks if the @strMDSBatchStatus is either not equal to 2 (Success) or the timeout has expired. The For Loop Container can only succeed if the staging batch is successfully loaded. Now we can validate the model so add an Execute SQL Task to the control flow and connect it to the ‘FLC – Wait Until Batch Completes’ task with a success constraint. Rename the task - mine is called ‘SQL – Validate Model’. Change the connection to MasterDataServices and the SQLStatement as follows: DECLARE @ModelName NVARCHAR(50) = ? DECLARE @UserName NVARCHAR(50) = ? DECLARE @User_ID INT DECLARE @Version_ID INT DECLARE @Model_id INT SET @User_ID = (SELECT ID FROM mdm.tblUser u WHERE u.UserName = @UserName ) SET @Version_ID = (SELECT MAX(ID) FROM mdm.viw_SYSTEM_SCHEMA_VERSION WHERE Model_Name = @ModelName) SET @Model_ID = (SELECT Model_ID FROM mdm.viw_SYSTEM_SCHEMA_VERSION WHERE Model_Name = @ModelName) EXECUTE mdm.udpValidateModel @User_ID, @Model_ID, @Version_ID, 1 Set the parameter mapping as follows: Almost done. Just to finish it off lets add a script task to the control flow and connect it to the ‘FLC – Wait Until Batch Completes’ task with a completion constraint. Change the Evaluation Operation of the constraint to ‘Expression and Constraint’ and set the Expression to ‘@strMDSBatchStatus != 2’. Edit the script and add the following line of code under   // TODO: Add your code here: Dts.Events.FireError(0, "SCR - Fire Error", "MDS Timeout Occurred", string.Empty, 0); This task will fire an error event if the MDS staging batch does not complete successfully. The finished package control flow should look similar to the following image: Execute this package and then check the Product entity in MDS. It should look something like the following: Looking at the other entities you will see that we have added members to three entities and validated all these new members. Summary Over the last two blog posts I have shown a way of automating the loading of data to Master Data Services via SSIS. This pattern can be used to cater for most of your loading requirements. That’s it, the completed Integration Services project source code and MDS Model can be downloaded from here – (You will need to create a login first). Your comments are very welcome.

A Pattern To Load Data to Master Data Services via SSIS

Part 1 Introduction Loading new members to a MDS entity will be a common requirement in all MDS implementations. In these blog posts I am going to walk you through building an SSIS package that performs the following processes: Load new members and attributes to several entities via staging tables Validate the MDS model that contains the entities In part one we will load the MDS staging tables ready to take our new members and attributes into our MDS model. For a thorough understanding of the staging process in MDS please see the Master Data Services Team blog post on Importing Data by Using the Staging Process. A pre-requisite is to have the AdventureWorks2008R2 database samples installed on the same instance of SQL Server as Master Data Services. In MDS I have created a model named ‘Product’ with an entity of the same name. The product entity has the following attributes which are set to the default type and length unless specified: Name Code Model (Domain Attribute) Culture (Domain Attribute) Description (Text, 500) We are going to load this entity with Product data from the AdventureWorks2008R2 database using a SSIS package. In addition to this there are two further entities in the Product model: Culture Model These entities have just the code and name attributes and are set to the default type and length. The MDS model and Integration Services project source code can be downloaded from here – (You will need to create a login first). Building The Solution OK enough of the intro let’s get on and build the package. Start a new Visual Studio Integration Services Project and save the default package to a more suitable name. I’ve called mine ‘LoadMDS.dtsx’. Create the following connection managers as shown below remembering to replace the Server and MDS database names. Rename the connection managers to ‘AdventureWorks’ and ‘MasterDataServices’ respectively. Now we need to create some variables so go ahead and create the variables shown in the table below: We are now ready to put our first task into our package. This task will optionally clear the staging tables of all successfully loaded members, attributes and relationships prior to loading, based on the value of the blnClearStage parameter. Add an Execute SQL Task to the control flow of your package and rename it - I’ve called mine ‘SQL – Clear Staging Tables’. On the General tab set the connection to MasterDataServices and the SQL Statement as follows: DECLARE @ModelName NVARCHAR(50) = ? DECLARE @UserName NVARCHAR(50) = ? DECLARE @User_ID INT SET @User_ID = (SELECT ID FROM mdm.tblUser u WHERE u.UserName = @UserName ) IF ? = 1 EXECUTE mdm.udpStagingClear @User_ID, 4, 1, @ModelName, DEFAULT ELSE SELECT 1 AS A On the Parameter Mapping tab add the variables exactly as shown below: Add a Data Flow task to the control flow of the package and connect it to the ‘SQL – Clear Staging Tables’ task with a success constraint. Rename the task to ‘DFL – Load Staging Tables’. Add three further variables to our package as follows: In the data flow of our package add an OLEDB data source task, set the connection to AdventureWorks and the Data Access Mode to SQL Command. Add the following SQL to the SQL command text window: SELECT CAST(p.ProductID AS VARCHAR(10)) + pmx.CultureID AS ProductCode ,p.Name ,p.ProductModelID ,pm.Name AS ProductModelName ,pmx.CultureID ,c.Name AS CultureName ,pd.Description FROM Production.Product p INNER JOIN Production.ProductModel pm ON p.ProductModelID = pm.ProductModelID INNER JOIN Production.ProductModelProductDescriptionCulture pmx ON pm.ProductModelID = pmx.ProductModelID INNER JOIN Production.ProductDescription pd ON pmx.ProductDescriptionID = pd.ProductDescriptionID INNER JOIN Production.Culture c ON pmx.CultureID = c.CultureID Don’t worry if the formatting turns ugly, that’s just what happens. Press the Preview button and you will see that this query will return us the following columns to our data flow: ProductCode Name ProductModelID ProductModelName CultureID CultureName Description We need two more columns in our data flow and to get them we will use a Derived Column transformation task so drag one on to the data flow from the toolbox and connect it up to the data source. Add the columns as shown in the image below: Next the data flow needs to be duplicated into multiple streams so that the different members and attributes can be loaded to the staging tables. This is achieved by adding a Multicast transformation task to our data flow. This task does not require any configuration. There will be six outputs from the Multicast task and these will be used to load the following: Product Members Model Members Culture Members Product Model Attributes Product Culture Attributes Product Description Attributes Each of these outputs needs to be tailored as to whether they will be loading a member or an attribute and also which member or attribute they are loading. Add six Derived Column transformation tasks to the data flow and connect them to the Multicast transformation. At this point our data flow should look similar to the following: For each of the Derived Column transformations add the additional columns as specified below: OK now we have got all the information we need in our data flows to start loading to the staging tables but before we do that there is one more thing to do. As we are loading new members to the Model and Culture entities as well as Product we need to ensure that we have only distinct values for our member codes to prevent staging errors. To achieve this we add and connect Aggregate transformation shapes to the data flows underneath the ‘Add Culture Member Information’ and ‘Add Model Member Information’ shapes. The images below show how to configure these aggregate transformation shapes: Group By Culture                                                             Group By Model We are now ready to load the data to the MDS staging tables. Add six OLE DB destination shapes to the dataflow. Three of the destinations will be to load new entity members and the other three will be to load attributes for these new members. Configure the Connection Manager properties of the destinations as follows: Members                                                                     Attributes Connect the first destination shape to the ‘Add Product Member Information’ shape and configure it as a member destination. Click the Mappings tab and set the Input and Destination column mappings as shown below: Connect the second destination shape to the ‘Group By Culture’ shape and configure it as a Member destination. The column mappings will be the same as above except for the MemberName and MemberCode columns and these will be set to CultureName and CultureID respectively. Connect the third destination shape to the ‘Group By Model’ shape and configure it as a Member destination. The column mappings will be the same as above except for the MemberName and MemberCode columns and these will be set to ProductModelName and ProductModelID respectively. Connect the fourth destination shape to the ‘Add Culture Attribute Information’ shape and configure it as an Attribute destination. The column mappings will be as follows: Configure the next two destinations as Attribute destinations and map the columns as the other Attribute destination replacing the AttributeValue mapping with ProductModelID and Description respectively. Now our completed dataflow should look similar to the following: If you execute the package you will see that we have inserted 1764 Product member rows, 6 Culture member rows and 118 Model member rows into the mdm.tblStgMember table and 1764 attribute rows for each of the Culture, Model and Description attributes into the mdm.tblStgMemberAttribute table in your MDS database. It is worth noting that the data has now been staged only and we will not see it in our MDS entities yet. OK that’s as far as we are going to go in part one. In part two we will extend the package to move the data from the staging tables into the MDS model and validate the newly inserted data.

Data Quality Services

As I’m sure you’ve heard, CTP3 of SQL Server Denali was released yesterday, and can be downloaded here. Denali includes Data Quality Services (DQS), Microsoft’s new data cleansing and matching component that’s based on the Zoomix acquisition that occurred a couple of years back. Data Quality Services didn’t make it into the first CTP, but is now available, so I though it would be worth a quick blog post. Installing Data Quality Services Data Quality Services is an option in the main install, but when you go to run the Data Quality Client, you’ll get a message stating that DQS is not installed. As far as I can tell, DQS needs to be installed manually, by running the DQSInstaller.exe, which you can find in the SQL Server Binn directory. This will create two SQL Server databases: Data Quality Client Once DQS is configured, you’ll be in a position to use the Data Quality Client, which is a windows application, available in 32 or 64 bit. Once you connect, you’ll get the following screen: The knowledge base is the key to how DQS works, being utilised to determine the data quality of your source data. You can create your own knowledge base, using your own data, or even cloud-based data. For this example, I’m going to use the built in knowledge base called DQS Data. As an example, I’ve created some data that I want to cleanse. It’s adventure works country data that I’ve put into a separate table that I’ve called dbo.Country. My task now is is to clean this data, which I can do by creating a data quality project: I’ve called my project Countries, and I’ve picked the built-in DQS Data knowledge base, which I noticed contains reference data for countries. The activity that I’ve selected is Cleansing, and then I’ve clicked Create. DQS will then prompt for mapping the source data to one of the domains in the Knowledge Base. Here I’ve mapping my country name to the Country domain from the knowledge base: After clicking the Start button on the next screen, the cleaning process starts, which gives the following results: I put a couple of typos into my source data, such as changing Algeria to ‘Algerian’ and Albania to ‘Albana’. These were picked up by DQS, along with a few others, but a user has the opportunity to approve or reject, via the radio buttons shown above. I chose to approve the first two, and then clicked next. In the final screen, DQS allows you to output the cleaned data to a separate table, as shown below: If you go into the table that DQS has created, you will see that there is a Name_Status column, which holds the cleaning status on a per record basis: This can be used to update the source data, if required, and therefore address the data quality issues. Summary This has been a quick introduction to cleaning data with Data Quality Services. There’s plenty more to look at, and I hope to find time to do so at some point - In particular I’m keen to take a look at how DQS works with Master Data Services. I’ve noticed that there’s also a new MS DQS blog (http://blogs.msdn.com/b/dqs/) – I’m sure that will be a great resource for DQS learning.

SQL Server Denali CTP 1 Released

I’ve just found out that the first CTP for SQL Server codename Denali is now available for download. The link to get both the 32 and 64 bit versions is: http://www.microsoft.com/downloads/en/details.aspx?FamilyID=6a04f16f-f6be-4f92-9c92-f7e5677d91f9 On a related note, Simon Sabin has recently posted here about the Tech-Ed Europe keynote that shows a glimpse of a new Silverlight reporting tool for SQL that’s part of Denali. Well worth watching the keynote video…

Master Data Services - Reversing Transactions

MDM tools give the control of the enterprise master data over to the data stewards and power users, rather than relying on automated data integration alone. Master Data Services is no exception to the above. One of the ways that this is true for MDS is that it allows users to inspect the transactions that have occurred (either internal to MDS or from a source system) and choose if they want to reverse them. In order to achieve this MDS has a useful log of all transactions that's viewable by users. Here's an example of some transactions that have occurred in my test system - some are from data that I've loaded up via the staging tables, some are from manual member additions that I've carried out in the front end, and some are from business rules that have automatically run: In the model that this example is taken from, I've got some business rules that look to address data quality issues. Taking the Kimball view on data quality issues in a data warehousing context - many can, and should, be addressed in the source system, then re-loaded. That isn't always possible, which is one of the reasons why we have business rules in MDS. However, as good any sort of automated rule is - there are always exceptions. In the transactions shown above, an automatic business rule has run that checks a Customer's overdraft limit, then sets it to 10,000 if its over 10,000. Therefore, when a value of 50,000 was encountered for Member Code 10311, the MDS business rules kicked in and quite correctly did their job. This was not what I wanted in this particular case. Thankfully we can click on the undo button that's shown above the grid, and reverse a chosen transaction, whether its come from a source system, a business rule or a manual edit. It doesn't seem possible to reverse many transactions at once, but that may be just due to the CTP. In my example, by selecting the first transaction in the list, then clicking the undo button, I've reversed my automatic business rule. Therefore, the user Kaylee Adams (10311) shown below now has her original overdraft limit: In conclusion, when some sort of manual intervention is needed to successfully manage master data, MDM tools allow that intervention to come from the power users, rather than having to wait for someone technical to address the issue.

Master Data Services - Business Rules

I've been keeping an eye on the SQL Server 2008 R2 CTPs over the past few months, but have been compelled to start blogging again following the release of Master Data Services (MDS) in the November CTP. The idea of a Microsoft MDM tool first caught my attention with the acquisition of Stratature, and since then I've seen a few talks on the subject, such as Kirk Haselden's talk on the subject back at the BI Conference last year. Now that I've got my hands on it, I've decided to cover the set up of business rules in MDS. Business rules are key to an MDM solution. If we want to use MDM to load data from disparate source systems, we will definitely have to carry out a lot of cleansing and confirming in order to ensure that the end users only consume clean and accurate data. To set the scene a bit, I've created several entities in my MDM model, namely Customer, Country and City. These could form a Customer Geography hierarchy for example, but for the moment I'm going to focus on Customer. The following shows the Customers that I've entered manually: When I add a Customer (via the button that is circled) or when I edit a customer, the third column of the grid for the relevant member will change from a tick to a question mark, indicating that data validation has not taken place. For this example, what I want to happen is for the Overdraft Limit attribute to validate that it is within normal boundaries that have been set by the business, e.g. a bank. To do this, I'm going to set up a simple business rule. Selecting Manage->Business Rules will take you to the Business Rules Maintenance screen, where the 'plus' icon will create you a new business rule. Editing the new blank rule will give a screen with a basic IF....THEN GUI to produce a basic business rule. On the IF part you pick conditions such as greater than, less than etc, alongside an all important dimension attribute. You do this by dragging and dropping conditions, in the screen below: In my case I've picked that the overdraft limit can't be greater than £10,000, and if it is greater, then set it back to £10,000. This will do for now, but I could have prevented validation from succeeding, or caused MDM workflow to start. Clicking the MDS back button will take us back to the business rules maintenance screen, where the rule is not active until we publish it: Once we do publish the rule, it will kick in whenever validation runs or when you manually run the business rules. In my grid of Customers above, I have an overdraft which is a mistake. When I validate the Customer entity, the 5555555 for the second customer automatically reverts to £10,000, as shown below: This is quite a simple example of course, and via manual editing. The real power of these business rules will come when loading masses of data from source systems, with the added power of workflow to prompt business users to deal with the validation issues that may occur. I'll aim to post about integrating from other systems via my next post in due course....

BI for IT Professionals using PerformancePoint

The PerformancePoint downloads page has been recently updated to include a framework on how BI can provided to the IT Department. Normally the IT Department would be helping out with supporting BI solutions, not actually being end users themselves - but it makes sense when you consider the kind of information that Operations Manager captures. As this video explains, the end goal is to create solutions that will allow effective monitoring the IT infrastructure. An example of the kind of the kind of dashboards that can be produced is shown below:   There is white paper and also a sample solution available for download to learn more.

SQL Server 2005 Cumulative Update 8 for Service Pack 2

Microsoft have just released Cumulative Update 8 for SQL Server 2005 Service Pack 2. Amongst the changes are a few fixes for small problems in Analysis Services, such as: Infinite recursion occurs in the CalculationPassValue function in SQL Server 2005 Analysis Services; A data source view that connects to an Oracle database generates incorrect relationships when you develop a Microsoft SQL Server 2005 Analysis Services project by using BIDS; The MeasureGroupMeasures function does not work correctly in SQL Server 2005 Analysis Services; All the MDX queries that are running on an instance of SQL Server 2005 Analysis Services are canceled when you start or stop a SQL Server Profiler trace for the instance. Details the above fixes and many more can be found here, as well as details of how to download the update. As usual, you can only apply the hotfix if you have Service Pack 2 installed. Details of all SQL Server 2005 builds released after Service Pack 2 can be found here.

PerformancePoint Evaluation Release

The first "official" release of PPS Server 2007 is now available on MS public downloads, albeit an evaluation copy x86 x64 I've not installed it yet so not sure whether there's anything in that didn't make it to CTP 4 but Nick Barclay has posted on the fact that the SAP BW connector for Monitoring has been cut from the release! Still no sign of a full release on MSDN? I'll update once I've had a look around.  Oh and the sample data is back!