Unless you’ve been living under a rock for the last week or so, you will have noticed that CTP2 for SQL Server 2016 has been released and is now available for download.
I finally got round to downloading the CTP myself yesterday and my first port of call was to take a look at the new version of Master Data Services, as it’s been given a several improvements. Included in this list of improvements is a revamped front end, so I thought it would be worth making a start by exploring the front end improvements in this blog post.
New Look and Feel
The first thing that you will notice is the new look and feel that will greet you in the web UI. Here’s the new look:
To recap - here’s the equivalent in SQL Server 2014 MDS:
The have been a few changes to the actual Explorer grid interface. One is that you now get a Copy Member button and also there is the ability to export your current view to an Excel MDS Add-In favourites file:
Another useful change is that on each entity you now get told who last updated the member and who created the member. In the previous version, MDS logged who last updated the member, but it wasn’t displayed to the end users. I’ve spoken to users who need this feature so its good to see it implemented:
If you go into the Explorer, the main grid is still present and looks the same, but there are a few changes to the menu. There used to be a Explorer menu but this is now gone – after all you can get to the same thing via the other menus. There is a new Entity Dependencies menu, which will list all of the entities within the model, as you can see here:
After each entity is some brackets, which indicates the number of dependencies that the entity has. If you click on the entity (e.g. the Region entity), then you will get the opportunity on the Entity Dependencies view to drill down on an entity member in order to locate its dependencies at different levels:
Hierarchies and Collections
There have been some major changes to MDS hierarchies as covered on the MDS Team blog. In summary, Explicit Hierarchies and collections are now depreciated with the idea that they can be replaced with some of the new features of derived hierarchies. One such feature is the “Unused” node, which shows any members of the Derived Hierarchy that are missing a parent. E.g. in the example below I’ve created a new Area member, but its not been assigned a value for its Big Area attribute:
A small but useful change that I’ve noticed is that there is a setting for hierarchies that will auto-collapse a hierarchy branch if you expand a sibling branch:
For example, looking back at the screenshot I’ve included above, if North America was expanded, then turning this setting on would mean that that North America would collapse if I expanded the International branch of the hierarchy.
These UI improvements to the web interface make a welcomed change to a product that is now in its fourth SQL Server version. It’s great to see Microsoft investing in MDS, this will no doubt help increase the uptake of a product that (aside from in SQL 2014) has now seen steady improvement.
This is just a quick tip for Master Data Services, specifically around the setup of new entities. When creating new Master Data Services entities one of the settings that you can choose is the Staging Table Name, which is an optional setting:
As it says in the screenshot, if you do not complete the field, the entity name will be used. This might be ok in simple, single model environments, but you can easily get your staging tables into a mess if you don’t complete this setting. MDS will try and default the staging table name to the name of the entity. But if you’ve got several models that all have the same entity, then MDS will add an underscore and number to the end of your staging table name. For example, here there is a Region entity within the Customer model and within the Engineer model. But which staging table belongs to which model?
The approach that I tend to take in the entity setup is give the staging table an explicit name, which is the Model name prefixed with the Entity name. E.g. I would choose EngineerRegion and CustomerRegion. This results in a much clearer set of staging tables:
We now know which table belongs to which model. As the MDS Staging Tables all follow a particular pattern, we can now create a generic set of ETL packages using SSIS. I won’t go into this in huge detail, but by specifying the model name and the entity name as variables you can create yourself a template MDS package that can be easily copied and pasted to load other entities. For example below, the staging table load, staging stored procedure and (optionally) model validation all are controlled by the following variables:
The variables are used as expressions in the Control Flow and Data Flow components resulting in a generic MDS loading template:
I’ve had a few questions recently on various projects around the Master Data Services Model Object permissions. At a simple level they can be quite easy to master, but can get more complex as you try and combine the different permissions to create your desired security setup.
This guide assumes that you’ve set the Functional Area Permissions and given the user or group access to the Explorer functional area. The tables below show the different permissions that can be applied to each object within a model:
This example shows the MS sample Customer model, which I’ve named as Customer Sample.
The user will be a Model Administrator, meaning that they can add/delete/modify entities and access all data within each entity in the model. If they also have access to the System Administration functional area, then they will be able to actually add/edit/delete entities and other model objects.
The user will be able to read the data within each entity in the entire model.
The user will not be able see the model.
This example shows the entities that are contained within the same Customer Sample model.
You can add/modify and delete members within the entity, in this case “Big Area”.
You can read all members within the entity.
Deny + Read Only on Model
This is where it starts to get interesting. By having Read Only on a Model it means that you will have access to all entities. But by setting Deny on a specific entity, the user will now not see that entity in the drop down list.
Update + Read Only on Model
This will result in Read Only for all entities, but the user will be able to add, edit and delete members within the Big Area entity.
Member Type Permissions
Depending on the entity, you could have both “Leaf” Member Type or Consolidated Member Type permissions (see here for the difference between the two). This table covers Leaf Members:
Only choosing Read Only at the “Member Type” level will cause the same result as choosing Read Only on the entity. Therefore you will be able to read all members in the entity.
You can add/modify and delete members within the entity, in this case “Big Area”.
The user will not be able see or access the entity.
As you can see, if you only have Leaf Members within the entity, then the permissions are the same as if you are setting them at the entity level.
With just Read Only on one attribute, this will actually result in the user being able to view the whole of the Area entity, even though the Read Only permission has been set on one attribute.
The user will be able to view the whole entity, but will only be able to modify the Big Area attribute on existing members.
The user will not be able to see anything, as Deny on just an attribute is not enough to give the user access to the entity. See the next row for how to correct this.
Deny + Read Only on Entity
The user will be able to view the Area entity, but the Big Area attribute will not be visible.
Update + Read Only on Entity
The user will be able to view the entity, but will only be able to update the Big Area attribute.
Read Only + Update on Entity
The user will be able view, add and edit members, but will not be able to specify a value for the Read Only attribute. The user can, however, delete members.
The last three examples really highlight how combining MDS permissions can be powerful. Lets look at these in more detail:
Deny on Attribute + Read Only on Entity
Setting Deny on the Big Area attribute, but Read Only on the entity will result in the user seeing the following:
The “Big Area” attribute is completely missing from the MDS Explorer grid.
Update on Attribute and Read Only on Entity
On the other hand, setting Update on the Big Area attribute and Read Only on the Area entity will result in the user seeing the following when editing a member in the Area entity:
The two highlighted attributes (Name and Code) are both Read Only as we set the entity to be Read Only. The user can edit the Big Area attribute for each member, as we gave the user Update permission on that attribute.
Read Only on Attribute and Update on Entity
Finally, in the last example we have Read Only on the Big Area Attribute, but Update on the Area Entity. This means the user cannot specify a value for the Big Area attribute:
So there we have it! Hopefully this post will be useful – as you can see the MDS Object Model permissions can be very granular and should be able to cover most of the scenarios that you need.
A standard practice in the data warehousing world is the process of delivering conformed dimensions, which is an activity that can involve data cleaning, standardisation, de-duplicating and matching, with a goal of ensuring that data from multiple different locations can be analysed in a consistent manner. For example, System A may have a customer called William Smith, whereas System B might hold the exact same customer as Mr W Smith. The data warehouse ultimately needs to output one version of William Smith to the end user in this example, which will allow all data associated with this customer to be easily analysed.
I’m currently facing the challenge of de-duplicating and matching across data sources, although in my case most of the data sources are not in English. Its one thing to match customers or products that have slightly different descriptions using fuzzy logic in your ETL, but how do you do this when your products are in multiple different languages? Although we’ve not actually ended up needing to use it, the experience got me thinking, could SSIS be used to translate values in one language to another?
Microsoft Translator API
Enter Microsoft Translator, a “statistical machine translation system”, that’s available on the Azure Marketplace for what looks to be a fairly reasonable price. For example, you can translate up to 2 million characters a month for free, whereas 4 million a month will set you back £26.36 a month. As an example I’ve translated the word ‘Bonjour’ here, without even telling it what the input language is (the service has auto detect capability):
SSIS and the Microsoft Translator API
Using an example from AdventureWorks, I’d like to see if its possible to translate the French descriptions found in dbo.DimProduct into English. To do this, I’ve set up a very simple data flow that starts off with extracting the top 100 products from Adventure Works that have a French description. There is then a Script Component, which makes the all important call to the Microsoft Translator API, then finally a Union All just to allow me to catch the rows while debugging. Overall the data flow is as follows:
The script component is where all the magic happens, as this contains the call to the API using following MS translator example code. The MS translator code needs adapting to fit into the SSIS environment and rather than re-posting every line of code, I’m just going to highlight the changes I made to get this working with SSIS. My first step was to add a script component with the following output column:
Then within the Edit Script window I added the following additional references:
I then setup my using section as follows:
The next step is to call the PreExecute() method once in order to get an access token from Azure Marketplace. The key is to pass in a Client ID and a Client Secret, both of which can be setup here.
Finally ProcessInputRow now needs to carry out the translation, which is carried out with the following code:
public override void Input0_ProcessInputRow(Input0Buffer Row)
//Speficy from french, to English
string uri = "http://api.microsofttranslator.com/v2/Http.svc/Translate?text=" +
System.Web.HttpUtility.UrlEncode(Row.FrenchInputName.ToString()) + "&from=fr&to=en";
System.Net.WebRequest translationWebRequest = System.Net.WebRequest.Create(uri);
System.Net.WebResponse response = null;
response = translationWebRequest.GetResponse();
Stream stream = response.GetResponseStream();
System.Text.Encoding encode = System.Text.Encoding.GetEncoding("utf-8");
StreamReader translatedStream = new System.IO.StreamReader(stream, encode);
System.Xml.XmlDocument xTranslation = new System.Xml.XmlDocument();
Row.Translation = xTranslation.InnerText;
Et voila! After compiling the script code and running the package, we can see the following results in the data viewer:
The third column called Translation is the translated English column name from the API, whereas the last column is the correct English translation from Adventure Works. As you can see, for the first few rows its only missing hyphens and capitalisation and the other rows are very close to the correct English translation from Adventure Works. It’s not perfect in all scenarios that I tested, and this approach won’t be suitable for large data volumes, but certainly for the above data it gives a very good match – better than I expected.
Microsoft Translator seems to have done a very good job as a first pass of translating the data. There would be nothing stopping you enhancing this further of course, perhaps with something like DQS or similar cleansing tools, which in turn would mean that you could get the data into good shape before attempting to carry out fuzzy matching.
An interesting challenge in data warehousing is how to cope with a dimension that is heterogeneous in nature, e.g. a product dimension that only has a handful of attributes for one particular product type, but then may have many more attributes for a different product type. One way to deal with this problem is to create ‘Core and Custom’ dimensions or fact tables, which is a concept I will introduce in this first post, before outlining some of the technical challenges in a later posts.
The ‘Core and Custom’ concept (AKA Supertype and Subtype) is covered by Ralph Kimball in the Data Warehouse Toolkit, where he uses the example of a retail bank selling different types of products, e.g. a mortgage ‘product’ and a current account ‘product’. They both have a balance and an interest amount, which are the facts, but they have a varying number of different dimension attributes. E.g. a Current Account product may have attributes such as ‘Monthly Fee Indicator’ and ‘Debit Card Eligible Indicator’, whereas a Mortgage product may have attributes such as ‘Term’ and ‘Fixed/Adjustable’ to represent Fixed/Variable etc.
I’ve seen this example myself recently where a company sells Commercial Vehicles, Cars, Spare Parts and Bikes. This is the example that I’ll use in order to illustrate the concept a bit further.
Core and Custom Dimensions
If a company wants a global view of all of its product sales, whether they be commercial vehicles, parts, cars or bikes, then an ideal scenario is to have one sales fact table and one product dimension table. Unfortunately due to the diverse nature of the products offered, it could become unmanageable and messy to have a single dimension table that simple has all attributes from all different product lines. For example, a car might have attributes such as Engine Size, Transmission, Airbag Indicator and Fuel Type, whereas a bike might have attributes such as Size, Gender, Speed and Rear Travel – they are clearly very different.
The Core and Custom dimension approach is essentially to have a ‘Core’ dimension table that holds all products, with a column for the surrogate key and then all of the common attributes. The Core dimension table would look like the following:
This is then complemented by the ‘Custom’ dimension tables which are highly specialised dimension tables that are specific to a product line. A given product will be present in both the Core dimension table and one of the Custom dimension tables, with the same surrogate key. This is an important point – a product in the Core table will have the same surrogate key as a product in one of the Custom dimension tables. Therefore in the example below, the highlighted rows in the Core dimension table also appear in the Custom bike product dimension table:
The idea is that the fact table can be joined interchangeably with either the Core dimension table to get an overview of all products, or to the Custom dimension table to get highly specialised analysis. The Core dimension table would join to a fact table just like any other dimension table:
In contrast, either Custom dimension table could be joined to the fact table according to the type of analysis required. The example below shows how either one of three of the Custom dimension tables could be joined to the fact table:
Note – The above diagram shows the optional join paths for the Custom dimensions to the fact table. From the user’s perspective, a single dimension would always be presented – either all products, in which case the Core dimension would be used along with the whole fact table, or one of the Custom dimensions, in which case the fact table should only contain the products that relate to the Custom dimension, e.g. with a view.
The alternative approach is to have separate Custom fact tables for each business line – this could be necessary if there are specific facts or dimensions that only apply to a particular product line. Therefore an approach with both Custom fact and dimension tables might look as follows:
My parting thoughts are that whilst I like the Core and Custom approach, you should only use it where appropriate. For example if you have a few product lines and they vary only slightly, then its fine to create a few attributes within a single dimension table that do not apply to all products. On the other hand, if you want to do detailed analysis across highly varied product lines, then Core and Custom dimension and fact tables provide a very flexible method of presenting different information to different groups of users.
Predicting data is nothing new in the Microsoft BI world, with the data mining capabilities in Analysis Services, but there is now another option – namely Azure ML, Microsoft’s cloud based predictive analytics solution. Like many people, I’ve been keen to see what Azure ML can do. Luckily, you can still sign up for a completely free trial for Azure, meaning you can be up and running with Azure ML in a matter of minutes.
For this blog post I’ve selected an insurance use case scenario, as I’ve been working in the insurance industry quite a bit over the past couple of years. My reason for using Azure ML is that I would like to see if it’s possible to predict the retention rate, i.e out of all our policies that were due for renewal, what percentage of those policies actually renewed? Therefore I’m going to try and predict whether a car policy holder will renew based on a dataset comprising of previous policies that show whether or not the policy was renewed.
I didn’t want to use real data for this example, so I’ve got some made up data, which I’ve saved as a simple CSV. Thankfully this can be easily uploaded to Azure using ML Studio, which is the main development tool for Azure ML:
Having uploaded my test dataset, I’m ready to build a predictive model. Azure ML allows me to do this by creating an experiment, which is to ML Studio what a project is to Visual Studio. Within ML Studio, the starting point is to pick a dataset. Data can be consumed from a variety of sources, such as SQL Azure, Azure Blob Storage and http to name a few. But in this case I’m going to use my csv file, which is available under Saved Datasets, along with a host of other ‘modules’ that perform various tasks within ML. The modules can be dragged and dropped onto the canvas, so the starting point is to drag the insurance dataset onto the canvas:
Once I have a dataset on the canvas, I can right click on the saved dataset and choose Visualize, which will profile my data for me, showing me statistics such as the minimum, maximum and standard deviation for each column:
There are a myriad of factors that could potentially affect the retention rate, such as renewal price, number of other policies held (e.g. perhaps they have both life and car insurance with the insurance company), how long the policy holder has been a customer, as well as make/model, insurance add-ons (legal cover, hire car cover) age and other demographics to name a few. I’ve not got all of these factors in my dataset, but I have a fair few, just to show an example.
Once I’ve got my insurance dataset, the next task is to add a Split module, which will allow me to split the rows into two buckets, in a 75%/25% split. Firstly, a training set is needed, which will be fed into the actual model, allowing it to learn based on the past data. The other bucket will be the test set in this case, which will be used to test how the trained model performs. Here you can see the two dots at the bottom of the Split module are the two outputs:
The next task is to drag a Train Model onto the canvas and then connect the training set to the second input of the Train Model. For the first input, the Train model is expecting an input of untrained learner, so I’ve given it the input of an untrained Two-Class Neural Network. The second input expected by the Train Model is a dataset, which is the training set that comes from the Split. Therefore we now have the following shape to the experiment:
The red icon on the train model indicates that there is no column selected, so therefore I’ve picked the column “Renewed” (Yes/No) from the column selector to indicate that this is the outcome that we wish to predict.
In addition, we need to add a Score Model, which will allow us to score the now trained model against the test dataset. Therefore the Score Model accepts two inputs, namely a Trained Model and the test dataset:
If I now run the experiment then we can click on the Visualize option on the output of the Score Model. This will show a comparison of the predicted outcome against the outcome on the test dataset, on a row by row basis:
Finally we can add in an Evaluate Model, which will measure the performance of the model using a specific set of metrics. Again we can choose the Visualize option, which will give us metrics such as True/False Positive/Negative and Cumulative AUC:
What I think’s great about Azure ML is that its very easy to compare the performance of the Neural Network to other models. Therefore I can add in a Boosted Decision Tree, another Train Model and also another Score Model:
These can be quickly wired up to the relevant inputs. The final step is then to then set the Evaluate Model to accept a second input, which is the output of the second Score model. It all ends up looking like this:
As you can see, we now have two inputs into the Evaluate Model, which will allow us to compare the performance of the two models that have been chosen. In this example, when choosing Visualize on the Evaluate model, I can see that the Two Class Boosted Decision Tree gave the best overall results:
The final result is that we have a model that is capable of predicting whether a policy holder is likely to renew or not. Of course this is just a quick example but it hopefully gives you an idea of what Azure ML is capable of.
Rather than just using the the local domain values in its internal knowledge base, Data Quality Services (DQS) can instead clean data by using external reference data. This is a feature called Reference Data Services and in this blog post I’m going to show an example of cleaning SSIS data by connecting to the Azure Data Market.
Azure Data Market
After you’ve signed up for Azure Data Market, you’ll need to subscribe to datasets that are compatible with DQS. Unfortunately they generally cost money, but I have found one (namely DMTI Spatial) that allows you to sign up for a free trial. I do think this is a good idea and I’d like to see free trials offered on Data Market by other providers. I don’t happen to have a huge data quality problem with Canadian addresses, but it does allow me to show how Reference Data Services works with SSIS.
Once we’ve signed up for some data, the next step is to build the knowledge base in the Data Quality Client. In my case, I’m going to setup a simple set of address domains based on addresses:
For a detailed guide on how to setup domains, see the following article.
Reference Data Services
In order to create the link to the external data, settings need to be changed within Domain Management. Having selected the composite address domain, the first step is to click on the Reference Data tab, then click on the browse button, which will cause the following window to appear:
As there is only one dataset available, I’m going to pick it and then map each of the domains to the schema given by the reference dataset:
After clicking ok, there are some optional settings to adjust in order to influence auto correction, the number of candidates and the minimum confidence:
SSIS and DQS
Now its time to look at how the combination of DQS and the external data can be used within SSIS. I’ve put together just a simple CSV containing a few commercial Canadian addresses to clean, with a few spelling mistakes etc:
Within SSIS, the first step is to create a data flow that pulls data from the CSV. Then, within the data flow, we need to connect the the CSV source to the DQS Cleansing Transform:
The key part within SSIS is to edit the DQS Cleansing transform, as we need to first select a knowledge base, then match the incoming columns to the DQS domains:
Interestingly on the advanced tab of the transform there are checkboxes to return additional columns from the reference data provider, which is a provider-dependant feature and is not actually available within the DQ client:
Now we’re ready to go. When running the SSIS package with a data viewer, the following results are output:
Most of our addresses have been looked up and corrected, e.g. “4370 Lorimer” to “4370 Lorimer Road” and “Gerard St” to “Gerrard St”. The confidence of the cleaning is included in the output, as well as the aforementioned ‘extra’ appended data. In this case, as we’re using the DMTI provider, the type of property and latitude/longitude are amongst the columns returned.
As a final point, I noticed that the initial results from the external provider were changing my address values of “Road” and “Street” to “Rd” and “St”. As you can see from my Excel screenshot earlier, my input addresses were a mix of “Rd” and “Road”, but I wanted to standardise on “Road” and “Street”. This is exactly what Term Based Relations are for in DQS, but i didn’t expect them to work with external reference data as the data returned by the external data is in theory correct. Surprisingly, they do work with Reference Data Services, meaning its possible to gain a bit of extra control over the output.
As those of you who use MDS will know, the Code attribute within a Master Data Services entity is mandatory, as its used to uniquely identify the members within the entity. Not only is it mandatory, but unfortunately you cannot change its underlying SQL Server data type – it always gets created as an nvarchar. This is useful in some cases, but what do you do if you want to ensure that the Code can only be numeric? This post outlines two options that may help:
Master Data Services Security
Although it will vary across different MDS implementations, the Code is quite often used as as an auto incrementing integer, which can be achieved easily by changing the entity properties. If all you want is to auto generate the code and prevent users from changing it, then you can just set the Code to be read only, as shown below:
The security approach will work fine if users aren’t allowed to update the Code. On the other hand, if the Code can be updated, then a different solution is needed.
Here an MDS business Validation Business Rule type can help, as one of the Validation rules is called must contain the pattern:
By dragging this business rule onto the Action node, its now possible to write the rule action. If we chose the Code as the attribute to validate and then enter ^[0-9]+$ as the regex pattern, then the rule looks as follows:
Finally, after publishing the new Business Rule, members with an non-numeric code will now be invalid, as shown with the sample MDS Customer model:
To me one of the most exciting parts of the Microsoft Analytics Platform System (APS) is Polybase, which in a nutshell allows you to access data residing in Hadoop or Windows Azure Blob Storage, all via T-SQL in PDW. What this means is that data can be transparently queried by a user or developer, in real time, regardless of whether the data lives in PDW or Hadoop/Azure. James Roland-Jones gives a thorough overview of all things Polybase here.
What I’m going to do in this post is to show an example of how existing data within Hadoop can be combined with data that resides in an APS PDW region.
Polybase Example - Setup
There are two key tasks to complete before we’re able to start querying data. We need to setup a data source, so that Polybase knows where to get the data from, plus we need to describe the file format of the external file that we’re going to read. The data source here specifies that we’re going to use the Hadoop nodes that are co-located with the PDW nodes, but this could be a non-appliance Hadoop cluster:
CREATE EXTERNAL DATA SOURCE HadoopRegion_DataSource
TYPE = HADOOP,
LOCATION = 'hdfs://hdfs://H12345-C-HHN01',
--Optional specify a tracker location to enable predicate
--push down to Hadoop
JOB_TRACKER_LOCATION = 'H12345-C-HHN01:50300'
The next task is to setup a file format. In this case we’re defining that the file is pipe delimited, although we can use Polybase with other formats (e.g. RCFile):
CREATE EXTERNAL FILE FORMAT HadoopRegion_DelimitedPipe_NoCompression
FORMAT_TYPE = DELIMITEDTEXT,
Field_terminator = '|')
Accessing External Data
Before we access data in Hadoop, I want to show the file for this example. In this case I’ve got a customer dimension within PDW, but I have customer survey data within Hadoop that i wish to combine with my data in the warehouse. The survey data has been uploaded in this case to the Hadoop region within APS, via the portal:
The following shows a sample of the Hadoop file, note that its pipe delimited, with columns for Date, Survey Id, Product, Customer Id and Survey Score:
Now I want to access the Hadoop data and combine it with other data in the warehouse. To do this we need to use the CREATE EXTERNAL TABLE command.
CREATE EXTERNAL TABLE [dbo].[HDFS_CustomerSurvey]
DateId int NOT NULL,
SurveyResponseId int NOT NULL,
ProductCategoryName varchar(100) NULL,
CustomerId varchar(50) NULL,
SurveyResult INT NULL
LOCATION = '/user/data/survey/',
DATA_SOURCE = HadoopRegion_DataSource,
FILE_FORMAT = HadoopRegion_DelimitedPipe_NoCompression
This results in an external table being available within the PDW region, as shown below:
Now if we want to query the Hadoop data it’s just a simple case of T-SQL. Here I’m joining Hadoop survey data to the customer dimension:
CS.DateId AS SurveyDate,
CS.ProductCategoryName AS SurveyProduct,
FROM dbo.HDFS_CustomerSurvey CS
INNER JOIN dbo.DimCustomer C ON C.CustomerAlternateKey = CS.CustomerId
Which gives the following results:
So that’s it, just a simple T-SQL query in the end. Polybase has taken away the complexity and allowed us to integrate different data sources using a widely used standard query language.
After running a modelling workshop with end users, my next activity on a project is generally to produce a prototype model with some test data, in order to check with users that the planned star schema design will work. My tool of choice for this job tends to be Power Pivot, but I was presented with a slightly different solution when working with a client recently. This got me thinking it would be worth contrasting the two approaches….
First of all, lets start off with Power Pivot. The goal here is to quickly produce a prototype to verify the star schema design. I tend to just request that a business user brings along an Excel workbook with samples of each dimension and facts pre-extracted in Excel, which can then be loaded into Power Pivot. I find that this saves time, although as an alternative of course Power Pivot could extract directly from SQL Server, Oracle etc if needed.
As an example, we can quickly walk through creating a very simple model to create a star schema:
- I’ve copied the data that represents my single fact table and related dimensions into Excel. In this case it’s to model a simple Sales Order Transactions fact, therefore my Excel workbook has dimensions such a Product, Customer, Sales Territory etc, as we well as the data that represents the fact table:
- After completing and closing the workbook, we now create a new workbook and open the Power Pivot window.
- Within the Power Pivot window, the next step is to import the data from the Excel workbook created in step 1. This can be done by choosing From Other Sources->Excel workbook and then browsing to the file.
- Power Pivot will now present us with all of the sheets that it finds in the workbook, so we select the sheets that represent the dimensions and facts:
- The next step is to setup relationships between the various tables. Once this is done, we have a simple prototype data model, as shown below:
After a bit of tidying up (e.g. creating some hierarchies and removing unwanted columns), we can now connect to a Pivot Table in Excel that will help verify the star schema:
They key point here is that we have rapidly built a prototype, without the need for any ETL. This means we can quickly cover the design issues with the users, but this time with their actual data, rather than just on a whiteboard.
The alternative approach that was presented to me recently is to develop the Analysis Services cube/Tabular model before carrying out the ETL. Essentially this means that the users connect to cube, as a production system, unaware that under the hood the ETL is not yet complete.
How is this achieved? Essentially by putting logic in the SQL view layer. Its a well known Analysis Services best practice to bind your SSAS objects (e.g. dimensions, measure groups) to SQL Server views, rather than using objects within the DSV. Rather than each view pulling its data from a complete dimension or fact table, instead each view would pull its data from a staging area or a copy of the source database. The idea being that, over time, the contents of each view would be updated to point to the actual dimensions and facts, once they are built.
Therefore a normal view (in this case for a product dimension) that feeds Analysis Services might look like:
With the ‘No ETL’ approach, the view looks very different, as any transformation and cleaning will have to be carried out from the view:
SELECT ROW_NUMBER() OVER(ORDER BY PROD.ProductID) AS ProductKey,
PROD.Name AS ProductName,
CAST(ISNULL(SUB.Name, 'Unknown') AS VARCHAR(100)) AS ProductSubCategory,
CAST(ISNULL(CAT.Name, 'Unknown') AS VARCHAR(100)) AS ProductCategory
FROM [$(AdventureWorks)].Production.Product PROD
LEFT JOIN [$(AdventureWorks)].Production.ProductSubcategory SUB ON SUB.ProductSubcategoryID = PROD.ProductSubcategoryID
LEFT JOIN [$(AdventureWorks)].Production.ProductCategory CAT ON CAT.ProductCategoryID = SUB.ProductCategoryID
This therefore incurs some technical debt, as the cube gets built before the ETL or even the physical dimension table. But the idea is that you can get the users using the cube, on production even, then gradually replace the views with proper ETL and dimension tables once you have some more feedback.
The Power Pivot method is tried and tested for me many times – I find that its very quick to get up and running. The Analysis Services approach is not quite as quick to get up and running, but offers the benefit that the cube can continually evolve in a fairly agile manner. Its early days for me using this approach, but the a disadvantage is that the views can get quite complex, impacting performance depending on the data volumes.
Both methods probably have their place. If I wanted to validate my design, I think I would continue to produce a quick Power Pivot model. On the other hand, if quick delivery to production is a must, and the complexity/data volumes are not an issue, then I would use the view approach.
I’ve seen a few data quality issues recently in a some client data warehouses/marts where I’ve been asked to carry out some Analysis Services work. These data quality issues have often led to the classic SSAS dimension processing error, namely:
“Errors in the OLAP storage engine: A duplicate attribute key has been found when processing: Table: 'dbo_DimProduct', Column: 'SubCategory', Value: 'Other'. The attribute is 'Sub Category'”
Normally this error is easy to track down, but there are a few scenarios that I’ve encountered where SSAS attribute properties actually make it harder for you to troubleshoot.
Scenario 1 – Trimming
When hitting the above error, the first thing to do would be to run a query to see where the duplicate attribute is. In my example, I’ve just produced a very simple dimension that’s loosely based on Adventure Works – here are the attribute relationships:
Therefore the first place to start would be to write a query that checks if we have any Sub Categories that are duplicated across Categories, such as:
GROUP BY SubCategory
HAVING COUNT(DISTINCT Category) > 1
In my case, this actually yields no results. So what’s happening here? We can’t find any duplicates in SQL, but SSAS thinks there are duplicates. Well….Analysis Services dimension attributes contain a property called Trimming, which, by default, will remove spaces at the end of an attribute key or name. Needless to say, a dimension shouldn’t actually contain leading or trailing spaces, but in this specific scenario, the ETL wasn’t as robust as it could be. In the actual example I encountered, there was a tab at the end of one of the attribute names. To visualise that in the DimProduct example that I’ve created, the highlighted attribute has a tab at the end:
Scenario 2 – NullProcessing
The next scenario I want to highlight is the same error, but a different dimension. Again I’m going to create a simple example to illustrate what actually happened. This time I’ve got a customer dimension table, as follows:
The error is: A duplicate attribute key has been found when processing: Table: 'dbo_DimCustomer', Column: 'CustomerCountryKey', Value: '0'. The attribute is 'Customer Country'
A quick query on the above data will reveal that a CustomerCountryKey of 0 doesn’t actually exist twice for the same country, so what’s happened here? Again its due to a dimension property, this time on the attribute key, called NullProcessing. As its default is Automatic, it means that Nulls will get converted to zero during dimension processing. As we already have a member with CustomerCountryKey of 0, then we get an error.
ETL and Data Quality
These sort of issues highlight just why handling data quality during the ETL is so important! For the first scenario, some basic trimming/cleaning can fix the issue before it hits Analysis Services. As for the last issue, this is partly due to data warehouse/mart design. Personally I would never allow a dimension attribute to be null, partly because you shouldn’t present null attributes to a user, but also due to the lack of control that can result as shown above. Therefore a bit of design work up front to ensure that you have a robust data model is essential. Interestingly the properties that I’ve mentioned don’t exist for tables in Analysis Services Tabular. Whereas I think there are a few SSAS MD features that need to make it over into the next version of Tabular, I can personally live without the two properties outlined above.
I’ve always liked how Excel and other client tools deal with measure groups in a multi dimensional SSAS cube. Once you connect to the cube, you get a drop down in the Pivot Table fields pane that lets you choose which measure group you want. From there, you get a filtered list of measures and dimensions. For example, if I connect to the Adventure Works 2012 sample cube, I get the following:
By picking Internet Orders, I will see only the measures dimensions that relate to Internet Orders, which provides me with a good way of navigating a large cube.
Unfortunately though, if I connect to a Tabular model (e.g. the sample Adventure Works 2012 model that I got from codeplex), then if I click the same drop down, then I will see every table in the entire model, whether the table is a dimension, a fact or something else. E.g. here for a tabular model than contains 3 fact tables, but 15 tables in total, then I see all 15 tables:
This isn’t ideal in my opinion, as I’d much rather that users had the ability to quickly jump to an area of interest, which measure groups achieve fairly well. As I have 3 fact tables, I would expect to see those 3 fact tables/measure groups in the drop down.
I did accept this as a quirk of tabular until I stumbled across something recently. If you connect to a tabular perspective, then Excel behaves exactly as Multi Dimensional does, i.e. it shows you only measure groups in the drop down. To illustrate I’ve added a new perspective to the model that contains all tables and I’ve called this perspective ‘Adventure Works’:
Now I can connect to this new perspective via Excel. Remember the perspective contains all my tables, so it shouldn’t be any different than connecting the model itself:
The perspective does give a different result though – now only the actual measure groups are displayed to me in the drop down, which is much more user friendly:
This is completely different to the way that perspectives work in multi dimensional. When making MD cubes, I wouldn’t always need perspectives, just because the measure groups provided a ‘natural’ way of the users picking the subset of the cube that was of interest to them. Now with tabular, it seems that using a perspective will actually improve the user experience.
The MDS Hierarchy Member permissions have been given a complete revamp in SQL Server 2012. As explained here, the hierarchy member permissions are purely optional and allow you to limit the access to a specific set of members.
Even though the actual member permissions are optional, you may have a need to purely allow a user to browse and/or update the Derived Hierarchy. This is also something that’s been affected by the aforementioned change, so I want to explore here how we would grant users access to a Derived Hierarchy.
In 2008 R2, on the model tab, you got a Derived Hierarchies node, as shown in the image below:
If we contrast that to SQL Server 2012 (albeit with a different model), we don’t have a Derived Hierarchies node:
Therefore, to access the actual hierarchy, according to this article, you need to give the user Update permission on the model. That’s not the only way to do it, but certainly working only at the entity level, or the member level, is not good enough. To explore, lets look at an example. I have the sample Customer model installed and I want to give a test user access to the following CustomerType hierarchy, which is just made up of the CustomerType and Customer entities:
Going to the User and Group Permissions, just quick confirmation that assigning Update permission to the Customer and CustomerType entities is not good enough. As shown below, the user does not have access to any of the Derived Hierarchies that I wanted:
Upping the permissions level a bit by granting read only permission to the model does the trick – the test user can now see the Derived Hierarchy and can also add members. This is thanks to 1) a Read Only permission at the Model level and 2) Update permission on the two entities that make up the Derived Hierarchy:
However, the simplified security comes with a slight drawback, which is to get access to the Derived Hierarchy we’ve now given the user Read Only access to all entities in the model:
You may not want to grant a user access to all entities, but thankfully it can be avoided by granting an explicit Deny at the entity level. For example, as shown below I’ve set an explicit Deny for the AddressType entity:
This will ensure that the user now doesn’t see the AddressType entity on the Entities menu. Now that the user can see the hierarchy, what about restricting access to the members? That’s a bigger topic, so I don’t want to cover it here, but I will just provide a useful tip. When you’re testing your security, don’t forget that member permissions are not applied immediately! The following article explains how to immediately apply the member permissions, which is something that I’d recommend carrying out whilst testing.
|Packt have just released a second version of their ‘MDX Cookbook’, titled ‘MDX With Microsoft SQL Server 2012 Analysis Services’, authored by Sherry Li and Tomislav Piasevoli. As I’m one of the technical reviewers of this book, it’s put me in a good position to be able to put together a review of the book in this blog post. |
The new SQL Server 2012 version of the book is similar in terms of content and structure to the original, presenting a number of solutions to common MDX problems in a ‘recipe’ format.
In the first two chapters, the book covers basic MDX techniques, such as the layout of a basic query and a variety of techniques for working with sets. The pace of the book warms up nicely here, starting with a few basic concepts, before gradually getting more complex.
After a useful chapter on working with time in MDX, there are some excellent chapters that give what should be useful recipes on common reporting techniques, such as top/bottom members, ranking and averages.
As the book progresses, there are some more complex chapters that cover areas such as linear regression, complex sorts and recursion – to name a few. Whether the recipes are simple or complex, they works well, as each recipe is packed with step-by-step instructions and screenshots, all based on the SQL Server 2012 Adventure Works cube.
Although it’s also in the first edition, I particularly like the chapter ‘When MDX is Not Enough’, which talks about how modifying the cube design can assist you when writing some MDX queries. Essentially the chapter promotes evaluating if you can modify the cube design in some way, rather than writing very complex MDX, which is something that I completely agree with.
The book differs slightly to the 2008 R2 version in that a few more of the basics (e.g. basic MDX query structure) are covered in the introductory chapters. This seems to work well, meaning someone relatively new to MDX can pick up some simple concepts, whereas someone who is more experienced in MDX will also get a massive amount out of the book.
So overall this comes highly recommend for your bookshelf or Kindle…
After doing my recent blog post on the Excel Fuzzy Lookup Add-In Vs SSIS, it got me thinking about the capabilities of the SSIS Fuzzy Lookup Transformation.
Although I’m aware that the Fuzzy Lookup is a token-based solution, I’ve only ever tuned it really by altering the similarity thresholds on both the columns and also on the task itself. What you can also do is alter how the transform deals with tokens. Therefore, I thought it would be worth a quick post to show how the tokens help produce matches.
The Fuzzy Lookup works by splitting up strings into several different components, known as tokens. For example, the Adventure Works reseller “Cross-Country Riding Supplies” might be split up into the tokens “Cross”, “Country”, “Riding” and “Supplies”. This is key to the matching process, as the fuzzy algorithms will attempt a match based on the commonality between tokens in the input and reference datasets.
SSIS Fuzzy Lookup
The advanced tab of the fuzzy lookup will show the delimiters used to create the tokens. As you can see, a hyphen is included in the list, meaning words separated by a hyphen get interpreted as separate tokens:
On the “Reference Table” tab of the transformation, there is the option to store a new index, which means that SQL Server will create a table in the data source to hold the tokens found in the reference table:
In my made up dbo.Retailers table, I’ve got two retailers – “Nearest Bike Store” and “Gears+Chain Supply”. After running the SSIS package once, the dbo.IndexTest table will be created, which is where the reference tokens get stored in order to help with matching. As it’s just stored in SQL Server we can select from it.Before we do that, I should just say, I’m only passing this fuzzy lookup 1 input string, which is a misspelt company called “Bikes Chaiin Supply”. If I run the package, I get the following matches and scores:
This has a fairly low similarity and confidence, so we can probably do better. Going back to the index, if we select from dbo.IndexTest, we get the following:
The index table also contains substrings found in the reference table, known as q-grams, which are designed to assist with matches that contain errors. What’s interesting about these results is that “gears+chain” has been entered on its own, meaning that the word “chain” has not been interpreted as a token in this case. The solution, in this case, is to alter the token delimiters to include “+”:
If we re-run the package and then look again at the index table, we now see that the word “chain” has been successfully interpreted as a token. In addition we also have a few variants on the word chain, such as “chai” and “hain”:
After re-running the matching process with out new delimiters, we can now look at the results of the matching process. If we take a look at the data viewer output, we can see that this has now resulted in a better similarity and confidence for the first row:
The delimiters work across all columns for the whole fuzzy lookup, so that’s something to test when changing the delimiters – it may not be beneficial in all situations. However the new delimiter has clearly worked very well in this example, so it’s very much worth considering if you’re not quite getting the level of matches that you had hoped for.
More Posts Next page »