Tristan Robinson

Tristan Robinson's Blog

Adatibits - Scalable Deep Learning with Azure Batch AI



I recently had the pleasure of attending SQLBits, there were a number of great talks this year, and I think overall it was a fantastic event. As part of our commitment to learning at Adatis, we were also challenged to present back to the team something of interest we learnt at the event. This then becomes our internal Adatibits event where we can get sample of sessions from across the week.

As such, I was pleased when I saw a talk by Ben Keen on the Friday bill of SQLBits that revolves around deep learning. Having just come through the Microsoft Data Science program, this fell in line with my interest / research in data science, and was focusing on the bleeding edge of the subject area. It was also the talk I was probably looking forward to the most from the synopsis, and I thought it was very well delivered.

Anyway, on to the blog. In the following paragraphs, I’ll cover a cut down version of the talk and also talk about my experience using the MNIST dataset on Azure Batch AI. Credit to Ben for a few of the images I’ve used as they came off his slide deck.


What is Deep Learning?

So you’ve heard of machine learning (ML), and what that can do – deep learning is essentially a subset of ML, which uses neural network architectures (similar to human brains). It can work with both supervised (labelled) and unsupervised data, although its value today is tending towards learning from the labelled data. You can almost think of it as an evolution of ML.  It’s performance tends to improve the more data you can throw at it, where traditional ML algorithms seem to plateau. Deep learning also has an ability to perform automatic feature extraction from raw data (feature learning), where as the traditional routes have features provided as part of the dataset.


How does it help us? What are its use cases?

Deep learning excels at finding patterns in unstructured data. The following examples would be very difficult to write program to do, which is where the field of DL comes in. The use cases are usually split into 4 main areas – image, text, sound, and video.


Deep Neural Network Example

A simple example of how we can use deep learning is to understand the complexity around house prices. Taking an input layer of neurons for things such as Age, Sq. Footage, Bedrooms, and Location of a house – one can normally apply the traditional linear formula y = mx + c to apply weightings to the neurons to calculate the house price. This is a very simplistic view and there are many more factors that apply that can change the value. This often involves the different neurons intersecting with one another.

For example, people would think a house with a large number of bedrooms is a good thing and this would raise the price of the house, but if all these bedrooms were really small – then this wouldn’t be a very attractive offering for anyone other than a developer (and even then they might baulk at the effort involved) therefore lowing the price. This is especially true with people wanting more open plan houses nowadays. So traditionally people might be interested in bedrooms, this may shift in recent times to Sq. Footage as the main driver.




Therefore a number of weights can be attributed to an intermediary layer called the hidden layer.

The neural network architecture then uses this hidden layer to perform a better prediction. It does this by starting off with completely arbitrary weights (which will make a poor prediction). The process then multiplies the inputs by these weights, and applies an activation function to get to the hidden layer (a1, a2, a3). This neuron is then multiplied by another weight and another activation function to generate the prediction. This value is then scored, and evaluated via some form of loss function (Root Mean Squared Error). The weights are adjusted accordingly and the process is repeated. The weights are adjusted through a process called gradient decent.




To give you an idea of scale, and weight optimisation that's required – a 50x50 RGB image has 7,500 input neurons. Therefore we’re going to need to scale out as the training is very compute intensive. This is where Azure Batch AI comes in!


Azure Batch AI

Azure Batch AI is a managed service for training deep learning models in parallel at scale.

It’s currently in public preview, which I believe it entered around September 2017. It’s built on top of Azure Batch and essentially sells the standard Azure story where it provides the infrastructure for data scientists so they don’t need to worry about it and can get on with more practical work.

You can take your ML tools and workbooks (CNTK, TensorFlow, Python, etc.) and provision a GPU cluster on demand to run them against. Its important to note, the provision is of GPU, not CPU – similar cores, less money, less power consumed for this type of activity.

Once trained, the service can provide access to the trained model via apps and data services.




As part of my interest in the subject, I then went and looked at using the service to train a model off the MNIST dataset. This is a collection of handwritten digits between 1-9 with over 60,000 examples. It’s a great dataset to use to try out learning techniques and pattern recognition methods while spending minimal efforts on pre-processing and formatting. This is not always easy with most images as they contain a lot of noise and require time to convert into a format ready for training.




I then followed the following process within Azure Batch AI.

Created a Storage Account using Azure CLI along with a file share and directory.

# Login 
az login -u <username> -p <password>

# Register resource providers
az provider register -n Microsoft.BatchAI
az provider register -n Microsoft.Batch

# Create Resource Group
az group create --name AzureBatchAIDemo --location uksouth

# Create storage account to host data/scripts
az storage account create --name azurebatchaidemostorage --sku Standard_LRS --resource-group AzureBatchAIDemo

# Create File Share
az storage share create --account-name azurebatchaidemostorage --name batchaiquickstart

# Create Directory
az storage directory create --share-name batchaiquickstart --name mnistcntksample --account-name azurebatchaidemostorage


Uploaded the training / test datasets, and Python script.

# Upload train, test and script files
az storage file upload --share-name batchaiquickstart --source Train-28x28_cntk_text.txt --path mnistcntksample --account-name azurebatchaidemostorage
az storage file upload --share-name batchaiquickstart --source Test-28x28_cntk_text.txt --path mnistcntksample --account-name azurebatchaidemostorage
az storage file upload --share-name batchaiquickstart --source --path mnistcntksample --account-name azurebatchaidemostorage


Provisioned a GPU cluster. The NC6 consists of 1 GPU, which is 6 vCPUs, 56GB memory, and is roughly 80p/hour. This scales all the way up to an ND24 which is 4 GPUs, 448GB memory, for roughly £7.40/hour.

# Create GPU Cluster NC6 is a NVIDIA K80 GPU
az batchai cluster create --name azurebatchaidemocluster --vm-size STANDARD_NC6 --image UbuntuLTS --min 1 --max 1 --storage-account-name azurebatchaidemostorage --afs-name batchaiquickstart --afs-mount-path azurefileshare --user-name <username> --password <password> --resource-group AzureBatchAIDemo --location westeurope

# Cluster status overview
az batchai cluster list -o table


Created a training job from a JSON template – this tells the cluster where to find the scripts and the data, how many nodes to use, what container to use, and where to store the trained model. This can be then be run!

# Create a training job from a JSON template
az batchai job create --name batchaidemo --cluster-name azurebatchaidemocluster --config batchaidemo.json --resource-group AzureBatchAIDemo --location westeurope

# Job status
az batchai job list -o table


The output can be seen in real time along with the epochs and metrics. An epoch is essentially a full training cycle, and by having multiple epochs, you can cross validate your data, which leads the model to generalise more and fit real world data better.

# Output metadata
az batchai job list-files --name batchaidemo --output-directory-id stdouterr --resource-group AzureBatchAIDemo

# Observe realtime output
az batchai job stream-file --job-name batchaidemo  --output-directory-id stdouterr --name stderr.txt --resource-group AzureBatchAIDemo



The pipeline can also be seen in the Azure portal along with links to the output metadata.



Once the model has been trained, it can be extracted, and the resources can be cleared down.

# Clean Up
az batchai job delete --name batchaidemo  
az batchai cluster delete --name azurebatchaidemocluster 
az group delete --name AzureBatchAIDemo



By moving the compute into Azure, and having the ability to scale – this means we can generate a faster learning rate for our problem. This in turn will mean better hyperparameter tuning to generate better weightings, which will mean better models, and better predictions.

As Steph Locke also alluded to in her data science talk – this means data scientists can do more work on things that they are good at, rather than waiting around for models to train to re-evaluate. Deep learning is certainly an interesting space to be in currently!

Improving Machine Learning Models

As part of my MS Data Science Professional Program, a number of the topics recently have been based around getting the most out of an Azure ML model. As part of this blog, I will be looking at the techniques and ways in which you can model and improve a solution. While I was tackling this problem with Azure ML, these techniques apply to building better models through other languages/platforms such as Python or Scala.


Data Munging

A process that undoubtedly every data scientist goes through with every DS problem they face, is that of data munging. This is a term that is being used more frequently to describe the process of transforming data from its raw state into another format, something more valuable for downstream analytics. Models running on data that is poor in quality, missing or duplicated will produce poor predictions. Therefore a very simple pre cursor to any problem is to explore the data and understand what is required to turn the data set into a form which is better for the latter stages of the model design. The following techniques are used as part of this stage:

  • Removing Duplicate Rows
  • Cleaning Missing Data (custom substitution of numerics, usually 1 or 0 or replacing with the mean, medium, mode across the dataset)
  • Cleaning Missing Data (removing bad quality rows entirely or removing bad quality columns entirely – more extreme)
  • Creating Categorical Features from String Features
  • Normalising Numeric Features (ZScore or Min/Max) to bring everything on to the same scale.

These steps are quite basic so I won’t go into detail here, but none the less, they should be considered at the start of a modelling a better solution to a problem.


Feature Selection

One of the reoccurring principles that appears with machine learning is that of Ockham’s razor, which states that the best models are simple models that fit the data well; this is not an irrefutable principle of logic, but a preference for simplicity. Therefore there is a need of balance between accuracy and simplicity to limit the feature set which tends to lead to better predictions. Simpler models are also more interpretable to humans which also helps. While the data I was working with was limited to around 35 features, there are many data science problems which have thousands of features and so this technique is even more crucial.

There are multiple methods to perform feature selection, of which a few will be covered here. The first method is greedy backward selection which starts with all the features and then finds the feature that hurts predictive power the least when removed, and you remove it. This is done iteratively until a point is met (which will be discussed later). Its known as greedy since it never looks back after removing the feature each time.

An alternative method is greedy forward selection which is basically the inverse, starts with no features, and looks for the feature that by itself is the best model. This then carries on in a similar vein to the backward selection but adding features. The point at which you stop with forward selection is that of diminishing returns for your accuracy.

Defining accuracy is important here, and this is where a formula called Adjusted R² comes in. R² is a measure of how well the model fits the data, with being closer to 1 than 0 being a better fit. The adjusted part adds a penalty for every term in the model, thus it measures on a scale the size and accuracy of a model. Therefore you need enough features for your R² to be large but not too many that it brings the Adjusted R² down.


Permutation Feature Importance

Using the feature selection theory, and to prune the feature set down to those that are meaningful for prediction, you can use a module in Azure ML called Permutation Feature Importance. This essentially re-computes the model a number of times, leaving out each feature and looks at how much your metric changes because the feature was left out, and then ranks them in order of importance. Depending on what you are trying to model, i.e. a classification or regression problem – there are a number of options for the metrics to measure performance. In my instance, I was interested in the RMSE (Root Mean Squared Error) which in simple terms represents the sample standard deviation of the difference between prediction and observed values. It aggregates the magnitudes of error in prediction into a single measure of predictive power. The closer to 0, the better the predictive power – but it’s also good to note this is relative to what you are trying to measure.

Once the model has been run through, you can visualise the list of features and their contribution to the RMSE. At this point, it does not necessarily matter whether the feature contributes a positive or negative value to the RMSE, as long as the value is not 0. Any values of 0 indicate that they have zero contribution to feature importance, essentially whether they are part of the model or not, add nothing to it. You can then follow backward pruning techniques to remove these columns from the feature set. It is then worth running the model again, to check the feature importance as the removal of those features may impact other features. If more features then have a value of 0, you should remove those too, and repeat. You can then measure the impact of the changes using separate pipelines, and passing the output into the same evaluation model, and checking the ROC curve (described below). Even with the RMSE staying the same between the 2 pipelines, by removing features, you are able to build a model which is more likely to generalize be more effective in the real world when values change.




Picking the Best Model Type

There is no reason to believe that any particular machine learning model will have the best performance (although we always have favourites); a classification model type that works best for one set of features and labels in a dataset does not always work best for another. As part of modelling any dataset, testing and comparing multiple machine learning models is usually a good approach. Its also important to note that the performance achieved with any particular machine learning model can change after performing feature engineering, therefore it is best to run the selection after this stage. The following model evaluates logistic regression, boosted decision trees, neural networks and support vectors with the same dataset to find out which is best.




To understand the performance of a machine learning model, there are a number of techniques to use. The easiest way is to pass the output of each model into an Evaluate Model module, which accepts up to 2 datasets at a time (left and right inputs). After the experiment is run, you can visualise the output of the models using this module, and examine the ROC curve. The first scored dataset (blue) represents the original model (in this case a neural network), and the scored dataset to compare against (red) represents the second dataset (in this case a support vector machine). The higher and further to the left the curve, the better the performance of the model (in this case, the neutral network).

Scrolling down further, you can also use the Accuracy, Recall, and AUC performance metrics, which indicate the accuracy and area under the curve. The model with the higher metrics is performing better. In particular, the lower the recall metric, the higher the number of false negatives.




Parameter Sweeping

Once you’ve picked the ML model contributing to making your predictive power better, it will require a set of parameters. For instance, with decision trees, this is in the form of a leaf count to determine depth, or no. of trees to determine width, along with their samples per leaf, and the learning rate. By default there is always a set available, but these will always need tweaking to improve things further and generate a better RMSE.

This can be done by either sweeping a giant grid of parameters, or by a random sweep. The latter being a lot quicker to process at run time for obvious reasons Fortunately, the performance is not normally sensitive to a change in these values if you have done much of the previous analysis first. Parameter sweeping really starts to squeeze the best out of the model.

In Azure ML, this can be done via a tune model hyper parameter module. The same options are available to measure metrics as the feature selection module, so I was interested in the RMSE again. As part of tuning the parameters through this module, we will need to split the training data beforehand, this can be done 50:50. This is so that the parameters have a set of data to validate against. This is then kept separate to the scoring data set as usual which is another completely separate set of data. Once the model has run, we can again evaluate the best parameters, against the original model and evaluate the RMSE, as well as the Accuracy, Recall and AUC. This is very similar to the previous techniques of evaluation. Visualising the sweep results, will display the parameters used, and then these can be programmed back into the original ML model, while removing the tune hyper parameter module, to speed things up on future runs.

A process of nested cross validation can be used on top of this to build confidence that the correct parameters have been used and it wasn’t just luck that they ended up being better than another set.





Once you have been through this process, you will then want to run a process of cross-validation, which runs the data through multiple times (folds) where each time, different data is used for training, and scoring. You can then generate a mean and standard deviation for each fold and prove the model is consistent across the data set, and that it will not be skewed by any new data for future predictions. This will give you a good idea of whether the model will generalise well and be robust enough to move to production.

Of course, there are many more techniques to the ones listed here, but this should give you a good introduction to the ones to look for to deliver predictive power from your model.

Considerations for Creating a Power BI Enterprise Report Deck

Creating or re-creating an Enterprise report deck in Power BI should be reasonably straight forward given a specification, but there are a number of considerations which need to be made when building something of this nature. In the following blog post, I will detail some of these, and the route I would suggest taking. The contents of this blog revolve around more tabular reports than chart visuals, but the same themes can apply.



I think it goes without saying to keep the font consistent across both a single report, and a report deck.  The default for Power BI is Segoe UI which for the most part is pleasant, just be careful not to flick between this and Segoe UI Light as this can cause discrepancies. It is however the font size that will cause you more of an issue.

  1. The first consideration is to set a minimum size. As report developers we want the report to look as tidy as possible, and usually this means fitting everything on 1 page. The easiest way to do this is to set the font smaller if you are having space issues – but this does not always translate so well to the end user. Depending on the device, they may consider the minimum Power BI lets you set (size 8) as too small for consumption on something such as an iPad – so this is worth checking first.
  2. The second consideration is to set the font size for different elements of the report, i.e. row level data at something like a 10, and header level elements at a 12. Anything else that exists such as filter elements should be set the same as the header levels. I would usually set titles a number of points above this, at something like an 18. In general, having varying levels of font size on a single report between elements will look inconsistent so the aim here is consistency!
  3. The third consideration if possible is to keep the font size the same across all the reports within the report deck for the same type of element. Again, this adds a consistent feel to the deck. If one report has more rows than another, in my opinion its still better to use the same font size across both, rather than filling the page on both using varying sizes.
  4. The last consideration is to be careful when mixing text from a textbox and a card together in the same area of the report. Unfortunately Power BI does not currently support expressions like SSRS does, thus a textbox is for static text only. Dynamic text can only be created through a measure and assigned to a card. However having both elements side-by-side with one another does not give the expected result. The font size of the font in a text box is not the same as a card; a card size 10 is roughly equivalent to a text box size 13 (although you can only set 12-14), thus leaving you with some inconsistent fonts between elements. My suggestion is to create measures referring to elements on the report, and use them for both static/dynamic text, thus every textbox will be a card visual and have that consistent look and feel. If you only need static text, stick to text boxes.



The next consideration is around the number of objects on a report – keep it simple.  Avoid building a giant monolithic report, the more objects you use, the slower the report will perform on PBI service, iPad’s and even to develop.  This is especially true for tables/matrices which will each need to fire off separate DAX queries to return the data elements. Too many objects also has knock on effects for exporting to PowerPoint as objects will overlap with one another more which may not be as much of a case within PBI service but will affect other apps. You can use the selection pane (in the view tab) so move objects above/below one another which will bring forward/push back the elements.



Another scenario which I have come across is that sometimes it is necessary to include a column header in between the header for a measure and the actual data – for instance to specify the currency or unit. There are 2 options available; the first is to set the headers of the table as white text and insert text boxes over their position. While this achieves the goal, the final look and feel is not ideal as a large proportion of time can be spent aligning the columns with the text in the text boxes, and even then it can still be pixels out of alignment. Adding/removing measures then means you have to go through the same process again as everything shifts left/right. Fortunately, in the November release of Power BI, they have added the ability to align data within the tables better. A better approach for this scenario is to rename the measures within the table visual to whichever unit you want to show for that column. The downside of this is for a developer you will then need to hover the measures to see where the original measure came from, a small annoyance which is compensated by the amount of time saved trying to do alignment within a table. Also, this means less objects in the report, and less objects will generally create a quicker, more manageable report.

For anyone particularly new to Power BI, you an use the arrow keys to move around a single element pixel by pixel, to help with alignment. There’s also options on the format tab. I’m still waiting for the ability to pixel nudge multiple elements when selected together!




Hopefully you should be guided in terms of colour by a corporate colour scheme. This often comprises of a set of 4 or so main RGB values to use, complimented by a further set of colours. Pick 1 (or even 2) of these from the main set of colours and use that for the report framework, either the border/frame, or report header/footer, and then use another colour for the table headers, or two if the report requires double table headers. Again, consistency is the key across reports within the report deck. If using double headers for the columns, consider using the colours as background behind the header text rather than colouring in the text in the corporate colour. Consider white text on the darker backgrounds.


Parameter Selection

Most reports will contain some kind of slicer visual, to allow the user to change the context of the data – usually by period. As part of the report build, you’ll need to assess where the best position for this is on the report and to keep it consistent between reports within the deck. If your reports will require the real estate going across the page (i.e. wide tables), then consider the top of the report, else if they need the real estate going down the page (i.e. long tables), consider the right hand side. I think by default I would build it at the top, alongside any logos / titles. If you require multiple slicers, maybe move these to the side and build a panel to group them together.

Another little hidden option is that of switching the slicer visual between List/Dropdown (circled red below). For some reason, list is the default but most users will agree that the dropdown is neater, and saves space. I’m not sure why this appears here rather than in the standard visual configuration tab, maybe they will move it soon? The dropdown slicer visual still has some issues which I hope will be fixed soon such as not de-selecting the dropdown after a selection has been made. Another click is required outside of the dropdown to hide the options. This is not the best for the end users, and there seems to be no viable alternative.



Header Logic Swapping

Unfortunately as I mentioned previously, Power BI does not support expressions within tables, and therefore switching context based on a parameter is not easy to achieve. This is possible but it needs to be done entirely within DAX. To keep the DAX measures for this simple, consider specifying the position on the report as the name of the measure. Then within the definition of the measure, reference other created measures and keep the statement simple, allowing anyone debugging the report to trace the switching logic easily. Also use a DAX formatter such as this to make the DAX as readable as possible. It would be nice for this to be included within Power BI, hopefully it will soon! I’ve included an example DAX statement below to provide this functionality.


New Measure = 
IF (
        "Comments describing the logic definition", "",
    BLANK ()



What does all of this lead to? The answer is a template for the report deck. As well as having guidelines for the above items which make up a report, its also good to build a physical .pbix template for your suite of reports. This way, you are not starting from scratch for each report,and you will get a more consistent feel down to the pixel level of where the objects are. Don’t over complicate the template, but leave enough elements on it to save you re-creating them each time you build a new report. I would generally avoid copying from an existing report each time to then build another report, as this will sometimes include elements like PBI defined measures, which you do not want to carry between reports. Instead define a template which you take from each time.



Once decided on a number of these points, it is worth gaining a consensus from the product owner over whether this is acceptable to use moving forward. Do not get to the end of the deck, and demonstrate your decisions across the report set, this will leave you with far too much re-work. Instead aim to deliver maybe one of the more complex reports with a number of the items above, and then apply those decisions to the rest of the report deck.

Generating Usage Statistics from a SSAS Tabular Cube

Once you have users accessing your cube it’s almost inevitable at some point that someone will ask you to generate usage statistics from it, and there are a number of methods to achieve this. In this quick blog post, I’ll detail them and my experiences with each, and then use this data to create a PBI report.


Native Tabular Properties

The first method is natively through the tabular cube properties. This also has the added bonus (read impact) that it will optimise future aggregations based on usage – in fact that’s its main purpose.

This can be done by setting the CreateQueryLogTable to true, setting up the QueryLogConnectionString (to point to the DB where the usage table requires hosting), setting the QueryLogSamping rate (10 means every 10th query will be logged), and finally the name of the QueryLog table.

SQL Server Analysis Services Query Log Related Properties

Advantages of this method is that its very easy to setup with limited knowledge required and it could potentially improve performance if you have an environment where users submit repetitive queries. Unfortunately there are also a number of disadvantages which led me to find other methods. Firstly, it creates a degree of overhead on the cube if its sampling too often; we actually had visible performance related complaints once we turned it on – either through the sampling or change to the “optimised” aggregations. Depending on the sampling rate, you could also find that users who rarely use the cube are not picked up as part of the stats.  As well as this any changes to the cube structure will cause the logging table to be reset. The table is also limited in terms of what it actually logs (as you can see below) – useful if you just want just the user and timestamp info but not much else, and no real ability to configure.

Query the OlapQueryLog table for Cube activity


AS Trace

To that extent, I looked for other tools to do the same task but better and I found AS Trace.

Originally built for SQL Server 2012, it works fine on 2014 – and provides you the ability to run a trace against the cube activities (and log to a table) exactly like the SQL profiler but without the overhead of the GUI which adds unnecessary memory/processor power. It also runs as a windows service allowing it to restart automatically when the server reboots. If this is the case, the tool also logs the existing data to a History table and truncates the logging table. Exactly what I was after.

The tool collects information based on a preconfigured Analysis Services Profiler template, which can be optimised depending on which events you are interested in. I initially ran it using most events selected, and with a limited user set it was generating in the region of 25,000 rows a day. This was clearly not maintainable for a long period of time. I then used the following blog post to understand what each event of the profiler was giving me and then just created a lightweight trace definition file to give me what I wanted. I limited it to Query Begin, Query End (for DAX/MDX statements) and Audit Logon/Logout (for session data).

The setup is very straight forward, just run the install.bat as an escalated privileged account, and check it installs the service correctly. Next, add your SSAS service account to the Logon of the service, make sure the account has “Log on as Service” and membership to the database you are writing to in the form of DDL and DML access, i.e. able to create tables, write to tables – and lastly admin rights to the instance of SSAS you intend to use.

Next, configure the ASTrace.exe.config file with the parameters you want the tool to use. This includes the location of the cube (can handle multiple cubes), the location of the trace definition file, the location of the DB instance and table you want to log to and lastly whether you want to preserve history on restart. The only thing I couldn’t do here, is set the schema of the table it was using to log to, which defaults to dbo.

All that’s left is to start the service, and check the log file to see if it has created any errors on start-up. If not, the table should be created correctly and awaiting input.

I also saw another method while researching using Extended Events (XEvents) but did not implement this once AS Trace provided me with the information I needed.


View / Power BI Report

I initially used the data to run a limited set of queries to extract total users, and total queries for a given time period. This was useful to a degree but from the data collected I realised I could be doing so much more.

This lead me to do some analysis across the type of metrics being logged, and allowed me to create a view on top of the tables of what I thought might be useful on a report. I removed all the redundant columns it was tracking, and created some friendly names for the EventSubclass, and other columns. I used the PATINDEX function to check the query statement for existence of some important values – while not an exact science, it would give me a good picture of the split between certain user groups and KPIs being run. I’ve included the view definition below.

I ended up limiting the data to EventClass 10 as this seemed to capture all the necessary data. The only downside I have seen so far is that users querying through the Power BI web service are anonymised under the service account name. I’m currently looking into options to resolve this which I’ve seen as configuration options on Power BI – to allow through the username as long as it can be matched at the other end.

RowNumber AS ID, 
SPID AS SessionID,
CurrentTime AS DateQueried, 
NTUserName AS Username,
CASE EventSubClass 
WHEN 0 THEN 'MDX Query (Excel)' 
WHEN 3 THEN 'DAX Query (Power BI)' 
END AS QueryType, 
CASE Success WHEN 1 THEN 'Successful Query' ELSE 'Query Error' END AS SuccessfulQuery,
CONVERT(DECIMAL(10,2),CONVERT(DECIMAL(18,3),Duration)/1000) AS DurationSec, 
TextData AS Query, 
CASE PATINDEX('%Mexico%',TextData) WHEN 0 THEN 0 ELSE 1 END AS MexicoMarket,
CASE PATINDEX('%Colombia%',TextData) WHEN 0 THEN 0 ELSE 1 END AS ColombiaMarket,
EventClass = 10

Once I had the view, creating the report was relatively straight forward, and can be seen below.

I included metrics for number of queries by user (blurred out) which also doubled as a filter, the % split of queries for things such as Excel/Power BI, a measure of queries by timeframe, a logarithmic scaled display for queries by query duration, and lastly a split of queries by KPI. I intend to tweak these once I receive more data from the trace, but was relatively happy with the information that they were providing.

Untitled (002)

Please let me know if you have any comments.

A Crash Course in R Part 2

Following on from Part 1 which introduces the R basics (which can be found here), in Part 2 I’ll start to use lists, data frames, and more excitingly graphics.



Lists differ from vectors and matrices because they are able to store different data types together in the same data structure. A list can contain all kinds of R objects - vectors, matrices, data frames, factors and more. They can even store other lists. Unfortunately calculations are less straightforward because there's no predefined structure.

To store data within a list, we can use the list() function, trip <- c("London","Paris",220,3). As mentioned in my previous blog we can use str(list) to understand the structure, this will come in handy. Notice, the different data types.

Subsetting lists is not similar to subsetting vectors. If you try to subset a list using the same square brackets as when subsetting a vector, it will return a list element containing the first element, not just the first element "London" as a character vector. To extract just the vector element, we can use double square brackets [[ ]]. We can use similar syntax to before, to extract the first and third elements [c(1,3)] as a new list, but the same does not work with double brackets. The double brackets are reserved to select single elements from a list. If you have a list inside a list, then [[c(1,3)]] would work and would select the 3rd element of the 1st list! Subsetting by names, and logicals is exactly the same as vectors.

Another piece of syntax we can use is $ to select an element, but it only works on named lists. To select the destination of the trip, we can use trip$destination. We can also use this syntax to add elements to the list, trip$equipment <- c("bike","food","drink").

Interested in testing your knowledge, check out the DataCamp exercises here and here.


Data Frames

While using R, you'll often be working with data sets. These are typically comprised of observations and each observation has a number of variables against it, similar to a customer/product table you'd usually find in a database. This is where data frames come in, as the other data structures are not practical to store this type of data. In a data frame, the rows correspond to the observations, while the columns the variables. Similar to a matrix but we can store different data types (like a list!). Under the hood, a data frame is actually a list but with some special requirements such as vector length, and char vectors as factors. Creating a data frame is usually achieved by importing data from source rather than manually created but you can do this via the data.frame function as shown below:


# Definition of vectors 
planets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune") 
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",  
          "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant") 
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883) 
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67) 

# Encoded type as a factor
type_factor <- factor(type) 

# Constructed planets_df
planets_df <- data.frame(planets, type_factor, diameter, rotation, rings, stringsAsFactors = FALSE) 

# Displays the structure of planets_df 


Similar to SQL, sometimes you will want just a small portion of the dataset to inspect. This can be done by using the head() and tail() functions. We can also use dim() to show the dimensions which returns the number of rows and number of columns, but str() is preferable as you get a lot more detail.

Due to the fact a data frame is an intersection between a matrix and a list, the subsetting syntax can be used from both, so the familiar [ ], [[ ]], and $ will work to extract either elements, vectors, or data frames depending upon the type of subsetting performed. We can also use the subset() function, for instance subset(planets_df, subset = has_rings == TRUE). The subset argument should be a logical expression that indicates which rows to keep.

We can also extend the data frame in a similar format to lists/matrices. To add a new variable/column we can use either people$height <- new_vector or people[["height"]] <- new_vector. We can also use cbind() for example people <- cbind(people, “height” = new_vector), it works just the same. Similarly rbind() can add new rows, but you'll need to add a data frame made of new vectors to the existing data frame, the names between the data frames will also need to match.

Lastly, you may also want to sort your data frame - you can do this via the order() function. The function returns a vector with the rank position of each element. For example in a vector of 21,23,22,25,24 the order function will return 1,3,2,5,4 to correspond the position in the rank. We can then use ages[ranks, ] to re-arrange the order of the data frame. To sort in descending order, we can use the argument decreasing = TRUE.


# Created a desired ordering for planets_df
positions <- order(planets_df$diameter, decreasing = TRUE)

# Created a new, ordered data frame
largest_first_df <- planets_df[positions, ]

# Printed new data frame


Interested in testing your knowledge, check out the DataCamp exercises here and here.



One of the main reasons to use R is its graphics capabilities. The difference between R and a program such as Excel is that you can create plots with lines of R code which you can replicate each time. The default graphics functionality of R can do many things, but packages have also been developed to extend this – popular packages include ggplot2, ggvis, and lattice.

The first function to look at is plot() which is a very generic function to plot. For instance, take a data frame of MPs (members of parliament), which contains the name of the MP, their constituency area, their party, their constituency county, number of years as MP, etc. We can plot the parties in a bar chart by using plot(parliamentMembers$party). R realises that their party is a factor and you want to do a count across it in a bar chart format (by default). If you pick a different variable such as the continuous variable number of years as MP, the figures are shown on an indexed plot. Pick 2 continuous variables, such as number of years as MP vs. yearly expenses - plot(parliamentMembers$numOfYearsMP, parliamentMembers$yearlyExpenses) and the result is a scatter plot (each variable holds an axis).

Are these variables related? To make relationships clearer, we can use the log() function- plot(log(parliamentMembers$numOfYearsMP), log(parliamentMembers$yearlyExpenses)). For 2 categorical variables, R handles it differently again, and creates a stacked bar chart giving the user an idea of the proportion of the 2 variables. Notice how the type of variable you use impacts the type of chart displayed.  You will find that the first element of plot forms the x axis and the second element the y axis. For 3 variables, it gets better still – here I used plot(parliamentMembers[c("numOfYearsMP","votesReceived","yearlyExpenses")]). As you can see below, R plots the variables against one another in 9 different plots!




The second function to look at is hist(), which gives us a way to view the distribution of data. We can use this in a similar format to plot by specifying the syntax hist(parliamentMembers$numOfYearsMP). By default, R uses an algorithm to work out the number of bins to split the data into based on the data set. To create a more detailed representation, we can increase the bins by using the breaks argument - hist(parliamentMembers$numOfYearsMP, breaks = 10).

There are of course, many other functions we can use to create graphics, the most popular being barplot(), boxplot(), pie(), and pairs().


Customising the Layout

To make our plots more informative,we need to add a number of arguments to the plot function. For instance the following R script creates the plot below:


xlab = "Number of votes", 
ylab = "Year", 
main = "Starting Year of MP vs Votes Received", 
col = "orange",
col.main = "black",
cex.axis = 0.8
pch = 19

# xlab = horizontal axis label
# ylab = vertical axis label
# main = plot title
# col = colour of line
# col.main = colour of the title
# cex.axis = ratio of font size on axis tick marks
# pch = plot symbol (35 different types!!)



To inspect and control these graphical parameters, we can use the par() function. To get a complete list of options we can use ?par to bring up the R documentation. By default, the parameters are set per plot, but to specify session-wide parameters, just use par(col = “blue”).


Multiple Plots

The next natural step for plots is to incorporate multiple plots – either side by side or on top of one another.

To build a grid of plots and compare correlations we can use the mfrow parameter, like this par(mfrow = c(2,2)) by using par() and passing in a 2x2 vector which will build us 4 subplots on a 2 by 2 grid. Now, when you start creating plots, they are not replaced each time but are added to the grid one by one. Plots are added in a row-wise fashion - to use column-wise, we can use mfcol. To reset the graphical parameters, so that R plots once again to a single figure per layout, we pass in a 1x1 grid.

Apart from these 2 parameters, we can also use the layout() function that allows us to specify more complex arrangements. This takes in a matrix, where you specify the location of the figures. For instance:


grid <- matrix(c(1,1,2,3), nrow = 2, ncol = 2, byrow = TRUE)

[1]  1  1
[2]  2  3

plot(dataFrame$country, dataFrame$sales)
plot(dataFrame$time, dataFrame$sales)
plot(dataFrame$businessFunction, dataFrame$sales)

# Plot 1 stretches across the entire width of the figure
# Plot 2 and 3 sit underneath this and are half the size


To reset the layout we can use layout(1) or use the mfcol / mfrow parameters again. One clever trick to save time is to store the default parameters in a variable such as old_par <- par() and then reset once done using par(old_par).

Apart from showing the graphics next to one another, we can also stack them on top of one another in layers. There are functions such as abline(), text(), lines(), and segments() to add depth to a graphic. Using lm() we can create an object which contains the coefficients of the line representing the linear fit using 2 variables, for instance movies_lm <- lm(movies$rating ~ movies$sales). We can then add it to an existing plot such as plot(movies$rating, movies$sales) by using the abline() (adds the line) and coef() (extracts the coefficients) functions, for instance abline(coef(movies_lm)). To add some informative text we can use xco <- 7e5 yco <- 7 and text(xco, yco, label = “More votes? Higher rating!”) to generate the following:




Interested in testing your knowledge, check out the DataCamp exercises here, here and here.



This pretty much concludes the crash course in R. We’ve looked at the basics, vectors, matrices, factors, lists, data frames and graphics – and in each case looked at arithmetic and subsetting functions.

Next I’ll be looking at some programming techniques using R, reading/writing to data sources and some use cases – to demonstrate the power of R using some slightly more interesting data sets such as the crime statistics in San Francisco.




A Crash Course in R Part 1

What is R?  Also known as the language for statistical computing, it was developed in the 1990s, and provides the ability to use a wide variety of statistical techniques and visualization capabilities across a set of data. 

Pros for the language include the fact its open source, it has great graphical capabilities, runs via a CLI (provides the ability to script and automate), has a huge community behind it, and is gaining a wider adoption in business!

R can be used by a number of tools; the most common are R Tools for Visual Studio, RStudio standalone, or more recently R Scripts in Power BI.



Let’s start with the absolute basics. If you print 1+2 to the command line, the console will return 3. Print text to the command line, the console will return the same body of text.

Everyone loves variables. To be able to store values in variables, we can use the syntax apples <- 4 and pears <- 2 to the command line to store 4 in the apples, and 2 in the pears variable. There’s no print out here because its a variable. We can then do total_fruit <- apples + pears to create a new variable using existing variables.

As you create variables, you create a workspace, which you can reference using ls(). This details all the variables created within the R session. You can then use rm(variable) to clean up the workspace to maintain resource.

Now, not everyone loves commenting but you can comment your code via #. Here’s a simple example script to calculate the volume of a circle:

# Create the variables r and R 
r <- 2 
R <- 6 

# Calculate the volume of a circle
vol_circle <- 2*pi^2*r^2*R 

# Remove all intermediary variables that you've used with rm() 

# List the elements in your workspace 

[1] "vol_circle"


Data Types

As with any language, there are a number of data types supported:

  • TRUE/FALSE are "logical"
  • "This is text" is "character"
  • 2 or 2.5 are "numerics”
  • You can add L to numeric such as 2L to call this number an integer (the outputs the same, but the class is different). Here we have what's known as a type hierarchy.
  • Other types include double, complex, and raws.

We can use the function class() to determine the data type of variable. We can also use the dot function as to coerce or transform between the data types, such as as.character(4) = "4" and as.numeric("4.5") = 4.5.  To evaluate the type use is.numeric(2) = TRUE or is.numeric(“2”) = TRUE.  NA is returned when trying to convert “hello” to numeric.

Interested in testing your knowledge, check out the DataCamp exercises here and here.



A vector is a sequence of data elements of the same data type which can called using the c() function. Vectors can be of all the types seen previously.  The three properties of a vector are type, length, and attributes.

For example, a vector to provide us UK Government Parties (who doesn’t love politics) and assigned to the variable parties can be built parties <- c("Labour","Conservative","Libdems","SNP"). A check can done to see if the variable is of a vector type similar to before using is.vector(parties).

But what if the vector contains data that has slightly more meaning behind it, for instance seat_count <- c(262,318,12,35). You can attach labels to the vector elements by using the names() function: names(seat_count) <- parties.  You could also do this using one line - c(Labour = 262, Conservative = 318, Libdem = 12, SNP = 35).

The variables created previously are actually stored in a vector of length 1. R does not provide a data structure to hold a single number/character string.  If you do build a vector of different data types, R performs coercion to make sure they are all of the dame data type. For instance c(1,5,"A","B","C","D") becomes "1","5","A","B","C","D".  If you need to differentiate data types, you can use a list instead!


Vector Arithmetic

Computations can be performed between vectors and are done so in an element-wise fashion. This means you can do earnings <- c(10,20,30) * 3 to generate [30] [60] [90].

You can also do vector minus vector so earnings – expenses (again done element wise). Multiplication and division using this method does not result in a matrix!

Other functions include sum(bank) to sum all elements of the vector, sum(bank > 0) to return a count of elements in the vector (given bank contains numerics) or sum(bank == x) to return the count of element x in a vector.


Vector Subsetting

As the name suggests, this is basically selecting parts of your vector to end up as a new vector (a subset of your original vector).

If you want to select the first element from our seat_count vector, we write seat_count[1] and this will return Labour 262. Both the value and name are kept from the original vector. We can also use the name, so seat_count[“Labour”] will return the same result.

If you want to select multiple elements, you can use the syntax seat_count[c(1,4)] by passing in a vector. To select elements in a sequence you can use 2:5 instead of 2,3,4,5. You can also subset via an inverse, by using the syntax seat_count[-1] which returns all the seats, apart from the element in [1].

One last method to create subsetting is by logical vectoring, so by specifying seat_count[c(TRUE,FALSE,FALSE,FALSE)] we can return the equivalent of [1]. R is also able to “recycle” this type of vectoring so if your logical vector is length 2, it will loop over itself to fit the vector you are subsetting.

Remember we can also use the arithmetic from the previous section, to select our vector contents for subsetting, examples include main_parties <- seat_count[seat_count > 50].

Interested in testing your knowledge, check out the DataCamp exercises here, here, and here.



While a vector represents a sequence of data elements (1 dimensional), a matrix is a similar collection of data elements but arranged as rows/columns (2 dimensional) – a very natural extension to a vector.

To build a matrix you’ll need to use the following format; matrix(1:6, nrow = 2) which creates a 2-by-3 matrix for values 1-6. You can also specify columns rather than rows by using ncol = 3. R infers the other dimension by using the length of the input vector. To fill the vector by row instead of by columns, you can use the argument byrow = TRUE.

Another way to create a matrix is by using the functions rbind() and cbind(). These essentially take the 2 vectors you pass the function and stick them together. You can also use these functions to bind together a new vector with an existing matrix. For example my_matrix <- rbind(matrixA, vectorA)

To name the matrix, you can use rownames() and colnames(). For example rownames(my_matrix) <- c(“row1”,”row2”) and colnames(my_matrix) <- c(“col1”,”col2”).

You can also create a matrix using a one-liner, by using dimnames() and specifying a list of vector names.

my_matrix <- matrix(1:6, byrow = TRUE, nrow = 2,
dimnames = list(c("row1", "row2"), c("col1","col2","col3")))

Similar to vectors, matrices are also able to recycle themselves, only store a single atomic data type and perform coercion automatically.

Continuing on from vectors, matrices can also be subsetted. If you’re after a single element, you’ll need to specify both row and column elements of interest using the syntax m[1,3] for row 1 column 3. If you’re looking to subset an entire row or column, you can use the syntax m[3,] (notice the missing column value) to select the entirety of row 3. Columns can be selected using the inverse via m[,3]. You can also select multiple elements using a simple methodology to vectors. This can be achieved by using the syntax m[2, c(2,3)] to select the 2nd and 3rd column values of row 2. Subsetting the names works just the same as by index, you can even use a combination of both! The same is true of subsetting by a logical vector – just use c(FALSE,TRUE,TRUE) for the last 2 rows of a 3 row matrix. You can see some examples below.

# Create a matrix using 2 vectors
my_mega_matrix <- cbind(vectorA, vectorB)

# Subset the matrix to get columns 2 and 3
my_subsetted_mega_matrix <- my_mega_matrix[,c(FALSE,TRUE,TRUE,FALSE]

# Subset the matrix using names for columns 1 to 4
my_alt_subsetted_mega_matrix <- my_mega_matrix[,c("A","B","C","D")]

# Calculate totals for the columns 2 and 3
total_mega_matrix <- colSums(my_subsetted_mega_matrix)

As seen above there are another 2 functions we can use on matrices namely colSums() and rowSums() to do column and row arithmetic. This is addition to other standard arithmetic. All computations are performed element wise. So we can do my total_mega_matrix * 1.3 to convert the totals in GBP to USD (as an example). Performing calculations using 2 matrices is just the same (matrixA – MatrixB). Be careful here though, if they contain the same number of elements, everything will be done element wise, else recycling will occur.

Notice the similarity between vectors and matrices – they’re both data structures that store elements of the same type. Both perform coercion, and recycling. Arithmetic is also similar as everything is performed element wise.

Interested in testing your knowledge, check out the DataCamp exercises here, here, and here.



Unlike numeric variables, categorical variables can only take a limited number of different values. The specific data structure for this is what is known as a factor. A good example of this is blood, types can only be of type A, B, AB, or O – we then define a vector of peoples blood types blood <- c(“B”,”AB”,”O”,”O”,”A”,”B”,”B”). To convert this vector to a factor we can use factor(blood). R scans the vector to check for categories, and stores the distinct list as levels (sorted alphabetically). Values in the vector and then replaced with numeric values corresponding to the associated level. You can think of factors as integer vectors, where each integer refers to a category or level. To inspect the structure, we can use str(factor).

Similar to the names() function, you can also specify the levels() function and pass a vector to name the levels differently to those categories picked up in the scan, for instance levels(my_factor) <- c(“BT_A”,”BT_B”,”BT_O”,”BT_AB”). However its much safer to pass in both the levels and the labels because of the way in which it sets the levels alphabetically which means you have to be careful your names correspond correctly to the levels.

In statistics, there is also a difference between nominal categorical variables and ordinal categorical variables – nominal variables have no implied order, i.e. blood type A is not greater or less than B. There are examples however where ordering does exist, for example with t-shirt sizes, and you can use R to impose this on the factor. To do this, you can set the ordered function inside the vector to TRUE, and then specify the levels in the correct order. You can now evaluate the levels in the factor. An example can be seen below.

# Definition of temperature_vector
temperature_vector <- c("High", "Low", "High", "Low", "Medium")

# Encoded temperature_vector as a factor
temperature_factor <- factor(temperature_vector, 
                             ordered = TRUE,
                             levels = c("Low","Medium","High")

# Print out

Interested in testing your knowledge, check out the DataCamp exercises here.


This is only just the start of understanding R – in the next blog I’ll look at lists, data frames and most importantly graphics! We can then start to looking at some more complicated examples and use cases.

Extracting Users from AD using SSIS

I've recently been working on a project which required KPI level security alongside the traditional row level security secured at a geography level. This would limit what financial data a user could see within a cube, without having to create multiple cubes or use perspectives (which would not actually secure the data).

To achieve this, I needed to populate a set of 'KPI User/Role' tables stored in Master Data Services (MDS) with a list of users who were stored in a particular AD group. I would need these tables updated on a regularly basis to grant/revoke access. We could then use these names along with the USERNAME() function in DAX to filter.


The Solution

One method to solve my problem would be by using SSIS. The package could  be setup to run as part of a SQL Agent Job, either by a schedule or on demand. My list of users were stored in an AD group called LH_FIN.

image image

To start with you will need to truncate and clear your MDS staging tables that you are about to populate.  You can then use the data flow to process the majority of the logic, by creating a script component task. The purpose of this is to loop through Active Directory and pick up the user details that belong to the specified AD Group or set of AD groups if dealing with multiple roles.

A number of variables are defined which the task uses to complete the lookup.

  • strLDAP – the LDAP directory on which to perform the lookup
  • strDomain – the domain on which the AD group(s) belong
  • strADPrefix – the AD group prefix from which to return user information about
  • strADParent – the parent group which contains the AD groups which you are looking up (may not need to be used if only looking up a single AD group)

To extract users from multiple groups, make sure the prefix stored in the variable strADPrefix covers both groups. Once the rows are extracted it would then be a case of using SSIS to split the data accordingly on the AD Group Name.

The following code can be used in the script:

The first section sets up the objects required to interrogate the directory, and the fields we expect to return from the accounts – the most important of which is memberof which is used to check versus our AD Prefix. It also filters out items such as service accounts and disabled accounts to speed up the interrogation process.

    Public Overrides Sub CreateNewOutputRows()

        Dim domain As String = Variables.strDomain
        Dim searchRoot As New DirectoryEntry(Variables.strLDAP, Nothing, 
                                             Nothing, AuthenticationTypes.Secure)
        Dim dirSearch As New DirectorySearcher(searchRoot)
        dirSearch.SearchScope = SearchScope.Subtree
        'LogonName, GroupsUserBelongsTo, Department, JobTitle, MailAddress, DisplayName
        'filter to user objects
        dirSearch.Filter = "(objectCategory=person)" 
        'filter to user objects
        dirSearch.Filter = "(objectClass=user)" 
        'filter out disabled accounts
        dirSearch.Filter = "(!userAccountControl:1.2.840.113556.1.4.803:=2)" 
        'filter out password never expires accounts, i.e. service accounts
        dirSearch.Filter = "(!userAccountControl:1.2.840.113556.1.4.803:=65536)" 
        'sets chunk size for retrieving items
        dirSearch.PageSize = 1000

The next section of code performs the search, and for any LDAP objects it finds within the filter set, returns the properties requested. These properties are then stored in key/value pairs.

        Dim props As ResultPropertyCollection
        Dim values As ResultPropertyValueCollection
        Dim key As String
        Dim userAccountName As String
        Dim departmentHome As String
        Dim jobtitle As String
        Dim GroupName As String
        Dim email As String
        Dim displayName As String
        Dim groups As New ArrayList

        Using searchRoot
        'Return all LDAP objects, LDAP://acl/CN=Tristan Robinson,OU=Employees,DC=ACL,DC=local
        'CN = Common Name, OU = Organisational Unit, DC = Domain Component
            Using results As SearchResultCollection = dirSearch.FindAll()
                For Each result As SearchResult In results
                    'For each object return properties, i.e. displayname, memberof, etc
                    props = result.Properties
                    For Each entry As DictionaryEntry In props
                        key = CType(entry.Key, String)
                        'For each property, inspect the property and record its value
                        'Logon Name
                        If key = "samaccountname" Then
                            values = CType(entry.Value, ResultPropertyValueCollection)
                            userAccountName = CType(values.Item(0), String)
                        End If
                        If key = "department" Then
                            values = CType(entry.Value, ResultPropertyValueCollection)
                            departmentHome = CType(values.Item(0), String)
                        End If
                        'Job Title
                        If key = "title" Then
                            values = CType(entry.Value, ResultPropertyValueCollection)
                            jobtitle = CType(values.Item(0), String)
                        End If
                        If key = "mail" Then
                            values = CType(entry.Value, ResultPropertyValueCollection)
                            email = CType(values.Item(0), String)
                        End If
                        'Display Name
                        If key = "displayname" Then
                            values = CType(entry.Value, ResultPropertyValueCollection)
                            displayName = CType(values.Item(0), String)
                        End If
                        'Groups User Belongs To (array/collection)
                        If key = "memberof" Then
                            values = CType(entry.Value, ResultPropertyValueCollection)
                            groups = GetGroups(values)
                        End If

The final section filters the data into the output buffer if from the array list we’ve extracted above, we have matching strings from our original AD Prefix variable. It will then reset, and loop round for the next account.

                    'Export user details to buffer if it passes the logical test
                    For Each item As String In groups
                        'Avoids computer accounts, i.e. ending with $
                        If userAccountName.EndsWith("$") = False 
                        And item.ToString.StartsWith(Variables.strADPrefix) 
                        Then 'And item.ToString <> (Variables.strADParent) 
                            If String.IsNullOrEmpty(userAccountName) Then
                                Output0Buffer.UserAccountName_IsNull = True
                                Output0Buffer.UserAccountName = userAccountName
                            End If
                            If String.IsNullOrEmpty(domain) Then
                                Output0Buffer.Domain_IsNull = True
                                Output0Buffer.Domain = domain
                            End If
                            If String.IsNullOrEmpty(item.ToString) Then
                                Output0Buffer.GroupName_IsNull = True
                                Output0Buffer.GroupName = item.ToString
                            End If
                            If String.IsNullOrEmpty(jobtitle) Then
                                Output0Buffer.JobTitle_IsNull = True
                                Output0Buffer.JobTitle = jobtitle
                            End If
                            If String.IsNullOrEmpty(email) Then
                                Output0Buffer.Email_IsNull = True
                                Output0Buffer.Email = email
                            End If
                            If String.IsNullOrEmpty(displayName) Then
                                Output0Buffer.DisplayName_IsNull = True
                                Output0Buffer.DisplayName = displayName
                            End If
                        End If
                    userAccountName = ""
                    departmentHome = ""
                    jobtitle = ""
                    GroupName = ""
                    email = ""
                    displayName = ""                    
            End Using
        End Using
    End Sub

I also required a function to split the list of groups a user belonged to and store them in another array list.

    Private Function GetGroups(ByVal values As ResultPropertyValueCollection) As ArrayList
        Dim valueList As ArrayList = New ArrayList()
        For Each Item As Object In values
            Dim memberof As String = Item.ToString()
            Dim pairs As String() = memberof.Split(",".ToCharArray)
            Dim group As String() = pairs(0).Split("=".ToCharArray)
        Return valueList
    End Function

End Class

Once a list of users has been extracted, you will need to do a lookup against the existing list and only stage those that are new. This can be achieved through a simple lookup component . You can then move the rows into the MDS staging table ready for the load into MDS.

After the data flow has processed successfully, the next stage is to sweep the records into MDS using one of the built in stored procedures in the product. The results of which can be seen in the MDS Import View on the MDS site.

The following entities can then be used to power the DAX:

  • User (Username, Team, AD Login – populated from the script above)  
  • Role (Role – populated manually)
  • User Role (User, Role – joins users to roles, populated manually)
  • KPI (KPI – populated manually)
  • KPI Role (Role, KPI – joins roles to KPIs, populated manually)

These are then processed to the DW by ETL, and a UserKPI view is written across the tables to provide an AD Login to KPI pair.

For filtering the measures in the cube, you can then apply the following piece of DAX to the original measure – Gross Sales in the example below:

Gross Sales:=
        CALCULATETABLE( 'UserKPI', UserKPI[KPIName] = "Gross Sales"),
        UserKPI[ADLogin], USERNAME()
    [Gross Sales (ACTUAL)]
    BLANK ()

This concludes this blog post – hopefully this will be useful for anyone that wants to extract users from AD to control access / security within an application. Data could also be loaded direct into DB tables rather than MDS if required. 

Please feel free to comment!

Modelling Survey Style Data

In a recent project, one of the business areas which required modelling was that of survey data. This can often be quite tricky, due to the fact the data is not so quantitative in nature as other business areas such as sales.

How do you go about calculating measures against this type of data? In this blog I will go about explaining a relatively simple model you can use to achieve this goal.


Master Data Services

To aid with modelling I used Master Data Services (MDS) to help map the questions and responses from the surveys.

The idea behind using MDS is that regardless of the questions asked, whether it be in one format or another, English or Spanish, you can map them to a set of master questions. Usually these will align closely to your measures/KPIs and should be relatively finite in number so that the mapping process is feasible for a business user. In my business case, the master set of questions revolved around items such as quality, price, and promotion of products. For responses, I needed to map these to a flag which indicated we wanted the response to contribute towards a measure and ultimately a KPI.

I first created the following entities in MDS.

  • Survey (holds the survey name, a unique identifier for the survey, and in my case I also created a master survey lookup to group the surveys together)
  • Source Question  (holds the distinct set of questions assigned to each survey, along with identifying codes, and question orders - each question would also be mapped to a master question) 
  • Source Response (holds a set of response options for each question, along with identifying codes)
  • Master Question (holds the master set of questions and joins to the KPIs)
  • KPI (holds a list of KPIs that you need to address by aggregating response data)
  • Response of Interest (holds a list of responses that are regarded as positive / contributing towards the KPI when answered by the question)
  • Response of Interest Mapping (allows the user to map the response options available on each question to a response of interest)

In terms of the response of interest, I was only interested in responses where the answer from the survey was “Yes” so this was all that was required here. However for more complex response options, the model can provide the scalability required. For instance, if you were looking for an answer between 7-10 and the survey had been answered with a 7, 8, 9, or 10 – each of these could be mapped to 7-10 without having to create responses of interest for all particular combinations. This scales well and can cover scenarios for instance where the answer should be between 700 to 1000 in the same way.

I also created a Master Question and Response of Interest value for N/A. This way, only the blanks on the mapping entities required populating and the user was never unsure whether a blank represented a question/response that was not of interest, or something that required mapping still.

All the entities above apart from Master Question, KPI, and Response of Interest were populated automatically from ETL with a SQL script used to extract the contents of those entities from source. The other 3 entities were populated manually by a business user. I also locked the entities / columns that the user shouldn’t be changing by using read-only permissions.

Some examples of the manually populated tables can be seen below:




Data Warehouse

For modelling the tables in the data warehouse, I created a separate dimension for each of the Response, Question, Survey, and KPI entities, and a single Fact to capture the responses of interest against these dimensions.

The majority of dimension lookups were straight forward along with the response of interest measure which can be seen below:

SU.Code AS SurveyId, 
SR.Name AS ResponseName, 
1 AS ResponseOfInterest 
FROM mdm.PL_ResponseOfInterestMapping RM 
INNER JOIN mdm.PL_SourceResponse SR
ON RM.SourceResponse_Id = SR.Id AND RM.Survey_Id = SR.Survey_Id 
INNER JOIN mdm.PL_Survey SU 
ON RM.Survey_Id = SU.Id 
RM.ResponseOfInterest_Code IS NOT NULL AND RM.ResponseOfInterest_Name <> 'N/A'

During our ETL runs for the fact we also checked for responses that had not been yet mapped – and did not pull these through.

If you then have a cube sat on top of your DW, you can then write measures across the fact to count the number of responses of interest. An example of which can be seen here:

Price Activation Standard:=
        CALCULATETABLE('Outlet Survey','KPI'[Sales Driver] = "Price"),
        'Outlet Survey'[IsResponseOfInterest] = 1

This was then checked against a Target fact table to calculate the compliance and the KPI was essentially an aggregation of the compliance across geography.



Overall, the model has proved very popular with the business. It’s easy to understand and gives the business control over which responses to count towards the KPI – without having to hard code values into the ETL which had been seen in previous solutions. It can also be extended easily be adding new KPIs and mapping them to new master questions without having to change any ETL. From a development perspective it also means that nothing should go into the DW as Unknown for a Dimension since the SQL to populate MDS, can also be used for the DW and therefore should always match.

If you have any questions, please feel to ask them in the comments.

Upgrading your BI Solution (2012 to 2014)

In a recent project I was asked to upgrade a BI solution from SQL Server 2012 to 2014.  The aim of which was to consolidate a number of database sources onto a single version of SQL Server. This blog looks at the steps required in order to achieve the upgrade process, along with issues I hit along the way.


Where do I start?

In order to develop on SQL Server 2014, you’ll need to update your development tools! This involves downloading the following applications.




SSDT BI for VS 2013

Allows you to modify and deploy SSIS, SSAS, SSRS files. I would then suggest downloading Update 5 from within the application. This resolved some source control issues for me.

SSDT Database Designer for VS 2013

Allows you to modify and deploy SQL related files.

Team Explorer for VS 2013

Sometimes this is bundled in together with SSDT but not the case for 2013.

BIDS Helper for 2014

This is a must have for any solution with a tabular cube. It allows the user more functionality to customise the cube and organise measures.


You’ll also need to update any custom components within the solution. For me, this was the 2014 Attunity drivers (3.0) to connect to our Oracle sources and some custom Adatis Framework components, both will need GACing. Remember that if you develop using 32 bit components you may need to download the equivalent 64 bit components once your solution is deployed to a server.

Once you have the necessary tools locally, you can start the upgrade process. Open up the solution in your new environment.


The Solution Upgrade Process

The first thing I noticed was that during the upgrade I lost all my SSIS components layout and comments when opening up the packages in VS 2013 that were developed in VS 2012. The code was still in the package file, but did not seem to do anything. I’m not sure this happens for everyone as the information I found to resolve the issue was not particularly well documented around the internet. Its possible that upgrading to Update 5 beforehand may resolve this issue.  To fix the the problem, I had to open up the packages code and search for the line that says version=“sql11”. This needs to be changed to “sql12”. The next time you open the package, everything should appear again. I also found that if you open up the package first in 2013, the application will automatically change this to sql12 however the comments/layout did not seem to appear. You also won’t be able to amend the file at this point so will need to roll back if using version control such as TFS.

Another issue I ran into was that even after I ran this process, a few developers were unable to see the comments still. This is due to the fact they were running the dark themed Visual Studio which did not invert the text colour on the comments therefore making them appear hidden. Again, probably another bug with the base version of SSDT I downloaded off the Microsoft website before upgrading to Update 5.

At this point you can start to run the packages through the Upgrade Package Wizard (if you right click on the SSIS packages folder in the project) .

upgrade      upgrade2

If packages are not appearing in the wizard, then the application is not recognising the packages as coming from a previous version of SQL Server. Its possible that even after running through the wizard, it reports the packages as having failed the upgrade. It’s fairly safe to ignore this, and is usually the result of custom components or similar.

Depending upon how many packages you need to upgrade you are now faced with the unfortunate task of replacing any of the custom components in those packages, with the new ones you downloaded to work with SQL Server 2014. In my case, this was the custom Adatis components in the Control Flow (Initiate Task/Terminate Task – remember to set LoadStatusType to Success on the Terminate properties), Data Flow (Clean – Data Cleansing/Standardise/Validate Columns), and Event Handlers (Row Throughput/Terminate Task on Error – remember to set LoadStatusType to Failure on the Terminate properties).

In some cases, components before the upgrade display a different icon than afterwards so its easy to tell what version you have. When this is not the case, you’ll have to check the package code.

BeforeReplaceComponents     AfterReplaceComponents

Following this, I then modified any connection strings I had within the solution to point to our new server. If you are not changing servers during your upgrade, just skip this step.

The next step is to update the DB projects to 2014 via the properties of the project. Very simple.


The final step needed within the solution is to upgrade the tabular model (if you have one). Firstly, make sure the model is open and then via the model properties, select the SQL Server 2014 compatibility level. Note – that if the model is not open, then this field will display blank.



Master Data Services
As part of my upgrade, I also had to upgrade our Master Data Services (MDS) to 2014. One of the pre-requirements for this is to have an IIS Web Server installed. If you’re provided a fresh clean server like I was then you’ll need to install this. This can be done through the Server Manager. Make sure the following features are added.

Web Server

  • Common HTTP Features
  • Static Content
  • Default Document
  • Directory Browsing
  • HTTP Errors
  • DO NOT INSTALL WebDAV Publishing

Application Development

  • ASP.NET 3.5/4.5
  • .NET Extensibility 3.5/4.5
  • ISAPI Extensions
  • ISAPI Filters

Health and Diagnostics

  • HTTP Logging
  • Request Monitor


  • Windows Authentication
  • Request Filtering


  • Static Content Compression

Management Tools

  • IIS Management Console
  • .NET Framework 3.0 Features
  • WCF Activation
  • HTTP Activation
  • Non-HTTP Activation

Windows Process Activation Service

  • Process Model
  • .NET Environment
  • Configuration APIs

The MDS configuration tool is quite good as it identifies if you are missing any pre-reqs. Once you have the pre-requisites for MDS installed as well as MDS 2014 itself, you can begin the upgrade process – which comes in 2 parts, the database, and the web site

Unfortunately, model deployment packages can be used only in the edition of SQL Server they were created in. You cannot deploy model deployment packages created in SQL Server 2008 R2/SQL Server 2012 to SQL Server 2014. Therefore you’ll need to take a database-level backup of your MDS database and restore to the new 2014 instance. In my case, I also needed to give our service account that was running SQL Server security access to the folder in which the backup was stored as it wasn’t able to see it.

Once you have successfully restored the previous MDS database to the new instance, you will need to load up the Master Data Services Configuration tool. To start with click on Database Configuration. If you then select the database you have just restored, it will suggest the database is not compatible with the current version of MDS and requires upgrading. Click on the Upgrade Database button and you should now have a compatible 2014 MDS database. I also had to re-set up the MDS mail profile at this point as this does not transfer across with the database.

The next step is to configure a new MDS website which can be done via the Web Configuration button. Firstly, select the default website that is created when you install IIS. There is no need to create anything separate – in fact this caused me more hassle than it was worth. You can then go about creating the site which requires you to specify an alias, and the application pool information (name and service account information).

Create WebApp

You are then required to associate both the database and web site together. At this point, you should be able to load up the new MDS web site and explore your old model.

One issue I ran into that stopped me loading into MDS at this point was that the service account you specify above will then need to be added to a local security group called “Log on as a batch job”. This can be done via the Computer Management application, then Local Users and Groups, Groups, Log on a batch job.

I also had some minor issues when loading MDS where the page was malformed and I was unable to see the database. I resolved this by providing the service account access to the web config of the web site directly.


Deploying & Testing

At this point, all the necessary requirements should be in place for you to deploy and test the upgrade. Deploy the solution as you would normally.

Your first step once your at this point is to add a new SSISDB Catalog to your new instance.  If like me you did this via the old 2012 Management Studio, you’ll run into some errors potentially. Instead, load up the new 2014 Management Studio and you should be fine.

Depending on how your solution is setup, you’ll need to re-confirm any environment variables or script them out again and then configure the SSISDB projects to use them.

Then you will need to re-setup any SQL Agent Jobs you had on the previous instance, again this can be scripted across – just make sure you change the connection details.

Following this, make sure you GAC any components required on the new server similar to what you did locally.

Lastly, you will also need to add your SSAS Service Account into the new SSAS instance.

You can then run the solution out as per usual and check your execution status reports for any irregularities.



Hopefully this covers everything you need to upgrade your solution. Whether its from 2008/2012 to 2014 or 2014 to 2016, the steps should be roughly the same, you might just run into slightly different issues. Generally the process is fairly straightforward, the hardest part was keeping track of the number of steps required and then ticking them off as you do them. If you need to do this for both development and production servers, I’d advise running the steps side by side so they are set up in a similar fashion. Please let me know in the comments below if you’ve had any similar experiences / issues.