Tristan Robinson

Tristan Robinson's Blog

Generating Usage Statistics from a SSAS Tabular Cube

Once you have users accessing your cube it’s almost inevitable at some point that someone will ask you to generate usage statistics from it, and there are a number of methods to achieve this. In this quick blog post, I’ll detail them and my experiences with each, and then use this data to create a PBI report.

 

Native Tabular Properties

The first method is natively through the tabular cube properties. This also has the added bonus (read impact) that it will optimise future aggregations based on usage – in fact that’s its main purpose.

This can be done by setting the CreateQueryLogTable to true, setting up the QueryLogConnectionString (to point to the DB where the usage table requires hosting), setting the QueryLogSamping rate (10 means every 10th query will be logged), and finally the name of the QueryLog table.

SQL Server Analysis Services Query Log Related Properties

Advantages of this method is that its very easy to setup with limited knowledge required and it could potentially improve performance if you have an environment where users submit repetitive queries. Unfortunately there are also a number of disadvantages which led me to find other methods. Firstly, it creates a degree of overhead on the cube if its sampling too often; we actually had visible performance related complaints once we turned it on – either through the sampling or change to the “optimised” aggregations. Depending on the sampling rate, you could also find that users who rarely use the cube are not picked up as part of the stats.  As well as this any changes to the cube structure will cause the logging table to be reset. The table is also limited in terms of what it actually logs (as you can see below) – useful if you just want just the user and timestamp info but not much else, and no real ability to configure.

Query the OlapQueryLog table for Cube activity

 

AS Trace

To that extent, I looked for other tools to do the same task but better and I found AS Trace.

Originally built for SQL Server 2012, it works fine on 2014 – and provides you the ability to run a trace against the cube activities (and log to a table) exactly like the SQL profiler but without the overhead of the GUI which adds unnecessary memory/processor power. It also runs as a windows service allowing it to restart automatically when the server reboots. If this is the case, the tool also logs the existing data to a History table and truncates the logging table. Exactly what I was after.

The tool collects information based on a preconfigured Analysis Services Profiler template, which can be optimised depending on which events you are interested in. I initially ran it using most events selected, and with a limited user set it was generating in the region of 25,000 rows a day. This was clearly not maintainable for a long period of time. I then used the following blog post to understand what each event of the profiler was giving me and then just created a lightweight trace definition file to give me what I wanted. I limited it to Query Begin, Query End (for DAX/MDX statements) and Audit Logon/Logout (for session data).

The setup is very straight forward, just run the install.bat as an escalated privileged account, and check it installs the service correctly. Next, add your SSAS service account to the Logon of the service, make sure the account has “Log on as Service” and membership to the database you are writing to in the form of DDL and DML access, i.e. able to create tables, write to tables – and lastly admin rights to the instance of SSAS you intend to use.

Next, configure the ASTrace.exe.config file with the parameters you want the tool to use. This includes the location of the cube (can handle multiple cubes), the location of the trace definition file, the location of the DB instance and table you want to log to and lastly whether you want to preserve history on restart. The only thing I couldn’t do here, is set the schema of the table it was using to log to, which defaults to dbo.

All that’s left is to start the service, and check the log file to see if it has created any errors on start-up. If not, the table should be created correctly and awaiting input.

I also saw another method while researching using Extended Events (XEvents) but did not implement this once AS Trace provided me with the information I needed.

 

View / Power BI Report

I initially used the data to run a limited set of queries to extract total users, and total queries for a given time period. This was useful to a degree but from the data collected I realised I could be doing so much more.

This lead me to do some analysis across the type of metrics being logged, and allowed me to create a view on top of the tables of what I thought might be useful on a report. I removed all the redundant columns it was tracking, and created some friendly names for the EventSubclass, and other columns. I used the PATINDEX function to check the query statement for existence of some important values – while not an exact science, it would give me a good picture of the split between certain user groups and KPIs being run. I’ve included the view definition below.

I ended up limiting the data to EventClass 10 as this seemed to capture all the necessary data. The only downside I have seen so far is that users querying through the Power BI web service are anonymised under the service account name. I’m currently looking into options to resolve this which I’ve seen as configuration options on Power BI – to allow through the username as long as it can be matched at the other end.

SELECT 
RowNumber AS ID, 
SPID AS SessionID,
CurrentTime AS DateQueried, 
NTUserName AS Username,
CASE EventSubClass 
WHEN 0 THEN 'MDX Query (Excel)' 
WHEN 3 THEN 'DAX Query (Power BI)' 
WHEN 1 THEN 'METADATA Query'
END AS QueryType, 
CASE Success WHEN 1 THEN 'Successful Query' ELSE 'Query Error' END AS SuccessfulQuery,
CONVERT(DECIMAL(10,2),CONVERT(DECIMAL(18,3),CPUTime)/1000) AS CPUTimeSec, 
CONVERT(DECIMAL(10,2),CONVERT(DECIMAL(18,3),Duration)/1000) AS DurationSec, 
TextData AS Query, 
CASE PATINDEX('%Mexico%',TextData) WHEN 0 THEN 0 ELSE 1 END AS MexicoMarket,
CASE PATINDEX('%Colombia%',TextData) WHEN 0 THEN 0 ELSE 1 END AS ColombiaMarket,
CASE PATINDEX('%CS4%',TextData) WHEN 0 THEN 0 ELSE 1 END AS CS4,
ServerName
FROM 
[dbo].[ASTraceTable]
WHERE
EventClass = 10

Once I had the view, creating the report was relatively straight forward, and can be seen below.

I included metrics for number of queries by user (blurred out) which also doubled as a filter, the % split of queries for things such as Excel/Power BI, a measure of queries by timeframe, a logarithmic scaled display for queries by query duration, and lastly a split of queries by KPI. I intend to tweak these once I receive more data from the trace, but was relatively happy with the information that they were providing.

Untitled (002)

Please let me know if you have any comments.

A Crash Course in R Part 2

Following on from Part 1 which introduces the R basics (which can be found here), in Part 2 I’ll start to use lists, data frames, and more excitingly graphics.

 

Lists

Lists differ from vectors and matrices because they are able to store different data types together in the same data structure. A list can contain all kinds of R objects - vectors, matrices, data frames, factors and more. They can even store other lists. Unfortunately calculations are less straightforward because there's no predefined structure.

To store data within a list, we can use the list() function, trip <- c("London","Paris",220,3). As mentioned in my previous blog we can use str(list) to understand the structure, this will come in handy. Notice, the different data types.

Subsetting lists is not similar to subsetting vectors. If you try to subset a list using the same square brackets as when subsetting a vector, it will return a list element containing the first element, not just the first element "London" as a character vector. To extract just the vector element, we can use double square brackets [[ ]]. We can use similar syntax to before, to extract the first and third elements [c(1,3)] as a new list, but the same does not work with double brackets. The double brackets are reserved to select single elements from a list. If you have a list inside a list, then [[c(1,3)]] would work and would select the 3rd element of the 1st list! Subsetting by names, and logicals is exactly the same as vectors.

Another piece of syntax we can use is $ to select an element, but it only works on named lists. To select the destination of the trip, we can use trip$destination. We can also use this syntax to add elements to the list, trip$equipment <- c("bike","food","drink").

Interested in testing your knowledge, check out the DataCamp exercises here and here.

 

Data Frames

While using R, you'll often be working with data sets. These are typically comprised of observations and each observation has a number of variables against it, similar to a customer/product table you'd usually find in a database. This is where data frames come in, as the other data structures are not practical to store this type of data. In a data frame, the rows correspond to the observations, while the columns the variables. Similar to a matrix but we can store different data types (like a list!). Under the hood, a data frame is actually a list but with some special requirements such as vector length, and char vectors as factors. Creating a data frame is usually achieved by importing data from source rather than manually created but you can do this via the data.frame function as shown below:

 

# Definition of vectors 
planets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune") 
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",  
          "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant") 
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883) 
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67) 
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE) 

# Encoded type as a factor
type_factor <- factor(type) 

# Constructed planets_df
planets_df <- data.frame(planets, type_factor, diameter, rotation, rings, stringsAsFactors = FALSE) 

# Displays the structure of planets_df 
str(planets_df) 

 

Similar to SQL, sometimes you will want just a small portion of the dataset to inspect. This can be done by using the head() and tail() functions. We can also use dim() to show the dimensions which returns the number of rows and number of columns, but str() is preferable as you get a lot more detail.

Due to the fact a data frame is an intersection between a matrix and a list, the subsetting syntax can be used from both, so the familiar [ ], [[ ]], and $ will work to extract either elements, vectors, or data frames depending upon the type of subsetting performed. We can also use the subset() function, for instance subset(planets_df, subset = has_rings == TRUE). The subset argument should be a logical expression that indicates which rows to keep.

We can also extend the data frame in a similar format to lists/matrices. To add a new variable/column we can use either people$height <- new_vector or people[["height"]] <- new_vector. We can also use cbind() for example people <- cbind(people, “height” = new_vector), it works just the same. Similarly rbind() can add new rows, but you'll need to add a data frame made of new vectors to the existing data frame, the names between the data frames will also need to match.

Lastly, you may also want to sort your data frame - you can do this via the order() function. The function returns a vector with the rank position of each element. For example in a vector of 21,23,22,25,24 the order function will return 1,3,2,5,4 to correspond the position in the rank. We can then use ages[ranks, ] to re-arrange the order of the data frame. To sort in descending order, we can use the argument decreasing = TRUE.

 

# Created a desired ordering for planets_df
positions <- order(planets_df$diameter, decreasing = TRUE)

# Created a new, ordered data frame
largest_first_df <- planets_df[positions, ]

# Printed new data frame
largest_first_df

 

Interested in testing your knowledge, check out the DataCamp exercises here and here.

 

Graphics

One of the main reasons to use R is its graphics capabilities. The difference between R and a program such as Excel is that you can create plots with lines of R code which you can replicate each time. The default graphics functionality of R can do many things, but packages have also been developed to extend this – popular packages include ggplot2, ggvis, and lattice.

The first function to look at is plot() which is a very generic function to plot. For instance, take a data frame of MPs (members of parliament), which contains the name of the MP, their constituency area, their party, their constituency county, number of years as MP, etc. We can plot the parties in a bar chart by using plot(parliamentMembers$party). R realises that their party is a factor and you want to do a count across it in a bar chart format (by default). If you pick a different variable such as the continuous variable number of years as MP, the figures are shown on an indexed plot. Pick 2 continuous variables, such as number of years as MP vs. yearly expenses - plot(parliamentMembers$numOfYearsMP, parliamentMembers$yearlyExpenses) and the result is a scatter plot (each variable holds an axis).

Are these variables related? To make relationships clearer, we can use the log() function- plot(log(parliamentMembers$numOfYearsMP), log(parliamentMembers$yearlyExpenses)). For 2 categorical variables, R handles it differently again, and creates a stacked bar chart giving the user an idea of the proportion of the 2 variables. Notice how the type of variable you use impacts the type of chart displayed.  You will find that the first element of plot forms the x axis and the second element the y axis. For 3 variables, it gets better still – here I used plot(parliamentMembers[c("numOfYearsMP","votesReceived","yearlyExpenses")]). As you can see below, R plots the variables against one another in 9 different plots!

 

clip_image001 

 

The second function to look at is hist(), which gives us a way to view the distribution of data. We can use this in a similar format to plot by specifying the syntax hist(parliamentMembers$numOfYearsMP). By default, R uses an algorithm to work out the number of bins to split the data into based on the data set. To create a more detailed representation, we can increase the bins by using the breaks argument - hist(parliamentMembers$numOfYearsMP, breaks = 10).

There are of course, many other functions we can use to create graphics, the most popular being barplot(), boxplot(), pie(), and pairs().

 

Customising the Layout

To make our plots more informative,we need to add a number of arguments to the plot function. For instance the following R script creates the plot below:

 

plot(
parliamentMembers$startingYearMP, 
parliamentMembers$votesReceived, 
xlab = "Number of votes", 
ylab = "Year", 
main = "Starting Year of MP vs Votes Received", 
col = "orange",
col.main = "black",
cex.axis = 0.8
pch = 19
)

# xlab = horizontal axis label
# ylab = vertical axis label
# main = plot title
# col = colour of line
# col.main = colour of the title
# cex.axis = ratio of font size on axis tick marks
# pch = plot symbol (35 different types!!)

clip_image001[7]

 

To inspect and control these graphical parameters, we can use the par() function. To get a complete list of options we can use ?par to bring up the R documentation. By default, the parameters are set per plot, but to specify session-wide parameters, just use par(col = “blue”).

 

Multiple Plots

The next natural step for plots is to incorporate multiple plots – either side by side or on top of one another.

To build a grid of plots and compare correlations we can use the mfrow parameter, like this par(mfrow = c(2,2)) by using par() and passing in a 2x2 vector which will build us 4 subplots on a 2 by 2 grid. Now, when you start creating plots, they are not replaced each time but are added to the grid one by one. Plots are added in a row-wise fashion - to use column-wise, we can use mfcol. To reset the graphical parameters, so that R plots once again to a single figure per layout, we pass in a 1x1 grid.

Apart from these 2 parameters, we can also use the layout() function that allows us to specify more complex arrangements. This takes in a matrix, where you specify the location of the figures. For instance:

 

grid <- matrix(c(1,1,2,3), nrow = 2, ncol = 2, byrow = TRUE)

grid
    [1][2]
[1]  1  1
[2]  2  3

layout(grid)
plot(dataFrame$country, dataFrame$sales)
plot(dataFrame$time, dataFrame$sales)
plot(dataFrame$businessFunction, dataFrame$sales)

# Plot 1 stretches across the entire width of the figure
# Plot 2 and 3 sit underneath this and are half the size

 

To reset the layout we can use layout(1) or use the mfcol / mfrow parameters again. One clever trick to save time is to store the default parameters in a variable such as old_par <- par() and then reset once done using par(old_par).

Apart from showing the graphics next to one another, we can also stack them on top of one another in layers. There are functions such as abline(), text(), lines(), and segments() to add depth to a graphic. Using lm() we can create an object which contains the coefficients of the line representing the linear fit using 2 variables, for instance movies_lm <- lm(movies$rating ~ movies$sales). We can then add it to an existing plot such as plot(movies$rating, movies$sales) by using the abline() (adds the line) and coef() (extracts the coefficients) functions, for instance abline(coef(movies_lm)). To add some informative text we can use xco <- 7e5 yco <- 7 and text(xco, yco, label = “More votes? Higher rating!”) to generate the following:

 

clip_image001[1]

 

Interested in testing your knowledge, check out the DataCamp exercises here, here and here.

 

Summary

This pretty much concludes the crash course in R. We’ve looked at the basics, vectors, matrices, factors, lists, data frames and graphics – and in each case looked at arithmetic and subsetting functions.

Next I’ll be looking at some programming techniques using R, reading/writing to data sources and some use cases – to demonstrate the power of R using some slightly more interesting data sets such as the crime statistics in San Francisco.

 

 

 

A Crash Course in R Part 1

What is R?  Also known as the language for statistical computing, it was developed in the 1990s, and provides the ability to use a wide variety of statistical techniques and visualization capabilities across a set of data. 

Pros for the language include the fact its open source, it has great graphical capabilities, runs via a CLI (provides the ability to script and automate), has a huge community behind it, and is gaining a wider adoption in business!

R can be used by a number of tools; the most common are R Tools for Visual Studio, RStudio standalone, or more recently R Scripts in Power BI.

 

Basics

Let’s start with the absolute basics. If you print 1+2 to the command line, the console will return 3. Print text to the command line, the console will return the same body of text.

Everyone loves variables. To be able to store values in variables, we can use the syntax apples <- 4 and pears <- 2 to the command line to store 4 in the apples, and 2 in the pears variable. There’s no print out here because its a variable. We can then do total_fruit <- apples + pears to create a new variable using existing variables.

As you create variables, you create a workspace, which you can reference using ls(). This details all the variables created within the R session. You can then use rm(variable) to clean up the workspace to maintain resource.

Now, not everyone loves commenting but you can comment your code via #. Here’s a simple example script to calculate the volume of a circle:

# Create the variables r and R 
r <- 2 
R <- 6 

# Calculate the volume of a circle
vol_circle <- 2*pi^2*r^2*R 

# Remove all intermediary variables that you've used with rm() 
rm(r,R) 

# List the elements in your workspace 
ls() 

[1] "vol_circle"

 

Data Types

As with any language, there are a number of data types supported:

  • TRUE/FALSE are "logical"
  • "This is text" is "character"
  • 2 or 2.5 are "numerics”
  • You can add L to numeric such as 2L to call this number an integer (the outputs the same, but the class is different). Here we have what's known as a type hierarchy.
  • Other types include double, complex, and raws.

We can use the function class() to determine the data type of variable. We can also use the dot function as to coerce or transform between the data types, such as as.character(4) = "4" and as.numeric("4.5") = 4.5.  To evaluate the type use is.numeric(2) = TRUE or is.numeric(“2”) = TRUE.  NA is returned when trying to convert “hello” to numeric.

Interested in testing your knowledge, check out the DataCamp exercises here and here.

 

Vectors

A vector is a sequence of data elements of the same data type which can called using the c() function. Vectors can be of all the types seen previously.  The three properties of a vector are type, length, and attributes.

For example, a vector to provide us UK Government Parties (who doesn’t love politics) and assigned to the variable parties can be built parties <- c("Labour","Conservative","Libdems","SNP"). A check can done to see if the variable is of a vector type similar to before using is.vector(parties).

But what if the vector contains data that has slightly more meaning behind it, for instance seat_count <- c(262,318,12,35). You can attach labels to the vector elements by using the names() function: names(seat_count) <- parties.  You could also do this using one line - c(Labour = 262, Conservative = 318, Libdem = 12, SNP = 35).

The variables created previously are actually stored in a vector of length 1. R does not provide a data structure to hold a single number/character string.  If you do build a vector of different data types, R performs coercion to make sure they are all of the dame data type. For instance c(1,5,"A","B","C","D") becomes "1","5","A","B","C","D".  If you need to differentiate data types, you can use a list instead!

 

Vector Arithmetic

Computations can be performed between vectors and are done so in an element-wise fashion. This means you can do earnings <- c(10,20,30) * 3 to generate [30] [60] [90].

You can also do vector minus vector so earnings – expenses (again done element wise). Multiplication and division using this method does not result in a matrix!

Other functions include sum(bank) to sum all elements of the vector, sum(bank > 0) to return a count of elements in the vector (given bank contains numerics) or sum(bank == x) to return the count of element x in a vector.

 

Vector Subsetting

As the name suggests, this is basically selecting parts of your vector to end up as a new vector (a subset of your original vector).

If you want to select the first element from our seat_count vector, we write seat_count[1] and this will return Labour 262. Both the value and name are kept from the original vector. We can also use the name, so seat_count[“Labour”] will return the same result.

If you want to select multiple elements, you can use the syntax seat_count[c(1,4)] by passing in a vector. To select elements in a sequence you can use 2:5 instead of 2,3,4,5. You can also subset via an inverse, by using the syntax seat_count[-1] which returns all the seats, apart from the element in [1].

One last method to create subsetting is by logical vectoring, so by specifying seat_count[c(TRUE,FALSE,FALSE,FALSE)] we can return the equivalent of [1]. R is also able to “recycle” this type of vectoring so if your logical vector is length 2, it will loop over itself to fit the vector you are subsetting.

Remember we can also use the arithmetic from the previous section, to select our vector contents for subsetting, examples include main_parties <- seat_count[seat_count > 50].

Interested in testing your knowledge, check out the DataCamp exercises here, here, and here.

 

Matrices

While a vector represents a sequence of data elements (1 dimensional), a matrix is a similar collection of data elements but arranged as rows/columns (2 dimensional) – a very natural extension to a vector.

To build a matrix you’ll need to use the following format; matrix(1:6, nrow = 2) which creates a 2-by-3 matrix for values 1-6. You can also specify columns rather than rows by using ncol = 3. R infers the other dimension by using the length of the input vector. To fill the vector by row instead of by columns, you can use the argument byrow = TRUE.

Another way to create a matrix is by using the functions rbind() and cbind(). These essentially take the 2 vectors you pass the function and stick them together. You can also use these functions to bind together a new vector with an existing matrix. For example my_matrix <- rbind(matrixA, vectorA)

To name the matrix, you can use rownames() and colnames(). For example rownames(my_matrix) <- c(“row1”,”row2”) and colnames(my_matrix) <- c(“col1”,”col2”).

You can also create a matrix using a one-liner, by using dimnames() and specifying a list of vector names.

my_matrix <- matrix(1:6, byrow = TRUE, nrow = 2,
dimnames = list(c("row1", "row2"), c("col1","col2","col3")))

Similar to vectors, matrices are also able to recycle themselves, only store a single atomic data type and perform coercion automatically.

Continuing on from vectors, matrices can also be subsetted. If you’re after a single element, you’ll need to specify both row and column elements of interest using the syntax m[1,3] for row 1 column 3. If you’re looking to subset an entire row or column, you can use the syntax m[3,] (notice the missing column value) to select the entirety of row 3. Columns can be selected using the inverse via m[,3]. You can also select multiple elements using a simple methodology to vectors. This can be achieved by using the syntax m[2, c(2,3)] to select the 2nd and 3rd column values of row 2. Subsetting the names works just the same as by index, you can even use a combination of both! The same is true of subsetting by a logical vector – just use c(FALSE,TRUE,TRUE) for the last 2 rows of a 3 row matrix. You can see some examples below.

# Create a matrix using 2 vectors
my_mega_matrix <- cbind(vectorA, vectorB)

# Subset the matrix to get columns 2 and 3
my_subsetted_mega_matrix <- my_mega_matrix[,c(FALSE,TRUE,TRUE,FALSE]

# Subset the matrix using names for columns 1 to 4
my_alt_subsetted_mega_matrix <- my_mega_matrix[,c("A","B","C","D")]

# Calculate totals for the columns 2 and 3
total_mega_matrix <- colSums(my_subsetted_mega_matrix)

As seen above there are another 2 functions we can use on matrices namely colSums() and rowSums() to do column and row arithmetic. This is addition to other standard arithmetic. All computations are performed element wise. So we can do my total_mega_matrix * 1.3 to convert the totals in GBP to USD (as an example). Performing calculations using 2 matrices is just the same (matrixA – MatrixB). Be careful here though, if they contain the same number of elements, everything will be done element wise, else recycling will occur.

Notice the similarity between vectors and matrices – they’re both data structures that store elements of the same type. Both perform coercion, and recycling. Arithmetic is also similar as everything is performed element wise.

Interested in testing your knowledge, check out the DataCamp exercises here, here, and here.

 

Factors

Unlike numeric variables, categorical variables can only take a limited number of different values. The specific data structure for this is what is known as a factor. A good example of this is blood, types can only be of type A, B, AB, or O – we then define a vector of peoples blood types blood <- c(“B”,”AB”,”O”,”O”,”A”,”B”,”B”). To convert this vector to a factor we can use factor(blood). R scans the vector to check for categories, and stores the distinct list as levels (sorted alphabetically). Values in the vector and then replaced with numeric values corresponding to the associated level. You can think of factors as integer vectors, where each integer refers to a category or level. To inspect the structure, we can use str(factor).

Similar to the names() function, you can also specify the levels() function and pass a vector to name the levels differently to those categories picked up in the scan, for instance levels(my_factor) <- c(“BT_A”,”BT_B”,”BT_O”,”BT_AB”). However its much safer to pass in both the levels and the labels because of the way in which it sets the levels alphabetically which means you have to be careful your names correspond correctly to the levels.

In statistics, there is also a difference between nominal categorical variables and ordinal categorical variables – nominal variables have no implied order, i.e. blood type A is not greater or less than B. There are examples however where ordering does exist, for example with t-shirt sizes, and you can use R to impose this on the factor. To do this, you can set the ordered function inside the vector to TRUE, and then specify the levels in the correct order. You can now evaluate the levels in the factor. An example can be seen below.

# Definition of temperature_vector
temperature_vector <- c("High", "Low", "High", "Low", "Medium")

# Encoded temperature_vector as a factor
temperature_factor <- factor(temperature_vector, 
                             ordered = TRUE,
                             levels = c("Low","Medium","High")
                             )

# Print out
temperature_factor

Interested in testing your knowledge, check out the DataCamp exercises here.

 

This is only just the start of understanding R – in the next blog I’ll look at lists, data frames and most importantly graphics! We can then start to looking at some more complicated examples and use cases.

Extracting Users from AD using SSIS

I've recently been working on a project which required KPI level security alongside the traditional row level security secured at a geography level. This would limit what financial data a user could see within a cube, without having to create multiple cubes or use perspectives (which would not actually secure the data).

To achieve this, I needed to populate a set of 'KPI User/Role' tables stored in Master Data Services (MDS) with a list of users who were stored in a particular AD group. I would need these tables updated on a regularly basis to grant/revoke access. We could then use these names along with the USERNAME() function in DAX to filter.

 

The Solution

One method to solve my problem would be by using SSIS. The package could  be setup to run as part of a SQL Agent Job, either by a schedule or on demand. My list of users were stored in an AD group called LH_FIN.

image image

To start with you will need to truncate and clear your MDS staging tables that you are about to populate.  You can then use the data flow to process the majority of the logic, by creating a script component task. The purpose of this is to loop through Active Directory and pick up the user details that belong to the specified AD Group or set of AD groups if dealing with multiple roles.

A number of variables are defined which the task uses to complete the lookup.

  • strLDAP – the LDAP directory on which to perform the lookup
  • strDomain – the domain on which the AD group(s) belong
  • strADPrefix – the AD group prefix from which to return user information about
  • strADParent – the parent group which contains the AD groups which you are looking up (may not need to be used if only looking up a single AD group)

To extract users from multiple groups, make sure the prefix stored in the variable strADPrefix covers both groups. Once the rows are extracted it would then be a case of using SSIS to split the data accordingly on the AD Group Name.

The following code can be used in the script:

The first section sets up the objects required to interrogate the directory, and the fields we expect to return from the accounts – the most important of which is memberof which is used to check versus our AD Prefix. It also filters out items such as service accounts and disabled accounts to speed up the interrogation process.

    Public Overrides Sub CreateNewOutputRows()

        Dim domain As String = Variables.strDomain
        Dim searchRoot As New DirectoryEntry(Variables.strLDAP, Nothing, 
                                             Nothing, AuthenticationTypes.Secure)
        Dim dirSearch As New DirectorySearcher(searchRoot)
        dirSearch.SearchScope = SearchScope.Subtree
        'LogonName, GroupsUserBelongsTo, Department, JobTitle, MailAddress, DisplayName
        dirSearch.PropertiesToLoad.Add("samaccountname")
        dirSearch.PropertiesToLoad.Add("memberof")
        dirSearch.PropertiesToLoad.Add("department")
        dirSearch.PropertiesToLoad.Add("title")
        dirSearch.PropertiesToLoad.Add("mail")
        dirSearch.PropertiesToLoad.Add("displayname")
        'filter to user objects
        dirSearch.Filter = "(objectCategory=person)" 
        'filter to user objects
        dirSearch.Filter = "(objectClass=user)" 
        'filter out disabled accounts
        dirSearch.Filter = "(!userAccountControl:1.2.840.113556.1.4.803:=2)" 
        'filter out password never expires accounts, i.e. service accounts
        dirSearch.Filter = "(!userAccountControl:1.2.840.113556.1.4.803:=65536)" 
        'sets chunk size for retrieving items
        dirSearch.PageSize = 1000

The next section of code performs the search, and for any LDAP objects it finds within the filter set, returns the properties requested. These properties are then stored in key/value pairs.

        Dim props As ResultPropertyCollection
        Dim values As ResultPropertyValueCollection
        Dim key As String
        Dim userAccountName As String
        Dim departmentHome As String
        Dim jobtitle As String
        Dim GroupName As String
        Dim email As String
        Dim displayName As String
        Dim groups As New ArrayList

        Using searchRoot
        'Return all LDAP objects, LDAP://acl/CN=Tristan Robinson,OU=Employees,DC=ACL,DC=local
        'CN = Common Name, OU = Organisational Unit, DC = Domain Component
            Using results As SearchResultCollection = dirSearch.FindAll()
                For Each result As SearchResult In results
                    'For each object return properties, i.e. displayname, memberof, etc
                    props = result.Properties
                    For Each entry As DictionaryEntry In props
                        key = CType(entry.Key, String)
                        'For each property, inspect the property and record its value
                        'Logon Name
                        If key = "samaccountname" Then
                            values = CType(entry.Value, ResultPropertyValueCollection)
                            userAccountName = CType(values.Item(0), String)
                        End If
                        'Department
                        If key = "department" Then
                            values = CType(entry.Value, ResultPropertyValueCollection)
                            departmentHome = CType(values.Item(0), String)
                        End If
                        'Job Title
                        If key = "title" Then
                            values = CType(entry.Value, ResultPropertyValueCollection)
                            jobtitle = CType(values.Item(0), String)
                        End If
                        'E-Mail
                        If key = "mail" Then
                            values = CType(entry.Value, ResultPropertyValueCollection)
                            email = CType(values.Item(0), String)
                        End If
                        'Display Name
                        If key = "displayname" Then
                            values = CType(entry.Value, ResultPropertyValueCollection)
                            displayName = CType(values.Item(0), String)
                        End If
                        'Groups User Belongs To (array/collection)
                        If key = "memberof" Then
                            values = CType(entry.Value, ResultPropertyValueCollection)
                            groups = GetGroups(values)
                        End If
                    Next

The final section filters the data into the output buffer if from the array list we’ve extracted above, we have matching strings from our original AD Prefix variable. It will then reset, and loop round for the next account.

                    'Export user details to buffer if it passes the logical test
                    For Each item As String In groups
                        'Avoids computer accounts, i.e. ending with $
                        If userAccountName.EndsWith("$") = False 
                        And item.ToString.StartsWith(Variables.strADPrefix) 
                        Then 'And item.ToString <> (Variables.strADParent) 
                            Output0Buffer.AddRow()
                            If String.IsNullOrEmpty(userAccountName) Then
                                Output0Buffer.UserAccountName_IsNull = True
                            Else
                                Output0Buffer.UserAccountName = userAccountName
                            End If
                            If String.IsNullOrEmpty(domain) Then
                                Output0Buffer.Domain_IsNull = True
                            Else
                                Output0Buffer.Domain = domain
                            End If
                            If String.IsNullOrEmpty(item.ToString) Then
                                Output0Buffer.GroupName_IsNull = True
                            Else
                                Output0Buffer.GroupName = item.ToString
                            End If
                            If String.IsNullOrEmpty(jobtitle) Then
                                Output0Buffer.JobTitle_IsNull = True
                            Else
                                Output0Buffer.JobTitle = jobtitle
                            End If
                            If String.IsNullOrEmpty(email) Then
                                Output0Buffer.Email_IsNull = True
                            Else
                                Output0Buffer.Email = email
                            End If
                            If String.IsNullOrEmpty(displayName) Then
                                Output0Buffer.DisplayName_IsNull = True
                            Else
                                Output0Buffer.DisplayName = displayName
                            End If
                        End If
                    Next
                    groups.Clear()
                    userAccountName = ""
                    departmentHome = ""
                    jobtitle = ""
                    GroupName = ""
                    email = ""
                    displayName = ""                    
                Next
            End Using
        End Using
    End Sub

I also required a function to split the list of groups a user belonged to and store them in another array list.

    Private Function GetGroups(ByVal values As ResultPropertyValueCollection) As ArrayList
        Dim valueList As ArrayList = New ArrayList()
        For Each Item As Object In values
            Dim memberof As String = Item.ToString()
            Dim pairs As String() = memberof.Split(",".ToCharArray)
            Dim group As String() = pairs(0).Split("=".ToCharArray)
            valueList.Add(group(1))
        Next
        Return valueList
    End Function

End Class

Once a list of users has been extracted, you will need to do a lookup against the existing list and only stage those that are new. This can be achieved through a simple lookup component . You can then move the rows into the MDS staging table ready for the load into MDS.

After the data flow has processed successfully, the next stage is to sweep the records into MDS using one of the built in stored procedures in the product. The results of which can be seen in the MDS Import View on the MDS site.

The following entities can then be used to power the DAX:

  • User (Username, Team, AD Login – populated from the script above)  
  • Role (Role – populated manually)
  • User Role (User, Role – joins users to roles, populated manually)
  • KPI (KPI – populated manually)
  • KPI Role (Role, KPI – joins roles to KPIs, populated manually)

These are then processed to the DW by ETL, and a UserKPI view is written across the tables to provide an AD Login to KPI pair.

For filtering the measures in the cube, you can then apply the following piece of DAX to the original measure – Gross Sales in the example below:

Gross Sales:=
IF(
    CONTAINS(
        CALCULATETABLE( 'UserKPI', UserKPI[KPIName] = "Gross Sales"),
        UserKPI[ADLogin], USERNAME()
    ),
    [Gross Sales (ACTUAL)]
    ,
    BLANK ()
)

This concludes this blog post – hopefully this will be useful for anyone that wants to extract users from AD to control access / security within an application. Data could also be loaded direct into DB tables rather than MDS if required. 

Please feel free to comment!

Modelling Survey Style Data

In a recent project, one of the business areas which required modelling was that of survey data. This can often be quite tricky, due to the fact the data is not so quantitative in nature as other business areas such as sales.

How do you go about calculating measures against this type of data? In this blog I will go about explaining a relatively simple model you can use to achieve this goal.

 

Master Data Services

To aid with modelling I used Master Data Services (MDS) to help map the questions and responses from the surveys.

The idea behind using MDS is that regardless of the questions asked, whether it be in one format or another, English or Spanish, you can map them to a set of master questions. Usually these will align closely to your measures/KPIs and should be relatively finite in number so that the mapping process is feasible for a business user. In my business case, the master set of questions revolved around items such as quality, price, and promotion of products. For responses, I needed to map these to a flag which indicated we wanted the response to contribute towards a measure and ultimately a KPI.

I first created the following entities in MDS.

  • Survey (holds the survey name, a unique identifier for the survey, and in my case I also created a master survey lookup to group the surveys together)
  • Source Question  (holds the distinct set of questions assigned to each survey, along with identifying codes, and question orders - each question would also be mapped to a master question) 
  • Source Response (holds a set of response options for each question, along with identifying codes)
  • Master Question (holds the master set of questions and joins to the KPIs)
  • KPI (holds a list of KPIs that you need to address by aggregating response data)
  • Response of Interest (holds a list of responses that are regarded as positive / contributing towards the KPI when answered by the question)
  • Response of Interest Mapping (allows the user to map the response options available on each question to a response of interest)

In terms of the response of interest, I was only interested in responses where the answer from the survey was “Yes” so this was all that was required here. However for more complex response options, the model can provide the scalability required. For instance, if you were looking for an answer between 7-10 and the survey had been answered with a 7, 8, 9, or 10 – each of these could be mapped to 7-10 without having to create responses of interest for all particular combinations. This scales well and can cover scenarios for instance where the answer should be between 700 to 1000 in the same way.

I also created a Master Question and Response of Interest value for N/A. This way, only the blanks on the mapping entities required populating and the user was never unsure whether a blank represented a question/response that was not of interest, or something that required mapping still.

All the entities above apart from Master Question, KPI, and Response of Interest were populated automatically from ETL with a SQL script used to extract the contents of those entities from source. The other 3 entities were populated manually by a business user. I also locked the entities / columns that the user shouldn’t be changing by using read-only permissions.

Some examples of the manually populated tables can be seen below:

clip_image0024_thumb1

clip_image002_thumb1

 

Data Warehouse

For modelling the tables in the data warehouse, I created a separate dimension for each of the Response, Question, Survey, and KPI entities, and a single Fact to capture the responses of interest against these dimensions.

The majority of dimension lookups were straight forward along with the response of interest measure which can be seen below:

SELECT 
SU.Code AS SurveyId, 
SR.Name AS ResponseName, 
1 AS ResponseOfInterest 
FROM mdm.PL_ResponseOfInterestMapping RM 
INNER JOIN mdm.PL_SourceResponse SR
ON RM.SourceResponse_Id = SR.Id AND RM.Survey_Id = SR.Survey_Id 
INNER JOIN mdm.PL_Survey SU 
ON RM.Survey_Id = SU.Id 
WHERE 
RM.ResponseOfInterest_Code IS NOT NULL AND RM.ResponseOfInterest_Name <> 'N/A'

During our ETL runs for the fact we also checked for responses that had not been yet mapped – and did not pull these through.

If you then have a cube sat on top of your DW, you can then write measures across the fact to count the number of responses of interest. An example of which can be seen here:

Price Activation Standard:=
COUNTROWS(
    FILTER(
        CALCULATETABLE('Outlet Survey','KPI'[Sales Driver] = "Price"),
        'Outlet Survey'[IsResponseOfInterest] = 1
    )
)

This was then checked against a Target fact table to calculate the compliance and the KPI was essentially an aggregation of the compliance across geography.

 

Summary

Overall, the model has proved very popular with the business. It’s easy to understand and gives the business control over which responses to count towards the KPI – without having to hard code values into the ETL which had been seen in previous solutions. It can also be extended easily be adding new KPIs and mapping them to new master questions without having to change any ETL. From a development perspective it also means that nothing should go into the DW as Unknown for a Dimension since the SQL to populate MDS, can also be used for the DW and therefore should always match.

If you have any questions, please feel to ask them in the comments.

Upgrading your BI Solution (2012 to 2014)

In a recent project I was asked to upgrade a BI solution from SQL Server 2012 to 2014.  The aim of which was to consolidate a number of database sources onto a single version of SQL Server. This blog looks at the steps required in order to achieve the upgrade process, along with issues I hit along the way.

 

Where do I start?

In order to develop on SQL Server 2014, you’ll need to update your development tools! This involves downloading the following applications.

Application

Comments

Link

SSDT BI for VS 2013

Allows you to modify and deploy SSIS, SSAS, SSRS files. I would then suggest downloading Update 5 from within the application. This resolved some source control issues for me.

https://www.microsoft.com/en-gb/download/details.aspx?id=42313

SSDT Database Designer for VS 2013

Allows you to modify and deploy SQL related files.

https://msdn.microsoft.com/en-us/dn864412

Team Explorer for VS 2013

Sometimes this is bundled in together with SSDT but not the case for 2013.

https://www.microsoft.com/en-gb/download/details.aspx?id=40776

BIDS Helper for 2014

This is a must have for any solution with a tabular cube. It allows the user more functionality to customise the cube and organise measures.

https://bidshelper.codeplex.com/

 

You’ll also need to update any custom components within the solution. For me, this was the 2014 Attunity drivers (3.0) to connect to our Oracle sources and some custom Adatis Framework components, both will need GACing. Remember that if you develop using 32 bit components you may need to download the equivalent 64 bit components once your solution is deployed to a server.

Once you have the necessary tools locally, you can start the upgrade process. Open up the solution in your new environment.

 

The Solution Upgrade Process

The first thing I noticed was that during the upgrade I lost all my SSIS components layout and comments when opening up the packages in VS 2013 that were developed in VS 2012. The code was still in the package file, but did not seem to do anything. I’m not sure this happens for everyone as the information I found to resolve the issue was not particularly well documented around the internet. Its possible that upgrading to Update 5 beforehand may resolve this issue.  To fix the the problem, I had to open up the packages code and search for the line that says version=“sql11”. This needs to be changed to “sql12”. The next time you open the package, everything should appear again. I also found that if you open up the package first in 2013, the application will automatically change this to sql12 however the comments/layout did not seem to appear. You also won’t be able to amend the file at this point so will need to roll back if using version control such as TFS.

Another issue I ran into was that even after I ran this process, a few developers were unable to see the comments still. This is due to the fact they were running the dark themed Visual Studio which did not invert the text colour on the comments therefore making them appear hidden. Again, probably another bug with the base version of SSDT I downloaded off the Microsoft website before upgrading to Update 5.

At this point you can start to run the packages through the Upgrade Package Wizard (if you right click on the SSIS packages folder in the project) .

upgrade      upgrade2

If packages are not appearing in the wizard, then the application is not recognising the packages as coming from a previous version of SQL Server. Its possible that even after running through the wizard, it reports the packages as having failed the upgrade. It’s fairly safe to ignore this, and is usually the result of custom components or similar.

Depending upon how many packages you need to upgrade you are now faced with the unfortunate task of replacing any of the custom components in those packages, with the new ones you downloaded to work with SQL Server 2014. In my case, this was the custom Adatis components in the Control Flow (Initiate Task/Terminate Task – remember to set LoadStatusType to Success on the Terminate properties), Data Flow (Clean – Data Cleansing/Standardise/Validate Columns), and Event Handlers (Row Throughput/Terminate Task on Error – remember to set LoadStatusType to Failure on the Terminate properties).

In some cases, components before the upgrade display a different icon than afterwards so its easy to tell what version you have. When this is not the case, you’ll have to check the package code.

BeforeReplaceComponents     AfterReplaceComponents

Following this, I then modified any connection strings I had within the solution to point to our new server. If you are not changing servers during your upgrade, just skip this step.

The next step is to update the DB projects to 2014 via the properties of the project. Very simple.

image

The final step needed within the solution is to upgrade the tabular model (if you have one). Firstly, make sure the model is open and then via the model properties, select the SQL Server 2014 compatibility level. Note – that if the model is not open, then this field will display blank.

image

 

Master Data Services
As part of my upgrade, I also had to upgrade our Master Data Services (MDS) to 2014. One of the pre-requirements for this is to have an IIS Web Server installed. If you’re provided a fresh clean server like I was then you’ll need to install this. This can be done through the Server Manager. Make sure the following features are added.

Web Server

  • Common HTTP Features
  • Static Content
  • Default Document
  • Directory Browsing
  • HTTP Errors
  • DO NOT INSTALL WebDAV Publishing

Application Development

  • ASP.NET 3.5/4.5
  • .NET Extensibility 3.5/4.5
  • ISAPI Extensions
  • ISAPI Filters

Health and Diagnostics

  • HTTP Logging
  • Request Monitor

Security

  • Windows Authentication
  • Request Filtering

Performance

  • Static Content Compression

Management Tools

  • IIS Management Console
  • .NET Framework 3.0 Features
  • WCF Activation
  • HTTP Activation
  • Non-HTTP Activation

Windows Process Activation Service

  • Process Model
  • .NET Environment
  • Configuration APIs

The MDS configuration tool is quite good as it identifies if you are missing any pre-reqs. Once you have the pre-requisites for MDS installed as well as MDS 2014 itself, you can begin the upgrade process – which comes in 2 parts, the database, and the web site

Unfortunately, model deployment packages can be used only in the edition of SQL Server they were created in. You cannot deploy model deployment packages created in SQL Server 2008 R2/SQL Server 2012 to SQL Server 2014. Therefore you’ll need to take a database-level backup of your MDS database and restore to the new 2014 instance. In my case, I also needed to give our service account that was running SQL Server security access to the folder in which the backup was stored as it wasn’t able to see it.

Once you have successfully restored the previous MDS database to the new instance, you will need to load up the Master Data Services Configuration tool. To start with click on Database Configuration. If you then select the database you have just restored, it will suggest the database is not compatible with the current version of MDS and requires upgrading. Click on the Upgrade Database button and you should now have a compatible 2014 MDS database. I also had to re-set up the MDS mail profile at this point as this does not transfer across with the database.

The next step is to configure a new MDS website which can be done via the Web Configuration button. Firstly, select the default website that is created when you install IIS. There is no need to create anything separate – in fact this caused me more hassle than it was worth. You can then go about creating the site which requires you to specify an alias, and the application pool information (name and service account information).

Create WebApp

You are then required to associate both the database and web site together. At this point, you should be able to load up the new MDS web site and explore your old model.

One issue I ran into that stopped me loading into MDS at this point was that the service account you specify above will then need to be added to a local security group called “Log on as a batch job”. This can be done via the Computer Management application, then Local Users and Groups, Groups, Log on a batch job.

I also had some minor issues when loading MDS where the page was malformed and I was unable to see the database. I resolved this by providing the service account access to the web config of the web site directly.

 

Deploying & Testing

At this point, all the necessary requirements should be in place for you to deploy and test the upgrade. Deploy the solution as you would normally.

Your first step once your at this point is to add a new SSISDB Catalog to your new instance.  If like me you did this via the old 2012 Management Studio, you’ll run into some errors potentially. Instead, load up the new 2014 Management Studio and you should be fine.

Depending on how your solution is setup, you’ll need to re-confirm any environment variables or script them out again and then configure the SSISDB projects to use them.

Then you will need to re-setup any SQL Agent Jobs you had on the previous instance, again this can be scripted across – just make sure you change the connection details.

Following this, make sure you GAC any components required on the new server similar to what you did locally.

Lastly, you will also need to add your SSAS Service Account into the new SSAS instance.

You can then run the solution out as per usual and check your execution status reports for any irregularities.

 

Summary

Hopefully this covers everything you need to upgrade your solution. Whether its from 2008/2012 to 2014 or 2014 to 2016, the steps should be roughly the same, you might just run into slightly different issues. Generally the process is fairly straightforward, the hardest part was keeping track of the number of steps required and then ticking them off as you do them. If you need to do this for both development and production servers, I’d advise running the steps side by side so they are set up in a similar fashion. Please let me know in the comments below if you’ve had any similar experiences / issues.