Tristan Robinson

Tristan Robinson's Blog

Getting Started with Databricks Cluster Pricing

The use of databricks for data engineering or data analytics workloads is becoming more prevalent as the platform grows, and has made its way into most of our recent modern data architecture proposals – whether that be PaaS warehouses, or data science platforms.

To run any type of workload on the platform, you will need to setup a cluster to do the processing for you. While the Azure-based platform has made this relatively simple for development purposes, i.e. give it a name, select a runtime, select the type of VMs you want and away you go – for production workloads, a bit more thought needs to go into the configuration/cost.  In the following blog I’ll start by looking at the pricing in a bit more detail which will aim to provide a cost element to the cluster configuration process.

For arguments sake, the work that we tend to deliver with databricks is based on data engineering usage – spinning up resource for an allocated period to perform a task. Therefore this is generally the focus for the following topic.

 

To get started in this area, I think it would be useful to included some definitions.

  • DBU – a databricks unit (unit of processing capability per hour billed on per second usage)
  • Data Engineering Workload - a job that both starts and terminates the cluster which it runs on (via the job scheduler)
  • Data Analytics Workload – a non automated workload, for example running a command manually within a databricks notebook. Multiple users can share the cluster to perform interactive analysis
  • Cluster – made up of instances of processing (VMs) and constitute of a driver, and workers. Workers can either be provisioned upfront, or autoscaled between a min no. workers / max no. workers.
  • Tier – either standard or premium. Premium includes role based access control, ODBC endpoint authentication, audit logs, Databricks Delta (unified data management system).

The billing for the clusters primarily works depending on the type of workload you initiate and tier (or functionality) you require. As you might of guessed data engineering workloads on the standard tier offer the best price.

I’ve taken the DS3 v2 instance (VM) pricing from the Azure Databricks pricing page.

image

 

The pricing can be broken down as follows:

  • Each instance is charged at £0.262/hour. So for example, the cost of a very simple cluster - 1 driver and 2 workers is £0.262/hour x 3 = £0.786/hour. The VM cost does not depend on the workload type/tier.
  • The DBU cost is then calculated at £0.196/hour. So for example, the cost of 3 nodes (as above) is £0.196/hour x 3 = £0.588/hour. This cost does change depending on workload type/tier.
  • The total cost is then £0.786/hour (VM Cost) + £0.588/hour (DBU Cost) = £1.374/hour. Also known as the pay as you go price. Discounts are then added accordingly for reserved processing power.

I thought this was worth simplifying since the pricing page doesn’t make this abundantly clear with the way the table is laid out and often this is overlooked. Due to the vast amount of options you can have to setup clusters, its worth understanding this element to balance against time.

The DBU count is merely to be used as reference to compare the different VMs processing power and is not directly included in the calculations. Its also worth mentioning that by default databricks services are setup as premium and can be downgraded to standard only by contacting support. In some cases, this can add some massive cost savings depending upon the type of work you are doing on the platform so please take this into account before spinning up clusters and don’t just go with the default.

With regards to configuration, clusters can either be setup under a High Concurrency mode (previously known as serverless) or as Standard. The high concurrency mode is optimised for concurrent workloads and therefore is more applicable to data analytics workloads and interactive notebooks which are used by multiple users simultaneously. This piece of configuration does not effect the pricing.

By using the following cost model, we can then assume for a basic batch ETL run where we have a driver and 8 worker nodes on relatively small DS3 instances, would cost £123.60/month given a standard 1 hour daily ETL window. Hopefully this provides as a very simple introduction into the pricing model used by Databricks.

Databricks UDF Performance Comparisons

I’ve recently been spending quite a bit of time on the Azure Databricks platform, and while learning decided it was worth using it to experiment with some common data warehousing tasks in the form of data cleansing. As Databricks provides us with a platform to run a Spark environment on, it offers options to use cross-platform APIs that allow us to write code in Scala, Python, R, and SQL within the same notebook. As with most things in life, not everything is equal and there are potential differences in performance between them. In this blog, I will explain the tests I produced with the aim of outlining best practice for Databricks implementations for UDFs of this nature.

Scala is the native language for Spark – and without going into too much detail here, it will compile down faster to the JVM for processing. Under the hood, Python on the other hand provides a wrapper around the code but in reality is a Scala program telling the cluster what to do, and being transformed by Scala code. Converting these objects into a form Python can read is called serialisation / deserialisation, and its expensive, especially over time and across a distributed dataset. This most expensive scenario occurs through UDFs (functions) – the runtime process for which can be seen below. The overhead here is in (4) and (5) to read the data and write into JVM memory.

image

Using Scala to create the UDFs, the execution process can skip these steps and keep everything native. Scala UDFs operate within the JVM of the executor so we can skip serialisation and deserialisation.

image

 

Experiments

As part of my data for this task I took a list of company names from a data set and then run them through a process to codify them, essentially stripping out characters which cause them to be unique and converting them to upper case, thus grouping a set of companies together under the same name. For instance Adatis, Adatis Ltd, and Adatis (Ltd) would become ADATIS. This was an example of a typical cleansing activity when working with data sets. The dataset in question was around 2.5GB and contained 10.5m rows. The cluster I used was Databricks runtime 4.2 (Spark 2.3.1 / Scala 2.11) with Standard_DS2_v2 VMs for the driver/worker nodes (14GB memory) with autoscaling disabled and limited to 2 workers. I disabled the autoscaling for this as I was seeing wildly inconsistent timings each run which impacted the tests. The goods news is that with it enabled and using up to 8 workers, the timings were about 20% faster albeit more erratic from a standard deviation point of view.

The following approaches were tested:

  • Scala program calls Scala UDF via Function
  • Scala program calls Scala UDF via SQL
  • Python program calls Scala UDF via SQL
  • Python program calls Python UDF via Function
  • Python program calls Python Vectorised UDF via Function
  • Python program uses SQL

While it was true in previous versions of Spark that there was a difference between these using Scala/Python, in the latest version of Spark (2.3) it is believed to be more of a level playing field by using Apache Arrow in the form of Vectorised Pandas UDFs within Python.

As part of the tests I also wanted to use Python to call a Scala UDF via a function but unfortunately we cannot do this without creating a Jar file of the Scala code and importing it separately. This would be done via SBT (build tool) using the following guide here. I considered this too much of an overhead for the purposes of the experiment.

The following code was then used as part of a Databricks notebook to define the tests. A custom function to time the write was required for Scala whereas Python allows us to use %timeit for a similar purpose.

 

Scala program calls Scala UDF via Function

// Scala program calls Scala UDF via Function

%scala

def codifyScalaUdf = udf((string: String) => string.toUpperCase.replace(" ", "").replace("#","").replace(";","").replace("&","").replace(" AND ","").replace(" THE ","").replace("LTD","").replace("LIMITED","").replace("PLC","").replace(".","").replace(",","").replace("[","").replace("]","").replace("LLP","").replace("INC","").replace("CORP",""))  

spark.udf.register("ScalaUdf", codifyScalaUdf)  

val transformedScalaDf = table("DataTable").select(codifyScalaUdf($"CompanyName").alias("CompanyName"))
val ssfTime = timeIt(transformedScalaDf.write.mode("overwrite").format("parquet").saveAsTable("SSF"))

 

Scala program calls Scala UDF via SQL

// Scala program calls Scala UDF via SQL

%scala

val sss = spark.sql("SELECT ScalaUdf(CompanyName) as a from DataTable where CompanyName is not null")
val sssTime = timeIt(sss.write.mode("overwrite").format("parquet").saveAsTable("SSS"))

 

Python program calls Scala UDF via SQL

# Python program calls Scala UDF via SQL

pss = spark.sql("SELECT ScalaUdf(CompanyName) as a from DataTable where CompanyName is not null")
%timeit -r 1 pss.write.format("parquet").saveAsTable("PSS", mode='overwrite') 

 

Python program calls Python UDF via Function

# Python program calls Python UDF via Function

from pyspark.sql.functions import *
from pyspark.sql.types import StringType

@udf(StringType())
def pythonCodifyUDF(string):
    return (string.upper().replace(" ", "").replace("#","").replace(";","").replace("&","").replace(" AND ","").replace(" THE ","").replace("LTD","").replace("LIMITED","").replace("PLC","").replace(".","").replace(",","").replace("[","").replace("]","").replace("LLP","").replace("INC","").replace("CORP",""))

pyDF = df.select(pythonCodifyUDF(col("CompanyName")).alias("CompanyName")).filter(col("CompanyName").isNotNull())
%timeit -r 1 pyDF.write.format("parquet").saveAsTable("PPF", mode='overwrite')

 

Python program calls Python Vectorised UDF via Function

# Python program calls Python Vectorised UDF via Function

from pyspark.sql.types import StringType
from pyspark.sql.functions import pandas_udf, col

@pandas_udf(returnType=StringType())
def pythonCodifyVecUDF(string):
    return (string.replace(" ", "").replace("#","").replace(";","").replace("&","").replace(" AND ","").replace(" THE ","").replace("LTD","").replace("LIMITED","").replace("PLC","").replace(".","").replace(",","").replace("[","").replace("]","").replace("LLP","").replace("INC","").replace("CORP","")).str.upper()
  
pyVecDF = df.select(pythonCodifyVecUDF(col("CompanyName")).alias("CompanyName")).filter(col("CompanyName").isNotNull())
%timeit -r 1 pyVecDF.write.format("parquet").saveAsTable("PVF", mode='overwrite')

 

Python Program uses SQL

# Python Program uses SQL

sql = spark.sql("SELECT upper(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(CompanyName,' ',''),'&',''),';',''),'#',''),' AND ',''),' THE ',''),'LTD',''),'LIMITED',''),'PLC',''),'.',''),',',''),'[',''),']',''),'LLP',''),'INC',''),'CORP','')) as a from DataTable where CompanyName is not null")           

%timeit -r 1 sql.write.format("parquet").saveAsTable("SQL", mode='overwrite')

 

Results and Observations

image

It was interesting to note the following:

  • The hypothesis above does indeed hold true and the 2 methods which were expected to be slowest were within the experiment, and by a considerable margin.
  • The Scala UDF performs consistently regardless of the method used to call the UDF.
  • The Python vectorised UDF now performs on par with the Scala UDFs and there is a clear difference between the vectorised and non-vectorised Python UDFs.
  • The standard deviation for the vectorised UDF was surprisingly low and the method was performing consistently each run. The non-vectorised Python UDF was the opposite.

To summarise, moving forward – as long as you adopt to writing your UDFs in Scala or use the vectorised version of the Python UDF, the performance will be similar for this type of activity. Its worth noting to definitely avoid writing the UDFs as standard Python functions due to the theory and results above. Over time, across a complete solution and with more data, this time would add up.

Lightweight Testing in Azure through Pester & Azure Automation

I’m currently working on a relatively lightweight PaaS modern data warehouse project, and as part of the build there was a requirement to add automated testing into the platform. While I was thinking about how I could achieve this - I remembered during the 2018 SQLBits I attended there was a DevOps session by the sabin.io team in which they did something very similar. They used PowerShell to add tests into part of a continuous integration pipeline hosted on VSTS through a framework called Pester. In this blog, I will talk about my implementation of Pester, but specifically look at it through the context of hosting it in a Azure PaaS environment (something which adds a small amount of overhead to its implementation and took me a while to get working).

 

What is Pester?

Pester is basically a ubiquitous test and mock framework for PowerShell (hosted on GitHub here). It can literally be used to test anything - as long as you can invoke whatever you are trying to test through PowerShell. This includes SQL statements, Azure resources, Windows resources, etc. Pester provides the language to DEFINE and EXECUTE the test case. I won’t go through the numerous advantages of automated testing, but lets just say its pretty useful to have it as part of a platform to alert you about issues early. I won’t talk about it so much in the context of DevOps CI/CD, but its got use cases here too. Microsoft is also on board with the framework, and it comes pre-installed with Windows builds nowadays.

 

How do I define a test case?

Describe defines a group of tests, and all test files must contain at least one describe block. It then defines a single test case, so in my example below it will invoke a SQL statement to return a result which is then tested.

The test is then passed or failed using an assertion such as Should be 0 or Should BeNullOrEmpty based on the object passed to it. You can also use other variations such as ShouldBeExactly or even Should Throw to pass a test based on a terminating error.

In the code segment below, I’m checking keys that were unable to lookup against our dimensions from the fact tables. One thing to be careful of (as there was no related error message) was that you will need to alias your COUNT or SUM in the SQL query and then call that as a property of your result object afterwards to check against.

Describe "Warehouse Award Fact -1 Keys" {
    It "Award Fact - Supplier -1 Keys" {
        $Query = "SELECT COUNT(*) AS AwardSupplier FROM Warehouse.FctAward WHERE SupplierKey = -1 "
        $Result = Invoke-Sqlcmd -ServerInstance $AsqlServerName -Database $AsqlDatabaseName -Query $Query -Username $AsqlUsername -Password $AsqlPassword
        $Result.AwardSupplier | Should be 0
    } 
    It "Award Fact - Contract -1 Keys" {
        $Query = "SELECT COUNT(*) AS AwardContract FROM Warehouse.FctAward WHERE ContractKey = -1 "
        $Result = Invoke-Sqlcmd -ServerInstance $AsqlServerName -Database $AsqlDatabaseName -Query $Query -Username $AsqlUsername -Password $AsqlPassword
        $Result.AwardContract | Should be 0
    } 
}

 

As part of the definition I decided it would be best to spit the test blocks up into logical groups, and then keep them all within the same file. However, if you wanted to test both data and something else such as Azure resources, then I would consider splitting out the tests into separate files to keep things modular. In terms of tests, to give you an idea of the sort of things we were keen to monitor, I created the following:

  • Stage row counts match clean row counts
  • Business keys are distinct within source
  • Business keys were distinct within dimensions
  • Total £ amounts matched between source and warehouse
  • No MDS errors existed on data import
  • Fact table unknown keys for each dimension (above)
  • Misc tests that tripped us up during early build phases based on assumptions

These tests were added to over time and formed a group of acceptance tests for each DW run. As data is always changing, its good to have these to validate your initial premises put in place around a data set.

While I didn’t need to use it, there are also commands to mock variables therefore putting the PowerShell code into a specific state for a particular test. This is helpful to avoid changing the real environment while replicating states.

 

Standing up Pester in Azure

Implementing the pester framework with traditional resources is already well documented and very simple to get started in both a manual and automated way. Unfortunately as I was working on a fully PaaS project, I needed to implement the framework within Azure. This did not seem to be as well documented. To help me out with the task, I went straight to Azure Automation – for anyone that has not used this before, its basically a way to host PowerShell scripts within Runbooks. While it was fairly intuitive to import the Pester framework from their GitHub repo, there is also an option with Automation to select from a modules gallery. Pester is part of this gallery, so for ease of use, I would download it here which also makes maintenance slightly easier. Its also worth mentioning that its also worth defining important variables such as DB connections, etc outside of the Runbook within Automation and then passing them in as parameters. Think of this as similar to environment variables on a SSIS project. Credentials such as our Runas account are also defined externally to the runbook.

 

image

 

Now on to the important bit; defining the testing execution script.

The first body of code will connect to Azure using the Runas account as a service principal. This allows us to execute the script without using our own credentials. Setup of this account is a whole separate blog in itself so I won’t go into that detail here.

Once this has been defined, the script will then connect into Blob storage to extract the tests (defined above), and place it in the local Temp directory of Azure Automation. This was one of my major blockers initially as I was trying to find a way to Invoke the test scripts from within the same PS script as the execution, therefore not needing to host the tests anywhere and keep everything contained in a runbook or parent/child runbooks. Unfortunately (as far as I can tell) the Invoke-Pester command which executes the tests needs a separate file location to be defined. Either way, we had a blob storage area already setup for the project so this was not really an overhead to create a new container to store the tests in. Automation uses a Temp folder internally to store anything, so I used this to land the tests ready for processing.

# Get required variables
$AutomationConnectionName = Get-AutomationVariable -Name 'AutomationConnectionName'

# Connect to an automation connection to get TenantId and SubscriptionId
$Connection = Get-AutomationConnection -Name $AutomationConnectionName
$TenantId = $Connection.TenantId
$SubscriptionId = $Connection.SubscriptionId

# Login to Azure using AzureRunAsConnection
Connect-AzureRmAccount -ServicePrincipal -TenantId $Connection.TenantId -ApplicationId $Connection.ApplicationId -CertificateThumbprint $Connection.CertificateThumbprint

# Connect to Storage Account to get tests script
$resourceGroup = Get-AutomationVariable -Name 'ResourceGroupName'
$storageAccountName = Get-AutomationVariable -Name 'PreStageBlobStorageAccountName'
$storageAccount = Get-AzureRmStorageAccount -ResourceGroupName $resourceGroup -Name $storageAccountName 
$ctx = $storageAccount.Context
$data = Get-AzureStorageBlobContent -Blob "PesterTests.ps1" -Container 'pester' -Destination "C:\Temp\" -Context $ctx 

 

The next section of the script will depend on what you want to do with the results. For me, we already had an ETL control schema setup in our SQLDB, so it felt right to submit the results into a new table in there for reporting purposes. Alternatively you could create a new schema called UnitTesting or similar. There are also lots of other things you could do with the results, i.e. to trigger other events or use in a CI/CD environment.

The below code will open up a SQLDB connection, and then create a function to insert rows into the DB which will call in the next block of code. For security purposes, the parameters are defined outside the SQL command and added in at execution. While there are quite a few things you can extract from a pester test result, I decided to take the Test Name, Test Result, and Failure Message – to keep things simple. I also included an inserted date so we can work out the latest tests, as well as a Trigger Id to join it back into the parent pipeline that called the test scripts. This ties in nicely to other bits of our ETL reporting framework.

# Open SQL connection 
$DBServer = Get-AutomationVariable -Name 'DatabaseServerFullName'
$DBName = Get-AutomationVariable -Name 'DatabaseName'
$DBUsername = Get-AutomationVariable -Name 'DatabaseAdminUsername'
$DBPassword = Get-AutomationVariable -Name 'DatabaseAdminPassword'
$sqlConnection = New-Object System.Data.SqlClient.SqlConnection
$sqlConnection.ConnectionString = "Server=$DBServer;Database=$DBName;UID=$DBUsername;PWD=$DBPassword"
$sqlConnection.Open()

# Create GUID to group tests together
$GUID = [guid]::Newguid()
$GUID = $GUID.ToString()

# Create Inserted datetime
$Inserted = Get-Date

# Define function to submit to database
function Do-InsertRow {

    $sqlCommand = New-Object System.Data.SqlClient.SqlCommand
    $sqlCommand.Connection = $sqlConnection
    
    $sqlCommand.CommandText = "SET NOCOUNT ON; " +
        "INSERT INTO BISystem.UnitTests (TriggerId,TestName,TestResult,FailureMessage,Inserted)" +
        "VALUES (@TriggerId,@TestName,@TestResult,@FailureMessage,@Inserted); " 
    
    $sqlCommand.Parameters.Add((New-Object Data.SqlClient.SqlParameter("@TriggerId",[Data.SQLDBType]::NVarChar, 50))) | Out-Null
    $sqlCommand.Parameters.Add((New-Object Data.SqlClient.SqlParameter("@TestName",[Data.SQLDBType]::NVarChar, 200))) | Out-Null
    $sqlCommand.Parameters.Add((New-Object Data.SqlClient.SqlParameter("@TestResult",[Data.SQLDBType]::NVarChar, 10))) | Out-Null
    $sqlCommand.Parameters.Add((New-Object Data.SqlClient.SqlParameter("@FailureMessage",[Data.SQLDBType]::NVarChar, 500))) | Out-Null
    $sqlCommand.Parameters.Add((New-Object Data.SqlClient.SqlParameter("@Inserted",[Data.SQLDBType]::DateTime2))) | Out-Null

    $sqlCommand.Parameters[0].Value = $PipelineTriggerID
    $sqlCommand.Parameters[1].Value = $TestName
    $sqlCommand.Parameters[2].Value = $TestResult
    $sqlCommand.Parameters[3].Value = $TestError
    $sqlCommand.Parameters[4].Value = $Inserted

    $sqlCommand.ExecuteNonQuery()

}

 

Time to Invoke the test scripts. This can be done through the Invoke-Pester command. As mentioned above, I had to pass in the test scripts location. I also defined the PassThru parameter so that the results were passed into an $Output object ready to deconstruct and write to SQLDB. While its not mandatory you can also define the test groups to pickup as part of the invocation, which has the added bonus of being able to use wildcards against.

The script will then loop round each object in the array of the $Output object defined previously. For each one, it will extract the attributes of the test that I’m after and then call my Insert to DB function to write the results to SQLDB.

# Invoke Pester
$Output = Invoke-Pester -Script C:\Temp\PesterTests.ps1 -PassThru -TestName '*CaSIE*' 

# Output Results
Write-Output "TestNameFilter: $($Output.TestNameFilter)." 
Write-Output "Total Count: $($Output.TotalCount)." 
Write-Output "Passed Count: $($Output.PassedCount)." 
Write-Output "Failed Count: $($Output.FailedCount)."
Write-Output "Time: $($Output.Time)."
Write-Output ""

# Loop over TestResult objects to submit to database
$Output.TestResult | ForEach-Object {
    $TestName = $_.Name
    $TestResult = $_.Result
    $TestError = $_.FailureMessage 
    Write-Output "Test Result: $($TestName), $($TestResult), $($TestError)" 
    Do-InsertRow
} 

# Close the connection.
if ($sqlConnection.State -eq [Data.ConnectionState]::Open) {
    $sqlConnection.Close()
}

# Show done when finished (for testing/logging purpose only)
Write-Output "Finished Running Tests"

 

Once everything has been created and you’ve tested the scripts via the test pane and checked the results populate into the database, you are ready to hook it into your ADF pipelines and integrate it into your solution! To do this is very simple – within each Runbook you can create a webhook to the runbook for any external resources to call. This comes in the form of a secret URL which will then kick off the runbook with the embedded script. Passing parameters into the runbook at this point requires a bit more work and I go into this detail in a separate blog post. Be careful to take a copy of this URL, as you cannot view it after creation. Its then just a case of creating a web activity in your ADF pipeline to call this, to run the scripts. All in all, a very straightforward mechanism.

Once the results were in SQLDB I also defined a view on top of the table as I was having trouble extracting the test group out. To make things simple, I just re-worded the individual test cases to include this, and then use a CHARINDEX to split them out, thus meaning I could now report by group also. I also added a field to calculate the Latest Test result set, thus meaning we were only reporting on the most relevant test set. I then built a Power BI report on top of this to integrate into our existing solution (below). Test results were presented as a measure % Passed (defined below). This was then displayed overall, vs each test group, and then vs each individual test.

% Passed = 
VAR PercentPassed =
DIVIDE(
CALCULATE(COUNT('BISystem UnitTestsView'[TestId]),'BISystem UnitTestsView'[TestResult] = "Passed", 'BISystem UnitTestsView'[LatestTest] = 1),
CALCULATE(COUNT('BISystem UnitTestsView'[TestId]),'BISystem UnitTestsView'[LatestTest] = 1)
) 
RETURN
IF(PercentPassed = BLANK(),0,PercentPassed)

 

image

 

Summary

Pester is a great tool to use to add automated testing in to your project. It’s taken a few hours to stand up but now that’s been done, it’s just a case of defining tests. The ubiquitous nature of the framework means we can define all sorts of test across data/software/hardware. The thing I like the most about using it is that its simple. While I decided it was best to populate the test results into a database for ease of use for reporting, it might also be worth investigating population into the VSTS testing framework. Hopefully you will find this blog useful.

Using Azure Automation to Archive SFTP Files Orchestrated through ADF

I recently ran into a problem which required me to work with a SFTP site as part of our ETL process in Azure. Using the traditional BI stack, FTP tasks would be natively supported but this is no longer the case when working with Azure – this now requires a bit of plumbing. This blog will take a look at the ways in which you can interact with FTP/SFTP sites as part of your pipelines in ADF and the issues I ran into. Specifically the task I was trying to solve was to archive files (move them to a separate folder) that had passed into our staging layer, so they would not be picked up on the next process run. The blog will also cover passing parameters into a webhook as I also needed the solution to be flexible to handle different SFTP sources.

Like many similar tasks through Azure, you can go about them in a number of ways (Logic Apps, Functions, PS Runbooks, etc) but its about finding the solution that fits with your architecture and you find comfortable developing and maintaining. I initially started looked at using Logic Apps since quite a bit of the work was done for you, they have the connectors already setup, and the framework to achieve some simple tasks which sounded perfect for the job. Unfortunately, Logic Apps is not designed for any type of heavy lifting. You can move files using the platform but are capped at a mere 50MB when trying to pick up and put down a file across the platform. Rumours were that Microsoft would extend this to 1GB at some point during 2018 but this has not been the case so far.  There was also no such task to just change the remote path location as far as I could tell – it looks like people are up-voting it here.

Next I looked at using Azure functions, since I would do something similar through SSIS if this was a traditional stack problem. This involves writing a bit of C# to talk to the SFTP site and do the tasks for you. While this is very possible, it didn’t fit into our existing architecture. I also felt it was easier to build something to maintain using PowerShell – a language which I’m enjoying coding in more and more.

Lastly I looked at using Azure Automation and doing the tasks through PowerShell. There are a number of modules which you can use to achieve the goal, for which I used Posh SSH. I also looked at WinSCP but found the Posh cmdlets more flexible.

The first thing I did was work out how to pass parameters from a webhook into the script. This was so that the script could be used across multiple SFTP folders. In ADF it was as simple as defining the headers/body to pass with the POST call to the webhook.

 

image

 

Within the PS script I then defined the following region to pass the parameters into. The parameters are passed in as an object which is then deconstructed into the various header/body elements. The body is also deconstructed further from JSON into the variable I wanted. I believe it is also mandatory to name the object received by the script as $WebhookData as this is not configurable externally.

# Get parameter value
Param
([object]$WebhookData) 

#region Verify Webhook
if ($WebHookData){

    # Collect properties of WebhookData
    $WebhookName     =     $WebHookData.WebhookName
    $WebhookHeaders  =     $WebHookData.RequestHeader
    $WebhookBody     =     $WebHookData.RequestBody

    # Collect individual headers if required. Input converted from JSON.
    $Input = (ConvertFrom-Json -InputObject $WebhookBody)
    Write-Output  "WebhookBody: $($Input)"
    Write-Output -InputObject ('Runbook started from webhook {0}.' -f $WebhookName)
    
    # Extract variables
    $folderName = $Input.FolderName
}
else
{
   Write-Error -Message 'Runbook was not started from Webhook' -ErrorAction stop
}
#endregion

 

Once this is then defined within the script, you can setup the webhook from the Automation end and paste the URL it generates back into ADF. Its also at this point that the webhook configuration allows you to define a parameter as part of the webhook. Without the body of code above, this will not be an option, and cannot be created first as I originally assumed.

 

image

 

Once the parameter passing has been dealt with, the code block for connecting to the SFTP site and performing the task is straight forward.

# Get variable values
$userName = Get-AutomationVariable -Name 'SftpUserName'
$userPassword = Get-AutomationVariable -Name 'SftpPassword'
$hostName = Get-AutomationVariable -Name 'SftpHost'
$port = Get-AutomationVariable -Name 'SftpPort'

# Create PS credential from username/password
$userPassword = ConvertTo-SecureString -String $userPassword -AsPlainText -Force
$userCredential = New-Object -TypeName System.Management.Automation.PSCredential ($userName, $userPassword)

# Create new SFTP session
$session = New-SFTPSession -ComputerName $hostName -Credential $userCredential -Port $port -AcceptKey 

# Retrieve child objects on remote path and for every zip rename to zip.bk
$remotePath = "/Data/" + $folderName + "/"
Get-SFTPChildItem $session $remotePath| 
    ForEach-Object {
        if ($_.Fullname -like '*.zip' -Or $_.Fullname -like "*.txt") {  
            $NewName = $_.Name + ".bk"
            Rename-SFTPFile $session -Path $_.FullName -NewName $NewName
            Write-Output "$($_.FullName) has been renamed to $($NewName)"
        }
    }

 

The first task is to extract the variables from the Automation resource. These should not be hard-coded into the script, but stored as variables external to the runbook. Then a PSCredential requires configuration in the format above.

Finally the task to archive the files is performed. This involves creating a new SFTP session, and iterating over the child items within the remote path. For each file it finds, it will then perform the archiving process as long as they have the correct extension.

While I originally wanted to move the files between folders I realised this was not possible with the modules I was using. I’m sure with a bit more research something would be available out there to achieve this, specifically changing the path of a remote file.

Therefore I was limited to changing the name of a remote file by adding a .bk to the end of the filename. ADF would then only pass over files ending in .zip the next time round.

 

Conclusion

I’m hoping this will be useful for anyone in the same situation. I’ve specifically used the webhook parameters multiple times now.  While I didn’t achieve my original goal - with a small workaround I still satisfied my original requirement to archive the files from future process runs.  There’s also some other code snippets that might be of good reference for future implementations here too, specifically thinking around the PS credential creation. Hope it helps!

Iconography in Design

In this blog I will go about discussing the importance of design, in particular looking at how icons can help add the finishing touches to a piece of front end development.

Whether you’re building a Power BI report, SSRS report, PowerApp, or doing any kind of front end visualisation and design, you’ll know that 50% of the battle is making whatever you’re building jump out of the page. You could have built the most useful report that a user could wish for, but unless it looks good, you’re not going to get any plaudits. Essentially, good design goes a long way to making a success on a piece of development. This is even more important if you’re building the underlying architecture / ETL, and for all our good practice and thoughtfulness in this area as developers, will rarely impress or impact on an end user.

Its important to get inspiration for design from somewhere, so before going about designing or even building something, take a look on Google, do some research and try to look for similar things to what you’re doing and see what you think looks good, or not so good. You should build an idea up of how you want to design you’re landing page, report or report headers, etc. within the tool. If you’re in Power BI, take a look at the partner showcase for example. There’s some really good examples to give you ideas – like this one!

Microsoft apps such as Power BI or PowerApps go a long way in helping us build the best when it comes to data visualisation, but unfortunately they sometimes rely on developers to go outside the box to finish things off.

 

Icon Finder

image

Recently I’ve started to use a site called iconfinder.com which has a large pool of useful icons you can use for building out certain corners of apps that the standard MS tooling will not support. The icons are mostly $2 each, but you can get a subscription for £20/month or $10 with a discount code (which you should cancel as soon as you register). Please, don’t jump in if you think you need to use this as a resource. Alternatively, save the icon using Chrome and use it as a placeholder until you are happy to push to production. For non-subscribed users, the icon will always come on a background of faint grey lines. This isn’t too bad as they don’t completely ruin the look and feel of the icon in the development and are good for a placeholder for demos, etc. To get started, just type in your keyword for the type of icon you’re looking for, and then its just a case of wading through the results to find the icon that fits the look and feel your inspired to build against. Sometimes, you’ll get another bit of inspiration off the back of this which you can use as another key word to find even more icons.

The site also comes with a very handy icon editor tool, essential Paint Shop Pro on the web. There’s lots of these sites out there but its useful its all integrated into one place at no extra cost. It will load the SVG icon into it automatically if subscribed which then allows you to edit colours or shapes etc. In my instance, I found a nice % complete icon set which would look good on a white background. Unfortunately, I wanted it on a blue header bar, so needed to change it up slightly to fit the look and feel. No problem, took less than a minute to modify and download.

image

Its also worth mentioning that the site does a good job at helping you find a pool of icons which will fit together nicely using the same look and feel, showing you icons from the same icon set. In one instance, I replaced an icon I was looking for to be from the same pool even though the icon wasn’t exactly what I was looking for – because overall it just felt like it fitted together nicer with the other icons on the screen.

 

Design in Practice

As mentioned above, doing some research before you build can really help you create a much better finish. For inspiration for a recent PowerApps design I did a quick search for landing pages on Google, and found a few I liked the look of (below). As long as the general elements you are working to are similar, it really doesn’t matter where the inspiration comes from. In these cases, they were in the form of mobile apps.

imageimage

From these images, I was able to identify the key components which made me bookmark them:

  • I wanted some kind of non-offensive background, possibly semi-transparent, or with overlay.
  • I wanted a title that stands out the page, so white on grey or similar.
  • I wanted a small section for a blurb for the PowerApp.
  • I wanted 2 buttons, and the buttons to stand out.
  • I wanted logos in the top corners.
  • I wanted a nice look and feel for the colour palette.

From this, I then produced the following landing page.

image

I found the background on picjumbo.com which turned out to be quite a nice resource for some generic business style artwork, and then added a blurring filter across the top. This still interferes slightly with the buttons / title so I’m not completely happy but satisfied enough that it achieves the look and feel I was looking for. For the title, the range of fonts supplied with PowerApps is rather limited so I could go externally for this too but was happy enough for the time being. The layout also leaves room to shrink the title and add a small blurb if need be. The buttons are made up of a number of icons and fit with the theme for the app.

image

As mentioned above, I also added % complete icons to each page so users were able to understand how far they were along the scoring pages within the app. PowerApps provides sufficient icons for the back/refresh buttons that fit in with the white on blue theme, so I didn’t have to go externally for these. These were placed on the page header next to the logo.

 

 

Power BI

While this most recent bit of design was focused on PowerApps, I also add small bits into PowerBI during report design. For instance, rather than just have a generic button that can push you to a “details” page which has a table for the row by row breakdown of some aggregated data - I looked for an icon, edited the colour palette slightly and added this to the report. With recent Power BI functionality, I can make the image act as a button and redirect the user to another page.

image

I’ve also used icons in dashboard design where a single visual didn’t really represent the content of the report to drill into. This can also be a good way to go about adding a bit of flavour to a dashboard to mix things up. In my case, it also meant the drill down into the report level was less ambiguous, by asking a question as the title if that’s what the user wants to do.

image

Design is always subjective of course, but its great to use other resources at your disposal to go about building out apps. Depending upon the current estate in which you develop, it also helps them stand out a bit more and add a unique context to the reports/apps within the project. Hopefully this blog has given you a few ideas for your next project!

Adatibits - Scalable Deep Learning with Azure Batch AI

image

 

I recently had the pleasure of attending SQLBits, there were a number of great talks this year, and I think overall it was a fantastic event. As part of our commitment to learning at Adatis, we were also challenged to present back to the team something of interest we learnt at the event. This then becomes our internal Adatibits event where we can get sample of sessions from across the week.

As such, I was pleased when I saw a talk by Ben Keen on the Friday bill of SQLBits that revolves around deep learning. Having just come through the Microsoft Data Science program, this fell in line with my interest / research in data science, and was focusing on the bleeding edge of the subject area. It was also the talk I was probably looking forward to the most from the synopsis, and I thought it was very well delivered.

Anyway, on to the blog. In the following paragraphs, I’ll cover a cut down version of the talk and also talk about my experience using the MNIST dataset on Azure Batch AI. Credit to Ben for a few of the images I’ve used as they came off his slide deck.

 

What is Deep Learning?

So you’ve heard of machine learning (ML), and what that can do – deep learning is essentially a subset of ML, which uses neural network architectures (similar to human brains). It can work with both supervised (labelled) and unsupervised data, although its value today is tending towards learning from the labelled data. You can almost think of it as an evolution of ML.  It’s performance tends to improve the more data you can throw at it, where traditional ML algorithms seem to plateau. Deep learning also has an ability to perform automatic feature extraction from raw data (feature learning), where as the traditional routes have features provided as part of the dataset.

 

How does it help us? What are its use cases?

Deep learning excels at finding patterns in unstructured data. The following examples would be very difficult to write program to do, which is where the field of DL comes in. The use cases are usually split into 4 main areas – image, text, sound, and video.

 

Deep Neural Network Example

A simple example of how we can use deep learning is to understand the complexity around house prices. Taking an input layer of neurons for things such as Age, Sq. Footage, Bedrooms, and Location of a house – one can normally apply the traditional linear formula y = mx + c to apply weightings to the neurons to calculate the house price. This is a very simplistic view and there are many more factors that apply that can change the value. This often involves the different neurons intersecting with one another.

For example, people would think a house with a large number of bedrooms is a good thing and this would raise the price of the house, but if all these bedrooms were really small – then this wouldn’t be a very attractive offering for anyone other than a developer (and even then they might baulk at the effort involved) therefore lowing the price. This is especially true with people wanting more open plan houses nowadays. So traditionally people might be interested in bedrooms, this may shift in recent times to Sq. Footage as the main driver.

 

image

 

Therefore a number of weights can be attributed to an intermediary layer called the hidden layer.

The neural network architecture then uses this hidden layer to perform a better prediction. It does this by starting off with completely arbitrary weights (which will make a poor prediction). The process then multiplies the inputs by these weights, and applies an activation function to get to the hidden layer (a1, a2, a3). This neuron is then multiplied by another weight and another activation function to generate the prediction. This value is then scored, and evaluated via some form of loss function (Root Mean Squared Error). The weights are adjusted accordingly and the process is repeated. The weights are adjusted through a process called gradient decent.

 

image

 

To give you an idea of scale, and weight optimisation that's required – a 50x50 RGB image has 7,500 input neurons. Therefore we’re going to need to scale out as the training is very compute intensive. This is where Azure Batch AI comes in!

 

Azure Batch AI

Azure Batch AI is a managed service for training deep learning models in parallel at scale.

It’s currently in public preview, which I believe it entered around September 2017. It’s built on top of Azure Batch and essentially sells the standard Azure story where it provides the infrastructure for data scientists so they don’t need to worry about it and can get on with more practical work.

You can take your ML tools and workbooks (CNTK, TensorFlow, Python, etc.) and provision a GPU cluster on demand to run them against. Its important to note, the provision is of GPU, not CPU – similar cores, less money, less power consumed for this type of activity.

Once trained, the service can provide access to the trained model via apps and data services.

 

image

 

As part of my interest in the subject, I then went and looked at using the service to train a model off the MNIST dataset. This is a collection of handwritten digits between 1-9 with over 60,000 examples. It’s a great dataset to use to try out learning techniques and pattern recognition methods while spending minimal efforts on pre-processing and formatting. This is not always easy with most images as they contain a lot of noise and require time to convert into a format ready for training.

 

image

 

I then followed the following process within Azure Batch AI.

Created a Storage Account using Azure CLI along with a file share and directory.

# Login 
az login -u <username> -p <password>

# Register resource providers
az provider register -n Microsoft.BatchAI
az provider register -n Microsoft.Batch

# Create Resource Group
az group create --name AzureBatchAIDemo --location uksouth

# Create storage account to host data/scripts
az storage account create --name azurebatchaidemostorage --sku Standard_LRS --resource-group AzureBatchAIDemo

# Create File Share
az storage share create --account-name azurebatchaidemostorage --name batchaiquickstart

# Create Directory
az storage directory create --share-name batchaiquickstart --name mnistcntksample --account-name azurebatchaidemostorage

 

Uploaded the training / test datasets, and Python script.

# Upload train, test and script files
az storage file upload --share-name batchaiquickstart --source Train-28x28_cntk_text.txt --path mnistcntksample --account-name azurebatchaidemostorage
az storage file upload --share-name batchaiquickstart --source Test-28x28_cntk_text.txt --path mnistcntksample --account-name azurebatchaidemostorage
az storage file upload --share-name batchaiquickstart --source ConvNet_MNIST.py --path mnistcntksample --account-name azurebatchaidemostorage

 

Provisioned a GPU cluster. The NC6 consists of 1 GPU, which is 6 vCPUs, 56GB memory, and is roughly 80p/hour. This scales all the way up to an ND24 which is 4 GPUs, 448GB memory, for roughly £7.40/hour.

# Create GPU Cluster NC6 is a NVIDIA K80 GPU
az batchai cluster create --name azurebatchaidemocluster --vm-size STANDARD_NC6 --image UbuntuLTS --min 1 --max 1 --storage-account-name azurebatchaidemostorage --afs-name batchaiquickstart --afs-mount-path azurefileshare --user-name <username> --password <password> --resource-group AzureBatchAIDemo --location westeurope

# Cluster status overview
az batchai cluster list -o table

 

Created a training job from a JSON template – this tells the cluster where to find the scripts and the data, how many nodes to use, what container to use, and where to store the trained model. This can be then be run!

# Create a training job from a JSON template
az batchai job create --name batchaidemo --cluster-name azurebatchaidemocluster --config batchaidemo.json --resource-group AzureBatchAIDemo --location westeurope

# Job status
az batchai job list -o table

 

The output can be seen in real time along with the epochs and metrics. An epoch is essentially a full training cycle, and by having multiple epochs, you can cross validate your data, which leads the model to generalise more and fit real world data better.

# Output metadata
az batchai job list-files --name batchaidemo --output-directory-id stdouterr --resource-group AzureBatchAIDemo

# Observe realtime output
az batchai job stream-file --job-name batchaidemo  --output-directory-id stdouterr --name stderr.txt --resource-group AzureBatchAIDemo

image 

 

The pipeline can also be seen in the Azure portal along with links to the output metadata.

image

 

Once the model has been trained, it can be extracted, and the resources can be cleared down.

# Clean Up
az batchai job delete --name batchaidemo  
az batchai cluster delete --name azurebatchaidemocluster 
az group delete --name AzureBatchAIDemo

 

Conclusion

By moving the compute into Azure, and having the ability to scale – this means we can generate a faster learning rate for our problem. This in turn will mean better hyperparameter tuning to generate better weightings, which will mean better models, and better predictions.

As Steph Locke also alluded to in her data science talk – this means data scientists can do more work on things that they are good at, rather than waiting around for models to train to re-evaluate. Deep learning is certainly an interesting space to be in currently!

Improving Machine Learning Models

As part of my MS Data Science Professional Program, a number of the topics recently have been based around getting the most out of an Azure ML model. As part of this blog, I will be looking at the techniques and ways in which you can model and improve a solution. While I was tackling this problem with Azure ML, these techniques apply to building better models through other languages/platforms such as Python or Scala.

 

Data Munging

A process that undoubtedly every data scientist goes through with every DS problem they face, is that of data munging. This is a term that is being used more frequently to describe the process of transforming data from its raw state into another format, something more valuable for downstream analytics. Models running on data that is poor in quality, missing or duplicated will produce poor predictions. Therefore a very simple pre cursor to any problem is to explore the data and understand what is required to turn the data set into a form which is better for the latter stages of the model design. The following techniques are used as part of this stage:

  • Removing Duplicate Rows
  • Cleaning Missing Data (custom substitution of numerics, usually 1 or 0 or replacing with the mean, medium, mode across the dataset)
  • Cleaning Missing Data (removing bad quality rows entirely or removing bad quality columns entirely – more extreme)
  • Creating Categorical Features from String Features
  • Normalising Numeric Features (ZScore or Min/Max) to bring everything on to the same scale.

These steps are quite basic so I won’t go into detail here, but none the less, they should be considered at the start of a modelling a better solution to a problem.

 

Feature Selection

One of the reoccurring principles that appears with machine learning is that of Ockham’s razor, which states that the best models are simple models that fit the data well; this is not an irrefutable principle of logic, but a preference for simplicity. Therefore there is a need of balance between accuracy and simplicity to limit the feature set which tends to lead to better predictions. Simpler models are also more interpretable to humans which also helps. While the data I was working with was limited to around 35 features, there are many data science problems which have thousands of features and so this technique is even more crucial.

There are multiple methods to perform feature selection, of which a few will be covered here. The first method is greedy backward selection which starts with all the features and then finds the feature that hurts predictive power the least when removed, and you remove it. This is done iteratively until a point is met (which will be discussed later). Its known as greedy since it never looks back after removing the feature each time.

An alternative method is greedy forward selection which is basically the inverse, starts with no features, and looks for the feature that by itself is the best model. This then carries on in a similar vein to the backward selection but adding features. The point at which you stop with forward selection is that of diminishing returns for your accuracy.

Defining accuracy is important here, and this is where a formula called Adjusted R² comes in. R² is a measure of how well the model fits the data, with being closer to 1 than 0 being a better fit. The adjusted part adds a penalty for every term in the model, thus it measures on a scale the size and accuracy of a model. Therefore you need enough features for your R² to be large but not too many that it brings the Adjusted R² down.

.

Permutation Feature Importance

Using the feature selection theory, and to prune the feature set down to those that are meaningful for prediction, you can use a module in Azure ML called Permutation Feature Importance. This essentially re-computes the model a number of times, leaving out each feature and looks at how much your metric changes because the feature was left out, and then ranks them in order of importance. Depending on what you are trying to model, i.e. a classification or regression problem – there are a number of options for the metrics to measure performance. In my instance, I was interested in the RMSE (Root Mean Squared Error) which in simple terms represents the sample standard deviation of the difference between prediction and observed values. It aggregates the magnitudes of error in prediction into a single measure of predictive power. The closer to 0, the better the predictive power – but it’s also good to note this is relative to what you are trying to measure.

Once the model has been run through, you can visualise the list of features and their contribution to the RMSE. At this point, it does not necessarily matter whether the feature contributes a positive or negative value to the RMSE, as long as the value is not 0. Any values of 0 indicate that they have zero contribution to feature importance, essentially whether they are part of the model or not, add nothing to it. You can then follow backward pruning techniques to remove these columns from the feature set. It is then worth running the model again, to check the feature importance as the removal of those features may impact other features. If more features then have a value of 0, you should remove those too, and repeat. You can then measure the impact of the changes using separate pipelines, and passing the output into the same evaluation model, and checking the ROC curve (described below). Even with the RMSE staying the same between the 2 pipelines, by removing features, you are able to build a model which is more likely to generalize be more effective in the real world when values change.

 

image

 

Picking the Best Model Type

There is no reason to believe that any particular machine learning model will have the best performance (although we always have favourites); a classification model type that works best for one set of features and labels in a dataset does not always work best for another. As part of modelling any dataset, testing and comparing multiple machine learning models is usually a good approach. Its also important to note that the performance achieved with any particular machine learning model can change after performing feature engineering, therefore it is best to run the selection after this stage. The following model evaluates logistic regression, boosted decision trees, neural networks and support vectors with the same dataset to find out which is best.

 

image

 

To understand the performance of a machine learning model, there are a number of techniques to use. The easiest way is to pass the output of each model into an Evaluate Model module, which accepts up to 2 datasets at a time (left and right inputs). After the experiment is run, you can visualise the output of the models using this module, and examine the ROC curve. The first scored dataset (blue) represents the original model (in this case a neural network), and the scored dataset to compare against (red) represents the second dataset (in this case a support vector machine). The higher and further to the left the curve, the better the performance of the model (in this case, the neutral network).

Scrolling down further, you can also use the Accuracy, Recall, and AUC performance metrics, which indicate the accuracy and area under the curve. The model with the higher metrics is performing better. In particular, the lower the recall metric, the higher the number of false negatives.

 

clip_image002

 

Parameter Sweeping

Once you’ve picked the ML model contributing to making your predictive power better, it will require a set of parameters. For instance, with decision trees, this is in the form of a leaf count to determine depth, or no. of trees to determine width, along with their samples per leaf, and the learning rate. By default there is always a set available, but these will always need tweaking to improve things further and generate a better RMSE.

This can be done by either sweeping a giant grid of parameters, or by a random sweep. The latter being a lot quicker to process at run time for obvious reasons Fortunately, the performance is not normally sensitive to a change in these values if you have done much of the previous analysis first. Parameter sweeping really starts to squeeze the best out of the model.

In Azure ML, this can be done via a tune model hyper parameter module. The same options are available to measure metrics as the feature selection module, so I was interested in the RMSE again. As part of tuning the parameters through this module, we will need to split the training data beforehand, this can be done 50:50. This is so that the parameters have a set of data to validate against. This is then kept separate to the scoring data set as usual which is another completely separate set of data. Once the model has run, we can again evaluate the best parameters, against the original model and evaluate the RMSE, as well as the Accuracy, Recall and AUC. This is very similar to the previous techniques of evaluation. Visualising the sweep results, will display the parameters used, and then these can be programmed back into the original ML model, while removing the tune hyper parameter module, to speed things up on future runs.

A process of nested cross validation can be used on top of this to build confidence that the correct parameters have been used and it wasn’t just luck that they ended up being better than another set.

 

image

 

Conclusion

Once you have been through this process, you will then want to run a process of cross-validation, which runs the data through multiple times (folds) where each time, different data is used for training, and scoring. You can then generate a mean and standard deviation for each fold and prove the model is consistent across the data set, and that it will not be skewed by any new data for future predictions. This will give you a good idea of whether the model will generalise well and be robust enough to move to production.

Of course, there are many more techniques to the ones listed here, but this should give you a good introduction to the ones to look for to deliver predictive power from your model.

Considerations for Creating a Power BI Enterprise Report Deck

Creating or re-creating an Enterprise report deck in Power BI should be reasonably straight forward given a specification, but there are a number of considerations which need to be made when building something of this nature. In the following blog post, I will detail some of these, and the route I would suggest taking. The contents of this blog revolve around more tabular reports than chart visuals, but the same themes can apply.

 

Fonts

I think it goes without saying to keep the font consistent across both a single report, and a report deck.  The default for Power BI is Segoe UI which for the most part is pleasant, just be careful not to flick between this and Segoe UI Light as this can cause discrepancies. It is however the font size that will cause you more of an issue.

  1. The first consideration is to set a minimum size. As report developers we want the report to look as tidy as possible, and usually this means fitting everything on 1 page. The easiest way to do this is to set the font smaller if you are having space issues – but this does not always translate so well to the end user. Depending on the device, they may consider the minimum Power BI lets you set (size 8) as too small for consumption on something such as an iPad – so this is worth checking first.
  2. The second consideration is to set the font size for different elements of the report, i.e. row level data at something like a 10, and header level elements at a 12. Anything else that exists such as filter elements should be set the same as the header levels. I would usually set titles a number of points above this, at something like an 18. In general, having varying levels of font size on a single report between elements will look inconsistent so the aim here is consistency!
  3. The third consideration if possible is to keep the font size the same across all the reports within the report deck for the same type of element. Again, this adds a consistent feel to the deck. If one report has more rows than another, in my opinion its still better to use the same font size across both, rather than filling the page on both using varying sizes.
  4. The last consideration is to be careful when mixing text from a textbox and a card together in the same area of the report. Unfortunately Power BI does not currently support expressions like SSRS does, thus a textbox is for static text only. Dynamic text can only be created through a measure and assigned to a card. However having both elements side-by-side with one another does not give the expected result. The font size of the font in a text box is not the same as a card; a card size 10 is roughly equivalent to a text box size 13 (although you can only set 12-14), thus leaving you with some inconsistent fonts between elements. My suggestion is to create measures referring to elements on the report, and use them for both static/dynamic text, thus every textbox will be a card visual and have that consistent look and feel. If you only need static text, stick to text boxes.

 

Objects

The next consideration is around the number of objects on a report – keep it simple.  Avoid building a giant monolithic report, the more objects you use, the slower the report will perform on PBI service, iPad’s and even to develop.  This is especially true for tables/matrices which will each need to fire off separate DAX queries to return the data elements. Too many objects also has knock on effects for exporting to PowerPoint as objects will overlap with one another more which may not be as much of a case within PBI service but will affect other apps. You can use the selection pane (in the view tab) so move objects above/below one another which will bring forward/push back the elements.

 

Alignment

Another scenario which I have come across is that sometimes it is necessary to include a column header in between the header for a measure and the actual data – for instance to specify the currency or unit. There are 2 options available; the first is to set the headers of the table as white text and insert text boxes over their position. While this achieves the goal, the final look and feel is not ideal as a large proportion of time can be spent aligning the columns with the text in the text boxes, and even then it can still be pixels out of alignment. Adding/removing measures then means you have to go through the same process again as everything shifts left/right. Fortunately, in the November release of Power BI, they have added the ability to align data within the tables better. A better approach for this scenario is to rename the measures within the table visual to whichever unit you want to show for that column. The downside of this is for a developer you will then need to hover the measures to see where the original measure came from, a small annoyance which is compensated by the amount of time saved trying to do alignment within a table. Also, this means less objects in the report, and less objects will generally create a quicker, more manageable report.

For anyone particularly new to Power BI, you an use the arrow keys to move around a single element pixel by pixel, to help with alignment. There’s also options on the format tab. I’m still waiting for the ability to pixel nudge multiple elements when selected together!

clip_image002

 

Colour

Hopefully you should be guided in terms of colour by a corporate colour scheme. This often comprises of a set of 4 or so main RGB values to use, complimented by a further set of colours. Pick 1 (or even 2) of these from the main set of colours and use that for the report framework, either the border/frame, or report header/footer, and then use another colour for the table headers, or two if the report requires double table headers. Again, consistency is the key across reports within the report deck. If using double headers for the columns, consider using the colours as background behind the header text rather than colouring in the text in the corporate colour. Consider white text on the darker backgrounds.

 

Parameter Selection

Most reports will contain some kind of slicer visual, to allow the user to change the context of the data – usually by period. As part of the report build, you’ll need to assess where the best position for this is on the report and to keep it consistent between reports within the deck. If your reports will require the real estate going across the page (i.e. wide tables), then consider the top of the report, else if they need the real estate going down the page (i.e. long tables), consider the right hand side. I think by default I would build it at the top, alongside any logos / titles. If you require multiple slicers, maybe move these to the side and build a panel to group them together.

Another little hidden option is that of switching the slicer visual between List/Dropdown (circled red below). For some reason, list is the default but most users will agree that the dropdown is neater, and saves space. I’m not sure why this appears here rather than in the standard visual configuration tab, maybe they will move it soon? The dropdown slicer visual still has some issues which I hope will be fixed soon such as not de-selecting the dropdown after a selection has been made. Another click is required outside of the dropdown to hide the options. This is not the best for the end users, and there seems to be no viable alternative.

image

 

Header Logic Swapping

Unfortunately as I mentioned previously, Power BI does not support expressions within tables, and therefore switching context based on a parameter is not easy to achieve. This is possible but it needs to be done entirely within DAX. To keep the DAX measures for this simple, consider specifying the position on the report as the name of the measure. Then within the definition of the measure, reference other created measures and keep the statement simple, allowing anyone debugging the report to trace the switching logic easily. Also use a DAX formatter such as this to make the DAX as readable as possible. It would be nice for this to be included within Power BI, hopefully it will soon! I’ve included an example DAX statement below to provide this functionality.

 

New Measure = 
IF (
    HASONEVALUE('Dim'[Value]),
    SWITCH(
        VALUES('Dim'[Value]),
        "Comments describing the logic definition", "",
        "Value", 
        [Measure],
        [AlternativeMeasure]
    ),
    BLANK ()
)

 

Template

What does all of this lead to? The answer is a template for the report deck. As well as having guidelines for the above items which make up a report, its also good to build a physical .pbix template for your suite of reports. This way, you are not starting from scratch for each report,and you will get a more consistent feel down to the pixel level of where the objects are. Don’t over complicate the template, but leave enough elements on it to save you re-creating them each time you build a new report. I would generally avoid copying from an existing report each time to then build another report, as this will sometimes include elements like PBI defined measures, which you do not want to carry between reports. Instead define a template which you take from each time.

 

Conclusion

Once decided on a number of these points, it is worth gaining a consensus from the product owner over whether this is acceptable to use moving forward. Do not get to the end of the deck, and demonstrate your decisions across the report set, this will leave you with far too much re-work. Instead aim to deliver maybe one of the more complex reports with a number of the items above, and then apply those decisions to the rest of the report deck.

Generating Usage Statistics from a SSAS Tabular Cube

Once you have users accessing your cube it’s almost inevitable at some point that someone will ask you to generate usage statistics from it, and there are a number of methods to achieve this. In this quick blog post, I’ll detail them and my experiences with each, and then use this data to create a PBI report.

 

Native Tabular Properties

The first method is natively through the tabular cube properties. This also has the added bonus (read impact) that it will optimise future aggregations based on usage – in fact that’s its main purpose.

This can be done by setting the CreateQueryLogTable to true, setting up the QueryLogConnectionString (to point to the DB where the usage table requires hosting), setting the QueryLogSamping rate (10 means every 10th query will be logged), and finally the name of the QueryLog table.

SQL Server Analysis Services Query Log Related Properties

Advantages of this method is that its very easy to setup with limited knowledge required and it could potentially improve performance if you have an environment where users submit repetitive queries. Unfortunately there are also a number of disadvantages which led me to find other methods. Firstly, it creates a degree of overhead on the cube if its sampling too often; we actually had visible performance related complaints once we turned it on – either through the sampling or change to the “optimised” aggregations. Depending on the sampling rate, you could also find that users who rarely use the cube are not picked up as part of the stats.  As well as this any changes to the cube structure will cause the logging table to be reset. The table is also limited in terms of what it actually logs (as you can see below) – useful if you just want just the user and timestamp info but not much else, and no real ability to configure.

Query the OlapQueryLog table for Cube activity

 

AS Trace

To that extent, I looked for other tools to do the same task but better and I found AS Trace.

Originally built for SQL Server 2012, it works fine on 2014 – and provides you the ability to run a trace against the cube activities (and log to a table) exactly like the SQL profiler but without the overhead of the GUI which adds unnecessary memory/processor power. It also runs as a windows service allowing it to restart automatically when the server reboots. If this is the case, the tool also logs the existing data to a History table and truncates the logging table. Exactly what I was after.

The tool collects information based on a preconfigured Analysis Services Profiler template, which can be optimised depending on which events you are interested in. I initially ran it using most events selected, and with a limited user set it was generating in the region of 25,000 rows a day. This was clearly not maintainable for a long period of time. I then used the following blog post to understand what each event of the profiler was giving me and then just created a lightweight trace definition file to give me what I wanted. I limited it to Query Begin, Query End (for DAX/MDX statements) and Audit Logon/Logout (for session data).

The setup is very straight forward, just run the install.bat as an escalated privileged account, and check it installs the service correctly. Next, add your SSAS service account to the Logon of the service, make sure the account has “Log on as Service” and membership to the database you are writing to in the form of DDL and DML access, i.e. able to create tables, write to tables – and lastly admin rights to the instance of SSAS you intend to use.

Next, configure the ASTrace.exe.config file with the parameters you want the tool to use. This includes the location of the cube (can handle multiple cubes), the location of the trace definition file, the location of the DB instance and table you want to log to and lastly whether you want to preserve history on restart. The only thing I couldn’t do here, is set the schema of the table it was using to log to, which defaults to dbo.

All that’s left is to start the service, and check the log file to see if it has created any errors on start-up. If not, the table should be created correctly and awaiting input.

I also saw another method while researching using Extended Events (XEvents) but did not implement this once AS Trace provided me with the information I needed.

 

View / Power BI Report

I initially used the data to run a limited set of queries to extract total users, and total queries for a given time period. This was useful to a degree but from the data collected I realised I could be doing so much more.

This lead me to do some analysis across the type of metrics being logged, and allowed me to create a view on top of the tables of what I thought might be useful on a report. I removed all the redundant columns it was tracking, and created some friendly names for the EventSubclass, and other columns. I used the PATINDEX function to check the query statement for existence of some important values – while not an exact science, it would give me a good picture of the split between certain user groups and KPIs being run. I’ve included the view definition below.

I ended up limiting the data to EventClass 10 as this seemed to capture all the necessary data. The only downside I have seen so far is that users querying through the Power BI web service are anonymised under the service account name. I’m currently looking into options to resolve this which I’ve seen as configuration options on Power BI – to allow through the username as long as it can be matched at the other end.

SELECT 
RowNumber AS ID, 
SPID AS SessionID,
CurrentTime AS DateQueried, 
NTUserName AS Username,
CASE EventSubClass 
WHEN 0 THEN 'MDX Query (Excel)' 
WHEN 3 THEN 'DAX Query (Power BI)' 
WHEN 1 THEN 'METADATA Query'
END AS QueryType, 
CASE Success WHEN 1 THEN 'Successful Query' ELSE 'Query Error' END AS SuccessfulQuery,
CONVERT(DECIMAL(10,2),CONVERT(DECIMAL(18,3),CPUTime)/1000) AS CPUTimeSec, 
CONVERT(DECIMAL(10,2),CONVERT(DECIMAL(18,3),Duration)/1000) AS DurationSec, 
TextData AS Query, 
CASE PATINDEX('%Mexico%',TextData) WHEN 0 THEN 0 ELSE 1 END AS MexicoMarket,
CASE PATINDEX('%Colombia%',TextData) WHEN 0 THEN 0 ELSE 1 END AS ColombiaMarket,
CASE PATINDEX('%CS4%',TextData) WHEN 0 THEN 0 ELSE 1 END AS CS4,
ServerName
FROM 
[dbo].[ASTraceTable]
WHERE
EventClass = 10

Once I had the view, creating the report was relatively straight forward, and can be seen below.

I included metrics for number of queries by user (blurred out) which also doubled as a filter, the % split of queries for things such as Excel/Power BI, a measure of queries by timeframe, a logarithmic scaled display for queries by query duration, and lastly a split of queries by KPI. I intend to tweak these once I receive more data from the trace, but was relatively happy with the information that they were providing.

Untitled (002)

Please let me know if you have any comments.

A Crash Course in R Part 2

Following on from Part 1 which introduces the R basics (which can be found here), in Part 2 I’ll start to use lists, data frames, and more excitingly graphics.

 

Lists

Lists differ from vectors and matrices because they are able to store different data types together in the same data structure. A list can contain all kinds of R objects - vectors, matrices, data frames, factors and more. They can even store other lists. Unfortunately calculations are less straightforward because there's no predefined structure.

To store data within a list, we can use the list() function, trip <- c("London","Paris",220,3). As mentioned in my previous blog we can use str(list) to understand the structure, this will come in handy. Notice, the different data types.

Subsetting lists is not similar to subsetting vectors. If you try to subset a list using the same square brackets as when subsetting a vector, it will return a list element containing the first element, not just the first element "London" as a character vector. To extract just the vector element, we can use double square brackets [[ ]]. We can use similar syntax to before, to extract the first and third elements [c(1,3)] as a new list, but the same does not work with double brackets. The double brackets are reserved to select single elements from a list. If you have a list inside a list, then [[c(1,3)]] would work and would select the 3rd element of the 1st list! Subsetting by names, and logicals is exactly the same as vectors.

Another piece of syntax we can use is $ to select an element, but it only works on named lists. To select the destination of the trip, we can use trip$destination. We can also use this syntax to add elements to the list, trip$equipment <- c("bike","food","drink").

Interested in testing your knowledge, check out the DataCamp exercises here and here.

 

Data Frames

While using R, you'll often be working with data sets. These are typically comprised of observations and each observation has a number of variables against it, similar to a customer/product table you'd usually find in a database. This is where data frames come in, as the other data structures are not practical to store this type of data. In a data frame, the rows correspond to the observations, while the columns the variables. Similar to a matrix but we can store different data types (like a list!). Under the hood, a data frame is actually a list but with some special requirements such as vector length, and char vectors as factors. Creating a data frame is usually achieved by importing data from source rather than manually created but you can do this via the data.frame function as shown below:

 

# Definition of vectors 
planets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune") 
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",  
          "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant") 
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883) 
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67) 
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE) 

# Encoded type as a factor
type_factor <- factor(type) 

# Constructed planets_df
planets_df <- data.frame(planets, type_factor, diameter, rotation, rings, stringsAsFactors = FALSE) 

# Displays the structure of planets_df 
str(planets_df) 

 

Similar to SQL, sometimes you will want just a small portion of the dataset to inspect. This can be done by using the head() and tail() functions. We can also use dim() to show the dimensions which returns the number of rows and number of columns, but str() is preferable as you get a lot more detail.

Due to the fact a data frame is an intersection between a matrix and a list, the subsetting syntax can be used from both, so the familiar [ ], [[ ]], and $ will work to extract either elements, vectors, or data frames depending upon the type of subsetting performed. We can also use the subset() function, for instance subset(planets_df, subset = has_rings == TRUE). The subset argument should be a logical expression that indicates which rows to keep.

We can also extend the data frame in a similar format to lists/matrices. To add a new variable/column we can use either people$height <- new_vector or people[["height"]] <- new_vector. We can also use cbind() for example people <- cbind(people, “height” = new_vector), it works just the same. Similarly rbind() can add new rows, but you'll need to add a data frame made of new vectors to the existing data frame, the names between the data frames will also need to match.

Lastly, you may also want to sort your data frame - you can do this via the order() function. The function returns a vector with the rank position of each element. For example in a vector of 21,23,22,25,24 the order function will return 1,3,2,5,4 to correspond the position in the rank. We can then use ages[ranks, ] to re-arrange the order of the data frame. To sort in descending order, we can use the argument decreasing = TRUE.

 

# Created a desired ordering for planets_df
positions <- order(planets_df$diameter, decreasing = TRUE)

# Created a new, ordered data frame
largest_first_df <- planets_df[positions, ]

# Printed new data frame
largest_first_df

 

Interested in testing your knowledge, check out the DataCamp exercises here and here.

 

Graphics

One of the main reasons to use R is its graphics capabilities. The difference between R and a program such as Excel is that you can create plots with lines of R code which you can replicate each time. The default graphics functionality of R can do many things, but packages have also been developed to extend this – popular packages include ggplot2, ggvis, and lattice.

The first function to look at is plot() which is a very generic function to plot. For instance, take a data frame of MPs (members of parliament), which contains the name of the MP, their constituency area, their party, their constituency county, number of years as MP, etc. We can plot the parties in a bar chart by using plot(parliamentMembers$party). R realises that their party is a factor and you want to do a count across it in a bar chart format (by default). If you pick a different variable such as the continuous variable number of years as MP, the figures are shown on an indexed plot. Pick 2 continuous variables, such as number of years as MP vs. yearly expenses - plot(parliamentMembers$numOfYearsMP, parliamentMembers$yearlyExpenses) and the result is a scatter plot (each variable holds an axis).

Are these variables related? To make relationships clearer, we can use the log() function- plot(log(parliamentMembers$numOfYearsMP), log(parliamentMembers$yearlyExpenses)). For 2 categorical variables, R handles it differently again, and creates a stacked bar chart giving the user an idea of the proportion of the 2 variables. Notice how the type of variable you use impacts the type of chart displayed.  You will find that the first element of plot forms the x axis and the second element the y axis. For 3 variables, it gets better still – here I used plot(parliamentMembers[c("numOfYearsMP","votesReceived","yearlyExpenses")]). As you can see below, R plots the variables against one another in 9 different plots!

 

clip_image001 

 

The second function to look at is hist(), which gives us a way to view the distribution of data. We can use this in a similar format to plot by specifying the syntax hist(parliamentMembers$numOfYearsMP). By default, R uses an algorithm to work out the number of bins to split the data into based on the data set. To create a more detailed representation, we can increase the bins by using the breaks argument - hist(parliamentMembers$numOfYearsMP, breaks = 10).

There are of course, many other functions we can use to create graphics, the most popular being barplot(), boxplot(), pie(), and pairs().

 

Customising the Layout

To make our plots more informative,we need to add a number of arguments to the plot function. For instance the following R script creates the plot below:

 

plot(
parliamentMembers$startingYearMP, 
parliamentMembers$votesReceived, 
xlab = "Number of votes", 
ylab = "Year", 
main = "Starting Year of MP vs Votes Received", 
col = "orange",
col.main = "black",
cex.axis = 0.8
pch = 19
)

# xlab = horizontal axis label
# ylab = vertical axis label
# main = plot title
# col = colour of line
# col.main = colour of the title
# cex.axis = ratio of font size on axis tick marks
# pch = plot symbol (35 different types!!)

clip_image001[7]

 

To inspect and control these graphical parameters, we can use the par() function. To get a complete list of options we can use ?par to bring up the R documentation. By default, the parameters are set per plot, but to specify session-wide parameters, just use par(col = “blue”).

 

Multiple Plots

The next natural step for plots is to incorporate multiple plots – either side by side or on top of one another.

To build a grid of plots and compare correlations we can use the mfrow parameter, like this par(mfrow = c(2,2)) by using par() and passing in a 2x2 vector which will build us 4 subplots on a 2 by 2 grid. Now, when you start creating plots, they are not replaced each time but are added to the grid one by one. Plots are added in a row-wise fashion - to use column-wise, we can use mfcol. To reset the graphical parameters, so that R plots once again to a single figure per layout, we pass in a 1x1 grid.

Apart from these 2 parameters, we can also use the layout() function that allows us to specify more complex arrangements. This takes in a matrix, where you specify the location of the figures. For instance:

 

grid <- matrix(c(1,1,2,3), nrow = 2, ncol = 2, byrow = TRUE)

grid
    [1][2]
[1]  1  1
[2]  2  3

layout(grid)
plot(dataFrame$country, dataFrame$sales)
plot(dataFrame$time, dataFrame$sales)
plot(dataFrame$businessFunction, dataFrame$sales)

# Plot 1 stretches across the entire width of the figure
# Plot 2 and 3 sit underneath this and are half the size

 

To reset the layout we can use layout(1) or use the mfcol / mfrow parameters again. One clever trick to save time is to store the default parameters in a variable such as old_par <- par() and then reset once done using par(old_par).

Apart from showing the graphics next to one another, we can also stack them on top of one another in layers. There are functions such as abline(), text(), lines(), and segments() to add depth to a graphic. Using lm() we can create an object which contains the coefficients of the line representing the linear fit using 2 variables, for instance movies_lm <- lm(movies$rating ~ movies$sales). We can then add it to an existing plot such as plot(movies$rating, movies$sales) by using the abline() (adds the line) and coef() (extracts the coefficients) functions, for instance abline(coef(movies_lm)). To add some informative text we can use xco <- 7e5 yco <- 7 and text(xco, yco, label = “More votes? Higher rating!”) to generate the following:

 

clip_image001[1]

 

Interested in testing your knowledge, check out the DataCamp exercises here, here and here.

 

Summary

This pretty much concludes the crash course in R. We’ve looked at the basics, vectors, matrices, factors, lists, data frames and graphics – and in each case looked at arithmetic and subsetting functions.

Next I’ll be looking at some programming techniques using R, reading/writing to data sources and some use cases – to demonstrate the power of R using some slightly more interesting data sets such as the crime statistics in San Francisco.