Tristan Robinson

Tristan Robinson's Blog

Converting Data Factory Azure Function Activities from Web to Native

Microsoft have recently added support to call Azure Functions natively within Data Factory. Previous to this, you only had one option if you wanted to leverage the serverless compute – which was through a web activity. While this certainly did a job, it wasn’t ideal as it exposed the function/host key within code and also the http message so as to authenticate with the functions.  As such, I expect quite a few people’s pipelines will be in a similar state to my own. I have recently been through the process of upgrading them in line with the new native ADF functionality and as part of this, I thought it would be worth my time to share the process I went through so you don’t suffer the same pains!

image


Linked Service

The first step in this process is to add a new ADF linked service into your data factory. When you try to do this, you’ll probably end up doing what always seems to happen when first searching for an Azure Function connector and search the prompt in front of you. This is default filtered to data connectors so it won’t appear, and you’ll need to switch the tab at the top to compute, at which point you’ll see it. The next step is to add the URL to your functions end point which will be in the format https://functionappname.azurewebsites.net. You’ll then need to supply a key to the function app for authentication purposes. I strongly advise you to set up an Azure Key Vault and store the key in there, and then in ADF you can access the key via a secret. This is so simple to do and promotes best practice from a security perspective as well as being useful for CI/CD purposes. This key can be both a function key (specific to a single function) OR a host key (access to all functions). Most of the tutorials on the web as well as the official MS content currently reference the function key, so I was slightly worried at first you could only call functions via the function key. This is not the case, and the host key can be used too. If you’re not too familiar with functions, this can be found within the application settings of your function app.

image


Pipelines

The next step in the process is replacing those web activities with native function activities. While this is a fairly simple process, you’ll no longer be calling a GET on a HTTP webhook directly via a URL string, and passing in parameters. Instead, you’ll need to specify the function you need to call and pass in the parameters via a JSON body as part of a POST call to the end point. For any functions not requiring parameters, you’ll need to use a GET rather than a POST with an empty body as ADF has issues validating a body without any content.

To pass the parameters into a body, you’ll need to wrap them in JSON with the following format. I’ve provided examples of strings, integers, and expressions but hopefully you get the point here.

image


Function Code

As part of this process, you’ll also need to modify your C# code within your function apps. Previous to this upgrade, you may be handling your functions through the HttpResponseMessage object (see below for an example – I’ve highlighted in red the code snippets that relate to the subject matter). Your web activity is happy to accept a response from this class in Data Factory!

Now that we are using the native functions, Data Factory will no longer accept a Http message as a response, and instead wants a JSON object (JObject). As a result, our HttpResponseMessage needs to be replaced by an IActionResult object (see below for an example of changes). The change is relatively trivial and you only need to decide on the type of Object to create instead of just creating a response.

To use this object, you’ll need to import the Microsoft.AspNetCore.Mvc libraries through NuGet. At this point, I ran into an issue since the dependencies required by Microsoft.AspNetCore.Mvc conflicted with my existing libraries – to such an extent where separate NuGet packages wanted my Newtonsoft.Json library to be both equal to version 9.01 and greater than version 11.0.1. To solve this issue, I imported separate parts of the main library – in this instance it was Microsoft.AspNetCore.Mvc.Abstractions and Microsoft.AspNetCore.Mvc.Core. These 2 libraries did not have the same constraints as the entire framework library.

image

image


While the above code changes allowed me to send/receive a JObject, in most scenarios I also needed to POST parameters into the function. In this case, I also needed to modify the code to read this (see highlighted code block below). The content is now treated as a stream which then needs to be deserialised from JSON into an object. The parameters can then be read by treating the object as a list.

image


Errors

If you’ve found this blog as part of a Google search, you’ll more likely than not of hit the following generic error { "errorCode": "3600", "message": "Error calling the endpoint.", "failureType": "UserError", "target": "Activate New Parent Pipeline" }. This not a particularly helpful error as it basically tells you, you're not able to receive a correctly formatted response. More often than not, this will be due to an issue in your function code, than something in ADF. I spent a while thinking it was an authentication error when it was not! You may receive a slightly more verbose error if you check the function app logs through something such as Azure Insights – again, I would strongly recommend you set this up as part of your Azure functions service as it’s not ideal to rely on ADF to debug this. Alternatively, I would also suggest setting up Postman to debug the issues locally to have a bit more control over the variables at runtime through the Locals output.

image


Conclusion

Hopefully you will find this guide useful if you’ve about to go through a similar process to myself. It might also be useful for anyone setting up functions in ADF for the first time as it will cover most of the content required for the services to talk to one another.

PySpark DataFrame Transformation Chaining

After working with Databricks and PySpark for a while now, its clear there needs to be as much best practice defined upfront as possible when coding notebooks. While you can get away with quite a bit when writing SQL - which is all too familiar to most of us now, the transition into other languages (from a BI background) requires a bit more thought in terms of coding standards. In this short blog, I will talk about multiple DataFrame transformations and what I believe to be the best way to go about structuring your code for them.

More often than not, if you have not come from a programming background or are unfamiliar with a language, you will end up writing spaghetti code. Essentially code that reads a bit like a book, with no real structure to it, causing debugging/maintenance issues. We’ve all been there, especially at the start of a project, just wanting to get some output – but its worth trying to bear in mind some coding rules while designing to save the refactoring a few weeks down the line.

Jumping into the subject matter of the blog, this came about because I noticed some of our project code when it came to transformations was made up of re-assigning to a new data frame on each transformation. While functionality it was acceptable, it didn’t feel the right way to go about things. The same can be said with re-using the same data frame (although this caused other dependency issues). It also left us with huge chunks of code which was only possible to read by using the comments to see where some transformations stopped and others started.

After a small bit of research I discovered the concept of monkey patching (modifying a program to extend its local execution) the DataFrame object to include a transform function. This function is missing from PySpark but does exist as part of the Scala language already.

The following code can be used to achieve this, and can be stored in a generic wrapper functions notebook to separate it out from your main code. This can then be called to import the functions whenever you need them.

 

# Monkey patch the DataFrame object with a transform method
from pyspark.sql.dataframe import DataFrame

def transform(self, f):
    return f(self)

DataFrame.transform = transform

 

As part of your main code, we can then wrap the transformations into functions, passing in and returning a DataFrame. A separate statement can then be called specifying transform on the original DataFrame and the list of functions (transformations) you want to pass in. By using this method, the code is almost self-documenting as its clear what transformations you’ve then applied to move a DataFrame from one context into another. The example below only includes 2 transformations, but imagine you have 20+ to implement – this makes the code much easier to read and debug should there be an issue.

 

# Create started/finished timestamps and int dates in UTC as new columns in the DataFrame.
def append_utc_dates(df):
  df = df.withColumn("finishedTimestampUTC", to_utc_timestamp(col("finished"), "EST")) \
         .withColumn("startedTimestampUTC", to_utc_timestamp(col("started"), "EST")) \
         .withColumn("startedDateUTC", ConvertDateTimeToIntDate(col("startedTimestampUTC")))\
         .withColumn("finishedDateUTC", ConvertDateTimeToIntDate(col("finishedTimestampUTC")))   
  return df

# Adds new attribute based on hand structure in to the DataFrame
def append_BB(df):
  df = df.join(refHandStructure, df["structure"] == refHandStructure["structure"], "left") \
         .select(df["*"], refHandStructure["handStructureName"]) \
         .withColumn("BB",when(col("HSName") == "Limit", col("x")).otherwise(col("y")))
  df = df.drop("handStructureName")
  return df

# Perform hand transformations
dfTransformed = dfRaw.transform(append_utc_dates) \
                     .transform(append_BB)
      
display(dfTransformed)

 

You should then use new DataFrames for changes in grain or changes in purpose (not just for each transformation).

This technique probably shouldn’t be used for some of the more basic transformations, especially if you only have 1 or 2, but more so when you have 50/100+ lines of code.  Not everything needs to be wrapped into functions, but it certainly reads better for larger notebooks!

Implementing Enterprise Security in Azure Databricks - Part 2

Following on from my last blog on the topic of security within Azure Databricks which concentrated on the implementation model for data processing for platform, the following blog concentrates on the alternative - data processing for users.

 

Data Processing for Users

By this, I mean data-related activities that a user is performing interactively, for instance data analysis from Data Lake.

With the addition of Databricks runtime 5.1 which was released December 2018, comes the ability to use Azure AD credential pass-through. This is a huge step forward since there is no longer a need to control user permissions through Databricks Groups / Bash and then assigning these groups access to secrets to access Data Lake at runtime. As mentioned previously - with the lack of support for AAD within Databricks currently, ACL activities were done on an individual basis which was not ideal. By using this feature, you can now pass the authentication onto Data Lake, and as we know one of the advantages of Data Lake is the tight integration into Active Directory so this simplifies things. Its worth noting that this feature is currently in public preview but having tested it thoroughly, am happy with the implementation/limitations. The feature also requires a premium workspace and only works with high concurrency clusters – both of which you’d expect to use in this scenario.

The other good thing is that its incredibly easy to enable this functionality, as it is controlled by the cluster configuration process. To enable, navigate to the cluster configuration page, select runtime 5.1 or higher, and expand the advanced options. At this point, you will see a tick box which needs to be checked (see below). This will add another line of code into your spark config. Its actually good to see it was implemented in this way – and helps to stick to the Microsoft mantra of keeping things simple.

 

image

 

Once enabled, only connections into Data Lake via acl:// are valid – any existing connections with the dbfs or through the databricks databases mounted to route via the dbfs for unmanaged tables will stop working. This is a current limitation and may be fixed at GA (although technically you could re-build the tables using the acl:// path if this was an issue).

Great – so now my ACL can be controlled entirely within Data Lake without the need for Service Principals! But there’s more..

I touched on this with my previous blog, but as part of the access implementation for Data Lake, it is preferable to define Azure Active Directory Groups (AAD) to provide further flexibility moving forward. By this I mean, creating AAD groups and assigning them to Data Lake so as to create a level of abstraction away from Users/User AD Groups/Service Principals so that modifications will not need to be made to Data Lake permissions in the future, only to the AAD groups that are already assigned. From experience, by not going down this route - any additional user permissions that need applying in future have to be applied across all folders/files which depending on the size of the data lake, can take a particularly long time \ be awkward to add. Therefore this needs to be done upfront as part of the design!

I would suggest the following conventions for this group setup, an example being AAD_PLATFORM_ADL_ENV_RAW_READER.

  • AAD – to separate out AD/AAD once sync is implemented.
  • PLATFORM – the platform or project/department this group is used by.
  • ADL – the resource on the platform that the group is used by.
  • ENV – the environment on which the resource resides (prod/non-prod).
  • RAW – the layer within lake the permissions will be applied to.
  • READ – the permission the group will have access to.

You would then need to create the following AAD groups to cover all areas across the Data Lake. This assumes using our standard Data Lake layer/folder pattern first introduced by Ust’s blog back in 2016.

  • AAD_PLATFORM_ADL_PROD_RAW_READER
  • AAD_PLATFORM_ADL_PROD_RAW_WRITER
  • AAD_PLATFORM_ADL_PROD_BASE_READER
  • AAD_PLATFORM_ADL_PROD_BASE_WRITER
  • AAD_PLATFORM_ADL_PROD_ENRICHED_READER
  • AAD_PLATFORM_ADL_PROD_ENRICHED_WRITER
  • AAD_PLATFORM_ADL_PROD_CURATED_READER
  • AAD_PLATFORM_ADL_PROD_CURATED_WRITER
  • AAD_PLATFORM_ADL_PROD_LABORATORY
  • AAD_PLATFORM_ADL_PROD_LIBRARY_READER
  • AAD_PLATFORM_ADL_PROD_ADMIN

The permissions applied to these groups can then be implemented using the following matrix. When adding these permissions, they need to be added to the current folder, and all children, and added as both an access permission entry and a default permission entry. Without this, any changes to Data Lake in the future will not inherit these permissions.

 

image

 

Once this is complete, you would add the Users/AD User Groups/Service Principals into these AAD groups to provide access and the pass-through permissions work as expected.

By using this method it separates out both read/write and the data lake layers meaning that unless specifically granted, there will be a much more limited footprint in terms of access permissions into Data Lake. Using a combination of this, and the AD credential pass-through from Databricks provides a suitable solution for implementing security using the Databricks/Data Lake combo.

Implementing Enterprise Security in Azure Databricks - Part 1

In recent weeks, I’ve spent a fair chunk of time working with different aspects of Databricks and as part of this, one topic that consumed a proportion of that time is the security and authentication aspects of the service. Our scenario was one that I expect most people will come across over the next few years, essentially the integration of Databricks with Azure Data Lake for data processing. In the following blogs (yup I need 2) I intend to document the way we went about implementing the security model, breaking it up into data processing for a platform and data processing for users – each has a slightly different implementation.

 

Data Processing for Platform

By this, I mean data processing activities that happen as part of a service, orchestrated through something such as Azure Data Factory that happen on a schedule or in real-time.

I expect the very first thing most people will do once they have a Databricks workspace is to mount their Data Lake.  Its by the far easiest option for getting your hands on your data, and one that’s still worth using to start off as long as you understand the consequences. If you need to do this, I will point you in the direction of Hugh’s blog post covering the topic which I’ve used a number of times now to remember all the different secrets/GUIDs you need.  By mounting the Data Lake in your Databricks workspace, you are able to interact with the data as if it were in the DBFS (the databricks local file system). This proves incredibly powerful for tools such as dbutils which allow you to perform a vast number of file system operations. Unfortunately the one downside to this method is that anyone with access to the workspace is able to directly interact with all layers in your Data Lake – not ideal. Since Data Lake’s are not treated in the same way as something like an Azure SQL DB, you also won’t have those handy recovery options if something happens to your data because of this unrestricted access. This obviously poses a risk for an Enterprise solution. Its worth noting that the mount is done once, so your clusters can go down and be spun up again by other users and they would still be able to access the mountpoint until it is un-mounted.

With that in mind, it is far better to implement the access to Data Lake directly at runtime using the Spark API. The config for this is very similar to the mounting, except the spark config is set at runtime and therefore if this is not included you will be unable to access the Data Lake. This avoids the issues mentioned above where anyone can access the data, just because they can access the workspace. The following commands can be used to set this up at runtime.

 

# Get Secrets from Databricks
ClientId = dbutils.secrets.get(scope = "ScopeX", key = "ClientId")
ClientSecret = dbutils.secrets.get(scope = "ScopeX", key = "ClientSecret")
DirectoryId = dbutils.secrets.get(scope = "ScopeX", key = "DirectoryId")

 # Apply variables to spark config
spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", ClientId)
spark.conf.set("dfs.adls.oauth2.credential", ClientSecret)
spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/" + DirectoryId + "/oauth2/token")

 

You’ll notice that as part of this I’m retrieving the secrets/GUIDS I need for the connection from somewhere else – namely the Databricks-backed secrets store. This avoids exposing those secrets in plain text in your notebook – again this would not be ideal. The secret access is then based on an ACL (access control list) so I can only connect to Data Lake if I’m granted access into the secrets. While it is also possible to connect Databricks up to the Azure Key Vault and use this for secrets store instead, when I tried to configure this I was denied based on permissions. After research I was unable to overcome the issue. This would be more ideal to use but unfortunately there is limited support currently and the fact the error message contained spelling mistakes suggests to me the functionality is not yet mature.

To configure the databricks-backed secrets, the easiest method is to use an Azure Bash console and go in via the Databricks CLI. To access the console - within the Azure portal you’ll notice an icon similar to below as part of the top ribbon.

 

image

 

Clicking this will then prompt you to start either a PowerShell or Bash console – which will look similar to below.

 

image

 

The connection into the Databricks CLI can be setup as per the following commands.

 

virtualenv -p /usr/bin/python2.7 databrickscli
source databrickscli/bin/activate
pip install databricks-cli
databricks configure --token 

 

At this point, you’ll need to provide it both your databricks host – something similar to https://northeurope.azuredatabricks.net and a token. The token will need to be generated through your Databricks workspace – under User Settings / Access Tokens. This essentially lets you into the API without the need for a password. The important thing to mention at this point is that the token is a personal access token. While this doesn’t impact anything in particular with Azure Bash, its worth noting that the token is created under your Databricks account, and therefore using this token for something such as a linked service into ADF then will use your account to process the connection authenticating as you. Hopefully over time, this will be matured and the use of Managed Identity's or Service Principals directly connected into the platform will be possible. As you might of guessed then, you will need to make sure the account that generates the token is then used within the ACL to read the secret, otherwise at processing time – ADF will not be able to read through Databricks into Lake.

The configuration of the secrets is then required and can be done using the following commands. Simply replace [client id], [secret value], [directory id] with the necessary values from your service principal.

 

databricks secrets create-scope --scope ScopeX

databricks secrets put --scope ScopeX --key ClientId --string-value [client id]
databricks secrets put --scope ScopeX --key ClientSecret --string-value [secret value]
databricks secrets put --scope ScopeX --key DirectoryId --string-value [directory id]

databricks secrets put-acl --scope ScopeX --principal admins --permission READ

 

I then granted access to read these secrets to the admins group – this just keeps things simple but you can obviously provide it individual users as well or other Databricks groups. One of my hopes for 2019 is that the platform is integrated better into AAD moving forwards as everyone needs to be named individually currently. You would then need to manage permissions through the groups which can only be done via the CLI.

Once this is complete, you can now interact with your Data Lake and authenticate at runtime. Its also worth mentioning that this interaction changes the connection string from dbfs:// to adl:// which may have a knock-on effect if you use certain tools such as dbutils to do things within your code. This is also important to know since with databricks runtime 5.1 and AD credential pass-through, you will be unable to access anything other than Data Lake file systems. I’ll explain this further in my next blog.

 

Conclusion

Hopefully this will prove a useful to anyone venturing onto the platform and provide a basis to implement a security model. If you wanted to go a step further, you may also want to implement service principal access to Data Lake on a number of levels – both across folders AND read/write. This would add slightly more complexity to the solution but provide an even securer method avoiding the scenario where accounts can access the entire Data Lake. In my next blog, I will look at it from a user-perspective which takes on a slightly different implementation.

Power BI Composite Models and Aggregations

So something which feels like its gone under the radar a bit is the addition of composite models and aggregations in Power BI. These have been around a couple of months now in Preview but I’ve not seen much buzz around the features. After recently attending PASS Summit and seeing Christian Wade’s 1 Trillion row demo, I thought it was worth blogging about – especially as I expect parts of this functionality to be expanded to Azure AS in the near future.

If you think about it, the majority of BI query’s are done at some form of aggregation level but users will still want the detail, and so these features essentially unlock the ability to report against giant datasets at both an aggregated and granular level at the same time – something that was not physically possible beforehand. Power BI is now able to work with multiple petabytes of data with ease over trillions of rows with pretty much instant response times – so this is a bit of a game changer.

 

Composite Models

Previously with Power BI, you were restricted to either DirectQuery or Import for a single data source. With Composite models that has all changed. A report can now include data connections from more than one connection in any combination, so you are able to connect to both DirectQuery AND Import in the same model and combine the data. This then opens up a world of possibilities that were not possible before.  With this functionality, we also have the ability to create many-to-many relationships, but I won’t be doing into detail in this for this blog. As part of this also comes some functionality called Storage Mode which unlocks table-level storage features. This is the next part of the jigsaw.

 

Storage Mode

This allows you to specify whether tables are either imported or queried on the fly. There is now a third option “Dual” which acts as either a cached or not cached table depending on the context of the query that's submitted at runtime.

Setting the correct storage mode has many advantages such as:

  • Better query performance (no need to push real time queries to the dataset for relatively static tables such as those used for filters).
  • Ability to use larger datasets (interactive analysis is better for datasets you don't want to cache into memory such as tables with significant data volume).
  • Data refresh optimisation (only cache data that's necessary, meaning quicker refresh times).
  • Reducing latency in real-time data (no need to re-read those static tables each time).

To access the storage mode, you either select the table in the report pane and click Properties, or navigate to the table in the new Modelling View (requires exposing this in Options as its in preview).

Changing a table storage mode will then prompt us to change the related tables to Dual. This propagation logic is designed to help with models that contain many tables!

image

 

Modelling View

Before I get to aggregations, I also need to introduce the new modelling view. This will be changing to fall in line with what you may be familiar with on SSAS / Azure AS. It feels like this is one of the first steps to integrate SSAS further into the Power BI spectrum and facilitate enterprise datasets. By updating the modelling view, you now have the ability to create additional diagrams, thus allowing you to break out particularly complex models by subject area rather than trying to join it all up within one model. For instance, if you are working with 5 tables in your model, this isn't a particularly big deal – now multiply that by 10 and all of a sudden it becomes a pain to manage, and this new feature will help alleviate that. The feature also allows you to multi select objects in one go and update properties through a new side-pane. An example of this new feature can be seen below.

Modeling View

 

Aggregations

Finally, the most exciting feature of the lot – aggregations. Without the new modelling view, without the composite models, without the new storage modes – this would not be possible. Aggregations allow us to create a newly defined table within the model but with all fields of the original table it is created from to be aggregated/grouped in some or another. These aggregations/groups include count, group by, max, min, sum, and count rows. This can be set to either set to Import mode with/without incremental refresh, or via DirectQuery and optimised by using columnstore indexes. This then unlocks faster query performance over huge datasets via the cache at aggregated level using fractions of resource compared to detailed levels.

You then have the flexibility to set the aggregate table to import, while leaving the granular table to DirectQuery, which will speed up performance when navigating the report. The aggregate table can also be hidden so that user will not even be aware of the implementation.

You can also have more than one aggregation table, potentially one for a particular set of reports and then a lower grain aggregate table for analysts, and this can be done through the precedence option. The tool then will then query the aggregation table with highest precedence level first to see if it can resolve the runtime query before moving down the levels.

Aggregations don’t just work for measures such as SUM or COUNT. It also works for more complex measures - all of the components of a measure are always folded down to the sum, min, max, count, level and then those sub query’s work out whether they can hit the cache or not.  Its also worth mentioning that you can check if its hit the cache via the DAX editor.

image

 

Conclusion

For me, all these features mentioned above will be a bit of a game changer in the right scenario. I’ve ran into scaling problems before and re-working the logic at the Warehouse level for both a granular and aggregated datasets certainly added some overhead. While the functionality obviously doesn't work with Live Connection into SSAS cubes, this new functionality opens up options for using Power BI to do of the activities which historically would have only been done in SSAS. As Christian Wade also alluded to at PASS, Power BI will soon become a superset of SSAS and so we may well be doing less and less with SSAS and more with Power BI in this area as time goes by.

Power BI Dataflows – Bringing self-service ETL to the Business User

Announced at the recent PASS Summit, Power BI Dataflows has now gone into public preview. Previously, any ETL that users applied via Power Query within Power BI Desktop was only applied to their own dataset. With Dataflows, ETL processes are now classified as a first-class citizen and this update provides centralised self-service data prep as part of the Power BI product so that business users can expose cleaned/schematised entities across the business in a similar way to a Data Warehouse. No longer will the same lightweight ETL processes be occurring across multiple users for the same dataset, and often in different ways. Traditionally, self-service BI is limited to analytic models and reports/dashboards but with Dataflows that is changing as it shifts down a peg.  In this blog, I’ll take a look at the new functionality and give my thoughts.

Before we begin, its worth noting that this is an entirely different piece of functionality to that of the similarly named Data Flows in Data Factory which is in private preview. You can tell the difference by the capitalisation of the word “Flow” for Data Factory. I’m not sure what the marketing department would say about this but it is slightly confusing having both been released as news at the same time.

 

What is a Dataflow?

A dataflow is a collection of entities (which are similar to tables) that are created and managed from within the Power BI Service (powerbi.com). This is then stored under the Common Data Model (CDM) in an Azure Data Lake Storage (Gen2) resource as files and folders – no longer as a Tabular Model that we are used to with Datasets.  Dataflows are used to ingest, cleanse, transform, enrich and schematize/model the data in a similar way to that of ETL to create a Data Warehouse. Once created you can use Power BI Desktop to create datasets based off of these entities in the usual manner. While dataflows can be created and edited by both Pro and Premium users, there is a larger set of functionality only available via premium. This includes creating computed entities (calculation aggregates), linking to existing entities in other dataflows, and incremental refreshes.

 

Creating a Dataflow

To go about creating a dataflow, navigate to an App Workspace on the Power BI Service (note that this cannot be My Workspace for obvious reasons, also note this cannot be done in Power BI Desktop). You’ll notice alongside Dashboards, Reports, Workbooks, and Datasets that there is a new artefact labelled Dataflows (Preview). If you have Premium capacity this will need to be enabled via the Admin Portal Capacity Settings. Click the + Create and select Dataflow. Each dataflow has only one owner and only the owner can edit it. To create a dataflow, you can either define new entities which involves connecting to a data source and then mapping to the Common Data Model entities / defining custom entities OR linking to other existing dataflows (see below).

image

 

If you are creating new entities, you are provided with a set of familiar data source connectors (see below). This set is not as complete as those through Power BI Desktop but I expect more to appear within time. I suspect there is also more work to do here because some of these connectors documented below, i.e. Excel - ask for a File Path or URL without giving any functionality to go get this through a lookup. This is something we have come to expect as part of Power BI Desktop.

image

 

Once you have loaded your data, you will then be provided with a fairly familiar query editor pane (see below). The first thing you will notice is that the options to manipulate and transform your data are limited to that of Power Query as part of Power BI Desktop. Again, I suspect this will mature when the functionality hits GA but for the time being gives some reasonable functionality to do what you need to do to create entities in the form of column/row transformations. You are also provided an option to “map to standard” which allows you to join into the Common Data Model (CDM), essentially some generic templates for entities such as Account/Product/Customer/etc. While I can see how this could be useful for an average user, I think it would still be better to create your own based around your businesses own definition of an Account/Product/Customer/etc. These are then stored as custom entities within the CDM. Once the entities are transformed/schematised how you want them, you save them to the Power BI Service with a name/description. Under the hood, this is then written to a Data Lake as a series of files along with a JSON schema.

image

 

Dataflows are then available via Power BI Desktop using Get Data (see below). I believe you will need the November 2018 release or later to see this option. The dataset is then treated in a similar way to everything else you would have seen previously within Power BI Desktop. When selected, you should be able to see all of the workspaces within your BI tenant that you have access to which contain dataflows. Within each dataflow you can see each entity.

That’s really it, they’ve kept it simple on purpose and abstract from what’s going on under the hood.

image

 

Wait, but what does this mean for the classic Enterprise Data Warehouse?

I certainly think from a business perspective it will blur the lines slightly between using traditional ETL processes to create a DW or going down this route, more so at smaller scales. It would be hard pressed to replace an Enterprise level solution, but there is definitely a use case somewhere around departmental level instead of creating a data mart. It will allow users to leverage their data faster than waiting on IT to build out a solution.

It really depends on the complexity and maturity of the platform they are intending to develop and its purpose. The fact is sits on ADLS Gen2 is a positive and allows other applications to access the data at a raw level rather than relying on a specific connector. It also allows users of products such as PowerApps to connect into a cleaned CDM, another plus.

On the other side, its worth bearing in mind that you will need the Premium version to facilitate incremental refresh which is required by the majority of data warehouse solutions.  The lowest cost of Power BI Premium is that of a single P1 node which costs at the time of writing £3,766/month or £45k/pa – not small by any standards. This is worth taking into consideration.

The other fact is that traditional DWHs deal with concepts such as SCD Type II and creating History which Dataflows will not be able to facilitate. You may also have noticed there was no mention of DAX or measures, which still needs to be down at the usual level within Power BI Desktop, not ideal to share KPIs across users. Lastly, there is currently no data lineage, although this has been rumoured to be part of the roadmap.

 

Conclusion

While this feature will be a game changer for a number of end users and projects, I think it will join the set of questions such as “Do I need SSAS or can I use Power BI for my models?” I expect to see this appear at next years conferences! The answer will always be – it depends!

As we have seen with some aspects of Power BI, this new functionality also opens up businesses to governance issues with potentially having users create their own version of the truth rather than something which has been thought out and modelled by someone with a certain skillset or authoritative source. This could cause contention between departments.

As noted above, I expect the Power BI team to continue to mature the functionality of the product – but for the time being, it certainly opens up another avenue to draw people into the toolset and provide another level of functionality for business and users alike.

Spark Streaming in Azure Databricks

Real-time stream processing is becoming more prevalent on modern day data platforms, and with a myriad of processing technologies out there, where do you begin? Stream processing involves the consumption of messages from either queue/files, doing some processing in the middle (querying, filtering, aggregation) and then forwarding the result to a sink – all with a minimal latency. This is in direct contrast to batch processing which usually occurs on an hourly or daily basis. Often is this the case, both of these will need to be combined to create a new data set.

In terms of options for real-time stream processing on Azure you have the following:

  • Azure Stream Analytics
  • Spark Streaming / Storm on HDInsight
  • Spark Streaming on Databricks
  • Azure Functions

Stream Analytics is a simple PaaS offering. It connects easily into other Azure resources such as Event Hubs, IoT Hub, and Blob, and outputs to a range of resources that you’d expect. It has its own intuitive query language, with the added benefit of letting you create functions in JavaScript. Scaling can be achieved by partitions, and it has windowing and late arrival event support that you’d expect from a processing option. For most jobs, this service will be the quickest/easiest to implement as long as its relatively small amount of limitations fall outside the bounds of what you want to achieve. Its also worth noting that the service does not currently support Azure network security such as Virtual Networks or IP Filtering. I suspect this may only be time with the Preview of this in EventHubs.

Both Spark Streaming on HDInsight and Databricks open up the options for configurability and are possibly more suited to an enterprise level data platform, allowing us to use languages such as Scala/Python or even Java for the processing in the middle. The use of these options also allows us to integrate Kafka (an open source alternative to EventHubs) as well as HDFS, and Data Lake as inputs. Scalability is determined by the cluster sizes and the support for other events mentioned above is also included. These options also give us the flexibility for the future, and allow us to adapt moving forward depending on evolving technologies. They also come with the benefit of Azure network security support so we can peer our clusters onto a virtual network.

Lastly – I wouldn’t personally use this but we can also use Functions to achieve the same goal through C#/Node.js. This route however does not include support for those temporal/windowing/late arrival events since functions are serverless and act on a per execution basis.

In the following blog, I’ll be looking at Spark Streaming on Databricks (which is fast becoming my favourite research topic).

A good place to start this is to understand the structured streaming model which I’ve seen a documented a few times now. Essentially treating the stream as an unbounded table, with new records from the stream being appended as a new rows to the table. This allows us to treat both batch and streaming data as tables in a DataFrame, therefore allowing similar queries to be run across them.

 

image

 

At this point, it will be useful to include some code to help explain the process. Before beginning its worth mounting your data sink to your databricks instance so you can reference it as if it were inside the DBFS (Databricks File System) – this is merely a pointer. For more info on this, refer to the databricks documentation here. Only create a mount point if you want all users in the workspace to have access. If you wish to apply security, you will need to access the store directly (also documented in the same place) and then apply permissions to the notebook accordingly.

As my input for my stream was from EventHubs, we can start by defining the reading stream. You’ll firstly need to add the maven coordinate com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.2 to add the EventHub library to the cluster to allow the connection. Further options can be added for the consumer group, starting positions (for partitioning), timeouts and events per trigger. Positions can also be used to define starting and ending points in time so that the stream is not running continuously.

connectionString = "Endpoint=sb://{EVENTHUBNAMESPACE}.servicebus.windows.net/{EVENTHUBNAME};EntityPath={EVENTHUBNAME};SharedAccessKeyName={ACCESSKEYNAME};SharedAccessKey={ACCESSKEY}"

startingEventPosition = {
  "offset": "-1",         # start of stream
  "seqNo": -1,            # not in use
  "enqueuedTime": None,   # not in use
  "isInclusive": True
}

endingEventPosition = {
  "offset": None,                                             # not in use
  "seqNo": -1,                                                # not in use
  "enqueuedTime": dt.now().strftime("%Y-%m-%dT%H:%M:%S.%fZ"), # point in time
  "isInclusive": True
}

ehConf = {}
ehConf['eventhubs.connectionString'] = connectionString
ehConf['eventhubs.startingPosition'] = json.dumps(startingEventPosition)
ehConf['eventhubs.endingPosition'] = json.dumps(endingEventPosition)

df = spark \
  .readStream \
  .format("eventhubs") \
  .options(**ehConf) \
  .load()

The streaming data that is then output then follows the following schema – the body followed by a series of metadata about the streaming message.

 

image

 

Its important to note that the body comes out as a binary stream (this contains our message). We will need to cast the body to a String to deserialize the column to the JSON that we are expecting. This can be done by using some Spark SQL to turn the binary into a string as JSON and then parsing the column into a StructType with specified schema. If multiple records are coming through in the same message, you will need to explode out the result into separate records. Flattening out the nested columns is also useful as long as the data frame is still manageable. Spark SQL provides some great functions here to make our life easy.

rawData = df. \
  selectExpr("cast(body as string) as json"). \
  select(from_json("json", Schema).alias("data")). \
  select("data.*")

While its entirely possible to construct your schema manually, its also worth noting that you can take a sample JSON, read it into a data frame using spark.read.json(path) and then calling printSchema() on top of it to return the inferred schema. This can then used be used to create the StructType.

# Inferred schema:

#   root
#    |-- LineTotal: string (nullable = true)
#    |-- OrderQty: string (nullable = true)
#    |-- ProductID: string (nullable = true)
#    |-- SalesOrderDetailID: string (nullable = true)
#    |-- SalesOrderID: string (nullable = true)
#    |-- UnitPrice: string (nullable = true)
#    |-- UnitPriceDiscount: string (nullable = true)
  
Schema = StructType([
    StructField('SalesOrderID', StringType(), False),
    StructField('SalesOrderDetailID', StringType(), False),
    StructField('OrderQty', StringType(), False),
    StructField('ProductID', StringType(), False),
    StructField('UnitPrice', StringType(), False),
    StructField('UnitPriceDiscount', StringType(), False),
    StructField('LineTotal', StringType(), False)
])

At this point, you have the data streaming into your data frame. To output to the console you can use display(rawData) to see the data visually. However this is only useful for debugging since the data is not actually going anywhere! To write the stream into somewhere such as data lake you would then use the following code. The checkpoint location can be used to recover from failures when the stream is interrupted, and this is important if this code were to make it to a production environment. Should a cluster fail, the query be restarted on a new cluster from a specific point and consistently recover, thus enabling exactly-once guarantees. This also means we can change the query as long as the input source and output schema are the same, and not directly interrupt the stream. Lastly, the trigger will check for new rows in to stream every 10 seconds.

rawData.writeStream \
    .format("json") \
    .outputMode("append") \
    .option("path", PATH) \
    .trigger(processingTime = "10 seconds") \
    .option("checkpointLocation", PATH) \
    .start()

Checking our data lake, you can now see the data has made its way over, broken up by the time intervals specified.

 

image

 

Hopefully this is useful for anyone getting going in the topic area. I’d advise to stick to Python given the extra capacity of the PySpark language over Scala, even though a lot of the Databricks documentation / tutorials uses Scala. This was just something that felt more comfortable.

If you intend to do much in this area I would definitely suggest you use the PySpark SQL documentation which can be found here. This is pretty much a bible for all commands and I’ve been referencing it quite a bit. If this is not enough there is also a cheat sheet available here. Again, very useful for reference when the language is still not engrained.

Getting Started with Databricks Cluster Pricing

The use of databricks for data engineering or data analytics workloads is becoming more prevalent as the platform grows, and has made its way into most of our recent modern data architecture proposals – whether that be PaaS warehouses, or data science platforms.

To run any type of workload on the platform, you will need to setup a cluster to do the processing for you. While the Azure-based platform has made this relatively simple for development purposes, i.e. give it a name, select a runtime, select the type of VMs you want and away you go – for production workloads, a bit more thought needs to go into the configuration/cost.  In the following blog I’ll start by looking at the pricing in a bit more detail which will aim to provide a cost element to the cluster configuration process.

For arguments sake, the work that we tend to deliver with databricks is based on data engineering usage – spinning up resource for an allocated period to perform a task. Therefore this is generally the focus for the following topic.

 

To get started in this area, I think it would be useful to included some definitions.

  • DBU – a databricks unit (unit of processing capability per hour billed on per minute usage)
  • Data Engineering Workload - a job that both starts and terminates the cluster which it runs on (via the job scheduler)
  • Data Analytics Workload – a non automated workload, for example running a command manually within a databricks notebook. Multiple users can share the cluster to perform interactive analysis
  • Cluster – made up of instances of processing (VMs) and constitute of a driver, and workers. Workers can either be provisioned upfront, or autoscaled between a min no. workers / max no. workers.
  • Tier – either standard or premium. Premium includes role based access control, ODBC endpoint authentication, audit logs, Databricks Delta (unified data management system).

The billing for the clusters primarily works depending on the type of workload you initiate and tier (or functionality) you require. As you might of guessed data engineering workloads on the standard tier offer the best price.

I’ve taken the DS3 v2 instance (VM) pricing from the Azure Databricks pricing page.

image

 

The pricing can be broken down as follows:

  • Each instance is charged at £0.262/hour. So for example, the cost of a very simple cluster - 1 driver and 2 workers is £0.262/hour x 3 = £0.786/hour. The VM cost does not depend on the workload type/tier.
  • The DBU cost is then calculated at £0.196/hour. So for example, the cost of 3 nodes (as above) is £0.196/hour x 3 = £0.588/hour. This cost does change depending on workload type/tier.
  • The total cost is then £0.786/hour (VM Cost) + £0.588/hour (DBU Cost) = £1.374/hour. Also known as the pay as you go price. Discounts are then added accordingly for reserved processing power.

I thought this was worth simplifying since the pricing page doesn’t make this abundantly clear with the way the table is laid out and often this is overlooked. Due to the vast amount of options you can have to setup clusters, its worth understanding this element to balance against time.

The DBU count is merely to be used as reference to compare the different VMs processing power and is not directly included in the calculations. Its also worth mentioning that by default databricks services are setup as premium and can be downgraded to standard only by contacting support. In some cases, this can add some massive cost savings depending upon the type of work you are doing on the platform so please take this into account before spinning up clusters and don’t just go with the default.

With regards to configuration, clusters can either be setup under a High Concurrency mode (previously known as serverless) or as Standard. The high concurrency mode is optimised for concurrent workloads and therefore is more applicable to data analytics workloads and interactive notebooks which are used by multiple users simultaneously. This piece of configuration does not effect the pricing.

By using the following cost model, we can then assume for a basic batch ETL run where we have a driver and 8 worker nodes on relatively small DS3 instances, would cost £123.60/month given a standard 1 hour daily ETL window. Hopefully this provides as a very simple introduction into the pricing model used by Databricks.

Databricks UDF Performance Comparisons

I’ve recently been spending quite a bit of time on the Azure Databricks platform, and while learning decided it was worth using it to experiment with some common data warehousing tasks in the form of data cleansing. As Databricks provides us with a platform to run a Spark environment on, it offers options to use cross-platform APIs that allow us to write code in Scala, Python, R, and SQL within the same notebook. As with most things in life, not everything is equal and there are potential differences in performance between them. In this blog, I will explain the tests I produced with the aim of outlining best practice for Databricks implementations for UDFs of this nature.

Scala is the native language for Spark – and without going into too much detail here, it will compile down faster to the JVM for processing. Under the hood, Python on the other hand provides a wrapper around the code but in reality is a Scala program telling the cluster what to do, and being transformed by Scala code. Converting these objects into a form Python can read is called serialisation / deserialisation, and its expensive, especially over time and across a distributed dataset. This most expensive scenario occurs through UDFs (functions) – the runtime process for which can be seen below. The overhead here is in (4) and (5) to read the data and write into JVM memory.

image

Using Scala to create the UDFs, the execution process can skip these steps and keep everything native. Scala UDFs operate within the JVM of the executor so we can skip serialisation and deserialisation.

image

 

Experiments

As part of my data for this task I took a list of company names from a data set and then run them through a process to codify them, essentially stripping out characters which cause them to be unique and converting them to upper case, thus grouping a set of companies together under the same name. For instance Adatis, Adatis Ltd, and Adatis (Ltd) would become ADATIS. This was an example of a typical cleansing activity when working with data sets. The dataset in question was around 2.5GB and contained 10.5m rows. The cluster I used was Databricks runtime 4.2 (Spark 2.3.1 / Scala 2.11) with Standard_DS2_v2 VMs for the driver/worker nodes (14GB memory) with autoscaling disabled and limited to 2 workers. I disabled the autoscaling for this as I was seeing wildly inconsistent timings each run which impacted the tests. The goods news is that with it enabled and using up to 8 workers, the timings were about 20% faster albeit more erratic from a standard deviation point of view.

The following approaches were tested:

  • Scala program calls Scala UDF via Function
  • Scala program calls Scala UDF via SQL
  • Python program calls Scala UDF via SQL
  • Python program calls Python UDF via Function
  • Python program calls Python Vectorised UDF via Function
  • Python program uses SQL

While it was true in previous versions of Spark that there was a difference between these using Scala/Python, in the latest version of Spark (2.3) it is believed to be more of a level playing field by using Apache Arrow in the form of Vectorised Pandas UDFs within Python.

As part of the tests I also wanted to use Python to call a Scala UDF via a function but unfortunately we cannot do this without creating a Jar file of the Scala code and importing it separately. This would be done via SBT (build tool) using the following guide here. I considered this too much of an overhead for the purposes of the experiment.

The following code was then used as part of a Databricks notebook to define the tests. A custom function to time the write was required for Scala whereas Python allows us to use %timeit for a similar purpose.

 

Scala program calls Scala UDF via Function

// Scala program calls Scala UDF via Function

%scala

def codifyScalaUdf = udf((string: String) => string.toUpperCase.replace(" ", "").replace("#","").replace(";","").replace("&","").replace(" AND ","").replace(" THE ","").replace("LTD","").replace("LIMITED","").replace("PLC","").replace(".","").replace(",","").replace("[","").replace("]","").replace("LLP","").replace("INC","").replace("CORP",""))  

spark.udf.register("ScalaUdf", codifyScalaUdf)  

val transformedScalaDf = table("DataTable").select(codifyScalaUdf($"CompanyName").alias("CompanyName"))
val ssfTime = timeIt(transformedScalaDf.write.mode("overwrite").format("parquet").saveAsTable("SSF"))

 

Scala program calls Scala UDF via SQL

// Scala program calls Scala UDF via SQL

%scala

val sss = spark.sql("SELECT ScalaUdf(CompanyName) as a from DataTable where CompanyName is not null")
val sssTime = timeIt(sss.write.mode("overwrite").format("parquet").saveAsTable("SSS"))

 

Python program calls Scala UDF via SQL

# Python program calls Scala UDF via SQL

pss = spark.sql("SELECT ScalaUdf(CompanyName) as a from DataTable where CompanyName is not null")
%timeit -r 1 pss.write.format("parquet").saveAsTable("PSS", mode='overwrite') 

 

Python program calls Python UDF via Function

# Python program calls Python UDF via Function

from pyspark.sql.functions import *
from pyspark.sql.types import StringType

@udf(StringType())
def pythonCodifyUDF(string):
    return (string.upper().replace(" ", "").replace("#","").replace(";","").replace("&","").replace(" AND ","").replace(" THE ","").replace("LTD","").replace("LIMITED","").replace("PLC","").replace(".","").replace(",","").replace("[","").replace("]","").replace("LLP","").replace("INC","").replace("CORP",""))

pyDF = df.select(pythonCodifyUDF(col("CompanyName")).alias("CompanyName")).filter(col("CompanyName").isNotNull())
%timeit -r 1 pyDF.write.format("parquet").saveAsTable("PPF", mode='overwrite')

 

Python program calls Python Vectorised UDF via Function

# Python program calls Python Vectorised UDF via Function

from pyspark.sql.types import StringType
from pyspark.sql.functions import pandas_udf, col

@pandas_udf(returnType=StringType())
def pythonCodifyVecUDF(string):
    return (string.replace(" ", "").replace("#","").replace(";","").replace("&","").replace(" AND ","").replace(" THE ","").replace("LTD","").replace("LIMITED","").replace("PLC","").replace(".","").replace(",","").replace("[","").replace("]","").replace("LLP","").replace("INC","").replace("CORP","")).str.upper()
  
pyVecDF = df.select(pythonCodifyVecUDF(col("CompanyName")).alias("CompanyName")).filter(col("CompanyName").isNotNull())
%timeit -r 1 pyVecDF.write.format("parquet").saveAsTable("PVF", mode='overwrite')

 

Python Program uses SQL

# Python Program uses SQL

sql = spark.sql("SELECT upper(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(CompanyName,' ',''),'&',''),';',''),'#',''),' AND ',''),' THE ',''),'LTD',''),'LIMITED',''),'PLC',''),'.',''),',',''),'[',''),']',''),'LLP',''),'INC',''),'CORP','')) as a from DataTable where CompanyName is not null")           

%timeit -r 1 sql.write.format("parquet").saveAsTable("SQL", mode='overwrite')

 

Results and Observations

image

It was interesting to note the following:

  • The hypothesis above does indeed hold true and the 2 methods which were expected to be slowest were within the experiment, and by a considerable margin.
  • The Scala UDF performs consistently regardless of the method used to call the UDF.
  • The Python vectorised UDF now performs on par with the Scala UDFs and there is a clear difference between the vectorised and non-vectorised Python UDFs.
  • The standard deviation for the vectorised UDF was surprisingly low and the method was performing consistently each run. The non-vectorised Python UDF was the opposite.

To summarise, moving forward – as long as you adopt to writing your UDFs in Scala or use the vectorised version of the Python UDF, the performance will be similar for this type of activity. Its worth noting to definitely avoid writing the UDFs as standard Python functions due to the theory and results above. Over time, across a complete solution and with more data, this time would add up.

Lightweight Testing in Azure through Pester & Azure Automation

I’m currently working on a relatively lightweight PaaS modern data warehouse project, and as part of the build there was a requirement to add automated testing into the platform. While I was thinking about how I could achieve this - I remembered during the 2018 SQLBits I attended there was a DevOps session by the sabin.io team in which they did something very similar. They used PowerShell to add tests into part of a continuous integration pipeline hosted on VSTS through a framework called Pester. In this blog, I will talk about my implementation of Pester, but specifically look at it through the context of hosting it in a Azure PaaS environment (something which adds a small amount of overhead to its implementation and took me a while to get working).

 

What is Pester?

Pester is basically a ubiquitous test and mock framework for PowerShell (hosted on GitHub here). It can literally be used to test anything - as long as you can invoke whatever you are trying to test through PowerShell. This includes SQL statements, Azure resources, Windows resources, etc. Pester provides the language to DEFINE and EXECUTE the test case. I won’t go through the numerous advantages of automated testing, but lets just say its pretty useful to have it as part of a platform to alert you about issues early. I won’t talk about it so much in the context of DevOps CI/CD, but its got use cases here too. Microsoft is also on board with the framework, and it comes pre-installed with Windows builds nowadays.

 

How do I define a test case?

Describe defines a group of tests, and all test files must contain at least one describe block. It then defines a single test case, so in my example below it will invoke a SQL statement to return a result which is then tested.

The test is then passed or failed using an assertion such as Should be 0 or Should BeNullOrEmpty based on the object passed to it. You can also use other variations such as ShouldBeExactly or even Should Throw to pass a test based on a terminating error.

In the code segment below, I’m checking keys that were unable to lookup against our dimensions from the fact tables. One thing to be careful of (as there was no related error message) was that you will need to alias your COUNT or SUM in the SQL query and then call that as a property of your result object afterwards to check against.

Describe "Warehouse Award Fact -1 Keys" {
    It "Award Fact - Supplier -1 Keys" {
        $Query = "SELECT COUNT(*) AS AwardSupplier FROM Warehouse.FctAward WHERE SupplierKey = -1 "
        $Result = Invoke-Sqlcmd -ServerInstance $AsqlServerName -Database $AsqlDatabaseName -Query $Query -Username $AsqlUsername -Password $AsqlPassword
        $Result.AwardSupplier | Should be 0
    } 
    It "Award Fact - Contract -1 Keys" {
        $Query = "SELECT COUNT(*) AS AwardContract FROM Warehouse.FctAward WHERE ContractKey = -1 "
        $Result = Invoke-Sqlcmd -ServerInstance $AsqlServerName -Database $AsqlDatabaseName -Query $Query -Username $AsqlUsername -Password $AsqlPassword
        $Result.AwardContract | Should be 0
    } 
}

 

As part of the definition I decided it would be best to spit the test blocks up into logical groups, and then keep them all within the same file. However, if you wanted to test both data and something else such as Azure resources, then I would consider splitting out the tests into separate files to keep things modular. In terms of tests, to give you an idea of the sort of things we were keen to monitor, I created the following:

  • Stage row counts match clean row counts
  • Business keys are distinct within source
  • Business keys were distinct within dimensions
  • Total £ amounts matched between source and warehouse
  • No MDS errors existed on data import
  • Fact table unknown keys for each dimension (above)
  • Misc tests that tripped us up during early build phases based on assumptions

These tests were added to over time and formed a group of acceptance tests for each DW run. As data is always changing, its good to have these to validate your initial premises put in place around a data set.

While I didn’t need to use it, there are also commands to mock variables therefore putting the PowerShell code into a specific state for a particular test. This is helpful to avoid changing the real environment while replicating states.

 

Standing up Pester in Azure

Implementing the pester framework with traditional resources is already well documented and very simple to get started in both a manual and automated way. Unfortunately as I was working on a fully PaaS project, I needed to implement the framework within Azure. This did not seem to be as well documented. To help me out with the task, I went straight to Azure Automation – for anyone that has not used this before, its basically a way to host PowerShell scripts within Runbooks. While it was fairly intuitive to import the Pester framework from their GitHub repo, there is also an option with Automation to select from a modules gallery. Pester is part of this gallery, so for ease of use, I would download it here which also makes maintenance slightly easier. Its also worth mentioning that its also worth defining important variables such as DB connections, etc outside of the Runbook within Automation and then passing them in as parameters. Think of this as similar to environment variables on a SSIS project. Credentials such as our Runas account are also defined externally to the runbook.

 

image

 

Now on to the important bit; defining the testing execution script.

The first body of code will connect to Azure using the Runas account as a service principal. This allows us to execute the script without using our own credentials. Setup of this account is a whole separate blog in itself so I won’t go into that detail here.

Once this has been defined, the script will then connect into Blob storage to extract the tests (defined above), and place it in the local Temp directory of Azure Automation. This was one of my major blockers initially as I was trying to find a way to Invoke the test scripts from within the same PS script as the execution, therefore not needing to host the tests anywhere and keep everything contained in a runbook or parent/child runbooks. Unfortunately (as far as I can tell) the Invoke-Pester command which executes the tests needs a separate file location to be defined. Either way, we had a blob storage area already setup for the project so this was not really an overhead to create a new container to store the tests in. Automation uses a Temp folder internally to store anything, so I used this to land the tests ready for processing.

# Get required variables
$AutomationConnectionName = Get-AutomationVariable -Name 'AutomationConnectionName'

# Connect to an automation connection to get TenantId and SubscriptionId
$Connection = Get-AutomationConnection -Name $AutomationConnectionName
$TenantId = $Connection.TenantId
$SubscriptionId = $Connection.SubscriptionId

# Login to Azure using AzureRunAsConnection
Connect-AzureRmAccount -ServicePrincipal -TenantId $Connection.TenantId -ApplicationId $Connection.ApplicationId -CertificateThumbprint $Connection.CertificateThumbprint

# Connect to Storage Account to get tests script
$resourceGroup = Get-AutomationVariable -Name 'ResourceGroupName'
$storageAccountName = Get-AutomationVariable -Name 'PreStageBlobStorageAccountName'
$storageAccount = Get-AzureRmStorageAccount -ResourceGroupName $resourceGroup -Name $storageAccountName 
$ctx = $storageAccount.Context
$data = Get-AzureStorageBlobContent -Blob "PesterTests.ps1" -Container 'pester' -Destination "C:\Temp\" -Context $ctx 

 

The next section of the script will depend on what you want to do with the results. For me, we already had an ETL control schema setup in our SQLDB, so it felt right to submit the results into a new table in there for reporting purposes. Alternatively you could create a new schema called UnitTesting or similar. There are also lots of other things you could do with the results, i.e. to trigger other events or use in a CI/CD environment.

The below code will open up a SQLDB connection, and then create a function to insert rows into the DB which will call in the next block of code. For security purposes, the parameters are defined outside the SQL command and added in at execution. While there are quite a few things you can extract from a pester test result, I decided to take the Test Name, Test Result, and Failure Message – to keep things simple. I also included an inserted date so we can work out the latest tests, as well as a Trigger Id to join it back into the parent pipeline that called the test scripts. This ties in nicely to other bits of our ETL reporting framework.

# Open SQL connection 
$DBServer = Get-AutomationVariable -Name 'DatabaseServerFullName'
$DBName = Get-AutomationVariable -Name 'DatabaseName'
$DBUsername = Get-AutomationVariable -Name 'DatabaseAdminUsername'
$DBPassword = Get-AutomationVariable -Name 'DatabaseAdminPassword'
$sqlConnection = New-Object System.Data.SqlClient.SqlConnection
$sqlConnection.ConnectionString = "Server=$DBServer;Database=$DBName;UID=$DBUsername;PWD=$DBPassword"
$sqlConnection.Open()

# Create GUID to group tests together
$GUID = [guid]::Newguid()
$GUID = $GUID.ToString()

# Create Inserted datetime
$Inserted = Get-Date

# Define function to submit to database
function Do-InsertRow {

    $sqlCommand = New-Object System.Data.SqlClient.SqlCommand
    $sqlCommand.Connection = $sqlConnection
    
    $sqlCommand.CommandText = "SET NOCOUNT ON; " +
        "INSERT INTO BISystem.UnitTests (TriggerId,TestName,TestResult,FailureMessage,Inserted)" +
        "VALUES (@TriggerId,@TestName,@TestResult,@FailureMessage,@Inserted); " 
    
    $sqlCommand.Parameters.Add((New-Object Data.SqlClient.SqlParameter("@TriggerId",[Data.SQLDBType]::NVarChar, 50))) | Out-Null
    $sqlCommand.Parameters.Add((New-Object Data.SqlClient.SqlParameter("@TestName",[Data.SQLDBType]::NVarChar, 200))) | Out-Null
    $sqlCommand.Parameters.Add((New-Object Data.SqlClient.SqlParameter("@TestResult",[Data.SQLDBType]::NVarChar, 10))) | Out-Null
    $sqlCommand.Parameters.Add((New-Object Data.SqlClient.SqlParameter("@FailureMessage",[Data.SQLDBType]::NVarChar, 500))) | Out-Null
    $sqlCommand.Parameters.Add((New-Object Data.SqlClient.SqlParameter("@Inserted",[Data.SQLDBType]::DateTime2))) | Out-Null

    $sqlCommand.Parameters[0].Value = $PipelineTriggerID
    $sqlCommand.Parameters[1].Value = $TestName
    $sqlCommand.Parameters[2].Value = $TestResult
    $sqlCommand.Parameters[3].Value = $TestError
    $sqlCommand.Parameters[4].Value = $Inserted

    $sqlCommand.ExecuteNonQuery()

}

 

Time to Invoke the test scripts. This can be done through the Invoke-Pester command. As mentioned above, I had to pass in the test scripts location. I also defined the PassThru parameter so that the results were passed into an $Output object ready to deconstruct and write to SQLDB. While its not mandatory you can also define the test groups to pickup as part of the invocation, which has the added bonus of being able to use wildcards against.

The script will then loop round each object in the array of the $Output object defined previously. For each one, it will extract the attributes of the test that I’m after and then call my Insert to DB function to write the results to SQLDB.

# Invoke Pester
$Output = Invoke-Pester -Script C:\Temp\PesterTests.ps1 -PassThru -TestName '*CaSIE*' 

# Output Results
Write-Output "TestNameFilter: $($Output.TestNameFilter)." 
Write-Output "Total Count: $($Output.TotalCount)." 
Write-Output "Passed Count: $($Output.PassedCount)." 
Write-Output "Failed Count: $($Output.FailedCount)."
Write-Output "Time: $($Output.Time)."
Write-Output ""

# Loop over TestResult objects to submit to database
$Output.TestResult | ForEach-Object {
    $TestName = $_.Name
    $TestResult = $_.Result
    $TestError = $_.FailureMessage 
    Write-Output "Test Result: $($TestName), $($TestResult), $($TestError)" 
    Do-InsertRow
} 

# Close the connection.
if ($sqlConnection.State -eq [Data.ConnectionState]::Open) {
    $sqlConnection.Close()
}

# Show done when finished (for testing/logging purpose only)
Write-Output "Finished Running Tests"

 

Once everything has been created and you’ve tested the scripts via the test pane and checked the results populate into the database, you are ready to hook it into your ADF pipelines and integrate it into your solution! To do this is very simple – within each Runbook you can create a webhook to the runbook for any external resources to call. This comes in the form of a secret URL which will then kick off the runbook with the embedded script. Passing parameters into the runbook at this point requires a bit more work and I go into this detail in a separate blog post. Be careful to take a copy of this URL, as you cannot view it after creation. Its then just a case of creating a web activity in your ADF pipeline to call this, to run the scripts. All in all, a very straightforward mechanism.

Once the results were in SQLDB I also defined a view on top of the table as I was having trouble extracting the test group out. To make things simple, I just re-worded the individual test cases to include this, and then use a CHARINDEX to split them out, thus meaning I could now report by group also. I also added a field to calculate the Latest Test result set, thus meaning we were only reporting on the most relevant test set. I then built a Power BI report on top of this to integrate into our existing solution (below). Test results were presented as a measure % Passed (defined below). This was then displayed overall, vs each test group, and then vs each individual test.

% Passed = 
VAR PercentPassed =
DIVIDE(
CALCULATE(COUNT('BISystem UnitTestsView'[TestId]),'BISystem UnitTestsView'[TestResult] = "Passed", 'BISystem UnitTestsView'[LatestTest] = 1),
CALCULATE(COUNT('BISystem UnitTestsView'[TestId]),'BISystem UnitTestsView'[LatestTest] = 1)
) 
RETURN
IF(PercentPassed = BLANK(),0,PercentPassed)

 

image

 

Summary

Pester is a great tool to use to add automated testing in to your project. It’s taken a few hours to stand up but now that’s been done, it’s just a case of defining tests. The ubiquitous nature of the framework means we can define all sorts of test across data/software/hardware. The thing I like the most about using it is that its simple. While I decided it was best to populate the test results into a database for ease of use for reporting, it might also be worth investigating population into the VSTS testing framework. Hopefully you will find this blog useful.