Tristan Robinson

Tristan Robinson's Blog

Power BI Composite Models and Aggregations

So something which feels like its gone under the radar a bit is the addition of composite models and aggregations in Power BI. These have been around a couple of months now in Preview but I’ve not seen much buzz around the features. After recently attending PASS Summit and seeing Christian Wade’s 1 Trillion row demo, I thought it was worth blogging about – especially as I expect parts of this functionality to be expanded to Azure AS in the near future.

If you think about it, the majority of BI query’s are done at some form of aggregation level but users will still want the detail, and so these features essentially unlock the ability to report against giant datasets at both an aggregated and granular level at the same time – something that was not physically possible beforehand. Power BI is now able to work with multiple petabytes of data with ease over trillions of rows with pretty much instant response times – so this is a bit of a game changer.

 

Composite Models

Previously with Power BI, you were restricted to either DirectQuery or Import for a single data source. With Composite models that has all changed. A report can now include data connections from more than one connection in any combination, so you are able to connect to both DirectQuery AND Import in the same model and combine the data. This then opens up a world of possibilities that were not possible before.  With this functionality, we also have the ability to create many-to-many relationships, but I won’t be doing into detail in this for this blog. As part of this also comes some functionality called Storage Mode which unlocks table-level storage features. This is the next part of the jigsaw.

 

Storage Mode

This allows you to specify whether tables are either imported or queried on the fly. There is now a third option “Dual” which acts as either a cached or not cached table depending on the context of the query that's submitted at runtime.

Setting the correct storage mode has many advantages such as:

  • Better query performance (no need to push real time queries to the dataset for relatively static tables such as those used for filters).
  • Ability to use larger datasets (interactive analysis is better for datasets you don't want to cache into memory such as tables with significant data volume).
  • Data refresh optimisation (only cache data that's necessary, meaning quicker refresh times).
  • Reducing latency in real-time data (no need to re-read those static tables each time).

To access the storage mode, you either select the table in the report pane and click Properties, or navigate to the table in the new Modelling View (requires exposing this in Options as its in preview).

Changing a table storage mode will then prompt us to change the related tables to Dual. This propagation logic is designed to help with models that contain many tables!

image

 

Modelling View

Before I get to aggregations, I also need to introduce the new modelling view. This will be changing to fall in line with what you may be familiar with on SSAS / Azure AS. It feels like this is one of the first steps to integrate SSAS further into the Power BI spectrum and facilitate enterprise datasets. By updating the modelling view, you now have the ability to create additional diagrams, thus allowing you to break out particularly complex models by subject area rather than trying to join it all up within one model. For instance, if you are working with 5 tables in your model, this isn't a particularly big deal – now multiply that by 10 and all of a sudden it becomes a pain to manage, and this new feature will help alleviate that. The feature also allows you to multi select objects in one go and update properties through a new side-pane. An example of this new feature can be seen below.

Modeling View

 

Aggregations

Finally, the most exciting feature of the lot – aggregations. Without the new modelling view, without the composite models, without the new storage modes – this would not be possible. Aggregations allow us to create a newly defined table within the model but with all fields of the original table it is created from to be aggregated/grouped in some or another. These aggregations/groups include count, group by, max, min, sum, and count rows. This can be set to either set to Import mode with/without incremental refresh, or via DirectQuery and optimised by using columnstore indexes. This then unlocks faster query performance over huge datasets via the cache at aggregated level using fractions of resource compared to detailed levels.

You then have the flexibility to set the aggregate table to import, while leaving the granular table to DirectQuery, which will speed up performance when navigating the report. The aggregate table can also be hidden so that user will not even be aware of the implementation.

You can also have more than one aggregation table, potentially one for a particular set of reports and then a lower grain aggregate table for analysts, and this can be done through the precedence option. The tool then will then query the aggregation table with highest precedence level first to see if it can resolve the runtime query before moving down the levels.

Aggregations don’t just work for measures such as SUM or COUNT. It also works for more complex measures - all of the components of a measure are always folded down to the sum, min, max, count, level and then those sub query’s work out whether they can hit the cache or not.  Its also worth mentioning that you can check if its hit the cache via the DAX editor.

image

 

Conclusion

For me, all these features mentioned above will be a bit of a game changer in the right scenario. I’ve ran into scaling problems before and re-working the logic at the Warehouse level for both a granular and aggregated datasets certainly added some overhead. While the functionality obviously doesn't work with Live Connection into SSAS cubes, this new functionality opens up options for using Power BI to do of the activities which historically would have only been done in SSAS. As Christian Wade also alluded to at PASS, Power BI will soon become a superset of SSAS and so we may well be doing less and less with SSAS and more with Power BI in this area as time goes by.

Power BI Dataflows – Bringing self-service ETL to the Business User

Announced at the recent PASS Summit, Power BI Dataflows has now gone into public preview. Previously, any ETL that users applied via Power Query within Power BI Desktop was only applied to their own dataset. With Dataflows, ETL processes are now classified as a first-class citizen and this update provides centralised self-service data prep as part of the Power BI product so that business users can expose cleaned/schematised entities across the business in a similar way to a Data Warehouse. No longer will the same lightweight ETL processes be occurring across multiple users for the same dataset, and often in different ways. Traditionally, self-service BI is limited to analytic models and reports/dashboards but with Dataflows that is changing as it shifts down a peg.  In this blog, I’ll take a look at the new functionality and give my thoughts.

Before we begin, its worth noting that this is an entirely different piece of functionality to that of the similarly named Data Flows in Data Factory which is in private preview. You can tell the difference by the capitalisation of the word “Flow” for Data Factory. I’m not sure what the marketing department would say about this but it is slightly confusing having both been released as news at the same time.

 

What is a Dataflow?

A dataflow is a collection of entities (which are similar to tables) that are created and managed from within the Power BI Service (powerbi.com). This is then stored under the Common Data Model (CDM) in an Azure Data Lake Storage (Gen2) resource as files and folders – no longer as a Tabular Model that we are used to with Datasets.  Dataflows are used to ingest, cleanse, transform, enrich and schematize/model the data in a similar way to that of ETL to create a Data Warehouse. Once created you can use Power BI Desktop to create datasets based off of these entities in the usual manner. While dataflows can be created and edited by both Pro and Premium users, there is a larger set of functionality only available via premium. This includes creating computed entities (calculation aggregates), linking to existing entities in other dataflows, and incremental refreshes.

 

Creating a Dataflow

To go about creating a dataflow, navigate to an App Workspace on the Power BI Service (note that this cannot be My Workspace for obvious reasons, also note this cannot be done in Power BI Desktop). You’ll notice alongside Dashboards, Reports, Workbooks, and Datasets that there is a new artefact labelled Dataflows (Preview). If you have Premium capacity this will need to be enabled via the Admin Portal Capacity Settings. Click the + Create and select Dataflow. Each dataflow has only one owner and only the owner can edit it. To create a dataflow, you can either define new entities which involves connecting to a data source and then mapping to the Common Data Model entities / defining custom entities OR linking to other existing dataflows (see below).

image

 

If you are creating new entities, you are provided with a set of familiar data source connectors (see below). This set is not as complete as those through Power BI Desktop but I expect more to appear within time. I suspect there is also more work to do here because some of these connectors documented below, i.e. Excel - ask for a File Path or URL without giving any functionality to go get this through a lookup. This is something we have come to expect as part of Power BI Desktop.

image

 

Once you have loaded your data, you will then be provided with a fairly familiar query editor pane (see below). The first thing you will notice is that the options to manipulate and transform your data are limited to that of Power Query as part of Power BI Desktop. Again, I suspect this will mature when the functionality hits GA but for the time being gives some reasonable functionality to do what you need to do to create entities in the form of column/row transformations. You are also provided an option to “map to standard” which allows you to join into the Common Data Model (CDM), essentially some generic templates for entities such as Account/Product/Customer/etc. While I can see how this could be useful for an average user, I think it would still be better to create your own based around your businesses own definition of an Account/Product/Customer/etc. These are then stored as custom entities within the CDM. Once the entities are transformed/schematised how you want them, you save them to the Power BI Service with a name/description. Under the hood, this is then written to a Data Lake as a series of files along with a JSON schema.

image

 

Dataflows are then available via Power BI Desktop using Get Data (see below). I believe you will need the November 2018 release or later to see this option. The dataset is then treated in a similar way to everything else you would have seen previously within Power BI Desktop. When selected, you should be able to see all of the workspaces within your BI tenant that you have access to which contain dataflows. Within each dataflow you can see each entity.

That’s really it, they’ve kept it simple on purpose and abstract from what’s going on under the hood.

image

 

Wait, but what does this mean for the classic Enterprise Data Warehouse?

I certainly think from a business perspective it will blur the lines slightly between using traditional ETL processes to create a DW or going down this route, more so at smaller scales. It would be hard pressed to replace an Enterprise level solution, but there is definitely a use case somewhere around departmental level instead of creating a data mart. It will allow users to leverage their data faster than waiting on IT to build out a solution.

It really depends on the complexity and maturity of the platform they are intending to develop and its purpose. The fact is sits on ADLS Gen2 is a positive and allows other applications to access the data at a raw level rather than relying on a specific connector. It also allows users of products such as PowerApps to connect into a cleaned CDM, another plus.

On the other side, its worth bearing in mind that you will need the Premium version to facilitate incremental refresh which is required by the majority of data warehouse solutions.  The lowest cost of Power BI Premium is that of a single P1 node which costs at the time of writing £3,766/month or £45k/pa – not small by any standards. This is worth taking into consideration.

The other fact is that traditional DWHs deal with concepts such as SCD Type II and creating History which Dataflows will not be able to facilitate. You may also have noticed there was no mention of DAX or measures, which still needs to be down at the usual level within Power BI Desktop, not ideal to share KPIs across users. Lastly, there is currently no data lineage, although this has been rumoured to be part of the roadmap.

 

Conclusion

While this feature will be a game changer for a number of end users and projects, I think it will join the set of questions such as “Do I need SSAS or can I use Power BI for my models?” I expect to see this appear at next years conferences! The answer will always be – it depends!

As we have seen with some aspects of Power BI, this new functionality also opens up businesses to governance issues with potentially having users create their own version of the truth rather than something which has been thought out and modelled by someone with a certain skillset or authoritative source. This could cause contention between departments.

As noted above, I expect the Power BI team to continue to mature the functionality of the product – but for the time being, it certainly opens up another avenue to draw people into the toolset and provide another level of functionality for business and users alike.

Spark Streaming in Azure Databricks

Real-time stream processing is becoming more prevalent on modern day data platforms, and with a myriad of processing technologies out there, where do you begin? Stream processing involves the consumption of messages from either queue/files, doing some processing in the middle (querying, filtering, aggregation) and then forwarding the result to a sink – all with a minimal latency. This is in direct contrast to batch processing which usually occurs on an hourly or daily basis. Often is this the case, both of these will need to be combined to create a new data set.

In terms of options for real-time stream processing on Azure you have the following:

  • Azure Stream Analytics
  • Spark Streaming / Storm on HDInsight
  • Spark Streaming on Databricks
  • Azure Functions

Stream Analytics is a simple PaaS offering. It connects easily into other Azure resources such as Event Hubs, IoT Hub, and Blob, and outputs to a range of resources that you’d expect. It has its own intuitive query language, with the added benefit of letting you create functions in JavaScript. Scaling can be achieved by partitions, and it has windowing and late arrival event support that you’d expect from a processing option. For most jobs, this service will be the quickest/easiest to implement as long as its relatively small amount of limitations fall outside the bounds of what you want to achieve. Its also worth noting that the service does not currently support Azure network security such as Virtual Networks or IP Filtering. I suspect this may only be time with the Preview of this in EventHubs.

Both Spark Streaming on HDInsight and Databricks open up the options for configurability and are possibly more suited to an enterprise level data platform, allowing us to use languages such as Scala/Python or even Java for the processing in the middle. The use of these options also allows us to integrate Kafka (an open source alternative to EventHubs) as well as HDFS, and Data Lake as inputs. Scalability is determined by the cluster sizes and the support for other events mentioned above is also included. These options also give us the flexibility for the future, and allow us to adapt moving forward depending on evolving technologies. They also come with the benefit of Azure network security support so we can peer our clusters onto a virtual network.

Lastly – I wouldn’t personally use this but we can also use Functions to achieve the same goal through C#/Node.js. This route however does not include support for those temporal/windowing/late arrival events since functions are serverless and act on a per execution basis.

In the following blog, I’ll be looking at Spark Streaming on Databricks (which is fast becoming my favourite research topic).

A good place to start this is to understand the structured streaming model which I’ve seen a documented a few times now. Essentially treating the stream as an unbounded table, with new records from the stream being appended as a new rows to the table. This allows us to treat both batch and streaming data as tables in a DataFrame, therefore allowing similar queries to be run across them.

 

image

 

At this point, it will be useful to include some code to help explain the process. Before beginning its worth mounting your data sink to your databricks instance so you can reference it as if it were inside the DBFS (Databricks File System) – this is merely a pointer. For more info on this, refer to the databricks documentation here. Only create a mount point if you want all users in the workspace to have access. If you wish to apply security, you will need to access the store directly (also documented in the same place) and then apply permissions to the notebook accordingly.

As my input for my stream was from EventHubs, we can start by defining the reading stream. You’ll firstly need to add the maven coordinate com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.2 to add the EventHub library to the cluster to allow the connection. Further options can be added for the consumer group, starting positions (for partitioning), timeouts and events per trigger. Positions can also be used to define starting and ending points in time so that the stream is not running continuously.

connectionString = "Endpoint=sb://{EVENTHUBNAMESPACE}.servicebus.windows.net/{EVENTHUBNAME};EntityPath={EVENTHUBNAME};SharedAccessKeyName={ACCESSKEYNAME};SharedAccessKey={ACCESSKEY}"

startingEventPosition = {
  "offset": "-1",         # start of stream
  "seqNo": -1,            # not in use
  "enqueuedTime": None,   # not in use
  "isInclusive": True
}

endingEventPosition = {
  "offset": None,                                             # not in use
  "seqNo": -1,                                                # not in use
  "enqueuedTime": dt.now().strftime("%Y-%m-%dT%H:%M:%S.%fZ"), # point in time
  "isInclusive": True
}

ehConf = {}
ehConf['eventhubs.connectionString'] = connectionString
ehConf['eventhubs.startingPosition'] = json.dumps(startingEventPosition)
ehConf['eventhubs.endingPosition'] = json.dumps(endingEventPosition)

df = spark \
  .readStream \
  .format("eventhubs") \
  .options(**ehConf) \
  .load()

The streaming data that is then output then follows the following schema – the body followed by a series of metadata about the streaming message.

 

image

 

Its important to note that the body comes out as a binary stream (this contains our message). We will need to cast the body to a String to deserialize the column to the JSON that we are expecting. This can be done by using some Spark SQL to turn the binary into a string as JSON and then parsing the column into a StructType with specified schema. If multiple records are coming through in the same message, you will need to explode out the result into separate records. Flattening out the nested columns is also useful as long as the data frame is still manageable. Spark SQL provides some great functions here to make our life easy.

rawData = df. \
  selectExpr("cast(body as string) as json"). \
  select(from_json("json", Schema).alias("data")). \
  select("data.*")

While its entirely possible to construct your schema manually, its also worth noting that you can take a sample JSON, read it into a data frame using spark.read.json(path) and then calling printSchema() on top of it to return the inferred schema. This can then used be used to create the StructType.

# Inferred schema:

#   root
#    |-- LineTotal: string (nullable = true)
#    |-- OrderQty: string (nullable = true)
#    |-- ProductID: string (nullable = true)
#    |-- SalesOrderDetailID: string (nullable = true)
#    |-- SalesOrderID: string (nullable = true)
#    |-- UnitPrice: string (nullable = true)
#    |-- UnitPriceDiscount: string (nullable = true)
  
Schema = StructType([
    StructField('SalesOrderID', StringType(), False),
    StructField('SalesOrderDetailID', StringType(), False),
    StructField('OrderQty', StringType(), False),
    StructField('ProductID', StringType(), False),
    StructField('UnitPrice', StringType(), False),
    StructField('UnitPriceDiscount', StringType(), False),
    StructField('LineTotal', StringType(), False)
])

At this point, you have the data streaming into your data frame. To output to the console you can use display(rawData) to see the data visually. However this is only useful for debugging since the data is not actually going anywhere! To write the stream into somewhere such as data lake you would then use the following code. The checkpoint location can be used to recover from failures when the stream is interrupted, and this is important if this code were to make it to a production environment. Should a cluster fail, the query be restarted on a new cluster from a specific point and consistently recover, thus enabling exactly-once guarantees. This also means we can change the query as long as the input source and output schema are the same, and not directly interrupt the stream. Lastly, the trigger will check for new rows in to stream every 10 seconds.

rawData.writeStream \
    .format("json") \
    .outputMode("append") \
    .option("path", PATH) \
    .trigger(processingTime = "10 seconds") \
    .option("checkpointLocation", PATH) \
    .start()

Checking our data lake, you can now see the data has made its way over, broken up by the time intervals specified.

 

image

 

Hopefully this is useful for anyone getting going in the topic area. I’d advise to stick to Python given the extra capacity of the PySpark language over Scala, even though a lot of the Databricks documentation / tutorials uses Scala. This was just something that felt more comfortable.

If you intend to do much in this area I would definitely suggest you use the PySpark SQL documentation which can be found here. This is pretty much a bible for all commands and I’ve been referencing it quite a bit. If this is not enough there is also a cheat sheet available here. Again, very useful for reference when the language is still not engrained.

Getting Started with Databricks Cluster Pricing

The use of databricks for data engineering or data analytics workloads is becoming more prevalent as the platform grows, and has made its way into most of our recent modern data architecture proposals – whether that be PaaS warehouses, or data science platforms.

To run any type of workload on the platform, you will need to setup a cluster to do the processing for you. While the Azure-based platform has made this relatively simple for development purposes, i.e. give it a name, select a runtime, select the type of VMs you want and away you go – for production workloads, a bit more thought needs to go into the configuration/cost.  In the following blog I’ll start by looking at the pricing in a bit more detail which will aim to provide a cost element to the cluster configuration process.

For arguments sake, the work that we tend to deliver with databricks is based on data engineering usage – spinning up resource for an allocated period to perform a task. Therefore this is generally the focus for the following topic.

 

To get started in this area, I think it would be useful to included some definitions.

  • DBU – a databricks unit (unit of processing capability per hour billed on per second usage)
  • Data Engineering Workload - a job that both starts and terminates the cluster which it runs on (via the job scheduler)
  • Data Analytics Workload – a non automated workload, for example running a command manually within a databricks notebook. Multiple users can share the cluster to perform interactive analysis
  • Cluster – made up of instances of processing (VMs) and constitute of a driver, and workers. Workers can either be provisioned upfront, or autoscaled between a min no. workers / max no. workers.
  • Tier – either standard or premium. Premium includes role based access control, ODBC endpoint authentication, audit logs, Databricks Delta (unified data management system).

The billing for the clusters primarily works depending on the type of workload you initiate and tier (or functionality) you require. As you might of guessed data engineering workloads on the standard tier offer the best price.

I’ve taken the DS3 v2 instance (VM) pricing from the Azure Databricks pricing page.

image

 

The pricing can be broken down as follows:

  • Each instance is charged at £0.262/hour. So for example, the cost of a very simple cluster - 1 driver and 2 workers is £0.262/hour x 3 = £0.786/hour. The VM cost does not depend on the workload type/tier.
  • The DBU cost is then calculated at £0.196/hour. So for example, the cost of 3 nodes (as above) is £0.196/hour x 3 = £0.588/hour. This cost does change depending on workload type/tier.
  • The total cost is then £0.786/hour (VM Cost) + £0.588/hour (DBU Cost) = £1.374/hour. Also known as the pay as you go price. Discounts are then added accordingly for reserved processing power.

I thought this was worth simplifying since the pricing page doesn’t make this abundantly clear with the way the table is laid out and often this is overlooked. Due to the vast amount of options you can have to setup clusters, its worth understanding this element to balance against time.

The DBU count is merely to be used as reference to compare the different VMs processing power and is not directly included in the calculations. Its also worth mentioning that by default databricks services are setup as premium and can be downgraded to standard only by contacting support. In some cases, this can add some massive cost savings depending upon the type of work you are doing on the platform so please take this into account before spinning up clusters and don’t just go with the default.

With regards to configuration, clusters can either be setup under a High Concurrency mode (previously known as serverless) or as Standard. The high concurrency mode is optimised for concurrent workloads and therefore is more applicable to data analytics workloads and interactive notebooks which are used by multiple users simultaneously. This piece of configuration does not effect the pricing.

By using the following cost model, we can then assume for a basic batch ETL run where we have a driver and 8 worker nodes on relatively small DS3 instances, would cost £123.60/month given a standard 1 hour daily ETL window. Hopefully this provides as a very simple introduction into the pricing model used by Databricks.

Databricks UDF Performance Comparisons

I’ve recently been spending quite a bit of time on the Azure Databricks platform, and while learning decided it was worth using it to experiment with some common data warehousing tasks in the form of data cleansing. As Databricks provides us with a platform to run a Spark environment on, it offers options to use cross-platform APIs that allow us to write code in Scala, Python, R, and SQL within the same notebook. As with most things in life, not everything is equal and there are potential differences in performance between them. In this blog, I will explain the tests I produced with the aim of outlining best practice for Databricks implementations for UDFs of this nature.

Scala is the native language for Spark – and without going into too much detail here, it will compile down faster to the JVM for processing. Under the hood, Python on the other hand provides a wrapper around the code but in reality is a Scala program telling the cluster what to do, and being transformed by Scala code. Converting these objects into a form Python can read is called serialisation / deserialisation, and its expensive, especially over time and across a distributed dataset. This most expensive scenario occurs through UDFs (functions) – the runtime process for which can be seen below. The overhead here is in (4) and (5) to read the data and write into JVM memory.

image

Using Scala to create the UDFs, the execution process can skip these steps and keep everything native. Scala UDFs operate within the JVM of the executor so we can skip serialisation and deserialisation.

image

 

Experiments

As part of my data for this task I took a list of company names from a data set and then run them through a process to codify them, essentially stripping out characters which cause them to be unique and converting them to upper case, thus grouping a set of companies together under the same name. For instance Adatis, Adatis Ltd, and Adatis (Ltd) would become ADATIS. This was an example of a typical cleansing activity when working with data sets. The dataset in question was around 2.5GB and contained 10.5m rows. The cluster I used was Databricks runtime 4.2 (Spark 2.3.1 / Scala 2.11) with Standard_DS2_v2 VMs for the driver/worker nodes (14GB memory) with autoscaling disabled and limited to 2 workers. I disabled the autoscaling for this as I was seeing wildly inconsistent timings each run which impacted the tests. The goods news is that with it enabled and using up to 8 workers, the timings were about 20% faster albeit more erratic from a standard deviation point of view.

The following approaches were tested:

  • Scala program calls Scala UDF via Function
  • Scala program calls Scala UDF via SQL
  • Python program calls Scala UDF via SQL
  • Python program calls Python UDF via Function
  • Python program calls Python Vectorised UDF via Function
  • Python program uses SQL

While it was true in previous versions of Spark that there was a difference between these using Scala/Python, in the latest version of Spark (2.3) it is believed to be more of a level playing field by using Apache Arrow in the form of Vectorised Pandas UDFs within Python.

As part of the tests I also wanted to use Python to call a Scala UDF via a function but unfortunately we cannot do this without creating a Jar file of the Scala code and importing it separately. This would be done via SBT (build tool) using the following guide here. I considered this too much of an overhead for the purposes of the experiment.

The following code was then used as part of a Databricks notebook to define the tests. A custom function to time the write was required for Scala whereas Python allows us to use %timeit for a similar purpose.

 

Scala program calls Scala UDF via Function

// Scala program calls Scala UDF via Function

%scala

def codifyScalaUdf = udf((string: String) => string.toUpperCase.replace(" ", "").replace("#","").replace(";","").replace("&","").replace(" AND ","").replace(" THE ","").replace("LTD","").replace("LIMITED","").replace("PLC","").replace(".","").replace(",","").replace("[","").replace("]","").replace("LLP","").replace("INC","").replace("CORP",""))  

spark.udf.register("ScalaUdf", codifyScalaUdf)  

val transformedScalaDf = table("DataTable").select(codifyScalaUdf($"CompanyName").alias("CompanyName"))
val ssfTime = timeIt(transformedScalaDf.write.mode("overwrite").format("parquet").saveAsTable("SSF"))

 

Scala program calls Scala UDF via SQL

// Scala program calls Scala UDF via SQL

%scala

val sss = spark.sql("SELECT ScalaUdf(CompanyName) as a from DataTable where CompanyName is not null")
val sssTime = timeIt(sss.write.mode("overwrite").format("parquet").saveAsTable("SSS"))

 

Python program calls Scala UDF via SQL

# Python program calls Scala UDF via SQL

pss = spark.sql("SELECT ScalaUdf(CompanyName) as a from DataTable where CompanyName is not null")
%timeit -r 1 pss.write.format("parquet").saveAsTable("PSS", mode='overwrite') 

 

Python program calls Python UDF via Function

# Python program calls Python UDF via Function

from pyspark.sql.functions import *
from pyspark.sql.types import StringType

@udf(StringType())
def pythonCodifyUDF(string):
    return (string.upper().replace(" ", "").replace("#","").replace(";","").replace("&","").replace(" AND ","").replace(" THE ","").replace("LTD","").replace("LIMITED","").replace("PLC","").replace(".","").replace(",","").replace("[","").replace("]","").replace("LLP","").replace("INC","").replace("CORP",""))

pyDF = df.select(pythonCodifyUDF(col("CompanyName")).alias("CompanyName")).filter(col("CompanyName").isNotNull())
%timeit -r 1 pyDF.write.format("parquet").saveAsTable("PPF", mode='overwrite')

 

Python program calls Python Vectorised UDF via Function

# Python program calls Python Vectorised UDF via Function

from pyspark.sql.types import StringType
from pyspark.sql.functions import pandas_udf, col

@pandas_udf(returnType=StringType())
def pythonCodifyVecUDF(string):
    return (string.replace(" ", "").replace("#","").replace(";","").replace("&","").replace(" AND ","").replace(" THE ","").replace("LTD","").replace("LIMITED","").replace("PLC","").replace(".","").replace(",","").replace("[","").replace("]","").replace("LLP","").replace("INC","").replace("CORP","")).str.upper()
  
pyVecDF = df.select(pythonCodifyVecUDF(col("CompanyName")).alias("CompanyName")).filter(col("CompanyName").isNotNull())
%timeit -r 1 pyVecDF.write.format("parquet").saveAsTable("PVF", mode='overwrite')

 

Python Program uses SQL

# Python Program uses SQL

sql = spark.sql("SELECT upper(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(CompanyName,' ',''),'&',''),';',''),'#',''),' AND ',''),' THE ',''),'LTD',''),'LIMITED',''),'PLC',''),'.',''),',',''),'[',''),']',''),'LLP',''),'INC',''),'CORP','')) as a from DataTable where CompanyName is not null")           

%timeit -r 1 sql.write.format("parquet").saveAsTable("SQL", mode='overwrite')

 

Results and Observations

image

It was interesting to note the following:

  • The hypothesis above does indeed hold true and the 2 methods which were expected to be slowest were within the experiment, and by a considerable margin.
  • The Scala UDF performs consistently regardless of the method used to call the UDF.
  • The Python vectorised UDF now performs on par with the Scala UDFs and there is a clear difference between the vectorised and non-vectorised Python UDFs.
  • The standard deviation for the vectorised UDF was surprisingly low and the method was performing consistently each run. The non-vectorised Python UDF was the opposite.

To summarise, moving forward – as long as you adopt to writing your UDFs in Scala or use the vectorised version of the Python UDF, the performance will be similar for this type of activity. Its worth noting to definitely avoid writing the UDFs as standard Python functions due to the theory and results above. Over time, across a complete solution and with more data, this time would add up.

Lightweight Testing in Azure through Pester & Azure Automation

I’m currently working on a relatively lightweight PaaS modern data warehouse project, and as part of the build there was a requirement to add automated testing into the platform. While I was thinking about how I could achieve this - I remembered during the 2018 SQLBits I attended there was a DevOps session by the sabin.io team in which they did something very similar. They used PowerShell to add tests into part of a continuous integration pipeline hosted on VSTS through a framework called Pester. In this blog, I will talk about my implementation of Pester, but specifically look at it through the context of hosting it in a Azure PaaS environment (something which adds a small amount of overhead to its implementation and took me a while to get working).

 

What is Pester?

Pester is basically a ubiquitous test and mock framework for PowerShell (hosted on GitHub here). It can literally be used to test anything - as long as you can invoke whatever you are trying to test through PowerShell. This includes SQL statements, Azure resources, Windows resources, etc. Pester provides the language to DEFINE and EXECUTE the test case. I won’t go through the numerous advantages of automated testing, but lets just say its pretty useful to have it as part of a platform to alert you about issues early. I won’t talk about it so much in the context of DevOps CI/CD, but its got use cases here too. Microsoft is also on board with the framework, and it comes pre-installed with Windows builds nowadays.

 

How do I define a test case?

Describe defines a group of tests, and all test files must contain at least one describe block. It then defines a single test case, so in my example below it will invoke a SQL statement to return a result which is then tested.

The test is then passed or failed using an assertion such as Should be 0 or Should BeNullOrEmpty based on the object passed to it. You can also use other variations such as ShouldBeExactly or even Should Throw to pass a test based on a terminating error.

In the code segment below, I’m checking keys that were unable to lookup against our dimensions from the fact tables. One thing to be careful of (as there was no related error message) was that you will need to alias your COUNT or SUM in the SQL query and then call that as a property of your result object afterwards to check against.

Describe "Warehouse Award Fact -1 Keys" {
    It "Award Fact - Supplier -1 Keys" {
        $Query = "SELECT COUNT(*) AS AwardSupplier FROM Warehouse.FctAward WHERE SupplierKey = -1 "
        $Result = Invoke-Sqlcmd -ServerInstance $AsqlServerName -Database $AsqlDatabaseName -Query $Query -Username $AsqlUsername -Password $AsqlPassword
        $Result.AwardSupplier | Should be 0
    } 
    It "Award Fact - Contract -1 Keys" {
        $Query = "SELECT COUNT(*) AS AwardContract FROM Warehouse.FctAward WHERE ContractKey = -1 "
        $Result = Invoke-Sqlcmd -ServerInstance $AsqlServerName -Database $AsqlDatabaseName -Query $Query -Username $AsqlUsername -Password $AsqlPassword
        $Result.AwardContract | Should be 0
    } 
}

 

As part of the definition I decided it would be best to spit the test blocks up into logical groups, and then keep them all within the same file. However, if you wanted to test both data and something else such as Azure resources, then I would consider splitting out the tests into separate files to keep things modular. In terms of tests, to give you an idea of the sort of things we were keen to monitor, I created the following:

  • Stage row counts match clean row counts
  • Business keys are distinct within source
  • Business keys were distinct within dimensions
  • Total £ amounts matched between source and warehouse
  • No MDS errors existed on data import
  • Fact table unknown keys for each dimension (above)
  • Misc tests that tripped us up during early build phases based on assumptions

These tests were added to over time and formed a group of acceptance tests for each DW run. As data is always changing, its good to have these to validate your initial premises put in place around a data set.

While I didn’t need to use it, there are also commands to mock variables therefore putting the PowerShell code into a specific state for a particular test. This is helpful to avoid changing the real environment while replicating states.

 

Standing up Pester in Azure

Implementing the pester framework with traditional resources is already well documented and very simple to get started in both a manual and automated way. Unfortunately as I was working on a fully PaaS project, I needed to implement the framework within Azure. This did not seem to be as well documented. To help me out with the task, I went straight to Azure Automation – for anyone that has not used this before, its basically a way to host PowerShell scripts within Runbooks. While it was fairly intuitive to import the Pester framework from their GitHub repo, there is also an option with Automation to select from a modules gallery. Pester is part of this gallery, so for ease of use, I would download it here which also makes maintenance slightly easier. Its also worth mentioning that its also worth defining important variables such as DB connections, etc outside of the Runbook within Automation and then passing them in as parameters. Think of this as similar to environment variables on a SSIS project. Credentials such as our Runas account are also defined externally to the runbook.

 

image

 

Now on to the important bit; defining the testing execution script.

The first body of code will connect to Azure using the Runas account as a service principal. This allows us to execute the script without using our own credentials. Setup of this account is a whole separate blog in itself so I won’t go into that detail here.

Once this has been defined, the script will then connect into Blob storage to extract the tests (defined above), and place it in the local Temp directory of Azure Automation. This was one of my major blockers initially as I was trying to find a way to Invoke the test scripts from within the same PS script as the execution, therefore not needing to host the tests anywhere and keep everything contained in a runbook or parent/child runbooks. Unfortunately (as far as I can tell) the Invoke-Pester command which executes the tests needs a separate file location to be defined. Either way, we had a blob storage area already setup for the project so this was not really an overhead to create a new container to store the tests in. Automation uses a Temp folder internally to store anything, so I used this to land the tests ready for processing.

# Get required variables
$AutomationConnectionName = Get-AutomationVariable -Name 'AutomationConnectionName'

# Connect to an automation connection to get TenantId and SubscriptionId
$Connection = Get-AutomationConnection -Name $AutomationConnectionName
$TenantId = $Connection.TenantId
$SubscriptionId = $Connection.SubscriptionId

# Login to Azure using AzureRunAsConnection
Connect-AzureRmAccount -ServicePrincipal -TenantId $Connection.TenantId -ApplicationId $Connection.ApplicationId -CertificateThumbprint $Connection.CertificateThumbprint

# Connect to Storage Account to get tests script
$resourceGroup = Get-AutomationVariable -Name 'ResourceGroupName'
$storageAccountName = Get-AutomationVariable -Name 'PreStageBlobStorageAccountName'
$storageAccount = Get-AzureRmStorageAccount -ResourceGroupName $resourceGroup -Name $storageAccountName 
$ctx = $storageAccount.Context
$data = Get-AzureStorageBlobContent -Blob "PesterTests.ps1" -Container 'pester' -Destination "C:\Temp\" -Context $ctx 

 

The next section of the script will depend on what you want to do with the results. For me, we already had an ETL control schema setup in our SQLDB, so it felt right to submit the results into a new table in there for reporting purposes. Alternatively you could create a new schema called UnitTesting or similar. There are also lots of other things you could do with the results, i.e. to trigger other events or use in a CI/CD environment.

The below code will open up a SQLDB connection, and then create a function to insert rows into the DB which will call in the next block of code. For security purposes, the parameters are defined outside the SQL command and added in at execution. While there are quite a few things you can extract from a pester test result, I decided to take the Test Name, Test Result, and Failure Message – to keep things simple. I also included an inserted date so we can work out the latest tests, as well as a Trigger Id to join it back into the parent pipeline that called the test scripts. This ties in nicely to other bits of our ETL reporting framework.

# Open SQL connection 
$DBServer = Get-AutomationVariable -Name 'DatabaseServerFullName'
$DBName = Get-AutomationVariable -Name 'DatabaseName'
$DBUsername = Get-AutomationVariable -Name 'DatabaseAdminUsername'
$DBPassword = Get-AutomationVariable -Name 'DatabaseAdminPassword'
$sqlConnection = New-Object System.Data.SqlClient.SqlConnection
$sqlConnection.ConnectionString = "Server=$DBServer;Database=$DBName;UID=$DBUsername;PWD=$DBPassword"
$sqlConnection.Open()

# Create GUID to group tests together
$GUID = [guid]::Newguid()
$GUID = $GUID.ToString()

# Create Inserted datetime
$Inserted = Get-Date

# Define function to submit to database
function Do-InsertRow {

    $sqlCommand = New-Object System.Data.SqlClient.SqlCommand
    $sqlCommand.Connection = $sqlConnection
    
    $sqlCommand.CommandText = "SET NOCOUNT ON; " +
        "INSERT INTO BISystem.UnitTests (TriggerId,TestName,TestResult,FailureMessage,Inserted)" +
        "VALUES (@TriggerId,@TestName,@TestResult,@FailureMessage,@Inserted); " 
    
    $sqlCommand.Parameters.Add((New-Object Data.SqlClient.SqlParameter("@TriggerId",[Data.SQLDBType]::NVarChar, 50))) | Out-Null
    $sqlCommand.Parameters.Add((New-Object Data.SqlClient.SqlParameter("@TestName",[Data.SQLDBType]::NVarChar, 200))) | Out-Null
    $sqlCommand.Parameters.Add((New-Object Data.SqlClient.SqlParameter("@TestResult",[Data.SQLDBType]::NVarChar, 10))) | Out-Null
    $sqlCommand.Parameters.Add((New-Object Data.SqlClient.SqlParameter("@FailureMessage",[Data.SQLDBType]::NVarChar, 500))) | Out-Null
    $sqlCommand.Parameters.Add((New-Object Data.SqlClient.SqlParameter("@Inserted",[Data.SQLDBType]::DateTime2))) | Out-Null

    $sqlCommand.Parameters[0].Value = $PipelineTriggerID
    $sqlCommand.Parameters[1].Value = $TestName
    $sqlCommand.Parameters[2].Value = $TestResult
    $sqlCommand.Parameters[3].Value = $TestError
    $sqlCommand.Parameters[4].Value = $Inserted

    $sqlCommand.ExecuteNonQuery()

}

 

Time to Invoke the test scripts. This can be done through the Invoke-Pester command. As mentioned above, I had to pass in the test scripts location. I also defined the PassThru parameter so that the results were passed into an $Output object ready to deconstruct and write to SQLDB. While its not mandatory you can also define the test groups to pickup as part of the invocation, which has the added bonus of being able to use wildcards against.

The script will then loop round each object in the array of the $Output object defined previously. For each one, it will extract the attributes of the test that I’m after and then call my Insert to DB function to write the results to SQLDB.

# Invoke Pester
$Output = Invoke-Pester -Script C:\Temp\PesterTests.ps1 -PassThru -TestName '*CaSIE*' 

# Output Results
Write-Output "TestNameFilter: $($Output.TestNameFilter)." 
Write-Output "Total Count: $($Output.TotalCount)." 
Write-Output "Passed Count: $($Output.PassedCount)." 
Write-Output "Failed Count: $($Output.FailedCount)."
Write-Output "Time: $($Output.Time)."
Write-Output ""

# Loop over TestResult objects to submit to database
$Output.TestResult | ForEach-Object {
    $TestName = $_.Name
    $TestResult = $_.Result
    $TestError = $_.FailureMessage 
    Write-Output "Test Result: $($TestName), $($TestResult), $($TestError)" 
    Do-InsertRow
} 

# Close the connection.
if ($sqlConnection.State -eq [Data.ConnectionState]::Open) {
    $sqlConnection.Close()
}

# Show done when finished (for testing/logging purpose only)
Write-Output "Finished Running Tests"

 

Once everything has been created and you’ve tested the scripts via the test pane and checked the results populate into the database, you are ready to hook it into your ADF pipelines and integrate it into your solution! To do this is very simple – within each Runbook you can create a webhook to the runbook for any external resources to call. This comes in the form of a secret URL which will then kick off the runbook with the embedded script. Passing parameters into the runbook at this point requires a bit more work and I go into this detail in a separate blog post. Be careful to take a copy of this URL, as you cannot view it after creation. Its then just a case of creating a web activity in your ADF pipeline to call this, to run the scripts. All in all, a very straightforward mechanism.

Once the results were in SQLDB I also defined a view on top of the table as I was having trouble extracting the test group out. To make things simple, I just re-worded the individual test cases to include this, and then use a CHARINDEX to split them out, thus meaning I could now report by group also. I also added a field to calculate the Latest Test result set, thus meaning we were only reporting on the most relevant test set. I then built a Power BI report on top of this to integrate into our existing solution (below). Test results were presented as a measure % Passed (defined below). This was then displayed overall, vs each test group, and then vs each individual test.

% Passed = 
VAR PercentPassed =
DIVIDE(
CALCULATE(COUNT('BISystem UnitTestsView'[TestId]),'BISystem UnitTestsView'[TestResult] = "Passed", 'BISystem UnitTestsView'[LatestTest] = 1),
CALCULATE(COUNT('BISystem UnitTestsView'[TestId]),'BISystem UnitTestsView'[LatestTest] = 1)
) 
RETURN
IF(PercentPassed = BLANK(),0,PercentPassed)

 

image

 

Summary

Pester is a great tool to use to add automated testing in to your project. It’s taken a few hours to stand up but now that’s been done, it’s just a case of defining tests. The ubiquitous nature of the framework means we can define all sorts of test across data/software/hardware. The thing I like the most about using it is that its simple. While I decided it was best to populate the test results into a database for ease of use for reporting, it might also be worth investigating population into the VSTS testing framework. Hopefully you will find this blog useful.

Using Azure Automation to Archive SFTP Files Orchestrated through ADF

I recently ran into a problem which required me to work with a SFTP site as part of our ETL process in Azure. Using the traditional BI stack, FTP tasks would be natively supported but this is no longer the case when working with Azure – this now requires a bit of plumbing. This blog will take a look at the ways in which you can interact with FTP/SFTP sites as part of your pipelines in ADF and the issues I ran into. Specifically the task I was trying to solve was to archive files (move them to a separate folder) that had passed into our staging layer, so they would not be picked up on the next process run. The blog will also cover passing parameters into a webhook as I also needed the solution to be flexible to handle different SFTP sources.

Like many similar tasks through Azure, you can go about them in a number of ways (Logic Apps, Functions, PS Runbooks, etc) but its about finding the solution that fits with your architecture and you find comfortable developing and maintaining. I initially started looked at using Logic Apps since quite a bit of the work was done for you, they have the connectors already setup, and the framework to achieve some simple tasks which sounded perfect for the job. Unfortunately, Logic Apps is not designed for any type of heavy lifting. You can move files using the platform but are capped at a mere 50MB when trying to pick up and put down a file across the platform. Rumours were that Microsoft would extend this to 1GB at some point during 2018 but this has not been the case so far.  There was also no such task to just change the remote path location as far as I could tell – it looks like people are up-voting it here.

Next I looked at using Azure functions, since I would do something similar through SSIS if this was a traditional stack problem. This involves writing a bit of C# to talk to the SFTP site and do the tasks for you. While this is very possible, it didn’t fit into our existing architecture. I also felt it was easier to build something to maintain using PowerShell – a language which I’m enjoying coding in more and more.

Lastly I looked at using Azure Automation and doing the tasks through PowerShell. There are a number of modules which you can use to achieve the goal, for which I used Posh SSH. I also looked at WinSCP but found the Posh cmdlets more flexible.

The first thing I did was work out how to pass parameters from a webhook into the script. This was so that the script could be used across multiple SFTP folders. In ADF it was as simple as defining the headers/body to pass with the POST call to the webhook.

 

image

 

Within the PS script I then defined the following region to pass the parameters into. The parameters are passed in as an object which is then deconstructed into the various header/body elements. The body is also deconstructed further from JSON into the variable I wanted. I believe it is also mandatory to name the object received by the script as $WebhookData as this is not configurable externally.

# Get parameter value
Param
([object]$WebhookData) 

#region Verify Webhook
if ($WebHookData){

    # Collect properties of WebhookData
    $WebhookName     =     $WebHookData.WebhookName
    $WebhookHeaders  =     $WebHookData.RequestHeader
    $WebhookBody     =     $WebHookData.RequestBody

    # Collect individual headers if required. Input converted from JSON.
    $Input = (ConvertFrom-Json -InputObject $WebhookBody)
    Write-Output  "WebhookBody: $($Input)"
    Write-Output -InputObject ('Runbook started from webhook {0}.' -f $WebhookName)
    
    # Extract variables
    $folderName = $Input.FolderName
}
else
{
   Write-Error -Message 'Runbook was not started from Webhook' -ErrorAction stop
}
#endregion

 

Once this is then defined within the script, you can setup the webhook from the Automation end and paste the URL it generates back into ADF. Its also at this point that the webhook configuration allows you to define a parameter as part of the webhook. Without the body of code above, this will not be an option, and cannot be created first as I originally assumed.

 

image

 

Once the parameter passing has been dealt with, the code block for connecting to the SFTP site and performing the task is straight forward.

# Get variable values
$userName = Get-AutomationVariable -Name 'SftpUserName'
$userPassword = Get-AutomationVariable -Name 'SftpPassword'
$hostName = Get-AutomationVariable -Name 'SftpHost'
$port = Get-AutomationVariable -Name 'SftpPort'

# Create PS credential from username/password
$userPassword = ConvertTo-SecureString -String $userPassword -AsPlainText -Force
$userCredential = New-Object -TypeName System.Management.Automation.PSCredential ($userName, $userPassword)

# Create new SFTP session
$session = New-SFTPSession -ComputerName $hostName -Credential $userCredential -Port $port -AcceptKey 

# Retrieve child objects on remote path and for every zip rename to zip.bk
$remotePath = "/Data/" + $folderName + "/"
Get-SFTPChildItem $session $remotePath| 
    ForEach-Object {
        if ($_.Fullname -like '*.zip' -Or $_.Fullname -like "*.txt") {  
            $NewName = $_.Name + ".bk"
            Rename-SFTPFile $session -Path $_.FullName -NewName $NewName
            Write-Output "$($_.FullName) has been renamed to $($NewName)"
        }
    }

 

The first task is to extract the variables from the Automation resource. These should not be hard-coded into the script, but stored as variables external to the runbook. Then a PSCredential requires configuration in the format above.

Finally the task to archive the files is performed. This involves creating a new SFTP session, and iterating over the child items within the remote path. For each file it finds, it will then perform the archiving process as long as they have the correct extension.

While I originally wanted to move the files between folders I realised this was not possible with the modules I was using. I’m sure with a bit more research something would be available out there to achieve this, specifically changing the path of a remote file.

Therefore I was limited to changing the name of a remote file by adding a .bk to the end of the filename. ADF would then only pass over files ending in .zip the next time round.

 

Conclusion

I’m hoping this will be useful for anyone in the same situation. I’ve specifically used the webhook parameters multiple times now.  While I didn’t achieve my original goal - with a small workaround I still satisfied my original requirement to archive the files from future process runs.  There’s also some other code snippets that might be of good reference for future implementations here too, specifically thinking around the PS credential creation. Hope it helps!

Iconography in Design

In this blog I will go about discussing the importance of design, in particular looking at how icons can help add the finishing touches to a piece of front end development.

Whether you’re building a Power BI report, SSRS report, PowerApp, or doing any kind of front end visualisation and design, you’ll know that 50% of the battle is making whatever you’re building jump out of the page. You could have built the most useful report that a user could wish for, but unless it looks good, you’re not going to get any plaudits. Essentially, good design goes a long way to making a success on a piece of development. This is even more important if you’re building the underlying architecture / ETL, and for all our good practice and thoughtfulness in this area as developers, will rarely impress or impact on an end user.

Its important to get inspiration for design from somewhere, so before going about designing or even building something, take a look on Google, do some research and try to look for similar things to what you’re doing and see what you think looks good, or not so good. You should build an idea up of how you want to design you’re landing page, report or report headers, etc. within the tool. If you’re in Power BI, take a look at the partner showcase for example. There’s some really good examples to give you ideas – like this one!

Microsoft apps such as Power BI or PowerApps go a long way in helping us build the best when it comes to data visualisation, but unfortunately they sometimes rely on developers to go outside the box to finish things off.

 

Icon Finder

image

Recently I’ve started to use a site called iconfinder.com which has a large pool of useful icons you can use for building out certain corners of apps that the standard MS tooling will not support. The icons are mostly $2 each, but you can get a subscription for £20/month or $10 with a discount code (which you should cancel as soon as you register). Please, don’t jump in if you think you need to use this as a resource. Alternatively, save the icon using Chrome and use it as a placeholder until you are happy to push to production. For non-subscribed users, the icon will always come on a background of faint grey lines. This isn’t too bad as they don’t completely ruin the look and feel of the icon in the development and are good for a placeholder for demos, etc. To get started, just type in your keyword for the type of icon you’re looking for, and then its just a case of wading through the results to find the icon that fits the look and feel your inspired to build against. Sometimes, you’ll get another bit of inspiration off the back of this which you can use as another key word to find even more icons.

The site also comes with a very handy icon editor tool, essential Paint Shop Pro on the web. There’s lots of these sites out there but its useful its all integrated into one place at no extra cost. It will load the SVG icon into it automatically if subscribed which then allows you to edit colours or shapes etc. In my instance, I found a nice % complete icon set which would look good on a white background. Unfortunately, I wanted it on a blue header bar, so needed to change it up slightly to fit the look and feel. No problem, took less than a minute to modify and download.

image

Its also worth mentioning that the site does a good job at helping you find a pool of icons which will fit together nicely using the same look and feel, showing you icons from the same icon set. In one instance, I replaced an icon I was looking for to be from the same pool even though the icon wasn’t exactly what I was looking for – because overall it just felt like it fitted together nicer with the other icons on the screen.

 

Design in Practice

As mentioned above, doing some research before you build can really help you create a much better finish. For inspiration for a recent PowerApps design I did a quick search for landing pages on Google, and found a few I liked the look of (below). As long as the general elements you are working to are similar, it really doesn’t matter where the inspiration comes from. In these cases, they were in the form of mobile apps.

imageimage

From these images, I was able to identify the key components which made me bookmark them:

  • I wanted some kind of non-offensive background, possibly semi-transparent, or with overlay.
  • I wanted a title that stands out the page, so white on grey or similar.
  • I wanted a small section for a blurb for the PowerApp.
  • I wanted 2 buttons, and the buttons to stand out.
  • I wanted logos in the top corners.
  • I wanted a nice look and feel for the colour palette.

From this, I then produced the following landing page.

image

I found the background on picjumbo.com which turned out to be quite a nice resource for some generic business style artwork, and then added a blurring filter across the top. This still interferes slightly with the buttons / title so I’m not completely happy but satisfied enough that it achieves the look and feel I was looking for. For the title, the range of fonts supplied with PowerApps is rather limited so I could go externally for this too but was happy enough for the time being. The layout also leaves room to shrink the title and add a small blurb if need be. The buttons are made up of a number of icons and fit with the theme for the app.

image

As mentioned above, I also added % complete icons to each page so users were able to understand how far they were along the scoring pages within the app. PowerApps provides sufficient icons for the back/refresh buttons that fit in with the white on blue theme, so I didn’t have to go externally for these. These were placed on the page header next to the logo.

 

 

Power BI

While this most recent bit of design was focused on PowerApps, I also add small bits into PowerBI during report design. For instance, rather than just have a generic button that can push you to a “details” page which has a table for the row by row breakdown of some aggregated data - I looked for an icon, edited the colour palette slightly and added this to the report. With recent Power BI functionality, I can make the image act as a button and redirect the user to another page.

image

I’ve also used icons in dashboard design where a single visual didn’t really represent the content of the report to drill into. This can also be a good way to go about adding a bit of flavour to a dashboard to mix things up. In my case, it also meant the drill down into the report level was less ambiguous, by asking a question as the title if that’s what the user wants to do.

image

Design is always subjective of course, but its great to use other resources at your disposal to go about building out apps. Depending upon the current estate in which you develop, it also helps them stand out a bit more and add a unique context to the reports/apps within the project. Hopefully this blog has given you a few ideas for your next project!

Adatibits - Scalable Deep Learning with Azure Batch AI

image

 

I recently had the pleasure of attending SQLBits, there were a number of great talks this year, and I think overall it was a fantastic event. As part of our commitment to learning at Adatis, we were also challenged to present back to the team something of interest we learnt at the event. This then becomes our internal Adatibits event where we can get sample of sessions from across the week.

As such, I was pleased when I saw a talk by Ben Keen on the Friday bill of SQLBits that revolves around deep learning. Having just come through the Microsoft Data Science program, this fell in line with my interest / research in data science, and was focusing on the bleeding edge of the subject area. It was also the talk I was probably looking forward to the most from the synopsis, and I thought it was very well delivered.

Anyway, on to the blog. In the following paragraphs, I’ll cover a cut down version of the talk and also talk about my experience using the MNIST dataset on Azure Batch AI. Credit to Ben for a few of the images I’ve used as they came off his slide deck.

 

What is Deep Learning?

So you’ve heard of machine learning (ML), and what that can do – deep learning is essentially a subset of ML, which uses neural network architectures (similar to human brains). It can work with both supervised (labelled) and unsupervised data, although its value today is tending towards learning from the labelled data. You can almost think of it as an evolution of ML.  It’s performance tends to improve the more data you can throw at it, where traditional ML algorithms seem to plateau. Deep learning also has an ability to perform automatic feature extraction from raw data (feature learning), where as the traditional routes have features provided as part of the dataset.

 

How does it help us? What are its use cases?

Deep learning excels at finding patterns in unstructured data. The following examples would be very difficult to write program to do, which is where the field of DL comes in. The use cases are usually split into 4 main areas – image, text, sound, and video.

 

Deep Neural Network Example

A simple example of how we can use deep learning is to understand the complexity around house prices. Taking an input layer of neurons for things such as Age, Sq. Footage, Bedrooms, and Location of a house – one can normally apply the traditional linear formula y = mx + c to apply weightings to the neurons to calculate the house price. This is a very simplistic view and there are many more factors that apply that can change the value. This often involves the different neurons intersecting with one another.

For example, people would think a house with a large number of bedrooms is a good thing and this would raise the price of the house, but if all these bedrooms were really small – then this wouldn’t be a very attractive offering for anyone other than a developer (and even then they might baulk at the effort involved) therefore lowing the price. This is especially true with people wanting more open plan houses nowadays. So traditionally people might be interested in bedrooms, this may shift in recent times to Sq. Footage as the main driver.

 

image

 

Therefore a number of weights can be attributed to an intermediary layer called the hidden layer.

The neural network architecture then uses this hidden layer to perform a better prediction. It does this by starting off with completely arbitrary weights (which will make a poor prediction). The process then multiplies the inputs by these weights, and applies an activation function to get to the hidden layer (a1, a2, a3). This neuron is then multiplied by another weight and another activation function to generate the prediction. This value is then scored, and evaluated via some form of loss function (Root Mean Squared Error). The weights are adjusted accordingly and the process is repeated. The weights are adjusted through a process called gradient decent.

 

image

 

To give you an idea of scale, and weight optimisation that's required – a 50x50 RGB image has 7,500 input neurons. Therefore we’re going to need to scale out as the training is very compute intensive. This is where Azure Batch AI comes in!

 

Azure Batch AI

Azure Batch AI is a managed service for training deep learning models in parallel at scale.

It’s currently in public preview, which I believe it entered around September 2017. It’s built on top of Azure Batch and essentially sells the standard Azure story where it provides the infrastructure for data scientists so they don’t need to worry about it and can get on with more practical work.

You can take your ML tools and workbooks (CNTK, TensorFlow, Python, etc.) and provision a GPU cluster on demand to run them against. Its important to note, the provision is of GPU, not CPU – similar cores, less money, less power consumed for this type of activity.

Once trained, the service can provide access to the trained model via apps and data services.

 

image

 

As part of my interest in the subject, I then went and looked at using the service to train a model off the MNIST dataset. This is a collection of handwritten digits between 1-9 with over 60,000 examples. It’s a great dataset to use to try out learning techniques and pattern recognition methods while spending minimal efforts on pre-processing and formatting. This is not always easy with most images as they contain a lot of noise and require time to convert into a format ready for training.

 

image

 

I then followed the following process within Azure Batch AI.

Created a Storage Account using Azure CLI along with a file share and directory.

# Login 
az login -u <username> -p <password>

# Register resource providers
az provider register -n Microsoft.BatchAI
az provider register -n Microsoft.Batch

# Create Resource Group
az group create --name AzureBatchAIDemo --location uksouth

# Create storage account to host data/scripts
az storage account create --name azurebatchaidemostorage --sku Standard_LRS --resource-group AzureBatchAIDemo

# Create File Share
az storage share create --account-name azurebatchaidemostorage --name batchaiquickstart

# Create Directory
az storage directory create --share-name batchaiquickstart --name mnistcntksample --account-name azurebatchaidemostorage

 

Uploaded the training / test datasets, and Python script.

# Upload train, test and script files
az storage file upload --share-name batchaiquickstart --source Train-28x28_cntk_text.txt --path mnistcntksample --account-name azurebatchaidemostorage
az storage file upload --share-name batchaiquickstart --source Test-28x28_cntk_text.txt --path mnistcntksample --account-name azurebatchaidemostorage
az storage file upload --share-name batchaiquickstart --source ConvNet_MNIST.py --path mnistcntksample --account-name azurebatchaidemostorage

 

Provisioned a GPU cluster. The NC6 consists of 1 GPU, which is 6 vCPUs, 56GB memory, and is roughly 80p/hour. This scales all the way up to an ND24 which is 4 GPUs, 448GB memory, for roughly £7.40/hour.

# Create GPU Cluster NC6 is a NVIDIA K80 GPU
az batchai cluster create --name azurebatchaidemocluster --vm-size STANDARD_NC6 --image UbuntuLTS --min 1 --max 1 --storage-account-name azurebatchaidemostorage --afs-name batchaiquickstart --afs-mount-path azurefileshare --user-name <username> --password <password> --resource-group AzureBatchAIDemo --location westeurope

# Cluster status overview
az batchai cluster list -o table

 

Created a training job from a JSON template – this tells the cluster where to find the scripts and the data, how many nodes to use, what container to use, and where to store the trained model. This can be then be run!

# Create a training job from a JSON template
az batchai job create --name batchaidemo --cluster-name azurebatchaidemocluster --config batchaidemo.json --resource-group AzureBatchAIDemo --location westeurope

# Job status
az batchai job list -o table

 

The output can be seen in real time along with the epochs and metrics. An epoch is essentially a full training cycle, and by having multiple epochs, you can cross validate your data, which leads the model to generalise more and fit real world data better.

# Output metadata
az batchai job list-files --name batchaidemo --output-directory-id stdouterr --resource-group AzureBatchAIDemo

# Observe realtime output
az batchai job stream-file --job-name batchaidemo  --output-directory-id stdouterr --name stderr.txt --resource-group AzureBatchAIDemo

image 

 

The pipeline can also be seen in the Azure portal along with links to the output metadata.

image

 

Once the model has been trained, it can be extracted, and the resources can be cleared down.

# Clean Up
az batchai job delete --name batchaidemo  
az batchai cluster delete --name azurebatchaidemocluster 
az group delete --name AzureBatchAIDemo

 

Conclusion

By moving the compute into Azure, and having the ability to scale – this means we can generate a faster learning rate for our problem. This in turn will mean better hyperparameter tuning to generate better weightings, which will mean better models, and better predictions.

As Steph Locke also alluded to in her data science talk – this means data scientists can do more work on things that they are good at, rather than waiting around for models to train to re-evaluate. Deep learning is certainly an interesting space to be in currently!

Improving Machine Learning Models

As part of my MS Data Science Professional Program, a number of the topics recently have been based around getting the most out of an Azure ML model. As part of this blog, I will be looking at the techniques and ways in which you can model and improve a solution. While I was tackling this problem with Azure ML, these techniques apply to building better models through other languages/platforms such as Python or Scala.

 

Data Munging

A process that undoubtedly every data scientist goes through with every DS problem they face, is that of data munging. This is a term that is being used more frequently to describe the process of transforming data from its raw state into another format, something more valuable for downstream analytics. Models running on data that is poor in quality, missing or duplicated will produce poor predictions. Therefore a very simple pre cursor to any problem is to explore the data and understand what is required to turn the data set into a form which is better for the latter stages of the model design. The following techniques are used as part of this stage:

  • Removing Duplicate Rows
  • Cleaning Missing Data (custom substitution of numerics, usually 1 or 0 or replacing with the mean, medium, mode across the dataset)
  • Cleaning Missing Data (removing bad quality rows entirely or removing bad quality columns entirely – more extreme)
  • Creating Categorical Features from String Features
  • Normalising Numeric Features (ZScore or Min/Max) to bring everything on to the same scale.

These steps are quite basic so I won’t go into detail here, but none the less, they should be considered at the start of a modelling a better solution to a problem.

 

Feature Selection

One of the reoccurring principles that appears with machine learning is that of Ockham’s razor, which states that the best models are simple models that fit the data well; this is not an irrefutable principle of logic, but a preference for simplicity. Therefore there is a need of balance between accuracy and simplicity to limit the feature set which tends to lead to better predictions. Simpler models are also more interpretable to humans which also helps. While the data I was working with was limited to around 35 features, there are many data science problems which have thousands of features and so this technique is even more crucial.

There are multiple methods to perform feature selection, of which a few will be covered here. The first method is greedy backward selection which starts with all the features and then finds the feature that hurts predictive power the least when removed, and you remove it. This is done iteratively until a point is met (which will be discussed later). Its known as greedy since it never looks back after removing the feature each time.

An alternative method is greedy forward selection which is basically the inverse, starts with no features, and looks for the feature that by itself is the best model. This then carries on in a similar vein to the backward selection but adding features. The point at which you stop with forward selection is that of diminishing returns for your accuracy.

Defining accuracy is important here, and this is where a formula called Adjusted R² comes in. R² is a measure of how well the model fits the data, with being closer to 1 than 0 being a better fit. The adjusted part adds a penalty for every term in the model, thus it measures on a scale the size and accuracy of a model. Therefore you need enough features for your R² to be large but not too many that it brings the Adjusted R² down.

.

Permutation Feature Importance

Using the feature selection theory, and to prune the feature set down to those that are meaningful for prediction, you can use a module in Azure ML called Permutation Feature Importance. This essentially re-computes the model a number of times, leaving out each feature and looks at how much your metric changes because the feature was left out, and then ranks them in order of importance. Depending on what you are trying to model, i.e. a classification or regression problem – there are a number of options for the metrics to measure performance. In my instance, I was interested in the RMSE (Root Mean Squared Error) which in simple terms represents the sample standard deviation of the difference between prediction and observed values. It aggregates the magnitudes of error in prediction into a single measure of predictive power. The closer to 0, the better the predictive power – but it’s also good to note this is relative to what you are trying to measure.

Once the model has been run through, you can visualise the list of features and their contribution to the RMSE. At this point, it does not necessarily matter whether the feature contributes a positive or negative value to the RMSE, as long as the value is not 0. Any values of 0 indicate that they have zero contribution to feature importance, essentially whether they are part of the model or not, add nothing to it. You can then follow backward pruning techniques to remove these columns from the feature set. It is then worth running the model again, to check the feature importance as the removal of those features may impact other features. If more features then have a value of 0, you should remove those too, and repeat. You can then measure the impact of the changes using separate pipelines, and passing the output into the same evaluation model, and checking the ROC curve (described below). Even with the RMSE staying the same between the 2 pipelines, by removing features, you are able to build a model which is more likely to generalize be more effective in the real world when values change.

 

image

 

Picking the Best Model Type

There is no reason to believe that any particular machine learning model will have the best performance (although we always have favourites); a classification model type that works best for one set of features and labels in a dataset does not always work best for another. As part of modelling any dataset, testing and comparing multiple machine learning models is usually a good approach. Its also important to note that the performance achieved with any particular machine learning model can change after performing feature engineering, therefore it is best to run the selection after this stage. The following model evaluates logistic regression, boosted decision trees, neural networks and support vectors with the same dataset to find out which is best.

 

image

 

To understand the performance of a machine learning model, there are a number of techniques to use. The easiest way is to pass the output of each model into an Evaluate Model module, which accepts up to 2 datasets at a time (left and right inputs). After the experiment is run, you can visualise the output of the models using this module, and examine the ROC curve. The first scored dataset (blue) represents the original model (in this case a neural network), and the scored dataset to compare against (red) represents the second dataset (in this case a support vector machine). The higher and further to the left the curve, the better the performance of the model (in this case, the neutral network).

Scrolling down further, you can also use the Accuracy, Recall, and AUC performance metrics, which indicate the accuracy and area under the curve. The model with the higher metrics is performing better. In particular, the lower the recall metric, the higher the number of false negatives.

 

clip_image002

 

Parameter Sweeping

Once you’ve picked the ML model contributing to making your predictive power better, it will require a set of parameters. For instance, with decision trees, this is in the form of a leaf count to determine depth, or no. of trees to determine width, along with their samples per leaf, and the learning rate. By default there is always a set available, but these will always need tweaking to improve things further and generate a better RMSE.

This can be done by either sweeping a giant grid of parameters, or by a random sweep. The latter being a lot quicker to process at run time for obvious reasons Fortunately, the performance is not normally sensitive to a change in these values if you have done much of the previous analysis first. Parameter sweeping really starts to squeeze the best out of the model.

In Azure ML, this can be done via a tune model hyper parameter module. The same options are available to measure metrics as the feature selection module, so I was interested in the RMSE again. As part of tuning the parameters through this module, we will need to split the training data beforehand, this can be done 50:50. This is so that the parameters have a set of data to validate against. This is then kept separate to the scoring data set as usual which is another completely separate set of data. Once the model has run, we can again evaluate the best parameters, against the original model and evaluate the RMSE, as well as the Accuracy, Recall and AUC. This is very similar to the previous techniques of evaluation. Visualising the sweep results, will display the parameters used, and then these can be programmed back into the original ML model, while removing the tune hyper parameter module, to speed things up on future runs.

A process of nested cross validation can be used on top of this to build confidence that the correct parameters have been used and it wasn’t just luck that they ended up being better than another set.

 

image

 

Conclusion

Once you have been through this process, you will then want to run a process of cross-validation, which runs the data through multiple times (folds) where each time, different data is used for training, and scoring. You can then generate a mean and standard deviation for each fold and prove the model is consistent across the data set, and that it will not be skewed by any new data for future predictions. This will give you a good idea of whether the model will generalise well and be robust enough to move to production.

Of course, there are many more techniques to the ones listed here, but this should give you a good introduction to the ones to look for to deliver predictive power from your model.