Sacha Tomey

Sacha Tomey's Blog

The Azure Modern Data Warehouse: Unparalleled Performance

Today, 80% of organisations adopt cloud-first strategies to scale, reduce costs, capitalise on new capabilities including advanced analytics and AI and to remain competitive. Cloud-adoption is accelerating, and data exploitation is a key driver.

The central tenet to this enterprise-wide exploitation of data is the cloud-based Modern Data Warehouse. Legacy on-premises or appliance based EDWs, that were once the strategic asset for only the largest of enterprise organisations, not only limit performance, and flexibility, but are also harder to set up & scale securely.

The Modern Data Warehouse fundamentally changes the landscape for data analytics by making analytics available to everyone across organisations of all sizes, and not only the largest enterprise.

A modern data analytics platform enables you to bring together huge volumes of relational and non-relational, or structured and unstructured data into a single repository; the highly scalable and cost-effective Azure Data Lake. This provides access across the enterprise from basic information consumption through to innovation led data science.

Big data processing capability for data preparation, such as transformation and cleansing, can be performed as well as infusing Machine Learning and AI with the results made readily available for analysis through visual tools like Power BI.

Azure provides unparalleled performance at incredible value. To further support this claim, Microsoft have just announced the GigaOM TPC-DS Benchmark Results that further cements Azure SQL Data Warehouse as a leader for price/performance for both decision support benchmarks, having already attained best price/performance status for the TPC-H benchmark, announced back in Feb 2019.

TPC-DS @ 30TB
$ per Query per Hour

DS_thumb1

TPC-H @ 30TB
$ per Query per Hour

H_thumb58

Azure SQL Data Warehouse (Azure SQL DW) always delivered on performance when compared to alternatives, and now GigaOm found analytics on Azure is 12x faster and 73% cheaper when compared using the TPC-DS benchmark. Azure SQL DW has established itself as the alternative to on-premises data warehouse platforms and leader in Cloud Analytics.

Adatis have been at the cutting edge of Cloud Analytics Solutions since the introduction of the Azure SQL Data Warehouse PaaS offering back in 2015. In the last 18 months we have noticed the profile of Azure SQL DW rise sharply; with Azure SQL DW outperforming and taking over workloads from its closest competitors.

We specialise in all aspects of delivering the value of Cloud Analytics, AI and the Modern Data Warehouse, from strategic business value led engagements through technical design and implementation to on-going continuous improvement via fully managed DataOps practices.

Arch_thumb[7]

Adatis utilise Microsoft Azure technologies, in conjunction with first-party Spark based services, that securely integrate to provide enterprise-grade, cloud-scale analytics and insight for all and partner deeply with Microsoft to enable data driven transformation for our customers. We work to develop a modern data analytics strategy and ensure it is implemented and supported in the best way possible, aligning to your specific company’s goals and overriding strategy.

If you want to find out how Adatis can help you make sense of your data and learn more about the Modern Data Warehouse, please join us in London on 6th June for an exclusive workshop. We will guide you through the Microsoft landscape and showcase how we can help you get more value from your data, wherever you are on your data transformation journey.

Register here for our "Put your Data to Work" in-person event to be held in London on 6th June 2019

Additional Resources

Microsoft Azure Big Data and Analytics

Information on Azure SQL Data Warehouse

Try Azure SQL Data Warehouse for Free

#SimplyUnmatched, #analytics, #Azure, #AzureSQLDW, #MSPowerBI

Azure Data Factory v2 : ForEach Activity : Handling Null Items

The ForEach activity is a great addition to Azure Data Factory v2 (ADF v2) – however, you can encounter issues in some situations where you pass a null in it’s ‘Items’ setting for it to iterate. 

When you pass a Null, you receive the error:

Error
{ 
   "errorCode": "InvalidTemplate", 
   "message": "The function 'length' expects its parameter to be an array or a string. The provided value IS of type 'Null'.", 
   "failureType": "UserError", 
   "target": "ForEach"
}

This happens because the initialisation of the ForEach iterator checks the length of the string or array that is passed in on the ‘Items’ setting.  When a null is supplied (i.e. no items to create an array from), the length function fails.  I would like to see the ADF ForEach Activity check for null first and only check the length and continue with the iterator when it’s not null but it doesn’t, although I’m sure that will tighten up in future versions.

Arguably the correct way to handle this is to implement an IF condition operator activity within your pipeline that tests for Null and only execute the ForEach iterator activities when you can confirm the object you want to iterate is not null.  However, for me that slightly overcomplicates the pipeline as you end up with nested activities / or additional pipelines that make it difficult to maintain and really understand what’s happening in the pipeline.

There’s an alternative and I’d be interested in understanding if there are any better alternatives to achieve the same result. Please comment below.

Rather than passing the Null object, we can run an inline test within the Items attribute and pass either a valid object for iterating (when the object is not null) or pass it an empty array when the object is null.  It took me a while to work out how to create an empty array.  @array(‘’) returned an array with a length of 1 so the ForEach loop fired but then subsequently failed as there was nothing to grab from the array.  An empty string had the same effect too.  We need to generate an array (or a string) with a length of zero. 

The solution was to ‘take’ an item from array(‘’) – which item?  The item with index of 0.  This returned me a non null array with a length of 0 – the ForEach loop now didn’t fail, it now succeeded and better still, didn’t trigger the sub activities – the result I wanted.

Wrapping this into a coalesce provides a self contained defensive null handling pattern that should arguably always be used on the Items setting of the ForEach activity.

@coalesce(<##Your string or array to iterate that might be null##>,take(array(''),0))

I would love to hear better/alternative approaches that don’t rely on the IF conditional activity.

Migrating to Native Scoring with SQL Server 2017 PREDICT

Microsoft introduced native predictive model scoring with the release of SQL Server 2017.

The PREDICT function (Documentation) is now a native T-SQL function that eliminates having to score using R or Python through the sp_execute_external_script procedure. It's an alternative to sp_rxPredict. In both cases you do not need to install R but with PREDICT you do not need to enable SQLCLR either - it's truly native.

PREDICT should make predictions much faster as the process avoids having to marshal the data between SQL Server and Machine Learning Services (Previously R Services).

Migrating from the original sp_execute_external_script approach to the new native approach tripped me up so I thought I'd share a quick summary of what I have learned.

Stumble One:

Error occurred during execution of the builtin function 'PREDICT' with HRESULT 0x80004001. 
Model type is unsupported.

Reason:

Not all models are supported. At the time of writing, only the following models are supported:

  • rxLinMod
  • rxLogit
  • rxBTrees
  • rxDtree
  • rxdForest

sp_rxPredict supports additional models including those available in the MicrosoftML package for R (I was using attempting to use rxFastTrees). I presume this limitation will reduce over time. The list of supported models is referenced in the PREDICT function (Documentation).

Stumble Two:

Error occurred during execution of the builtin function 'PREDICT' with HRESULT 0x80070057. 
Model is corrupt or invalid.

Reason:

The serialisation of the model needs to be modified for use by PREDICT. Typically you might serialise your model in R like this:

model <- data.frame(model=as.raw(serialize(model, NULL)))

Instead you need to use the rxSerializeModel method:

model <- data.frame(rxSerializeModel(model, realtimeScoringOnly = TRUE))

There's a corresponding rxUnserializeModel method, so it's worth updating the serialisation across the board so models can be used interchangeably in the event all models are eventually supported.  I have been a bit legacy.

That's it.  Oh, apart from the fact PREDICT is supported in Azure SQL DB, despite the documentation saying the contrary.

Geographic Spatial Analysis with Azure Data Lake Analytics (ADLA)

Whilst working on an Azure Data Lake project, a requirement hit the backlog that could be easily solved with a Geographical Information System (GIS) or even SQL Server - Spatial data type support was introduced into SQL Server 2008.

However, Azure Data Lake Analytics (ADLA) does not natively support spatial data analytics so we'll have to extract the data into another service right? Wrong ? :)

Due to the extensibility of Azure Data Lake Analytics, we can enhance it to do practically anything. In fact, we can lean on existing components and enhance the service without having to develop the enhancement itself.

This blog is a quick run through demonstrating how to enhance ADLA such that it will support Spatial analytics and meet our project requirement.

Problem

For simplicity I've trivialised the problem. Here's the requirement:

Indicate which Bus Stops are within 1.5 km of Southwark Tube Station.

To support this requirement, we have two datasets:

  • A list of all the Bus Stops in London, including their Geo location (circa 20k records)
  • The Geo location record of Southwark Tube Station (a single record !)
    • In fact, the location of the tube station is pretty accurate and is geo located to the entrance pavement outside the tube station:

clip_image001_thumb8_thumb

This would be an easy problem for a GIS to solve. You would specify the central point i.e. our Southwark Tube station marker and draw a circle, or buffer, with a radius 1.5 km around it and select all bus stops that fall within or intersect with that circle. This spatial analysis is easy for these systems as that's essentially what they are built to do.

SQL Server 2008 introduced the Spatial Data Type, this allowed spatial style analysis to be performed on geo data using T-SQL in conjunction with the supplied Geometry and Geography data types. More info on those can be found here

So, how can we solve our problem in ADLA, without a GIS and without having to export the data to SQL Server??

Solution

You can register existing assemblies with ADLA. It so happens that the SQL Server Data Types and Spatial assemblies are nicely packaged up and can be used directly within ADLA itself - think about that, it's pretty awesome !

Caveat: At the time of writing we have no idea of the licence implications. It will be up to you to ensure you are not in breach :)

Those assemblies can be downloaded from here.  You only need to download and install the following file:

  • ENU\x64\SQLSysClrTypes.msi

This installs two key assemblies, which you'll need to grab and upload to your Data Lake Store:

  • C:\Program Files (x86)\Microsoft SQL Server\130\SDK\Assemblies\Microsoft.SqlServer.Types.dll
  • C:\Windows\System32\SqlServerSpatial130.dll

Once they have been uploaded to your Data Lake Store, you need to register those assemblies with ADLA.

DECLARE @ASSEMBLY_PATH string = "/5.UTILITY/USQL-Extend/SQL-Server/";
DECLARE @TYPES_ASM string = @ASSEMBLY_PATH+"Microsoft.SqlServer.Types.dll";
DECLARE @SPATIAL_ASM string = @ASSEMBLY_PATH+"SqlServerSpatial130.dll";

CREATE DATABASE IF NOT EXISTS SQLServerExtensions;
USE DATABASE SQLServerExtensions;

DROP ASSEMBLY IF EXISTS SqlSpatial;
CREATE ASSEMBLY SqlSpatial
FROM @TYPES_ASM
WITH ADDITIONAL_FILES =
     (
         @SPATIAL_ASM
     );

Following registration of the assemblies, we can see the registration loaded in the ADLA Catalog database we created:

image_thumb2_thumb[4]

We are now ready to use this U-SQL enhancement in our U-SQL Query - let's go right ahead and solve our problem in one U-SQL Script.

// Reference the assemblies we require in our script.
// System.Xml we get for free as a System Assembly so we didn't need to register that and our SQLServerExtensions.SqlSpatial assembly
REFERENCE SYSTEM ASSEMBLY [System.Xml]; 
REFERENCE ASSEMBLY SQLServerExtensions.SqlSpatial; 

// Once the appropriate assemblies are registered, we can alias them using the USING keyword.
USING Geometry = Microsoft.SqlServer.Types.SqlGeometry; 
USING Geography = Microsoft.SqlServer.Types.SqlGeography; 
USING SqlChars = System.Data.SqlTypes.SqlChars; 

// First create the centralised point. 
// In this case it's the pavement outside the entrance of Southwark Tube Station, London. 
// Format is Longitude, Latitude and then SRID. 
// NB: It's Longitude then Latitude, that's the opposite way to what you might expect.. 
DECLARE @southwarkTube Geography = Geography.Point(-0.104777,51.503829,4326); 

// Next we extract our entire London bus stop data set from the file. 
// There's about 20k of them. 
@busStopInput = 
	EXTRACT 
		[StopCode]	string, 
		[StopName]	string, 
		[Latitude]	double?, 
		[Longitude]	double? 
	FROM @"/1.RAW/OpenData/Transport/bus-stops-narrow-full-london.csv" 
	USING Extractors.Csv(skipFirstNRows:1,silent:true); 

// This is effectively the transform step and where the magic happens 
// Very similar syntax to what you would do in T-SQL. 
// We are returning all the bus stops that fall within 1500m of Southwark Tube 
// Essentially we return all stops that intersect with a 1500m buffer around the central tube point 
@closeBusStops= 
	SELECT 
		* 
	FROM 
		@busStopInput 
	WHERE 
		@southwarkTube.STBuffer(1500).STIntersects(Geography.Point((double)@busStopInput.Longitude,(double)@busStopInput.Latitude,4326)).ToString()=="True"; 

// The results are written out to a csv file. 
OUTPUT 
	@closeBusStops TO "/4.LABORATORY/Desks/Sach/spatial-closebusstops.csv" 
	USING Outputters.Csv(outputHeader: true); 

The query outputs a list of bus stops that are within the specified Spatial distance from Southwark Tube Station. If we have a look at all the bus stops (in red) and overlay all the 'close' bus stops (in green), we can see the results:

clip_image00112_thumb4_thumb

clip_image002_thumb1_thumb

Pretty neat.

Azure Data Lake Analytics does not natively support spatial data analytics but by simply utilising the assemblies that ship with SQL Server, we can extend the capability of U-SQL to provide that functionality or practically any functionality we desire.

Azure Data Lake Store : Authorization Failed - The user does not have the permission to perform this operation

Just a quick one as it took slightly longer to solve than it possibly should have done so this might help others with the same problem.

I was working fine with an Azure Data Lake Store for a few weeks until yesterday when I couldn’t access the files anymore. 

The Azure Portal reported the following:
Authorization Failed - The user does not have the permission to perform this operation.

image

image

It turns out I was accessing it from a new location and the ADLS Firewall needed updating to whitelist my IP Address….

2017-05-24_13-24-28

After adding my IP, the Data Lake Store is now accessible again.

Power BI Streaming Datasets–An Alternative PowerShell Push Script

I attended the London Power BI Meetup last night. Guest speaker was Peter Myers On the topic of "Delivering Real-Time Power BI Dashboards With Power BI." It was a great session.

Peter showed off three mechanisms for streaming data to a real time dashboard:

  • The Power BI Rest API
  • Azure Stream Analytics
  • Streaming Datasets

We've done a fair bit at Adatis with the first two and whilst I was aware of the August 2016 feature, Streaming Datasets I'd never got round to looking at them in depth. Now, having seen them in action I wish I had - they are much quicker to set up than the other two options and require little to no development effort to get going - pretty good for demo scenarios or when you want to get something streaming pretty quickly at low cost.

You can find out more about Streaming Datasets and how to set them up here: https://powerbi.microsoft.com/en-us/documentation/powerbi-service-real-time-streaming/

If you create a new Streaming Dataset using 'API' as the source, Power BI will provide you with an example PowerShell script to send a single row of data into the dataset.  To extend this, I've hacked together a PowerShell script and that loops and sends 'random' data to the dataset. If you create a Streaming Dataset that matches the schema below, the PowerShell script further below will work immediately (subject to you replacing the endpoint information). If you create a different target streaming dataset you can easily modify the PowerShell script to continually push data into that dataset too.

I’ve shared this here, mainly as a repository for me, when I need it, but hopefully to benefit others too.

Streaming Dataset Schema

clip_image001_thumb

Alternative PowerShell Script

Just remember to copy the Power BI end point to the relevant location in the script.

You can find the end point (or Push URL) for the Dataset by navigating to the API Info area within the Streaming Dataset management page within the Power BI Service:

SNAGHTML1104da5e_thumb

# Initialise Stream
$sleepDuration = 1  #PowerBI seldom updates realtime dashboards faster than once per second.
$eventsToSend = 500 #Change this to determine how many events are part of the stream
$endpoint = "[INSERT YOUR ENDPOINT HERE]"
    
# Initialise the Payload
$payload = @{EventDate = '' ; EventValue = 0; EventSource = ''}

# Initialise Event Sources
$eventSource = @('Source1', 'Source2', 'Source3')

# Iterate until $eventsToSend events have been sent
$index = 1 
do
{
    # Update payload
    $payload.EventDate = Get-Date -format s
    $source = Get-Random -Minimum 0 -Maximum 3 
    $payload.EventSource = $eventSource[$source]
    $value = Get-Random -Minimum 0.00 -Maximum 101.00 
    $payload.EventValue = $value

    # Send the event
    Invoke-RestMethod -Method Post -Uri "$endpoint" -Body (ConvertTo-Json @($payload))

    # Report what has been sent
    "`nEvent {0}" -f $index
    $payload

    # Sleep for a second
    Start-Sleep $sleepDuration

    # Ready for the next iteration
    $index++

} While ($index -le $eventsToSend)

# Finished
"`n{0} Events Sent" -f $eventsToSend

Power BI Embedded - What is it?

At the Microsoft Data Insight Summit on 22nd March 2016, Microsoft announced that Power BI had 5 million registered users.  This is a pretty impressive milestone considering its relative recency to the market and maturity.

However, Microsoft are far more ambitious than 5 million users - Microsoft won't be happy until they "bring big data analytics to a billion people."  So, how can they significantly increase their user base and market share against their competition?

One strategy is to continue broadening the offering with more features, innovations, connections in a hope to tempt more users to the platform through sheer breadth. (Arguably their strategy to date).  Another is to focus on depth and start to shore up, dare I say 'finish' some of the features, innovations and connectivity they have already released.  A third strategy, and by no means a cop-out, or poor relation, is to fall back on their tried and tested method of relying on their extensive and established community of developers, ISVs and Partners to drive adoption through extensibility, application and service development.  Power BI Embedded is at the heart of this third strategy.

Despite some frustrations, I am a fan of Power BI.  I love the journey it has taken and the direction it is going but when you start to use it in anger you do find (arguably several) areas where it falls short - these areas, I have no doubt will be addressed in time but as a Partner at the forefront that focuses on Microsoft Data Analytics at its core, we are a little bruised and battered from navigating our way around shortcomings, through pitfalls and the need to invent workarounds, constrained by the SaaS model.

One of these shortcomings and one of the most common requested features by our customers was to expose Power BI to external users such as customers and partners.  Whilst this was technically possible through various guises, licencing, costs and agreements were not clear therefore unforeseen costs and/or non-compliance always loomed.

For anyone, even remotely close to Power BI, the arrival of external access was an easy prediction. However, I will admit, I thought Microsoft would implement it differently.  As the Power BI Pro service is linked to Active Directory and SharePoint Online, I speculated that external access would be via SharePoint Online External Users.  Technically, this is a seemingly straightforward mechanism for exposing Power BI to external users but possibly complicates licencing as we are no longer limited to SharePoint content what with the mash-up capabilities Power BI affords.

Use Cases

Instead, Microsoft have delivered Power BI Embedded.  Power BI Embedded is specifically targeted at these external access use cases.  In fact, it's exclusively for external access - you are prohibited from developing applications and solutions that utilise Power BI Embedded for internal consumption.  Typical scenarios include the following:

  • Embed interactive reports (authored in Power BI Desktop) within your own customer facing applications
  • Use the Power BI Visualisation Framework to create custom visualisations thus extending Power BI
  • Provide up-to-date, real-time interactive dashboard and reporting to customers through the Power BI Embedded Service framework

Power BI Embedded is an Azure Service.  This is an important point.  It is not an additional, broadening feature of the Power BI Service as we know it - it's a distinct offering that falls under the Azure umbrella, decoupled from the Power BI Service.

Licencing and costs

This decoupling allows Microsoft to control and modify two key elements that made everything tricky for external access under the existing Power BI service.  The Authentication method is different (users do not need to be part of AD.  Your custom application utilises Application Authentication Tokens) and therefore the licencing model can be (and is) completely different.

Licencing is not on a per user per month basis. It's on a per render basis. What's a render? A render is a visual element that results in a query to the service. The point to stress here is "..that results in a query to the service".  Power BI Embedded utilises caching so if an embedded a report contains half a dozen data visualisations you will not automatically incur half a dozen renders each time the report is viewed. If the service can obtain the visualisation details from cache then you avoid incurring a render.  Following the initial render, a filter, for example, is likely to incur an additional render as that will, certainly initially, result an additional query to the service.

As a Power BI Embedded Service owner you are able to control the extent to which users can drive new queries to minimise and limit costs but it remains to be seen how much control this provides.  Accurate prediction of render cost is likely to be extremely difficult until your solution is active and monitored.

At the time of writing (April 2016), Power BI Embedded is in preview and free until 1st May 2016.  Beyond this, there are two pricing tiers.  A Freemium model, allowing up to 1,000 renders per month and a Standard model that is charged at £1.5273 per 1,000 renders.

Connectivity

During the preview, connectivity is limited.  There's no indication of what will become available as it moves to General Availability and beyond but as it stands there are two modes, Direct Query and Cached:

Direct Query mode supports connections to cloud services only including:

  • SQL Azure DW
  • SQL Azure
  • Spark on HD Insight

Cached mode

Datasets can be loaded into Power BI Embedded service but they become static datasets.  You cannot refresh cached data once it has been loaded into the service.  Anything that relies on either the Personal or Enterprise Gateway cannot be kept up-to-date for use with Power BI Embedded (yet?).  This rules out on premise and hybrid (for example utilising IaaS) scenarios.

There is no indication as to the maximum size of the cached model the service will support.  The assumption is this will match that of the Power BI Service, currently 250MB.

Miscellaneous

  • Report and Dashboard rendering is through an IFRAME.
  • Power BI Embedded APIs are based on REST
  • Rendering is [currently] limited to interactive reports therefore you cannot expose ad-hoc drag drop style reporting over your dataset using Power BI Embedded
  • The Power BI Embedded Service is currently only available in the South Central US Region

Power BI Embedded is a great strategy for customers, partners and of course Microsoft; it will likely drive additional adoption of the Power BI Service pushing Power BI on past the current 5 million user base. The Power BI Service will drive external access (and therefore Power BI Embedded) and Power BI Embedded will drive internal usage (The Power BI Service).  One billion users anyone ?!

Additional Resources

Power BI Embedded Overview

Get Started with Power BI Embedded

Power BI Embedded FAQ

Power BI Embedded API

Addressing the Travelling Salesman Problem with Azure ML and Power Map

I’ve recorded a short video (< 7 minutes) showing how Azure ML, Power BI and Power Map can address the Travelling Salesman Problem.

The Travelling Salesman Problem is firmly routed in academia and is computationally complex to solve with many of the solutions tackling it in complete isolation to the real world, therefore it’s computed based on straight line distance only. Here we’ve utilised actual driving distance, journey time and real world information such as traffic information to provide a slightly more pragmatic result..

The video can be found here, on YouTube.

clip_image002 

Video : Build an Automated Multilingual Product Dimension with Power BI Desktop

I’ve recorded a short video (< 10 minutes) showing off some of the data manipulation/transformation/integration capabilities of Power BI Desktop.  In this video I take some raw product data and a set of international languages and build a Product dimension that contains translations of each of the raw products for each of the supplied languages.

The resulting product dimension build routine (using just the Power BI Desktop GUI) will automatically (following a refresh) cope with either additional products or additional languages to the raw source files.

For the translation itself I utilise the Microsoft Translate service from the Azure Marketplace. No coding required.

The video can be found here, on YouTube.

image

Friday Fun: GeoFlow does the Great South Run

GeoFlow was released to public preview yesterday; a new 3D visualization tool for Excel which allow users to create, navigate and interact with time-sensitive data applied to a digital map.

Back in October last year, along with 25,000 other people, my good friend and colleague Tim Kent (@TimK_Adatis) and I ran the Great South Run; a 10 mile run around the City of Portsmouth on the south coast of England.  As it happened, we both wore GPS watches and using the data collected I've created a simple GeoFlow tour of the race.

Tim is Green - I am Red - who wins...  there's only one way to find out ......

Run Race