Sacha Tomey

Sacha Tomey's Blog

Migrating to Native Scoring with SQL Server 2017 PREDICT

Microsoft introduced native predictive model scoring with the release of SQL Server 2017.

The PREDICT function (Documentation) is now a native T-SQL function that eliminates having to score using R or Python through the sp_execute_external_script procedure. It's an alternative to sp_rxPredict. In both cases you do not need to install R but with PREDICT you do not need to enable SQLCLR either - it's truly native.

PREDICT should make predictions much faster as the process avoids having to marshal the data between SQL Server and Machine Learning Services (Previously R Services).

Migrating from the original sp_execute_external_script approach to the new native approach tripped me up so I thought I'd share a quick summary of what I have learned.

Stumble One:

Error occurred during execution of the builtin function 'PREDICT' with HRESULT 0x80004001. 
Model type is unsupported.

Reason:

Not all models are supported. At the time of writing, only the following models are supported:

  • rxLinMod
  • rxLogit
  • rxBTrees
  • rxDtree
  • rxdForest

sp_rxPredict supports additional models including those available in the MicrosoftML package for R (I was using attempting to use rxFastTrees). I presume this limitation will reduce over time. The list of supported models is referenced in the PREDICT function (Documentation).

Stumble Two:

Error occurred during execution of the builtin function 'PREDICT' with HRESULT 0x80070057. 
Model is corrupt or invalid.

Reason:

The serialisation of the model needs to be modified for use by PREDICT. Typically you might serialise your model in R like this:

model <- data.frame(model=as.raw(serialize(model, NULL)))

Instead you need to use the rxSerializeModel method:

model <- data.frame(rxSerializeModel(model, realtimeScoringOnly = TRUE))

There's a corresponding rxUnserializeModel method, so it's worth updating the serialisation across the board so models can be used interchangeably in the event all models are eventually supported.  I have been a bit legacy.

That's it.  Oh, apart from the fact PREDICT is supported in Azure SQL DB, despite the documentation saying the contrary.

Geographic Spatial Analysis with Azure Data Lake Analytics (ADLA)

Whilst working on an Azure Data Lake project, a requirement hit the backlog that could be easily solved with a Geographical Information System (GIS) or even SQL Server - Spatial data type support was introduced into SQL Server 2008.

However, Azure Data Lake Analytics (ADLA) does not natively support spatial data analytics so we'll have to extract the data into another service right? Wrong ? :)

Due to the extensibility of Azure Data Lake Analytics, we can enhance it to do practically anything. In fact, we can lean on existing components and enhance the service without having to develop the enhancement itself.

This blog is a quick run through demonstrating how to enhance ADLA such that it will support Spatial analytics and meet our project requirement.

Problem

For simplicity I've trivialised the problem. Here's the requirement:

Indicate which Bus Stops are within 1.5 km of Southwark Tube Station.

To support this requirement, we have two datasets:

  • A list of all the Bus Stops in London, including their Geo location (circa 20k records)
  • The Geo location record of Southwark Tube Station (a single record !)
    • In fact, the location of the tube station is pretty accurate and is geo located to the entrance pavement outside the tube station:

clip_image001_thumb8_thumb

This would be an easy problem for a GIS to solve. You would specify the central point i.e. our Southwark Tube station marker and draw a circle, or buffer, with a radius 1.5 km around it and select all bus stops that fall within or intersect with that circle. This spatial analysis is easy for these systems as that's essentially what they are built to do.

SQL Server 2008 introduced the Spatial Data Type, this allowed spatial style analysis to be performed on geo data using T-SQL in conjunction with the supplied Geometry and Geography data types. More info on those can be found here

So, how can we solve our problem in ADLA, without a GIS and without having to export the data to SQL Server??

Solution

You can register existing assemblies with ADLA. It so happens that the SQL Server Data Types and Spatial assemblies are nicely packaged up and can be used directly within ADLA itself - think about that, it's pretty awesome !

Caveat: At the time of writing we have no idea of the licence implications. It will be up to you to ensure you are not in breach :)

Those assemblies can be downloaded from here.  You only need to download and install the following file:

  • ENU\x64\SQLSysClrTypes.msi

This installs two key assemblies, which you'll need to grab and upload to your Data Lake Store:

  • C:\Program Files (x86)\Microsoft SQL Server\130\SDK\Assemblies\Microsoft.SqlServer.Types.dll
  • C:\Windows\System32\SqlServerSpatial130.dll

Once they have been uploaded to your Data Lake Store, you need to register those assemblies with ADLA.

DECLARE @ASSEMBLY_PATH string = "/5.UTILITY/USQL-Extend/SQL-Server/";
DECLARE @TYPES_ASM string = @ASSEMBLY_PATH+"Microsoft.SqlServer.Types.dll";
DECLARE @SPATIAL_ASM string = @ASSEMBLY_PATH+"SqlServerSpatial130.dll";

CREATE DATABASE IF NOT EXISTS SQLServerExtensions;
USE DATABASE SQLServerExtensions;

DROP ASSEMBLY IF EXISTS SqlSpatial;
CREATE ASSEMBLY SqlSpatial
FROM @TYPES_ASM
WITH ADDITIONAL_FILES =
     (
         @SPATIAL_ASM
     );

Following registration of the assemblies, we can see the registration loaded in the ADLA Catalog database we created:

image_thumb2_thumb[4]

We are now ready to use this U-SQL enhancement in our U-SQL Query - let's go right ahead and solve our problem in one U-SQL Script.

// Reference the assemblies we require in our script.
// System.Xml we get for free as a System Assembly so we didn't need to register that and our SQLServerExtensions.SqlSpatial assembly
REFERENCE SYSTEM ASSEMBLY [System.Xml]; 
REFERENCE ASSEMBLY SQLServerExtensions.SqlSpatial; 

// Once the appropriate assemblies are registered, we can alias them using the USING keyword.
USING Geometry = Microsoft.SqlServer.Types.SqlGeometry; 
USING Geography = Microsoft.SqlServer.Types.SqlGeography; 
USING SqlChars = System.Data.SqlTypes.SqlChars; 

// First create the centralised point. 
// In this case it's the pavement outside the entrance of Southwark Tube Station, London. 
// Format is Longitude, Latitude and then SRID. 
// NB: It's Longitude then Latitude, that's the opposite way to what you might expect.. 
DECLARE @southwarkTube Geography = Geography.Point(-0.104777,51.503829,4326); 

// Next we extract our entire London bus stop data set from the file. 
// There's about 20k of them. 
@busStopInput = 
	EXTRACT 
		[StopCode]	string, 
		[StopName]	string, 
		[Latitude]	double?, 
		[Longitude]	double? 
	FROM @"/1.RAW/OpenData/Transport/bus-stops-narrow-full-london.csv" 
	USING Extractors.Csv(skipFirstNRows:1,silent:true); 

// This is effectively the transform step and where the magic happens 
// Very similar syntax to what you would do in T-SQL. 
// We are returning all the bus stops that fall within 1500m of Southwark Tube 
// Essentially we return all stops that intersect with a 1500m buffer around the central tube point 
@closeBusStops= 
	SELECT 
		* 
	FROM 
		@busStopInput 
	WHERE 
		@southwarkTube.STBuffer(1500).STIntersects(Geography.Point((double)@busStopInput.Longitude,(double)@busStopInput.Latitude,4326)).ToString()=="True"; 

// The results are written out to a csv file. 
OUTPUT 
	@closeBusStops TO "/4.LABORATORY/Desks/Sach/spatial-closebusstops.csv" 
	USING Outputters.Csv(outputHeader: true); 

The query outputs a list of bus stops that are within the specified Spatial distance from Southwark Tube Station. If we have a look at all the bus stops (in red) and overlay all the 'close' bus stops (in green), we can see the results:

clip_image00112_thumb4_thumb

clip_image002_thumb1_thumb

Pretty neat.

Azure Data Lake Analytics does not natively support spatial data analytics but by simply utilising the assemblies that ship with SQL Server, we can extend the capability of U-SQL to provide that functionality or practically any functionality we desire.

Azure Data Lake Store : Authorization Failed - The user does not have the permission to perform this operation

Just a quick one as it took slightly longer to solve than it possibly should have done so this might help others with the same problem.

I was working fine with an Azure Data Lake Store for a few weeks until yesterday when I couldn’t access the files anymore. 

The Azure Portal reported the following:
Authorization Failed - The user does not have the permission to perform this operation.

image

image

It turns out I was accessing it from a new location and the ADLS Firewall needed updating to whitelist my IP Address….

2017-05-24_13-24-28

After adding my IP, the Data Lake Store is now accessible again.