Emma Stewart's Blog

Python in SQL Server 2017

One of the new features of SQL Server 2017 was the ability to execute Python Scripts within SQL Server. For anyone who hasn’t heard of Python, it is the language of choice for data analysis. It has a lot of libraries for data analysis and predictive modelling, offers power and flexibility for various machine learning tasks and is also a much simpler language to learn than others.

The release of SQL Server 2016, saw the integration of the database engine with R Services, a data science language. By extending this support to Python, Microsoft have renamed R Services to ‘Machine Learning Services’ to include both R and Python.

The benefits of being able to run Python from SQL Server are that you can keep analytics close to the data (if your data is held within a SQL Server database) and reduce any unnecessary data movement. In a production environment you can simply execute your Python solution via a T-SQL Stored Procedure and you can also deploy the solution using the familiar development tool, Visual Studio.

Installation and Setup


When you download SQL Server 2017, make sure, during the time of installation on Feature Selection, you select the following:

  • Database Engine Services
  • Machine Services (In-Database)
  • Python

Please see here for detailed instructions on the setup. Make sure you download the latest version of SQL Server 2017 as there are errors within the pre-built Python packages in previous versions.

Once the installation is complete, you can now try out executing Python scripts from within Management Studio. Before we begin, we need to make sure we enable the execution of these scripts.

In order to see if we can run Python scripts, run the following:

EXEC sp_configure 'external scripts enabled'

If run_value = 1 that means we are allowed to run our Python scripts.

If it is 0, run the script below:

sp_configure 'external scripts enabled' , 1

Now, for the change to take effect, you need to restart the SQL Server service and you are good to go!

Executing Python scripts via T-SQL

The basic syntax for executing Python scripts is as follows:

    @language = N'language' ,   
    @script = N'script',  
    @input_data_1 = ] 'input_data_1'   
    [ , @input_data_1_name = ] N'input_data_1_name' ]   
    [ , @output_data_1_name = 'output_data_1_name' ]  
    [ , @parallel = 0 | 1 ]  
    [ , @params = ] N'@parameter_name data_type [ OUT | OUTPUT ] [ ,...n ]'  
    [ , @parameter1 = ] 'value1' [ OUT | OUTPUT ] [ ,...n ]  
    [ WITH <execute_option> ]  
    | { RESULT SETS NONE }   
    | { RESULT SETS (  ) }  

The mandatory arguments to provide are @language , @script .

@language = Indicates the scripts language. Values are R or Python.

@script = This is the body of the Python script

@input_data_1 = This is a T-SQL statement that reads some data from a table within the database.

@input_data_1_name = This is where you can name the variable used to represent the T-SQL query defined above. For executing Python scripts, the form of data here must be tabular, however for R it is slightly different.

@output_data_1_name =  Specifies a name of the variable that contains the data to be returned to SQL Server upon completion of the stored procedure. For Python, the output must be a pandas dataframe.

By default, result sets that are returned by this stored procedure are output with unnamed columns. If you would like your result set to contain column names, you can add WITH RESULT SETS to the end on the stored procedure. As well as specifying column names, you will also need to provide the data types. You will see the difference between including it and not in the examples shown below.

This system stored procedure, can also be used to execute R scripts, simply specifying the language in the @script parameter.

Please see here for more information about this stored procedure.


N.B Please be aware that formatting is very important and should be one of the first things you should check if you get an error during execution. There are various Python formatting sites online to help with this.

The examples below are to demonstrate how to use the syntax and can be classed as basic in the grand-scheme of what Python can do as a language.

Example 1

EXEC sp_execute_external_script  
@language =N'Python',
@script=N'OutputDataSet = InputDataSet',
@input_data_1 = N'SELECT 1 AS Test' 

In the above example, we simply printed the input value of the dataset. If you look at the output returned in SSMS, we receive the value 1 but with no column header.


If we add WITH RESULT SETS, we get the following:



Example 2

In this piece of code, we are looping through rows of a table (dbo.Test) and printing the value of each row.

EXEC sp_execute_external_script  @language =N'Python',
for i in InputDataSet.Id:
@input_data_1 = N'SELECT Id FROM dbo.Test'


The output in SSMS is as follows:


Example 3

In this piece of code, it shows how you can use variables and print the value.

EXEC sp_execute_external_script  @language =N'Python',
var1 = "Emma"
var2 = "Test"

print (var1 + var2)


There are lot of things we can do, however, we can achieve these basic concepts using normal T-SQL so there has been nothing new or exciting to see.

Example 4

A more interesting scenario, which is slightly harder to do using T-SQL, is we can use Python to perform some descriptive statistics of data we pass into it.

For this, we need to import the pandas library to take advantage of it. The pandas library is a package which provides data structures designed to make working with relational data easy and intuitive. See here for more information.

EXEC sp_execute_external_script  
@language =N'Python',
@script= N'
import pandas as pd
from pandas import DataFrame

OutputDataSet = pd.DataFrame(InputDataSet.describe())
@input_data_1 = N'SELECT
  CAST(TotalProductCost AS float)
, CAST(UnitPrice AS Float)
, CAST(OrderQuantity AS FLOAT)
FROM FactInternetSales'
with result sets ((
TotalProductCost float
, UnitPrice Float
, OrderQuantity FLOAT


By using ‘describe’ we can get all the usual statistical measures for the columns that we pass in.



The statistics are in the following order: Count, Mean, Standard Deviation, Min, 25% quartile, 50% quartile, 75% quartile and Max.

Now, a few words about the Python code used above:

Data Frame: A data frame is a data structure within Python which is like a table that we are used to within SQL Server. It contains a built-in function named “describe” which allows us to calculate the basic statistics of our dataset. We pass in the InputDataSet to the describe function and then this is converted to a data frame using the DataFrame function.

OutputDataSet: The resulting data frame is assigned to the result of the output stream and uses the default output name ‘OutputDataSet’

The example above is using data from FactInternetSales from the AdventureWorksDW. The fields needed to be converted to float as they have ‘money’ datatypes and that is not a supported datatype in Python.

Sentiment Analysis

Once you have got to grips with the basics, you can move onto what Python is great at – Machine Learning scenarios. One popular machine learning scenario is text analysis (or sentiment analysis). Sentiment analysis is analysing a piece of text to see if the sentiment is positive or negative. A good example of this would be applying it to tweets on Twitter to see if they are positive or negative. Using Python in SQL Server 2017 brings the added advantage that you can use pre-trained models out of the box to do your analysis.

In order to use pre-trained models, you need to add the models to the SQL Server instance where Machine Learning Services is installed (instructions are below):

1. Run the separate Windows-based installer for Machine Learning Server. Detailed instructions of what you need to install can be found here.




You should only need to tick the box for Pre-trained models as this is an update to what we already have.

2. To check that they have installed correctly, open the command prompt (Run as administrator) and navigate to

C:\Program Files\Microsoft SQL Server\140\Setup Bootstrap\SQL2017\x64\

and run the following

RSetup.exe /install /component MLM /version /language 1033 /destdir "C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\library\MicrosoftML\mxLibs\x64"


Now you have everything set up you can begin using the pre-trained models. I will be using this and giving my thoughts in a future blog, however, in the meantime there is a Microsoft blog which provides step by step instructions on how to perform this analysis.


In summary, Microsoft have made it easy to integrate running Python code from within SQL Server and made it more accessible to people who are used to working within a SQL Server environment.

Different ways of performing file operations on Azure Data Lake

Recently at a client, we needed to come up with a few different ways that we can perform File Management operations within their Data Lake – for example moving files once processed and in their case, renaming folders etc. We needed to come up with different solutions to what we currently used in order to keep within their desired architecture. So we started looking at using the REST API and calling that using C# within an SSIS package. The other option I looked at was using Python. I will explain more about both the methods below, but first there is some set up we need to do.


Aside from having an Azure Subscription and a Data Lake Store account, you will need an Azure Active Directory Application. For our case we needed a Web Application as we will be doing Service to Service authentication. This is where the application provides it’s own credentials to perform the operations whereas with End-User authentication, a user must log into your application using Azure AD.

Service-to-service authentication setup

Data Lake store uses Azure Active Directory (AAD) for authentication, and this results in our application being provided with an OAuth 2.0 token, which gets attached to each request made to the Azure Data Lake Store. To read more about how it works and how to create the app, get the relevant credentials and how to give it access to the Data Lake store, follow through the Microsoft tutorial here.

Make a note of the below as we will be needing them when we develop the solutions:

  • Tenant ID (also known as Directory ID)


  • Client Id (also known as the ApplicationID)



Within App registrations, if you look for your App under ‘All Apps’ and click on it you will be able to retrieve the Application Id.


  • Client Secret (also known as Authentication Key)

Within the App area used above, click on setting and then ‘Keys’. If you haven’t previously created one, you can create it there and you must remember to save it when it appears as you will not get another chance!


Client secret

  • Data Lake Name



Now we have everything set up and all the credentials we need, we can make a start constructing the requests. In order to test them out I used Postman which can be downloaded here.


Firstly, before we can begin any folder management operations, we need to authenticate against the Data Lake. To do this we need to perform a POST request in order to obtain an access token. This token will be used and passed to the other REST API calls we will make (e.g. deleting files) as we will always need to authenticate against the Data Lake.

To retrieve the access token, we need to pass through the TENANT ID, CLIENT ID and CLIENT SECRET and the request looks as follows:

curl -X POST https://login.microsoftonline.com/<TENANT-ID>/oauth2/token  \
  -F grant_type=client_credentials \
  -F resource=https://management.core.windows.net/ \
  -F client_id=<CLIENT-ID> \
  -F client_secret=<AUTH-KEY>

Within Postman, it looks like this:

Login in to Azure

1. Make sure the request type is set to POST

2. Make sure you have added your tenant id to the request

3. Fill out the body with your Client ID and Client Secret. (grant_type and resource are set as constant values as shown above).

4. Make a note of the Bearer access token as we will need it to perform any File Management operation on the Data Lake.


Deleting a File

Now we have our access token, we can perform a deletion of a file with the following:

curl -i -X DELETE -H "Authorization: Bearer <REDACTED>" 'https://<yourstorename>.azuredatalakestore.net/webhdfs/v1/mytempdir/myinputfile1.txt?op=DELETE'


Within Postman, it looks like the following:

Deleting a file


1. This is a DELETE request and have therefore changed the dropdown to reflect that.

2. Remember to add your data lake store name into the request; in my example it is called emma1

3. You can point to a particular file, or you can point it to a folder and add &recursive=true to the request and it will delete all the files within the folder including the folder itself. I haven’t managed to find a way to just delete the contents of the folder and leaving the folder as is.

4. The access token is sent as a header called ‘Authorization’. Make sure to include ‘Bearer ‘ before you access token as highlighted above.


Once you have sent the request, you will receive some JSON in the output to show if the action has been successful (true).

You can perform many more File Management operations using the Rest API and the code can be found here:

Common Errors

1. This following error is caused by running the delete file request when passing through an Access Token that has expired. To fix this issue, re-generate the Access token and pass that through instead.



2. The error below is caused by the Application that we are using to access the Data Lake Store not having sufficient permissions to perform the operations. Make sure it has access to the relevant folder(s). Check Step 3 here to find out how to set the access.




Now we have managed to get what them working manually within Postman, we need to consider how to call them in a production environment. The solution we implemented was an SSIS package (in-keeping with their current architecture) with script tasks calling C# which in turn calls the API. Before the File System Operation is called, we will run the authentication API call to obtain the latest Access Token, and place the value in a variable to be used later on within the package to ensure it is the latest.


Using Python

From having a play around with Python do do similar File Management operations, it seems rather limited in comparison and you can’t do as much.Nevertheless, I am sure more functionality will be added and it is useful to know how it works.

Firstly, if you don’t already have Python, you can download the latest version from here. As an IDE, I have been using Visual Studio 2017 which now comes with Python Support, see here for further information.

In order for us to be able to perform operations on the Data Lake, we need to install three Azure modules. To install the modules, open up the command prompt and run the following:

pip install azure-mgmt-resource
pip install azure-mgmt-datalake-store
pip install azure-datalake-store

Now we need to create the Python app (I used Visual Studio) to do the folder management tasks. In order to reference the modules we have just installed, we need to import the relevant modules so we can use them within our app. Each time we create an app related to Data Lake folder manipulations, we need to add them in each time. The code below shows how to do this. Save the application, but don’t run it yet!

## Use this for Azure AD authentication
from msrestazure.azure_active_directory import AADTokenCredentials

## Required for Azure Data Lake Store account management
from azure.mgmt.datalake.store import DataLakeStoreAccountManagementClient
from azure.mgmt.datalake.store.models import DataLakeStoreAccount

## Required for Azure Data Lake Store filesystem management
from azure.datalake.store import core, lib, multithread

## Common Azure imports
import adal
from azure.mgmt.resource.resources import ResourceManagementClient
from azure.mgmt.resource.resources.models import ResourceGroup

## Use these as needed for your application
import logging, getpass, pprint, uuid, time


Firstly, we  need to authenticate with Azure AD. Again, as described above there are two ways; End-User and Service-to-Service. We will be using Service-to-Service again in this example. To set this up, we run the following:

adlCreds = lib.auth(tenant_id = 'FILL-IN-HERE', client_secret = 'FILL-IN-HERE', client_id = 'FILL-IN-HERE', resource = 'https://datalake.azure.net/')

And fill in the TENANT ID, CLIENT SECRET and CLIENT ID that we captured earlier on.

Now we can authenticate against the Data Lake, we can now attempt to delete a file. We need to import some more modules, so add the script below to your application:


## Use this only for Azure AD service-to-service authentication
from azure.common.credentials import ServicePrincipalCredentials

## Required for Azure Data Lake Store filesystem management
from azure.datalake.store import core, lib, multithread


We now need to create a filesystem client:

## Declare variables
subscriptionId = 'FILL-IN-HERE'
adlsAccountName = 'FILL-IN-HERE'

## Create a filesystem client object
adlsFileSystemClient = core.AzureDLFileSystem(adlCreds, store_name=adlsAccountName)


We are now ready to perform some file management operations such as deleting a file:


## Delete a directory
adlsFileSystemClient.rm('/mysampledirectory', recursive=True)


Please see the script below for the full piece of code. You can find information on the other operation you can complete (e.g. creating directories) here 

## Use this for Azure AD authentication
from msrestazure.azure_active_directory import AADTokenCredentials

## Required for Azure Data Lake Store filesystem management
from azure.datalake.store import core, lib, multithread

# Common Azure imports
import adal
from azure.mgmt.resource.resources import ResourceManagementClient
from azure.mgmt.resource.resources.models import ResourceGroup

## Use these as needed for your application
import logging, getpass, pprint, uuid, time

## Service to service authentication with client secret for file system operations


adlCreds = lib.auth(tenant_id = XXX', client_secret = ‘XXX', client_id = ‘XXX', resource = 'https://datalake.azure.net/')

## Create filesystem client

## Declare variables
subscriptionId = ‘XXX’
adlsAccountName = 'emma1'

## Create a filesystem client object
adlsFileSystemClient = core.AzureDLFileSystem(adlCreds, store_name=adlsAccountName)

## Create a directory

## Delete a directory
adlsFileSystemClient.rm('/mysampledirectory', recursive=True)


In summary, there are a few different ways in which you can handle your file management operations within Data Lake and the principles behind the methods are very similar. So, if one way doesn’t fit into your architecture, there is always an alternative.

The thing that stumps everyone new to Azure Data Factory

Recently I was playing around with Azure Data Factory and was making my way through an online tutorial when I came across a frustrating error message that was not very helpful and my data wasn’t moving anywhere! The aim of this exercise was to move some text files in Azure Data Lake Store over to a table in an Azure SQL Database. Sounds easy enough!

The data within the Data Lake store was organised into a Year and Month hierarchy for the folders, and each days transactions were stored in a file which was named after the day within the relevant month folder. The task then was to create a pipeline which copies the dataset in the Data Lake Store over to the dbo.Orders table in Azure SQL DB every day within the scheduled period (Q1 2016).

After creating all the json scripts and deploying them (with no errors), I clicked on the ‘Monitor and Manage’ tile to monitor the activities, check everything was working as it should be and monitor the progress. After waiting for at least 10 minutes, I started to get frustrated.


As you can see, all the Pipeline Activities for each slice have a Status of: ‘Waiting: Dataset Dependecies’ and if you look at the Activity Window Explorer (see below), the activity window for each day shows a Status of Waiting (a solid orange square).



The files existed in the correct folders within my Data Lake Store and there were no errors during deployment so what was it waiting for?!

Well, it turns out we need to set an additional property on the input dataset to let Data Factory know that the data is being generated externally and not internally within Data Factory from another pipeline .




That property is "external": true at the bottom of the script.

Below is an example of a data factory and the arrows point to the data sets that need to have this external property set to true. Using the diagram of the pipeline once deployed, it is easy to identify which datasets need to be external as they are not linked to anything upstream.


How to do row counts in Azure SQL Data Warehouse

Continuing on from my last couple of blog post about working with the Azure Data Warehouse, here is another issue which has came up during development and is handy to know if you are going to be developing a solution! Smile

Keeping track of how much data has been loaded plays a key part in a BI Solution. It is important to know for a given load for example, how many rows were inserted, updated or deleted. Traditionally, we were able to use the @@ROWCOUNT function

@@ROWCOUNT returns the number of rows affected by the last statement.

Unfortunately, in Azure SQL Data Warehouse @@ROWCOUNT is not supported.

How does it work?

In the Microsoft Azure documentation,they do provide a workaround for this, please see here  for more information and a list of other unsupported functions.

They suggest creating a stored procedure which will query the system tables sys.dm_pdw_exec_requests and sys.dm_pdw_request_steps in order to get the row count for the last SQL statement for the current session.

sys.dm_pdw_exec_requests holds information about all requests currently or recently active in SQL Data Warehouse. It lists one row per request/query. holds information about all SQL Server query distributions as part of a SQL step in the query.

sys.dm_pdw_request_steps holds information about all steps that are part of a given request or query in SQL Data Warehouse. It lists one row per query step.

This is an example of what the stored procedure would look like:



As you can see above, we pass through a ‘LabelContext’ parameter. A Label is a concept in Azure SQL Data Warehouse that allows us to provide our queries with a text label that is easy to understand and we can find that label in the DMVs. For example:


Here, we have given our query the label ‘Test Label’ and if we wanted to find information about this query in the DMVs we can search using this label like:



So, putting this into context, in the ETL we are calling stored procedures to load our data (for example between clean and warehouse). Therefore, within the stored procedure we have the query written to insert or update the data and we would give this query a label. Then, within the same stored procedure, we would call the Row Count stored procedure, passing through the Label as parameter so we can retrieve the row count.


Be careful though!

On my current project we have come across times where we haven’t been able to get the row count back. This is because the sys.dm_pdw_exec_requests DMV we are querying is transient and only stores  the last 10,000 queries executed. So when we were running the query above, our requests were no longer there and we were getting nothing back! The table holds data on all queries that go against the distribution nodes and statistics gathering for each of the nodes. So in order to try and limit the records in this table, keep the nesting level of queries as low as possible to avoid the table blowing up and not having the data you need in it!


Stay tuned for another blog about working with Azure Data Warehouse!

Statistics in Azure SQL Data Warehouse

Following on from my previous post about Update Queries in Azure SQL Data Warehouse, I thought I would put together a mini-series of blogs related to my ‘struggles’ working with the Azure SQL DW. Don’t get me wrong, its great, just has some teething issues of which there are work-arounds! Winking smile

This blog post is going to look at what Statistics in the database world are, the differences between them on-prem (SQL Server) and in the cloud (Azure SQL Data Warehouse) and also how to use them in Azure Data Warehouse.

What are statistics?

Statistics are great, they provide information about your data which in turn helps queries execute faster, The more information that is available about your data, the quicker your queries will run as it will create the most optimal plan for the query. 

Think of the statistics as you would the mathematical ones- they give us information regarding the distribution of values in a table, column(s) or indexes. The statistics are stored in a histogram which shows the distribution of values, range of values and selectivity of values. Statistics objects on multiple columns store information regarding correlation of values among the columns. They are most important with queries that have JOINS and GROUP BY, HAVING, and WHERE clauses.

In SQL Server, you can get information about the statistics by querying the catalog views sys.stats and sys.stats_columns. By default, SQL Server automatically creates statistics for each index, and single columns.

See here for more information.

How does it work in Azure SQL Data Warehouse?

In Azure SQL Data Warehouse, statistics have to be created manually. On previous SQL Server projects, creating and maintaining statistics wasn’t something that we had to incorporate into our design (and really think about!) however with SQL DW we need to make sure we think about how to include it in our process in order to make sure we take advantage of the benefits of working with Azure DW.

The major selling point of Azure SQL Data Warehouse is that it is capable of processing huge volumes of data, one of the specific performance optimisations that has been made is the distributed query optimiser. Using the information obtained from the statistics (information on data size and distribution), the service is able to optimize queries by assessing the cost of specific distributed query operations. Therefore, since the query optimiser is cost-based, SQL DW will always choose the plan with the lowest cost.

Statistics are important for minimising data movement within the warehouse i.e. moving data from distributions to satisfy a query. If we don’t have statistics, azure data warehouse could end up performing data movement on the larger (perhaps fact) table instead of the smaller (dimension) table as it wouldn’t know any information about the size of them and would just have to guess!

How do we implement statistics in Azure Data Warehouse?

Microsoft have actually provided the code of how to generate the statistics so its just a case of deciding when in your process you want to create them or maintain.

In my current project, we have created a stored procedure which will create statistics and another that will update them if they already exists. Once data has ben loaded into a table, we call the stored procedure and then the statistics will be created or updated (depending on what is needed).

See the documentation for more information and the code.

Tip: On my current project, we were getting errors when running normal stored procedures to load the data.

Error message:

Number of Requests per session had been reached’.

Upon investigation in the system tables,’Show Statistics’ was treated as a request which was also evaluated for each node causing the number of requests to blow up. By increasing the data warehouse units (DWUs) and also the resource group allocation this problem went away. So, take advantage of the extra power available to you!


There is a big list on the Microsoft Azure website of features not supported in Azure SQL Data Warehouse, take a look ‘here’. I will cover further issues in my next few blogs Smile

Update queries in Azure SQL Data Warehouse


I’ve recently started working on a project where we working in the cloud with Azure SQL Data Warehouse:

“Azure SQL Data Warehouse is a cloud-based, scale-out database capable of processing massive volumes of data, both relational and non-relational” 

For more information about Azure SQL Data Warehouse, see here.

Although we develop with the same T-SQL as we do using the on-prem version of SQL Server, I did come across a bit of a quirk when writing some update statements.

If we are using the on-prem version of SQL Server, when we need to update data, we would have a SQL query like:


That is a basic update to a specific row of data in the Sales.MyOrderDetails table, using a where clause to filter for the row.

Sometimes, it isn’t always as straight forward and we need to join to other tables, so that we can refer to attributes from those rows for filtering purposes. For example:




However, if we take this approach in SQL Data Warehouse, we get the following error.



SQL Data Warehouse doesn't support ANSI joins in the FROM clause of an UPDATE statement (it is also the case for DELETE statements too). There is a way round it and it uses an implicit join.

Before we look at how the update query can be written, it is a good place to point out that Inner joins can be written in a couple of different ways to what we had above.

In an inner join, the ON and WHERE clause both perform the same filtering and they both return rows where the ON and WHERE predicate is true. Therefore, we could write an inner join as


or implicitly like,


However, it is normally best to stick with the original example rather than the implicit version as although it is still supported, it is an old deprecated syntax and not considered best practise.

So, in order to write an update query in SQL Data Warehouse that uses inner joins to filter the rows, the workaround is as follows:



In conclusion, most SQL statements written in Azure SQL Data Warehouse are written in the same way we would with he on-prem  version of SQL Server, however, there are some cases where the syntax differs slightly and I will be blogging more about these special cases as I come across them!