Adatis

Adatis BI Blogs

Python in SQL Server 2017

One of the new features of SQL Server 2017 was the ability to execute Python Scripts within SQL Server. For anyone who hasn’t heard of Python, it is the language of choice for data analysis. It has a lot of libraries for data analysis and predictive modelling, offers power and flexibility for various machine learning tasks and is also a much simpler language to learn than others. The release of SQL Server 2016, saw the integration of the database engine with R Services, a data science language. By extending this support to Python, Microsoft have renamed R Services to ‘Machine Learning Services’ to include both R and Python. The benefits of being able to run Python from SQL Server are that you can keep analytics close to the data (if your data is held within a SQL Server database) and reduce any unnecessary data movement. In a production environment you can simply execute your Python solution via a T-SQL Stored Procedure and you can also deploy the solution using the familiar development tool, Visual Studio. Installation and Setup   When you download SQL Server 2017, make sure, during the time of installation on Feature Selection, you select the following: Database Engine Services Machine Services (In-Database) Python Please see here for detailed instructions on the setup. Make sure you download the latest version of SQL Server 2017 as there are errors within the pre-built Python packages in previous versions. Once the installation is complete, you can now try out executing Python scripts from within Management Studio. Before we begin, we need to make sure we enable the execution of these scripts. In order to see if we can run Python scripts, run the following: EXEC sp_configure 'external scripts enabled'GO If run_value = 1 that means we are allowed to run our Python scripts. If it is 0, run the script below: sp_configure 'external scripts enabled' , 1RECONFIGURE WITH OVERRIDE;GO Now, for the change to take effect, you need to restart the SQL Server service and you are good to go! Executing Python scripts via T-SQL The basic syntax for executing Python scripts is as follows: sp_execute_external_script @language = N'language' , @script = N'script', @input_data_1 = ] 'input_data_1' [ , @input_data_1_name = ] N'input_data_1_name' ] [ , @output_data_1_name = 'output_data_1_name' ] [ , @parallel = 0 | 1 ] [ , @params = ] N'@parameter_name data_type [ OUT | OUTPUT ] [ ,...n ]' [ , @parameter1 = ] 'value1' [ OUT | OUTPUT ] [ ,...n ] [ WITH <execute_option> ] [;] <execute_option>::= { { RESULT SETS UNDEFINED } | { RESULT SETS NONE } | { RESULT SETS ( ) } } The mandatory arguments to provide are @language , @script . @language = Indicates the scripts language. Values are R or Python. @script = This is the body of the Python script @input_data_1 = This is a T-SQL statement that reads some data from a table within the database. @input_data_1_name = This is where you can name the variable used to represent the T-SQL query defined above. For executing Python scripts, the form of data here must be tabular, however for R it is slightly different. @output_data_1_name =  Specifies a name of the variable that contains the data to be returned to SQL Server upon completion of the stored procedure. For Python, the output must be a pandas dataframe. By default, result sets that are returned by this stored procedure are output with unnamed columns. If you would like your result set to contain column names, you can add WITH RESULT SETS to the end on the stored procedure. As well as specifying column names, you will also need to provide the data types. You will see the difference between including it and not in the examples shown below. This system stored procedure, can also be used to execute R scripts, simply specifying the language in the @script parameter. Please see here for more information about this stored procedure. Examples N.B Please be aware that formatting is very important and should be one of the first things you should check if you get an error during execution. There are various Python formatting sites online to help with this. The examples below are to demonstrate how to use the syntax and can be classed as basic in the grand-scheme of what Python can do as a language. Example 1 EXEC sp_execute_external_script @language =N'Python', @script=N'OutputDataSet = InputDataSet', @input_data_1 = N'SELECT 1 AS Test' In the above example, we simply printed the input value of the dataset. If you look at the output returned in SSMS, we receive the value 1 but with no column header. If we add WITH RESULT SETS, we get the following:   Example 2 In this piece of code, we are looping through rows of a table (dbo.Test) and printing the value of each row. EXEC sp_execute_external_script  @language =N'Python', @script= N' for i in InputDataSet.Id: print(i) ', @input_data_1 = N'SELECT Id FROM dbo.Test'   The output in SSMS is as follows: Example 3 In this piece of code, it shows how you can use variables and print the value. EXEC sp_execute_external_script  @language =N'Python', @script= N' var1 = "Emma" var2 = "Test" print (var1 + var2) ' There are lot of things we can do, however, we can achieve these basic concepts using normal T-SQL so there has been nothing new or exciting to see. Example 4 A more interesting scenario, which is slightly harder to do using T-SQL, is we can use Python to perform some descriptive statistics of data we pass into it. For this, we need to import the pandas library to take advantage of it. The pandas library is a package which provides data structures designed to make working with relational data easy and intuitive. See here for more information. EXEC sp_execute_external_script  @language =N'Python',@script= N'import pandas as pdfrom pandas import DataFrame OutputDataSet = pd.DataFrame(InputDataSet.describe()) ',@input_data_1 = N'SELECT   CAST(TotalProductCost AS float), CAST(UnitPrice AS Float), CAST(OrderQuantity AS FLOAT)FROM FactInternetSales'with result sets ((TotalProductCost float, UnitPrice Float, OrderQuantity FLOAT))   By using ‘describe’ we can get all the usual statistical measures for the columns that we pass in.   The statistics are in the following order: Count, Mean, Standard Deviation, Min, 25% quartile, 50% quartile, 75% quartile and Max. Now, a few words about the Python code used above: Data Frame: A data frame is a data structure within Python which is like a table that we are used to within SQL Server. It contains a built-in function named “describe” which allows us to calculate the basic statistics of our dataset. We pass in the InputDataSet to the describe function and then this is converted to a data frame using the DataFrame function. OutputDataSet: The resulting data frame is assigned to the result of the output stream and uses the default output name ‘OutputDataSet’ The example above is using data from FactInternetSales from the AdventureWorksDW. The fields needed to be converted to float as they have ‘money’ datatypes and that is not a supported datatype in Python. Sentiment Analysis Once you have got to grips with the basics, you can move onto what Python is great at – Machine Learning scenarios. One popular machine learning scenario is text analysis (or sentiment analysis). Sentiment analysis is analysing a piece of text to see if the sentiment is positive or negative. A good example of this would be applying it to tweets on Twitter to see if they are positive or negative. Using Python in SQL Server 2017 brings the added advantage that you can use pre-trained models out of the box to do your analysis. In order to use pre-trained models, you need to add the models to the SQL Server instance where Machine Learning Services is installed (instructions are below): 1. Run the separate Windows-based installer for Machine Learning Server. Detailed instructions of what you need to install can be found here.     You should only need to tick the box for Pre-trained models as this is an update to what we already have. 2. To check that they have installed correctly, open the command prompt (Run as administrator) and navigate to C:\Program Files\Microsoft SQL Server\140\Setup Bootstrap\SQL2017\x64\ and run the following RSetup.exe /install /component MLM /version 9.2.0.24 /language 1033 /destdir "C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\library\MicrosoftML\mxLibs\x64"   Now you have everything set up you can begin using the pre-trained models. I will be using this and giving my thoughts in a future blog, however, in the meantime there is a Microsoft blog which provides step by step instructions on how to perform this analysis.   In summary, Microsoft have made it easy to integrate running Python code from within SQL Server and made it more accessible to people who are used to working within a SQL Server environment.

Migrating to Native Scoring with SQL Server 2017 PREDICT

Microsoft introduced native predictive model scoring with the release of SQL Server 2017. The PREDICT function (Documentation) is now a native T-SQL function that eliminates having to score using R or Python through the sp_execute_external_script procedure. It's an alternative to sp_rxPredict. In both cases you do not need to install R but with PREDICT you do not need to enable SQLCLR either - it's truly native. PREDICT should make predictions much faster as the process avoids having to marshal the data between SQL Server and Machine Learning Services (Previously R Services). Migrating from the original sp_execute_external_script approach to the new native approach tripped me up so I thought I'd share a quick summary of what I have learned. Stumble One: Error occurred during execution of the builtin function 'PREDICT' with HRESULT 0x80004001. Model type is unsupported. Reason: Not all models are supported. At the time of writing, only the following models are supported: rxLinMod rxLogit rxBTrees rxDtree rxdForest sp_rxPredict supports additional models including those available in the MicrosoftML package for R (I was using attempting to use rxFastTrees). I presume this limitation will reduce over time. The list of supported models is referenced in the PREDICT function (Documentation). Stumble Two: Error occurred during execution of the builtin function 'PREDICT' with HRESULT 0x80070057. Model is corrupt or invalid. Reason: The serialisation of the model needs to be modified for use by PREDICT. Typically you might serialise your model in R like this: model <- data.frame(model=as.raw(serialize(model, NULL))) Instead you need to use the rxSerializeModel method: model <- data.frame(rxSerializeModel(model, realtimeScoringOnly = TRUE)) There's a corresponding rxUnserializeModel method, so it's worth updating the serialisation across the board so models can be used interchangeably in the event all models are eventually supported.  I have been a bit legacy. That's it.  Oh, apart from the fact PREDICT is supported in Azure SQL DB, despite the documentation saying the contrary.