Adatis

Adatis BI Blogs

Python in SQL Server 2017

One of the new features of SQL Server 2017 was the ability to execute Python Scripts within SQL Server. For anyone who hasn’t heard of Python, it is the language of choice for data analysis. It has a lot of libraries for data analysis and predictive modelling, offers power and flexibility for various machine learning tasks and is also a much simpler language to learn than others. The release of SQL Server 2016, saw the integration of the database engine with R Services, a data science language. By extending this support to Python, Microsoft have renamed R Services to ‘Machine Learning Services’ to include both R and Python. The benefits of being able to run Python from SQL Server are that you can keep analytics close to the data (if your data is held within a SQL Server database) and reduce any unnecessary data movement. In a production environment you can simply execute your Python solution via a T-SQL Stored Procedure and you can also deploy the solution using the familiar development tool, Visual Studio. Installation and Setup   When you download SQL Server 2017, make sure, during the time of installation on Feature Selection, you select the following: Database Engine Services Machine Services (In-Database) Python Please see here for detailed instructions on the setup. Make sure you download the latest version of SQL Server 2017 as there are errors within the pre-built Python packages in previous versions. Once the installation is complete, you can now try out executing Python scripts from within Management Studio. Before we begin, we need to make sure we enable the execution of these scripts. In order to see if we can run Python scripts, run the following: EXEC sp_configure 'external scripts enabled'GO If run_value = 1 that means we are allowed to run our Python scripts. If it is 0, run the script below: sp_configure 'external scripts enabled' , 1RECONFIGURE WITH OVERRIDE;GO Now, for the change to take effect, you need to restart the SQL Server service and you are good to go! Executing Python scripts via T-SQL The basic syntax for executing Python scripts is as follows: sp_execute_external_script @language = N'language' , @script = N'script', @input_data_1 = ] 'input_data_1' [ , @input_data_1_name = ] N'input_data_1_name' ] [ , @output_data_1_name = 'output_data_1_name' ] [ , @parallel = 0 | 1 ] [ , @params = ] N'@parameter_name data_type [ OUT | OUTPUT ] [ ,...n ]' [ , @parameter1 = ] 'value1' [ OUT | OUTPUT ] [ ,...n ] [ WITH <execute_option> ] [;] <execute_option>::= { { RESULT SETS UNDEFINED } | { RESULT SETS NONE } | { RESULT SETS ( ) } } The mandatory arguments to provide are @language , @script . @language = Indicates the scripts language. Values are R or Python. @script = This is the body of the Python script @input_data_1 = This is a T-SQL statement that reads some data from a table within the database. @input_data_1_name = This is where you can name the variable used to represent the T-SQL query defined above. For executing Python scripts, the form of data here must be tabular, however for R it is slightly different. @output_data_1_name =  Specifies a name of the variable that contains the data to be returned to SQL Server upon completion of the stored procedure. For Python, the output must be a pandas dataframe. By default, result sets that are returned by this stored procedure are output with unnamed columns. If you would like your result set to contain column names, you can add WITH RESULT SETS to the end on the stored procedure. As well as specifying column names, you will also need to provide the data types. You will see the difference between including it and not in the examples shown below. This system stored procedure, can also be used to execute R scripts, simply specifying the language in the @script parameter. Please see here for more information about this stored procedure. Examples N.B Please be aware that formatting is very important and should be one of the first things you should check if you get an error during execution. There are various Python formatting sites online to help with this. The examples below are to demonstrate how to use the syntax and can be classed as basic in the grand-scheme of what Python can do as a language. Example 1 EXEC sp_execute_external_script @language =N'Python', @script=N'OutputDataSet = InputDataSet', @input_data_1 = N'SELECT 1 AS Test' In the above example, we simply printed the input value of the dataset. If you look at the output returned in SSMS, we receive the value 1 but with no column header. If we add WITH RESULT SETS, we get the following:   Example 2 In this piece of code, we are looping through rows of a table (dbo.Test) and printing the value of each row. EXEC sp_execute_external_script  @language =N'Python', @script= N' for i in InputDataSet.Id: print(i) ', @input_data_1 = N'SELECT Id FROM dbo.Test'   The output in SSMS is as follows: Example 3 In this piece of code, it shows how you can use variables and print the value. EXEC sp_execute_external_script  @language =N'Python', @script= N' var1 = "Emma" var2 = "Test" print (var1 + var2) ' There are lot of things we can do, however, we can achieve these basic concepts using normal T-SQL so there has been nothing new or exciting to see. Example 4 A more interesting scenario, which is slightly harder to do using T-SQL, is we can use Python to perform some descriptive statistics of data we pass into it. For this, we need to import the pandas library to take advantage of it. The pandas library is a package which provides data structures designed to make working with relational data easy and intuitive. See here for more information. EXEC sp_execute_external_script  @language =N'Python',@script= N'import pandas as pdfrom pandas import DataFrame OutputDataSet = pd.DataFrame(InputDataSet.describe()) ',@input_data_1 = N'SELECT   CAST(TotalProductCost AS float), CAST(UnitPrice AS Float), CAST(OrderQuantity AS FLOAT)FROM FactInternetSales'with result sets ((TotalProductCost float, UnitPrice Float, OrderQuantity FLOAT))   By using ‘describe’ we can get all the usual statistical measures for the columns that we pass in.   The statistics are in the following order: Count, Mean, Standard Deviation, Min, 25% quartile, 50% quartile, 75% quartile and Max. Now, a few words about the Python code used above: Data Frame: A data frame is a data structure within Python which is like a table that we are used to within SQL Server. It contains a built-in function named “describe” which allows us to calculate the basic statistics of our dataset. We pass in the InputDataSet to the describe function and then this is converted to a data frame using the DataFrame function. OutputDataSet: The resulting data frame is assigned to the result of the output stream and uses the default output name ‘OutputDataSet’ The example above is using data from FactInternetSales from the AdventureWorksDW. The fields needed to be converted to float as they have ‘money’ datatypes and that is not a supported datatype in Python. Sentiment Analysis Once you have got to grips with the basics, you can move onto what Python is great at – Machine Learning scenarios. One popular machine learning scenario is text analysis (or sentiment analysis). Sentiment analysis is analysing a piece of text to see if the sentiment is positive or negative. A good example of this would be applying it to tweets on Twitter to see if they are positive or negative. Using Python in SQL Server 2017 brings the added advantage that you can use pre-trained models out of the box to do your analysis. In order to use pre-trained models, you need to add the models to the SQL Server instance where Machine Learning Services is installed (instructions are below): 1. Run the separate Windows-based installer for Machine Learning Server. Detailed instructions of what you need to install can be found here.     You should only need to tick the box for Pre-trained models as this is an update to what we already have. 2. To check that they have installed correctly, open the command prompt (Run as administrator) and navigate to C:\Program Files\Microsoft SQL Server\140\Setup Bootstrap\SQL2017\x64\ and run the following RSetup.exe /install /component MLM /version 9.2.0.24 /language 1033 /destdir "C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\library\MicrosoftML\mxLibs\x64"   Now you have everything set up you can begin using the pre-trained models. I will be using this and giving my thoughts in a future blog, however, in the meantime there is a Microsoft blog which provides step by step instructions on how to perform this analysis.   In summary, Microsoft have made it easy to integrate running Python code from within SQL Server and made it more accessible to people who are used to working within a SQL Server environment.

My Experience of the Microsoft Professional Program for Data Science

(Image 1 – Microsoft 2017 - https://academy.microsoft.com/en-us/professional-program/data-science)   In 2016 I was talking to Andrew Fryer (@DeepFat)- Microsoft technical evangelist, (after he attended Dundee university to present about Azure Machine Learning), about how Microsoft were piloting a degree course in data science. My interest was immediately spiked. Shortly after this hints began appear and the Edx page went live. Shortly after the Edx page went live, the degree was rebranded as the "Professional Program". I registered to be part of the pilot, however was not accepted until the course went live in September 2016.   Prior to 2016 my background was in BI, predominately in Microsoft Kimball data warehousing using SQL Server. At the end of 2015 I enrolled on a Master's Degree in Data Science through the University of Dundee. I did this with the intention of getting exposure to tools I had an interest in, but had some/little commercial experience (R, Machine learning and statistics). This course is ongoing and will finish in 2018, I highly recommend it! I would argue that it is the best Data Science Master's degree course in the UK. So going in to the MPP I had a decent idea of what to expect, plus a lot of SQL experience, R and Power BI. Beyond that I had attended a few sessions at various conferences on Azure ML. When the syllabus for the MPP came out, it directly complemented my studies.   Link to program - https://academy.microsoft.com/en-us/professional-program/data-science Link to Dundee Masters - https://www.dundee.ac.uk/study/pg/data-science/   Structure of the program The program is divided up in to 9 modules and a final project. All modules need to be completed but there are different options you can take - You can customise the course to suit your interests. You can choose to pay for the course (which you will need to do if you intend to work towards the certification) or audit the course for free.  I will indicate which modules I took and why. Most modules recommend at least 6 weeks part-time to complete. I started the first module in the middle of September 2016 and completed the final project middle of January 2017 – So the 6 week estimate is quite high, especially if you already have decent a base knowledge of the concepts already.   You can if you wish complete multiple modules at once. I am not sure I recommend this approach as to get the most out of the course, you should read around the subject as well as watching the videos. Each module has a start date and an end date that you need to complete it between. If you do not you will need to do it all again. You can start a module in one period and wait until the next for another module. You do not need to complete them all in 3 months. If you pay for the module but do not request your certificate before the course closes, you will need to take it again (top tip, as soon as you're happy with you score, request you certificate).   Module list Module Detail Time taken Data Science Orientation Data Science Orientation 2 - 3 days Query Relational Data Querying Data with Transact-SQL 1 day - Exam only Analyze and Visualize Data Analyzing and Visualizing Data with Excel  Analyzing and Visualizing Data with Power BI 2 - 4  days Understand Statistics Statistical Thinking for Data Science and Analytics 7 - 9 days Explore Data with Code Introduction to R for Data Science Introduction to Python for Data Science 7 - 9 days Understand Core Data Science Concepts Data Science Essentials 7 - 9 days Understand Machine Learning Principles of Machine Learning 2 weeks Use Code to Manipulate and Model Data  Programming with R for Data Science Programming with Python for Data Science R - 2 - 3 daysPython - 3 weeks Develop Intelligent Solutions   Applied Machine Learning  Implementing Predictive Solutions with Spark in HDInsight Developing Intelligent Applications 2 weeks Final Project Data Science Challenge 2 months*   The times taken are based on the time I had spare. I completed each module between projects, in the evening and at the weekend. This module can be completed in a few days, however you need to wait until it has completed to get you grade.   Structure of the modules Each modules is online. You log on to the Edx website and watch videos by leading experts. Either at the end of the video, after reading some text or at the end of a section of the modules you are given a multiple choice test. The multiple choice options are graded and form part of your overall score. The other main assessment method is labs, where you will be required to complete a series of tasks and enter the results. Unlike certifications, you get to see what your score is as you progress through the module. The multiple choice questions generally allow you to have two to three attempts at the answer, sometimes these are true/false with two attempts, which does undermine the integrity of the course.   There is normally a final section which you're only given one chance to answer, and holds a higher % towards your final mark. You need 70% to pass. Once you hit 70% you can claim your certificate - if you have chosen to pay for the module. Modules range from $20 to $100. For the most part I answered the questions fully and tried for the highest score possible. However, In all honestly towards the end, once I hit around 80%, I started looking at a different module. If the module was really interesting I would persevere.   Modules Data Science Orientation, Query Relational Data & Analyze and Visualize Data. These modules are very basic and really only skim the surface of all the topics they describe. The first module is a gentle introduction to the main concepts you will learn throughout the program. The next modules focused on querying data with SQL. Regardless of your opinion of SQL, you must agree that SQL the is language of data. Having an understanding of the fundamentals of SQL is paramount, as almost every level of the Microsoft Data Science stack has integration with databases. If you're familiar with SQL (I already held an MCSE in SQL 2012) you can skip the main content of this module and just take the test at the end. For the next you have an option of Excel or Power BI for visualisation. As I have experience with Power BI I opted for this module. Once again this is a very basic introduction to Power BI. It will get you familiar enough with the tool that you can do basic data exploration. Some parts of this course jarred with me. Data visualisation is so important and a key skill for any data scientist. In the Power BI module one of the exercises was to create a 3d pie chart. Pie charts are not a good visualisation as it is hard to differentiate between angles and making it 3d only escalates the issue. I wish Microsoft would have made reference to some of the great data viz experts when making this module - I cannot comment on the Excel version.   Understanding statistics. This module is different from its predecessors, in that it is not run by Microsoft. This is a MOOC from Columbia university, which you might have completed before. It covers a lot of the basic and more advanced stats that you need to know for data science. In particular a solid grounding in probability and probability theory. In BI you become familiar with descriptive stats and measures of variance, however I had not done a great deal of stats beyond this. I have researching statistical methods for the MSc, but I had not done any real stats since A-Level maths. This course was really interesting and I learnt a lot. I don’t know if this is the best way to really learn stats, but it is a good primer to what you need to know. I found topping up my understanding with blogs, books and YouTube helped support this module.   Explore data with code. You have two options again for this module, R and Python. Which should you learn I imagine you're asking, well the simple answer is both. Knowing either R or Python will get you so far, knowing both with make you a unicorn. Many ask why to learn one language over the other - aside from the previous point. R is very easy to get in to, it has a rich catalogue of libraries written by some of the smartest statistical minds. It has a simple interface and is easy to install. Python is harder to learn in my opinion as the language is massive! I found Python harder to work with, but it is much richer. I would recommend Python just for SciKitLearn the machine learning library. The python module is extended to use code dojo (the great online tuition site). As you progress through the questions and examples, you have an ide which will check you understanding and  will grade you as you go. I found this really helpful. This module is again a bit on the easier side. If you think the later Python module will be similar, you are in for a surprise! I did not take the R module as I was already using R in my day job.   Understand core data science concepts. Almost a redo of the first module and the understanding statistics module. Not a lot to say here, but repetition helped me understand and remember the concepts. The more I had to think about the core concepts the more they stuck. This module could have been removed with little to no impact on the course, but helped solidify my knowledge.   Understanding Machine learning. As this is a Microsoft course this module is all about Azure Machine Learning. If you have not used Azure ML before, it has a nice drag and drop interface which allows you to build quick simple models and create a web api key which you can then pass data to using any tool with a REST API. This module is half theory and half practical. There are a lot of labs, so you will need to take you time. If you skip ahead you will get the answers wrong and might not make it to 70%.   Using code to manipulate and model data. This section has two options again R and Python. I know quite a bit or R already so I started with Python. I wanted to do them both to see how you can do machine learning in both. I was expecting a continuation of the code dojo format from the previous module, this was far from the case. Each of the modules up until this point have worked with you to find the right answer. This module will equip you with the basics, but expect you to find the correct function and answer. Believe me when I say it was hard (with little prior experience of Python). The course will lead you to towards the right resources, but you need to read the documentation to answer the question. This was a great change of pace. Having to search for the answers made me absorb more than just the quizzes. This module was a struggle. Once I completed this I did the same for R. On a difficulty scale, if the Python module was 100, R was only at 20. The disparity in difficult is massive and frankly unfair. I was able to complete the R module very quickly. I left feeling disappointed that this did not have the same complexity that the Python module did.   Develop intelligent solutions. For this section you can pick one of three modules, Machine learning, Spark or micro services. I went with Spark. Why? Because I had already worked with Spark and Hadoop as part of the MSc at Dundee. I knew how it worked and what it did from an open source point of view, but not from a Microsoft HD-Insight perspective. This module was tricky but nothing compared to the Python module. I spent the best part of the week working on Spark, setting up HD-Insight clusters and forgetting to tear them down (top tip! Don’t leave a HD-Insight cluster running - They are EXPENSIVE!). The last module is a machine learning project, so picking the "Applied Machine Learning" option might put you in a better place than your competition. I did not attempt either the Machine Learning or the Micro-services modules.   Final project. Here is where the fun begins. You're given a problem and a dataset. You need to clean, reduce, derive features and process the dataset, then apply an ML technique to predict something. In my case it was whether or not someone will default on a loan. You could use any technique you liked as long as the final result was in Azure ML. I was pretty happy with my model early on and made very few tweaks as the course progressed. Unlike the previous modules where you can complete a module and get your score, your final score is only available once the module has ended. You will build an ML experiment and test against a private dataset. You can submit your experiment 3 times a day to be scored against the private data (maximus of 100 attempts). This will give you an indication of your score, but this is not your score! You score is calculated against a different dataset after the module has finished.  You top 5 scores will be used to test against the private closed data. If you have over-fitted you model, you might have a shock (as many did on the forums) when you score is marked.   I completed all modules at the start of January and waited until February to get my final score. My highest scoring answer, when used against the closed private dataset, did not get over the required 70% to pass. This was surprising but not all that unexpected. I had over-fitted the model. To counter balance this, I created 5 different experiments with 5 similar but different approaches. All score similar (~1-3% accuracy difference). This was enough to see me past the required 70% and to obtain the MPP in data science. The private dataset has been published now. In the coming weeks I will blog about the steps I took to predict if someone would default on their loan.   I have been asked at different stages of the course "would you recommend the course?". It really depends on what you want out of the course! If you expect to be a data scientist after completing the MPP, then you might be in for a shock. To get the most out of the course you need to supplement it with wider reading / research. YouTube has many great videos and recorded lectures which will really help process the content and see it taught from a different angle. If you're looking to get an understanding of the key techniques in  Data Science (from a Microsoft point-of-view) then you should take this course. If you're doing a degree where you need to do research, many of the modules will really help and build upon what you already know.   I hope you have found this interesting and that it has helped you decide whether or not you want to invest the time and money (each module is not free). If you do decide and you persevere you will too be the owner of the MPP in Data Science (as seen below).   Terry McCann - Adatis Data Science Consultant & Organiser of the Exeter Data Science User Group - You can find us on MeetUp.