Adatis BI Blogs

Adatibits - Scalable Deep Learning with Azure Batch AI

  I recently had the pleasure of attending SQLBits, there were a number of great talks this year, and I think overall it was a fantastic event. As part of our commitment to learning at Adatis, we were also challenged to present back to the team something of interest we learnt at the event. This then becomes our internal Adatibits event where we can get sample of sessions from across the week. As such, I was pleased when I saw a talk by Ben Keen on the Friday bill of SQLBits that revolves around deep learning. Having just come through the Microsoft Data Science program, this fell in line with my interest / research in data science, and was focusing on the bleeding edge of the subject area. It was also the talk I was probably looking forward to the most from the synopsis, and I thought it was very well delivered. Anyway, on to the blog. In the following paragraphs, I’ll cover a cut down version of the talk and also talk about my experience using the MNIST dataset on Azure Batch AI. Credit to Ben for a few of the images I’ve used as they came off his slide deck.   What is Deep Learning? So you’ve heard of machine learning (ML), and what that can do – deep learning is essentially a subset of ML, which uses neural network architectures (similar to human brains). It can work with both supervised (labelled) and unsupervised data, although its value today is tending towards learning from the labelled data. You can almost think of it as an evolution of ML.  It’s performance tends to improve the more data you can throw at it, where traditional ML algorithms seem to plateau. Deep learning also has an ability to perform automatic feature extraction from raw data (feature learning), where as the traditional routes have features provided as part of the dataset.   How does it help us? What are its use cases? Deep learning excels at finding patterns in unstructured data. The following examples would be very difficult to write program to do, which is where the field of DL comes in. The use cases are usually split into 4 main areas – image, text, sound, and video. Image – medical imaging to diagnose heart disease / cancers Image – classification / segmentation of satellite images (NASA) Image – restoring colour to blank and white photos Image – Pixel Super Resolution (generating higher res images from lower ones) Text – real time text translation Text – identifying font (Adobe DeepFont) Sound – real time foreign language translation Sound – restoring sound to videos Video – motion detection in self-driving cars (Tesla) Video – beating video games (DeepMind beating Atari Breakout) Video – redacting video footage in Police body-cameras (NIJ)   Deep Neural Network Example A simple example of how we can use deep learning is to understand the complexity around house prices. Taking an input layer of neurons for things such as Age, Sq. Footage, Bedrooms, and Location of a house – one can normally apply the traditional linear formula y = mx + c to apply weightings to the neurons to calculate the house price. This is a very simplistic view and there are many more factors that apply that can change the value. This often involves the different neurons intersecting with one another. For example, people would think a house with a large number of bedrooms is a good thing and this would raise the price of the house, but if all these bedrooms were really small – then this wouldn’t be a very attractive offering for anyone other than a developer (and even then they might baulk at the effort involved) therefore lowing the price. This is especially true with people wanting more open plan houses nowadays. So traditionally people might be interested in bedrooms, this may shift in recent times to Sq. Footage as the main driver.     Therefore a number of weights can be attributed to an intermediary layer called the hidden layer. The neural network architecture then uses this hidden layer to perform a better prediction. It does this by starting off with completely arbitrary weights (which will make a poor prediction). The process then multiplies the inputs by these weights, and applies an activation function to get to the hidden layer (a1, a2, a3). This neuron is then multiplied by another weight and another activation function to generate the prediction. This value is then scored, and evaluated via some form of loss function (Root Mean Squared Error). The weights are adjusted accordingly and the process is repeated. The weights are adjusted through a process called gradient decent.     To give you an idea of scale, and weight optimisation that's required – a 50x50 RGB image has 7,500 input neurons. Therefore we’re going to need to scale out as the training is very compute intensive. This is where Azure Batch AI comes in!   Azure Batch AI Azure Batch AI is a managed service for training deep learning models in parallel at scale. It’s currently in public preview, which I believe it entered around September 2017. It’s built on top of Azure Batch and essentially sells the standard Azure story where it provides the infrastructure for data scientists so they don’t need to worry about it and can get on with more practical work. You can take your ML tools and workbooks (CNTK, TensorFlow, Python, etc.) and provision a GPU cluster on demand to run them against. Its important to note, the provision is of GPU, not CPU – similar cores, less money, less power consumed for this type of activity. Once trained, the service can provide access to the trained model via apps and data services.     As part of my interest in the subject, I then went and looked at using the service to train a model off the MNIST dataset. This is a collection of handwritten digits between 1-9 with over 60,000 examples. It’s a great dataset to use to try out learning techniques and pattern recognition methods while spending minimal efforts on pre-processing and formatting. This is not always easy with most images as they contain a lot of noise and require time to convert into a format ready for training.     I then followed the following process within Azure Batch AI. Created a Storage Account using Azure CLI along with a file share and directory. # Login az login -u <username> -p <password> # Register resource providers az provider register -n Microsoft.BatchAI az provider register -n Microsoft.Batch # Create Resource Group az group create --name AzureBatchAIDemo --location uksouth # Create storage account to host data/scripts az storage account create --name azurebatchaidemostorage --sku Standard_LRS --resource-group AzureBatchAIDemo # Create File Share az storage share create --account-name azurebatchaidemostorage --name batchaiquickstart # Create Directory az storage directory create --share-name batchaiquickstart --name mnistcntksample --account-name azurebatchaidemostorage   Uploaded the training / test datasets, and Python script. # Upload train, test and script files az storage file upload --share-name batchaiquickstart --source Train-28x28_cntk_text.txt --path mnistcntksample --account-name azurebatchaidemostorage az storage file upload --share-name batchaiquickstart --source Test-28x28_cntk_text.txt --path mnistcntksample --account-name azurebatchaidemostorage az storage file upload --share-name batchaiquickstart --source --path mnistcntksample --account-name azurebatchaidemostorage   Provisioned a GPU cluster. The NC6 consists of 1 GPU, which is 6 vCPUs, 56GB memory, and is roughly 80p/hour. This scales all the way up to an ND24 which is 4 GPUs, 448GB memory, for roughly £7.40/hour. # Create GPU Cluster NC6 is a NVIDIA K80 GPU az batchai cluster create --name azurebatchaidemocluster --vm-size STANDARD_NC6 --image UbuntuLTS --min 1 --max 1 --storage-account-name azurebatchaidemostorage --afs-name batchaiquickstart --afs-mount-path azurefileshare --user-name <username> --password <password> --resource-group AzureBatchAIDemo --location westeurope # Cluster status overview az batchai cluster list -o table   Created a training job from a JSON template – this tells the cluster where to find the scripts and the data, how many nodes to use, what container to use, and where to store the trained model. This can be then be run! # Create a training job from a JSON template az batchai job create --name batchaidemo --cluster-name azurebatchaidemocluster --config batchaidemo.json --resource-group AzureBatchAIDemo --location westeurope # Job status az batchai job list -o table   The output can be seen in real time along with the epochs and metrics. An epoch is essentially a full training cycle, and by having multiple epochs, you can cross validate your data, which leads the model to generalise more and fit real world data better. # Output metadata az batchai job list-files --name batchaidemo --output-directory-id stdouterr --resource-group AzureBatchAIDemo # Observe realtime output az batchai job stream-file --job-name batchaidemo --output-directory-id stdouterr --name stderr.txt --resource-group AzureBatchAIDemo     The pipeline can also be seen in the Azure portal along with links to the output metadata.   Once the model has been trained, it can be extracted, and the resources can be cleared down. # Clean Up az batchai job delete --name batchaidemo az batchai cluster delete --name azurebatchaidemocluster az group delete --name AzureBatchAIDemo   Conclusion By moving the compute into Azure, and having the ability to scale – this means we can generate a faster learning rate for our problem. This in turn will mean better hyperparameter tuning to generate better weightings, which will mean better models, and better predictions. As Steph Locke also alluded to in her data science talk – this means data scientists can do more work on things that they are good at, rather than waiting around for models to train to re-evaluate. Deep learning is certainly an interesting space to be in currently!

Loss Functions and Gradient Descent in Machine Learning

In an earlier blog I explained some of the basic building blocks of Neural Networks and Deep Learning (here). This was very high level and omitted a number of concepts which I wanted to explain but for clarity decided to leave until later. In this blog I will introduce Loss Functions and Gradient Descent, however there are still many more which need to be explained. Loss Functions are used to calculate the error between the known correct output and the actual output generated by a model, Also often called Cost FunctionsGradient Descent is an iterative optimization method for finding the minimum of a function. On each iteration the parameters in a model are amended in the direction of the negative gradient of the output until the optimum parameters for the model are identified.These are fundamental to understanding training models and are common to both supervised machine learning and deep learning. An ExampleA worked example is probably the easiest way to illustrate how a loss function and gradient descent are used together to train a simple model. The model is simple to allow the focus can be on the methods and not the model. Lets see how this works. Imagine there are 5 observations of the weight vs cost of a commodity, The object is to train a model to allow it to be used to predict the price for any weight of the commodity. The observations are:-I plot on a graph the Weight vs the Price and observe the followingModelling this as a linear problem, the equation of a line is of course y = Wx + b where W = slope of the line and b is the intersection of the line on the y axis. To make the problem simpler please accept the assumption that b = 0, this is logically reasonable as the price for zero grams of a commodity is reasonably zero and this is supported by the data.The method described here to train the model is to iteratively evaluate a model using a loss function and amend the parameters in a model to reduce the error in the model. So the next step is to have a guess at a value for W, if doesn’t need to be a good guess, in machine learning initial values are often randomly created so they are very unlikely to be anywhere near accurate on a first iteration. The first guess is shown in red below:Now its necessary to evaluate how bad this model is, this is where the loss function comes in. The loss function used here is called Mean Squared Error (MSE). For each observed point the difference between the observed (actual) value and the estimated value is calculated (the error is represented by the green lines in the diagram below). The errors are squared and then the average of the squared observations is taken to create an numerical representation for the error.This error is plotted on a graph showing Error vs Slope (W) This Error graph will be used in the Gradient Descent part of the method. Following this a small change to the value of W is made and the error is re-evaluated. In the graph below the original value for W is shown in blue the new value in red. The error is once again plotted on the error graph. The error graph reveals that the error is smaller and therefore the adjustment to the value of W was in the correct direction to reduce the error, in other words the model has been improved by the change.Small increments are made to the value of W to cause a reduction in the size of the error, ie to reduce the value of the loss function. In other words we want to descent the gradient of the curve until we find a minimum value for the loss function.Continuing the example, see below how we have continued to zoom in on a solution after several iterations. At a certain point continued changes in the same direction cause the model to become worse rather than improve. At this point the optimal value for W can be identified, its where the gradient of the error curve reaches zero or in other words the value of W pertaining to the lowest point on the graph (indicating the minimum error).Summarising this then, the Loss function is used to evaluate the model on each training run and the output of the loss function is used on each iteration to identify the direction to adjust model parameters. The optimum parameters create the minimum error in the model.Going forward we need to apply these 2 principals to explain Backpropagation, Backpropagation is the method by which Neural Networks learn, its the setting of all the Weights and Biases in the network to achieve the closest output possible to the desired output. That is for another blog which I hope to bring to you soon.

Migrating to Native Scoring with SQL Server 2017 PREDICT

Microsoft introduced native predictive model scoring with the release of SQL Server 2017. The PREDICT function (Documentation) is now a native T-SQL function that eliminates having to score using R or Python through the sp_execute_external_script procedure. It's an alternative to sp_rxPredict. In both cases you do not need to install R but with PREDICT you do not need to enable SQLCLR either - it's truly native. PREDICT should make predictions much faster as the process avoids having to marshal the data between SQL Server and Machine Learning Services (Previously R Services). Migrating from the original sp_execute_external_script approach to the new native approach tripped me up so I thought I'd share a quick summary of what I have learned. Stumble One: Error occurred during execution of the builtin function 'PREDICT' with HRESULT 0x80004001. Model type is unsupported. Reason: Not all models are supported. At the time of writing, only the following models are supported: rxLinMod rxLogit rxBTrees rxDtree rxdForest sp_rxPredict supports additional models including those available in the MicrosoftML package for R (I was using attempting to use rxFastTrees). I presume this limitation will reduce over time. The list of supported models is referenced in the PREDICT function (Documentation). Stumble Two: Error occurred during execution of the builtin function 'PREDICT' with HRESULT 0x80070057. Model is corrupt or invalid. Reason: The serialisation of the model needs to be modified for use by PREDICT. Typically you might serialise your model in R like this: model <- data.frame(model=as.raw(serialize(model, NULL))) Instead you need to use the rxSerializeModel method: model <- data.frame(rxSerializeModel(model, realtimeScoringOnly = TRUE)) There's a corresponding rxUnserializeModel method, so it's worth updating the serialisation across the board so models can be used interchangeably in the event all models are eventually supported.  I have been a bit legacy. That's it.  Oh, apart from the fact PREDICT is supported in Azure SQL DB, despite the documentation saying the contrary.

My Experience of the Microsoft Professional Program for Data Science

(Image 1 – Microsoft 2017 -   In 2016 I was talking to Andrew Fryer (@DeepFat)- Microsoft technical evangelist, (after he attended Dundee university to present about Azure Machine Learning), about how Microsoft were piloting a degree course in data science. My interest was immediately spiked. Shortly after this hints began appear and the Edx page went live. Shortly after the Edx page went live, the degree was rebranded as the "Professional Program". I registered to be part of the pilot, however was not accepted until the course went live in September 2016.   Prior to 2016 my background was in BI, predominately in Microsoft Kimball data warehousing using SQL Server. At the end of 2015 I enrolled on a Master's Degree in Data Science through the University of Dundee. I did this with the intention of getting exposure to tools I had an interest in, but had some/little commercial experience (R, Machine learning and statistics). This course is ongoing and will finish in 2018, I highly recommend it! I would argue that it is the best Data Science Master's degree course in the UK. So going in to the MPP I had a decent idea of what to expect, plus a lot of SQL experience, R and Power BI. Beyond that I had attended a few sessions at various conferences on Azure ML. When the syllabus for the MPP came out, it directly complemented my studies.   Link to program - Link to Dundee Masters -   Structure of the program The program is divided up in to 9 modules and a final project. All modules need to be completed but there are different options you can take - You can customise the course to suit your interests. You can choose to pay for the course (which you will need to do if you intend to work towards the certification) or audit the course for free.  I will indicate which modules I took and why. Most modules recommend at least 6 weeks part-time to complete. I started the first module in the middle of September 2016 and completed the final project middle of January 2017 – So the 6 week estimate is quite high, especially if you already have decent a base knowledge of the concepts already.   You can if you wish complete multiple modules at once. I am not sure I recommend this approach as to get the most out of the course, you should read around the subject as well as watching the videos. Each module has a start date and an end date that you need to complete it between. If you do not you will need to do it all again. You can start a module in one period and wait until the next for another module. You do not need to complete them all in 3 months. If you pay for the module but do not request your certificate before the course closes, you will need to take it again (top tip, as soon as you're happy with you score, request you certificate).   Module list Module Detail Time taken Data Science Orientation Data Science Orientation 2 - 3 days Query Relational Data Querying Data with Transact-SQL 1 day - Exam only Analyze and Visualize Data Analyzing and Visualizing Data with Excel  Analyzing and Visualizing Data with Power BI 2 - 4  days Understand Statistics Statistical Thinking for Data Science and Analytics 7 - 9 days Explore Data with Code Introduction to R for Data Science Introduction to Python for Data Science 7 - 9 days Understand Core Data Science Concepts Data Science Essentials 7 - 9 days Understand Machine Learning Principles of Machine Learning 2 weeks Use Code to Manipulate and Model Data  Programming with R for Data Science Programming with Python for Data Science R - 2 - 3 daysPython - 3 weeks Develop Intelligent Solutions   Applied Machine Learning  Implementing Predictive Solutions with Spark in HDInsight Developing Intelligent Applications 2 weeks Final Project Data Science Challenge 2 months*   The times taken are based on the time I had spare. I completed each module between projects, in the evening and at the weekend. This module can be completed in a few days, however you need to wait until it has completed to get you grade.   Structure of the modules Each modules is online. You log on to the Edx website and watch videos by leading experts. Either at the end of the video, after reading some text or at the end of a section of the modules you are given a multiple choice test. The multiple choice options are graded and form part of your overall score. The other main assessment method is labs, where you will be required to complete a series of tasks and enter the results. Unlike certifications, you get to see what your score is as you progress through the module. The multiple choice questions generally allow you to have two to three attempts at the answer, sometimes these are true/false with two attempts, which does undermine the integrity of the course.   There is normally a final section which you're only given one chance to answer, and holds a higher % towards your final mark. You need 70% to pass. Once you hit 70% you can claim your certificate - if you have chosen to pay for the module. Modules range from $20 to $100. For the most part I answered the questions fully and tried for the highest score possible. However, In all honestly towards the end, once I hit around 80%, I started looking at a different module. If the module was really interesting I would persevere.   Modules Data Science Orientation, Query Relational Data & Analyze and Visualize Data. These modules are very basic and really only skim the surface of all the topics they describe. The first module is a gentle introduction to the main concepts you will learn throughout the program. The next modules focused on querying data with SQL. Regardless of your opinion of SQL, you must agree that SQL the is language of data. Having an understanding of the fundamentals of SQL is paramount, as almost every level of the Microsoft Data Science stack has integration with databases. If you're familiar with SQL (I already held an MCSE in SQL 2012) you can skip the main content of this module and just take the test at the end. For the next you have an option of Excel or Power BI for visualisation. As I have experience with Power BI I opted for this module. Once again this is a very basic introduction to Power BI. It will get you familiar enough with the tool that you can do basic data exploration. Some parts of this course jarred with me. Data visualisation is so important and a key skill for any data scientist. In the Power BI module one of the exercises was to create a 3d pie chart. Pie charts are not a good visualisation as it is hard to differentiate between angles and making it 3d only escalates the issue. I wish Microsoft would have made reference to some of the great data viz experts when making this module - I cannot comment on the Excel version.   Understanding statistics. This module is different from its predecessors, in that it is not run by Microsoft. This is a MOOC from Columbia university, which you might have completed before. It covers a lot of the basic and more advanced stats that you need to know for data science. In particular a solid grounding in probability and probability theory. In BI you become familiar with descriptive stats and measures of variance, however I had not done a great deal of stats beyond this. I have researching statistical methods for the MSc, but I had not done any real stats since A-Level maths. This course was really interesting and I learnt a lot. I don’t know if this is the best way to really learn stats, but it is a good primer to what you need to know. I found topping up my understanding with blogs, books and YouTube helped support this module.   Explore data with code. You have two options again for this module, R and Python. Which should you learn I imagine you're asking, well the simple answer is both. Knowing either R or Python will get you so far, knowing both with make you a unicorn. Many ask why to learn one language over the other - aside from the previous point. R is very easy to get in to, it has a rich catalogue of libraries written by some of the smartest statistical minds. It has a simple interface and is easy to install. Python is harder to learn in my opinion as the language is massive! I found Python harder to work with, but it is much richer. I would recommend Python just for SciKitLearn the machine learning library. The python module is extended to use code dojo (the great online tuition site). As you progress through the questions and examples, you have an ide which will check you understanding and  will grade you as you go. I found this really helpful. This module is again a bit on the easier side. If you think the later Python module will be similar, you are in for a surprise! I did not take the R module as I was already using R in my day job.   Understand core data science concepts. Almost a redo of the first module and the understanding statistics module. Not a lot to say here, but repetition helped me understand and remember the concepts. The more I had to think about the core concepts the more they stuck. This module could have been removed with little to no impact on the course, but helped solidify my knowledge.   Understanding Machine learning. As this is a Microsoft course this module is all about Azure Machine Learning. If you have not used Azure ML before, it has a nice drag and drop interface which allows you to build quick simple models and create a web api key which you can then pass data to using any tool with a REST API. This module is half theory and half practical. There are a lot of labs, so you will need to take you time. If you skip ahead you will get the answers wrong and might not make it to 70%.   Using code to manipulate and model data. This section has two options again R and Python. I know quite a bit or R already so I started with Python. I wanted to do them both to see how you can do machine learning in both. I was expecting a continuation of the code dojo format from the previous module, this was far from the case. Each of the modules up until this point have worked with you to find the right answer. This module will equip you with the basics, but expect you to find the correct function and answer. Believe me when I say it was hard (with little prior experience of Python). The course will lead you to towards the right resources, but you need to read the documentation to answer the question. This was a great change of pace. Having to search for the answers made me absorb more than just the quizzes. This module was a struggle. Once I completed this I did the same for R. On a difficulty scale, if the Python module was 100, R was only at 20. The disparity in difficult is massive and frankly unfair. I was able to complete the R module very quickly. I left feeling disappointed that this did not have the same complexity that the Python module did.   Develop intelligent solutions. For this section you can pick one of three modules, Machine learning, Spark or micro services. I went with Spark. Why? Because I had already worked with Spark and Hadoop as part of the MSc at Dundee. I knew how it worked and what it did from an open source point of view, but not from a Microsoft HD-Insight perspective. This module was tricky but nothing compared to the Python module. I spent the best part of the week working on Spark, setting up HD-Insight clusters and forgetting to tear them down (top tip! Don’t leave a HD-Insight cluster running - They are EXPENSIVE!). The last module is a machine learning project, so picking the "Applied Machine Learning" option might put you in a better place than your competition. I did not attempt either the Machine Learning or the Micro-services modules.   Final project. Here is where the fun begins. You're given a problem and a dataset. You need to clean, reduce, derive features and process the dataset, then apply an ML technique to predict something. In my case it was whether or not someone will default on a loan. You could use any technique you liked as long as the final result was in Azure ML. I was pretty happy with my model early on and made very few tweaks as the course progressed. Unlike the previous modules where you can complete a module and get your score, your final score is only available once the module has ended. You will build an ML experiment and test against a private dataset. You can submit your experiment 3 times a day to be scored against the private data (maximus of 100 attempts). This will give you an indication of your score, but this is not your score! You score is calculated against a different dataset after the module has finished.  You top 5 scores will be used to test against the private closed data. If you have over-fitted you model, you might have a shock (as many did on the forums) when you score is marked.   I completed all modules at the start of January and waited until February to get my final score. My highest scoring answer, when used against the closed private dataset, did not get over the required 70% to pass. This was surprising but not all that unexpected. I had over-fitted the model. To counter balance this, I created 5 different experiments with 5 similar but different approaches. All score similar (~1-3% accuracy difference). This was enough to see me past the required 70% and to obtain the MPP in data science. The private dataset has been published now. In the coming weeks I will blog about the steps I took to predict if someone would default on their loan.   I have been asked at different stages of the course "would you recommend the course?". It really depends on what you want out of the course! If you expect to be a data scientist after completing the MPP, then you might be in for a shock. To get the most out of the course you need to supplement it with wider reading / research. YouTube has many great videos and recorded lectures which will really help process the content and see it taught from a different angle. If you're looking to get an understanding of the key techniques in  Data Science (from a Microsoft point-of-view) then you should take this course. If you're doing a degree where you need to do research, many of the modules will really help and build upon what you already know.   I hope you have found this interesting and that it has helped you decide whether or not you want to invest the time and money (each module is not free). If you do decide and you persevere you will too be the owner of the MPP in Data Science (as seen below).   Terry McCann - Adatis Data Science Consultant & Organiser of the Exeter Data Science User Group - You can find us on MeetUp.    

Azure ML Regression Example - Part 3 Deploying the Model

In the final blog of this series we will take the regression model we have created earlier in the series and make it accessible so it can be consumed by other programs. Making the experiment accessible to the outside world The next part of the process is to make the whole experiment accessible to the world outside of Azure ML. To do so you need to create a Web Service. This can be achieved by clicking Set Up Web Service button next to Run and then selecting Predictive Web Service [Recommended]. The experiment will change in front of your eyes and you should be left with a canvas looking similar to the one displayed below. If you would like to get back to your training experiment at any time you can do so by clicking Training experiment in the top right corner. You can then run the predictive web service again to update the predictive experiment. Whilst in the predictive experiment window run the experiment once again and then click Deploy Web Service. Having done this, you should be displayed with the below screen: Select Excel 2013 or later in the same row as REQUEST/RESPONSE. Click the tick to download the Excel document, open it and click Enable Editing, you will see something like the image below. If you are using Excel 2010 feel free to follow the example, it will be fairly similar, but not identical. Click Automobile Price Regression [Predictive Exp.] to begin. Click Use sample data to quickly construct a table with all the appropriate columns and a few examples. Feel free to alter the sample data to your heart’s content. Once you’re happy with your data, highlight it and select it as the Input range selected. Chose an empty cell as the Output. Click Predict. You should see something similar to below: You should now be able to see a replica table with a Scored Labels column displaying the estimated price for each row. Go ahead and rerun the experiment putting in whatever attribute values you desire. This experiment will now always return a Scored Label relating to the price based upon the training model. What next? This has just been a toe dip into the world of Azure ML. For more information on getting started with Azure ML track down a copy of Microsoft Azure Essentials – Azure Machine Learning by Jeff Barnes, this is a great starting point. If you want to know what you can do with Azure ML and how to start using Azure ML within other programs then check out my upcoming blog which will show you how to integrate Azure ML straight into Visual Studio.

Using Azure Machine Learning and Power BI to Predict Sporting Behaviour

Can we predict peoples’ sporting performance by knowing some details about them? If so, what is better at making these predictions: machines or humans? These questions seem so interesting so we decided to answer them; by creating a working IT solution to see how it would perform. The blog will provide an overview of the project providing a simple results analysis and details of technologies that we used to make it happen. As It would be hard to check all available sport disciplines we decided to focus on the one we love the most – Cycling. Our aim was to predict the maximum distance a rider would ride within one minute from standing start. Although “one minute” sounds insignificant, this is really tough exercise as we were simulating quite a tough track. We used the following equipment to perform the experiment: bike with all necessary sensors to enable recording of speed, cadence, crank turns, wheel turns, distance velodrome bike simulator heart rate monitor in form of wrist band Using this equipment allowed us to capture data about the ride in real time and display this using streaming analytics and Power BI in a live interactive dashboard as shown below (Picture 1):   Picture 1: Real time Power BI dashboards showing:  average heart rate(top row left); current speed in km/h(top row middle); average of speed(top row right); Current Crank turns, wheel turns and cadence(bottom row Left); Average of Crank turns, wheel turns and cadence (bottom row right)   Sensors were used to capture information about how fast our rider was cycling, how many crank turns they made, what was their heart rate during the ride and the most important - how far they did go within the time limit. Each rider had a chance to try to predict their maximum distance before their ride. We also made a prediction based upon previous cyclist results using Machine Learning algorithms. In order for the Machine Learning Algorithms to make estimates about each of the riders, we had to capture some representative properties about each rider before the ride. All riders needed to categorise themselves for each of properties listed below: age height weight gender smoking volume drinking volume cycling frequency   So taking weight as an example, people were asked to allocate themselves to the one out of the available buckets: e.g. Bucket 1 - 50-59kg, Bucket 2 - 60-69kg, Bucket 3 – 70-79kg … Bucket N – Above 100kg   Bucketing properties were used to help us reduce amount of distinct values, so it increased the probability that for a given ride we would find someone with similar characteristics, who had already had a ride. Obviously to make the prediction work we had to have an initial sample. That’s why we asked “Adatis people” to have a go on Friday afternoon. In true competitive spirit some of them even tried a few times a day! By the beginning of the SQLBits conference we had managed to save details of around 40 different rides. In a nutshell let me describe the process that we repeated for each rider. First step was to capture details of the volunteer by using ASP.NET Web app, including the maximum distance they think they will be able to reach (human prediction). Next, behind the scenes we provided their details to the machine learning algorithm exposed as web service to get a predicted distance. We then turned on all sensors and let the cyclist ride the bike. During the ride we captured all the data from the sensors and transferred it to the database through the Azure IoT stack. After the ride finished we updated the distance reached by the rider. The more cyclists participated, the bigger sample size we had to predict result for the next rider. Overall we captured 150 rides for 138 different riders. The initial sample size we used to make prediction was 40 riders and it grew up as more riders got involved.  The table below (Table 1) contains basic statistics of the differences between the machine learning predictions and human predictions.   Prediction Type Avg. Difference Std. Dev. For Difference Max Difference Min Difference Azure Machine Learning 119m 87m 360m 2m Humans 114m 89m 381m 0m Table 1: Absolute difference between Predicted and Reached distance for 1 minute ride. (Average distance reached 725m)   From these numbers we can easily see that neither Humans nor Machine Learning  came close to the real results reached by riders. The average difference over a 725m ride was 114m for humans with a standard deviation of 89 meters and 119 with a standard deviation of 87 meters. That means both of them were equally inaccurate. Although it is worth mentioning that we had single cases when the prediction was very close or even equal to the one reached. In trying to determine the reason behind the miscalculations in the ML prediction? I would say that the main reason is the sample size was not sufficient to make accurate predictions. Besides the small l sample there might be other reasons why predictions were so inaccurate such as: Incorrect bucket sizes for rider properties Too many properties to make a match Lack of strong enough correlation between properties and distance reached   It is also worth mentioning that some properties would show high correlation between property and distance like height of the rider or low correlation like drinking volume. The Best examples of high correlation we can see are on the charts attached below (Chart 1):   Chart 1: Correlation between distance reached in meters and height  category of the rider   And even more significant regarding fitness level (Chart 2):   Chart 2: Correlation between distance reached in meters and fitness category of the rider   On the other hand, some rider’s properties did not show the correlation that we would expect e.g. age (Chart 3)   Chart 3: Correlation between distance reached in meters and age of the rider   Although there is no straightforward correlation as previously stated we can observe a general trend that we tend to perform better the closer we get to our round birthdays. We can observe peaks at the ages of 18, 29, 39, 49. Is it perhaps because of the fear of getting to the next decade? I will leave this up to your interpretation… If you are interested into more technical explanation how we designed and build our project, I would like to invite you to the second part of the blog that would cover top level architecture of the project and also some deep insights into some core technologies used including: Azure Stream Analytics, Azure Event Bus, PowerBI web, ASP.NET MVC4 and SignalR.