Adatis

Adatis BI Blogs

Adatibits - Scalable Deep Learning with Azure Batch AI

  I recently had the pleasure of attending SQLBits, there were a number of great talks this year, and I think overall it was a fantastic event. As part of our commitment to learning at Adatis, we were also challenged to present back to the team something of interest we learnt at the event. This then becomes our internal Adatibits event where we can get sample of sessions from across the week. As such, I was pleased when I saw a talk by Ben Keen on the Friday bill of SQLBits that revolves around deep learning. Having just come through the Microsoft Data Science program, this fell in line with my interest / research in data science, and was focusing on the bleeding edge of the subject area. It was also the talk I was probably looking forward to the most from the synopsis, and I thought it was very well delivered. Anyway, on to the blog. In the following paragraphs, I’ll cover a cut down version of the talk and also talk about my experience using the MNIST dataset on Azure Batch AI. Credit to Ben for a few of the images I’ve used as they came off his slide deck.   What is Deep Learning? So you’ve heard of machine learning (ML), and what that can do – deep learning is essentially a subset of ML, which uses neural network architectures (similar to human brains). It can work with both supervised (labelled) and unsupervised data, although its value today is tending towards learning from the labelled data. You can almost think of it as an evolution of ML.  It’s performance tends to improve the more data you can throw at it, where traditional ML algorithms seem to plateau. Deep learning also has an ability to perform automatic feature extraction from raw data (feature learning), where as the traditional routes have features provided as part of the dataset.   How does it help us? What are its use cases? Deep learning excels at finding patterns in unstructured data. The following examples would be very difficult to write program to do, which is where the field of DL comes in. The use cases are usually split into 4 main areas – image, text, sound, and video. Image – medical imaging to diagnose heart disease / cancers Image – classification / segmentation of satellite images (NASA) Image – restoring colour to blank and white photos Image – Pixel Super Resolution (generating higher res images from lower ones) Text – real time text translation Text – identifying font (Adobe DeepFont) Sound – real time foreign language translation Sound – restoring sound to videos Video – motion detection in self-driving cars (Tesla) Video – beating video games (DeepMind beating Atari Breakout) Video – redacting video footage in Police body-cameras (NIJ)   Deep Neural Network Example A simple example of how we can use deep learning is to understand the complexity around house prices. Taking an input layer of neurons for things such as Age, Sq. Footage, Bedrooms, and Location of a house – one can normally apply the traditional linear formula y = mx + c to apply weightings to the neurons to calculate the house price. This is a very simplistic view and there are many more factors that apply that can change the value. This often involves the different neurons intersecting with one another. For example, people would think a house with a large number of bedrooms is a good thing and this would raise the price of the house, but if all these bedrooms were really small – then this wouldn’t be a very attractive offering for anyone other than a developer (and even then they might baulk at the effort involved) therefore lowing the price. This is especially true with people wanting more open plan houses nowadays. So traditionally people might be interested in bedrooms, this may shift in recent times to Sq. Footage as the main driver.     Therefore a number of weights can be attributed to an intermediary layer called the hidden layer. The neural network architecture then uses this hidden layer to perform a better prediction. It does this by starting off with completely arbitrary weights (which will make a poor prediction). The process then multiplies the inputs by these weights, and applies an activation function to get to the hidden layer (a1, a2, a3). This neuron is then multiplied by another weight and another activation function to generate the prediction. This value is then scored, and evaluated via some form of loss function (Root Mean Squared Error). The weights are adjusted accordingly and the process is repeated. The weights are adjusted through a process called gradient decent.     To give you an idea of scale, and weight optimisation that's required – a 50x50 RGB image has 7,500 input neurons. Therefore we’re going to need to scale out as the training is very compute intensive. This is where Azure Batch AI comes in!   Azure Batch AI Azure Batch AI is a managed service for training deep learning models in parallel at scale. It’s currently in public preview, which I believe it entered around September 2017. It’s built on top of Azure Batch and essentially sells the standard Azure story where it provides the infrastructure for data scientists so they don’t need to worry about it and can get on with more practical work. You can take your ML tools and workbooks (CNTK, TensorFlow, Python, etc.) and provision a GPU cluster on demand to run them against. Its important to note, the provision is of GPU, not CPU – similar cores, less money, less power consumed for this type of activity. Once trained, the service can provide access to the trained model via apps and data services.     As part of my interest in the subject, I then went and looked at using the service to train a model off the MNIST dataset. This is a collection of handwritten digits between 1-9 with over 60,000 examples. It’s a great dataset to use to try out learning techniques and pattern recognition methods while spending minimal efforts on pre-processing and formatting. This is not always easy with most images as they contain a lot of noise and require time to convert into a format ready for training.     I then followed the following process within Azure Batch AI. Created a Storage Account using Azure CLI along with a file share and directory. # Login az login -u <username> -p <password> # Register resource providers az provider register -n Microsoft.BatchAI az provider register -n Microsoft.Batch # Create Resource Group az group create --name AzureBatchAIDemo --location uksouth # Create storage account to host data/scripts az storage account create --name azurebatchaidemostorage --sku Standard_LRS --resource-group AzureBatchAIDemo # Create File Share az storage share create --account-name azurebatchaidemostorage --name batchaiquickstart # Create Directory az storage directory create --share-name batchaiquickstart --name mnistcntksample --account-name azurebatchaidemostorage   Uploaded the training / test datasets, and Python script. # Upload train, test and script files az storage file upload --share-name batchaiquickstart --source Train-28x28_cntk_text.txt --path mnistcntksample --account-name azurebatchaidemostorage az storage file upload --share-name batchaiquickstart --source Test-28x28_cntk_text.txt --path mnistcntksample --account-name azurebatchaidemostorage az storage file upload --share-name batchaiquickstart --source ConvNet_MNIST.py --path mnistcntksample --account-name azurebatchaidemostorage   Provisioned a GPU cluster. The NC6 consists of 1 GPU, which is 6 vCPUs, 56GB memory, and is roughly 80p/hour. This scales all the way up to an ND24 which is 4 GPUs, 448GB memory, for roughly £7.40/hour. # Create GPU Cluster NC6 is a NVIDIA K80 GPU az batchai cluster create --name azurebatchaidemocluster --vm-size STANDARD_NC6 --image UbuntuLTS --min 1 --max 1 --storage-account-name azurebatchaidemostorage --afs-name batchaiquickstart --afs-mount-path azurefileshare --user-name <username> --password <password> --resource-group AzureBatchAIDemo --location westeurope # Cluster status overview az batchai cluster list -o table   Created a training job from a JSON template – this tells the cluster where to find the scripts and the data, how many nodes to use, what container to use, and where to store the trained model. This can be then be run! # Create a training job from a JSON template az batchai job create --name batchaidemo --cluster-name azurebatchaidemocluster --config batchaidemo.json --resource-group AzureBatchAIDemo --location westeurope # Job status az batchai job list -o table   The output can be seen in real time along with the epochs and metrics. An epoch is essentially a full training cycle, and by having multiple epochs, you can cross validate your data, which leads the model to generalise more and fit real world data better. # Output metadata az batchai job list-files --name batchaidemo --output-directory-id stdouterr --resource-group AzureBatchAIDemo # Observe realtime output az batchai job stream-file --job-name batchaidemo --output-directory-id stdouterr --name stderr.txt --resource-group AzureBatchAIDemo     The pipeline can also be seen in the Azure portal along with links to the output metadata.   Once the model has been trained, it can be extracted, and the resources can be cleared down. # Clean Up az batchai job delete --name batchaidemo az batchai cluster delete --name azurebatchaidemocluster az group delete --name AzureBatchAIDemo   Conclusion By moving the compute into Azure, and having the ability to scale – this means we can generate a faster learning rate for our problem. This in turn will mean better hyperparameter tuning to generate better weightings, which will mean better models, and better predictions. As Steph Locke also alluded to in her data science talk – this means data scientists can do more work on things that they are good at, rather than waiting around for models to train to re-evaluate. Deep learning is certainly an interesting space to be in currently!

Loss Functions and Gradient Descent in Machine Learning

In an earlier blog I explained some of the basic building blocks of Neural Networks and Deep Learning (here). This was very high level and omitted a number of concepts which I wanted to explain but for clarity decided to leave until later. In this blog I will introduce Loss Functions and Gradient Descent, however there are still many more which need to be explained. Loss Functions are used to calculate the error between the known correct output and the actual output generated by a model, Also often called Cost FunctionsGradient Descent is an iterative optimization method for finding the minimum of a function. On each iteration the parameters in a model are amended in the direction of the negative gradient of the output until the optimum parameters for the model are identified.These are fundamental to understanding training models and are common to both supervised machine learning and deep learning. An ExampleA worked example is probably the easiest way to illustrate how a loss function and gradient descent are used together to train a simple model. The model is simple to allow the focus can be on the methods and not the model. Lets see how this works. Imagine there are 5 observations of the weight vs cost of a commodity, The object is to train a model to allow it to be used to predict the price for any weight of the commodity. The observations are:-I plot on a graph the Weight vs the Price and observe the followingModelling this as a linear problem, the equation of a line is of course y = Wx + b where W = slope of the line and b is the intersection of the line on the y axis. To make the problem simpler please accept the assumption that b = 0, this is logically reasonable as the price for zero grams of a commodity is reasonably zero and this is supported by the data.The method described here to train the model is to iteratively evaluate a model using a loss function and amend the parameters in a model to reduce the error in the model. So the next step is to have a guess at a value for W, if doesn’t need to be a good guess, in machine learning initial values are often randomly created so they are very unlikely to be anywhere near accurate on a first iteration. The first guess is shown in red below:Now its necessary to evaluate how bad this model is, this is where the loss function comes in. The loss function used here is called Mean Squared Error (MSE). For each observed point the difference between the observed (actual) value and the estimated value is calculated (the error is represented by the green lines in the diagram below). The errors are squared and then the average of the squared observations is taken to create an numerical representation for the error.This error is plotted on a graph showing Error vs Slope (W) This Error graph will be used in the Gradient Descent part of the method. Following this a small change to the value of W is made and the error is re-evaluated. In the graph below the original value for W is shown in blue the new value in red. The error is once again plotted on the error graph. The error graph reveals that the error is smaller and therefore the adjustment to the value of W was in the correct direction to reduce the error, in other words the model has been improved by the change.Small increments are made to the value of W to cause a reduction in the size of the error, ie to reduce the value of the loss function. In other words we want to descent the gradient of the curve until we find a minimum value for the loss function.Continuing the example, see below how we have continued to zoom in on a solution after several iterations. At a certain point continued changes in the same direction cause the model to become worse rather than improve. At this point the optimal value for W can be identified, its where the gradient of the error curve reaches zero or in other words the value of W pertaining to the lowest point on the graph (indicating the minimum error).Summarising this then, the Loss function is used to evaluate the model on each training run and the output of the loss function is used on each iteration to identify the direction to adjust model parameters. The optimum parameters create the minimum error in the model.Going forward we need to apply these 2 principals to explain Backpropagation, Backpropagation is the method by which Neural Networks learn, its the setting of all the Weights and Biases in the network to achieve the closest output possible to the desired output. That is for another blog which I hope to bring to you soon.

Introduction to Deep Learning, Neural Network Basics

My main focus is Business Intelligence and Data Warehousing but recently I have been involved in the area of Machine Learning and more specifically Deep Learning to see if there are opportunities to use the technology from these fields to solve common BI problems. Deep Learning uses interconnected networks of neurons called Neural Networks. These are loosely based on or perhaps it would be better to say inspired by networks of neurons in mammalian brains. For a more detailed definition see:- http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.htmlIn recent years Neural Networks have been used to solve problems which conventional computer programs have struggled to do. They have been particularly successful in areas where problems are highly intuitive to humans but incredibly complex to describe, eg understanding speech or controlling a moving vehicle.Let’s start then by looking at neurons and how they operate, they are after all the basic building block of a neural network. The simplest (and original) neuron is called the Perceptron, others which will be described here are the Sigmoid, Tanh and Rectified Linear Units (ReLu). This is not an exhaustive list but it’s a good place to start.PerceptronBelow is a representation of a perceptron. It has 3 inputs (X1, X2, X3) and one output, the inputs and the outputs are all binary values (0 or 1). It also has 3 Weights (W1, W2, W3) and a Bias represented by “b”The output of the perceptron can be 1 or 0. If the sum of the Inputs (X) multiplied by the Weights (W) is greater than the Bias then the Output will be 1, otherwise it will be 0. Bias then is the threshold for the neuron to “fire” or in other words return an output of 1. If the bias is set to a high value then the neuron is resistant to firing, or it could be said to have a high threshold. It will need higher weighted inputs to fire. If the bias is set to a low number then the neuron has a low resistance to firing. As an example then, consider the scenario where there are 3 inputs into the perceptron:-W1 has a weighting 0.7W2 has a weighting 0.4W3 has a weighting 0.2Let’s assume then that for this example the bias is set at 0.8. Now lets work through some scenarios:Scenario 1:-X1 has a value 0X2 has a value 1X3 has a value 1Evaluating this scenario:Input = SUM((X1 * W1) + (X2 * W2) + (X3 * W3))Input = SUM((0*0.7) + (1*0.4) + (1*0.2))Input = 0.6 As 0.6 does not exceed 0.8 the output of the Perceptron will be 0Scenario 2:-X1 has a value 1X2 has a value 0X3 has a value 1Evaluating this scenario:Input = SUM((X1 * W1) + (X2 * W2) + (X3 * W3))Input = SUM((1*0.7) + (0*0.4) + (1*0.2))Input = 0.9 As 0.9 does exceed 0.8 the output of the Perceptron will be 1This behaviour can be plotted as shown below, when the sum of the inputs multiplied by the weights exceed the bias (here 8) the output is 1 but less than the bias the output is zero:-Other NeuronsThe neurons below all behave in a similar way to the Perceptron, all can have multiple inputs and all have a single output. The output is always dependent on the inputs, weights and bias of the neuron. How they differ is in their their activation function, ie how they respond to the inputs.   SigmoidThe Sigmoid Neuron has decimal inputs and outputs which are numbers between 0 and 1. The shape of the activation function is smoother than the stepwise shape of the perceptron, as shown below.Explaining this in a little more detail, when the sum of the Weights multiplied by the Inputs is much lower than the bias the output is close to zero. As the bias is approached the output begins to rise until it is 0.5 at the point of the bias (still here the value of 8) after which as the weights and inputs sum increases it continues upwards towards the value one. This subtle difference provides Sigmoid neurons a significant advantage over Perceptrons as networks of Sigmoids are more stable in gradient-based learning, I’ll explain this in another blog later. TanhThe Tanh neuron is similar to the Sigmoid neuron except its rescaled to have outputs in the range from -1 to 1 Choosing between Sigmoid and a Tanh is sometimes a matter of trial and error, however it is sometimes said that Tanh learn better as Sigmoids suffer from saturation more easily, again I would have to expand the scope of this blog quite a lot to explain this though.ReLuThe ReLu neuron (short for Rectified Linear Unit) returns 0 until the Bias is reached then increase in a linear fashion as shown below.ReLu neurons are favoured over Sigmoid neurons in feed forward neural networks (see below) because they are less susceptible to a problem known as Learning Slowdown. A Simple Feed Forward NetworkNeurons can be linked together in different ways, the simplest design to explain is the Feed Forward Network. In the diagram below there are 3 layers of neurons with every neuron in a layer connected to every neuron in the next layer. Every neuron has an individual bias setting and every connection has an individual weight. This is called a feed forward network because the flow from inputs to outputs is unidirectional, there are no feedback loops here and the network is stateless as the output is calculated from the input without effecting the network. The network is trained by adjusting the Weights and Biases on each neuron until the desired output is are produced for the provided inputs. This training is usually done working from the outputs to the inputs with a method called Back Propagation. Back Propagation is a whole subject which can be explored more here:- https://pdfs.semanticscholar.org/4d3f/050801bd76ef10855ce115c31b301a83b405.pdfThere are 3 layers in a feed forward neural network, the input layer, the output layer and the layer in the middle which is called the hidden layer and which may itself be made up of several layers of neurons. In the diagram below the hidden layer has 3 layers of neurons. A neural network is considered “deep” if there are 2 or more hidden layers in the network.In summary then this blog introduces a simple neural network and explains a little of how neurons work. There are quite a few more concepts which need to be introduced to get a full picture but hopefully you found this interesting and informative and I’ll try to fill in some of the gaps in future blogs.