Hugh Freestone's Blog

Loss Functions and Gradient Descent in Machine Learning

In an earlier blog I explained some of the basic building blocks of Neural Networks and Deep Learning (here). This was very high level and omitted a number of concepts which I wanted to explain but for clarity decided to leave until later. In this blog I will introduce Loss Functions and Gradient Descent, however there are still many more which need to be explained.

Loss Functions are used to calculate the error between the known correct output and the actual output generated by a model, Also often called Cost Functions

Gradient Descent is an iterative optimization method for finding the minimum of a function. On each iteration the parameters in a model are amended in the direction of the negative gradient of the output until the optimum parameters for the model are identified.

These are fundamental to understanding training models and are common to both supervised machine learning and deep learning.

An Example

A worked example is probably the easiest way to illustrate how a loss function and gradient descent are used together to train a simple model. The model is simple to allow the focus can be on the methods and not the model. Lets see how this works.

Imagine there are 5 observations of the weight vs cost of a commodity, The object is to train a model to allow it to be used to predict the price for any weight of the commodity.

The observations are:-


I plot on a graph the Weight vs the Price and observe the following


Modelling this as a linear problem, the equation of a line is of course

y = Wx + b

where W = slope of the line and b is the intersection of the line on the y axis.

To make the problem simpler please accept the assumption that b = 0, this is logically reasonable as the price for zero grams of a commodity is reasonably zero and this is supported by the data.

The method described here to train the model is to iteratively evaluate a model using a loss function and amend the parameters in a model to reduce the error in the model. So the next step is to have a guess at a value for W, if doesn’t need to be a good guess, in machine learning initial values are often randomly created so they are very unlikely to be anywhere near accurate on a first iteration. The first guess is shown in red below:


Now its necessary to evaluate how bad this model is, this is where the loss function comes in. The loss function used here is called Mean Squared Error (MSE). For each observed point the difference between the observed (actual) value and the estimated value is calculated (the error is represented by the green lines in the diagram below). The errors are squared and then the average of the squared observations is taken to create an numerical representation for the error.This error is plotted on a graph showing Error vs Slope (W) This Error graph will be used in the Gradient Descent part of the method.



Following this a small change to the value of W is made and the error is re-evaluated. In the graph below the original value for W is shown in blue the new value in red. The error is once again plotted on the error graph. The error graph reveals that the error is smaller and therefore the adjustment to the value of W was in the correct direction to reduce the error, in other words the model has been improved by the change.



Small increments are made to the value of W to cause a reduction in the size of the error, ie to reduce the value of the loss function. In other words we want to descent the gradient of the curve until we find a minimum value for the loss function.

Continuing the example, see below how we have continued to zoom in on a solution after several iterations.



At a certain point continued changes in the same direction cause the model to become worse rather than improve.



At this point the optimal value for W can be identified, its where the gradient of the error curve reaches zero or in other words the value of W pertaining to the lowest point on the graph (indicating the minimum error).

Summarising this then, the Loss function is used to evaluate the model on each training run and the output of the loss function is used on each iteration to identify the direction to adjust model parameters. The optimum parameters create the minimum error in the model.

Going forward we need to apply these 2 principals to explain Backpropagation, Backpropagation is the method by which Neural Networks learn, its the setting of all the Weights and Biases in the network to achieve the closest output possible to the desired output. That is for another blog which I hope to bring to you soon.

Introduction to Deep Learning, Neural Network Basics

My main focus is Business Intelligence and Data Warehousing but recently I have been involved in the area of Machine Learning and more specifically Deep Learning to see if there are opportunities to use the technology from these fields to solve common BI problems.

Deep Learning uses interconnected networks of neurons called Neural Networks. These are loosely based on or perhaps it would be better to say inspired by networks of neurons in mammalian brains. For a more detailed definition see:-

In recent years Neural Networks have been used to solve problems which conventional computer programs have struggled to do. They have been particularly successful in areas where problems are highly intuitive to humans but incredibly complex to describe, eg understanding speech or controlling a moving vehicle.

Let’s start then by looking at neurons and how they operate, they are after all the basic building block of a neural network. The simplest (and original) neuron is called the Perceptron, others which will be described here are the Sigmoid, Tanh and Rectified Linear Units (ReLu). This is not an exhaustive list but it’s a good place to start.


Below is a representation of a perceptron. It has 3 inputs (X1, X2, X3) and one output, the inputs and the outputs are all binary values (0 or 1). It also has 3 Weights (W1, W2, W3) and a Bias represented by “b”


The output of the perceptron can be 1 or 0. If the sum of the Inputs (X) multiplied by the Weights (W) is greater than the Bias then the Output will be 1, otherwise it will be 0.

Bias then is the threshold for the neuron to “fire” or in other words return an output of 1. If the bias is set to a high value then the neuron is resistant to firing, or it could be said to have a high threshold. It will need higher weighted inputs to fire. If the bias is set to a low number then the neuron has a low resistance to firing.

As an example then, consider the scenario where there are 3 inputs into the perceptron:-

W1 has a weighting 0.7

W2 has a weighting 0.4

W3 has a weighting 0.2

Let’s assume then that for this example the bias is set at 0.8. Now lets work through some scenarios:

Scenario 1:-

X1 has a value 0

X2 has a value 1

X3 has a value 1

Evaluating this scenario:

Input = SUM((X1 * W1) + (X2 * W2) + (X3 * W3))

Input = SUM((0*0.7) + (1*0.4) + (1*0.2))

Input = 0.6

As 0.6 does not exceed 0.8 the output of the Perceptron will be 0

Scenario 2:-

X1 has a value 1

X2 has a value 0

X3 has a value 1

Evaluating this scenario:

Input = SUM((X1 * W1) + (X2 * W2) + (X3 * W3))

Input = SUM((1*0.7) + (0*0.4) + (1*0.2))

Input = 0.9

As 0.9 does exceed 0.8 the output of the Perceptron will be 1

This behaviour can be plotted as shown below, when the sum of the inputs multiplied by the weights exceed the bias (here 8) the output is 1 but less than the bias the output is zero:-


Other Neurons

The neurons below all behave in a similar way to the Perceptron, all can have multiple inputs and all have a single output. The output is always dependent on the inputs, weights and bias of the neuron. How they differ is in their their activation function, ie how they respond to the inputs.  


The Sigmoid Neuron has decimal inputs and outputs which are numbers between 0 and 1. The shape of the activation function is smoother than the stepwise shape of the perceptron, as shown below.


Explaining this in a little more detail, when the sum of the Weights multiplied by the Inputs is much lower than the bias the output is close to zero. As the bias is approached the output begins to rise until it is 0.5 at the point of the bias (still here the value of 8) after which as the weights and inputs sum increases it continues upwards towards the value one. This subtle difference provides Sigmoid neurons a significant advantage over Perceptrons as networks of Sigmoids are more stable in gradient-based learning, I’ll explain this in another blog later.


The Tanh neuron is similar to the Sigmoid neuron except its rescaled to have outputs in the range from -1 to 1


Choosing between Sigmoid and a Tanh is sometimes a matter of trial and error, however it is sometimes said that Tanh learn better as Sigmoids suffer from saturation more easily, again I would have to expand the scope of this blog quite a lot to explain this though.


The ReLu neuron (short for Rectified Linear Unit) returns 0 until the Bias is reached then increase in a linear fashion as shown below.


ReLu neurons are favoured over Sigmoid neurons in feed forward neural networks (see below) because they are less susceptible to a problem known as Learning Slowdown.

A Simple Feed Forward Network

Neurons can be linked together in different ways, the simplest design to explain is the Feed Forward Network. In the diagram below there are 3 layers of neurons with every neuron in a layer connected to every neuron in the next layer. Every neuron has an individual bias setting and every connection has an individual weight.


This is called a feed forward network because the flow from inputs to outputs is unidirectional, there are no feedback loops here and the network is stateless as the output is calculated from the input without effecting the network.

The network is trained by adjusting the Weights and Biases on each neuron until the desired output is are produced for the provided inputs. This training is usually done working from the outputs to the inputs with a method called Back Propagation. Back Propagation is a whole subject which can be explored more here:-

There are 3 layers in a feed forward neural network, the input layer, the output layer and the layer in the middle which is called the hidden layer and which may itself be made up of several layers of neurons. In the diagram below the hidden layer has 3 layers of neurons. A neural network is considered “deep” if there are 2 or more hidden layers in the network.


In summary then this blog introduces a simple neural network and explains a little of how neurons work. There are quite a few more concepts which need to be introduced to get a full picture but hopefully you found this interesting and informative and I’ll try to fill in some of the gaps in future blogs.