HughFreestone

Hugh Freestone's Blog

Connecting Azure Databricks to Data Lake Store

Just a quick post here to help anyone who needs integrate their Azure Databricks cluster with Data Lake Store. This is not hard to do but there are a few steps so its worth recording them here in a quick and easy to follow form.

This assumes you have created your Databricks cluster and have created a data lake store you want to integrate with. If you haven’t created your cluster that’s described in a previous blog here which you may find useful.

The objective here is to create a mount point, a folder in the lake accessible from Databricks so we can read from and write to ADLS. Here this is done in notebooks in Databricks using Python but if Scala is your thing then its just as easy. To create the mount point you need to run the following command:-


configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
            "dfs.adls.oauth2.client.id": "{YOUR SERVICE CLIENT ID}",
            "dfs.adls.oauth2.credential": "{YOUR SERVICE CREDENTIALS}",
            "dfs.adls.oauth2.refresh.url": "
https://login.microsoftonline.com/{YOUR DIRECTORY ID}/oauth2/token"}

dbutils.fs.mount(
   source = "adl://{YOUR DATA LAKE STORE ACCOUNT NAME}.azuredatalakestore.net{YOUR DIRECTORY NAME}",
   mount_point = "{mountPointPath}",
   extra_configs = configs)


So to do this we need to collect together the values to use for

  • {YOUR SERVICE CLIENT ID}
  • {YOUR SERVICE CREDENTIALS}
  • {YOUR DIRECTORY ID}
  • {YOUR DATA LAKE STORE ACCOUNT NAME}
  • {YOUR DIRECTORY NAME}
  • {mountPointPath}

First the easy ones, my data lake store is called “pythonregression” and I want the folder I am going to use to be ‘/mnt/python’, these are just my choices.

I need the service client id and credentials, for this I will create a new Application Registration by going to the Active Directory blade in the Azure portal and clicking on “New Application Registration”

image

Fill in you chosen App name, here I have used the name ‘MyNewApp’, I know not really original. Then press ‘Create’ to create the App registration

image

This will only take a few seconds to create and you should then see your App registration in the list of available apps. Click on the App you have created to see the details which will look something like this:

image

Make a note of the ApplicationId GUID (partially deleted here), this is the SERVICE CLIENT ID you will need. Then from this screen click the “Settings” button and then the “Keys” link. We are going to create a key specifically for the purpose. Enter a Key Description, choose a Duration from the drop down and when you hit “Save” a key will be produced. Save this key, its the value you need for YOUR SERVICE CREDENTIALS and as soon as you leave the blade it will disappear.

We now have everything we need except the DIRECTORY ID.

To get the DIRECTORY ID go back to the Active Directory blade and click on “Properties” as shown below:-

image


From here you can get the DIRECTORY ID

image


Ok, one last thing to do. You need to grant access to the “MyNewApp” App to the Data Lake Store, otherwise you will get access forbidden messages when you try to access ADLS from Databricks. This can be done from Data Explorer in ADLS using the link highlighted below.

image


Now we have everything we need to mount the drive. In Databricks launch a workspace then create a new notebook (as described in my previous post).

Run the command we put together above in the python notebook

image


You can then create directories and files in the lake from within your databricks notebook

image

If you want to you can unmount the drive using the following command

image


Something to note. If you Terminate the cluster (terminate meaning shutdown) you can restart the cluster and the mounted folder will still be available to you, it doesn’t need to be remounted.

You can access the file system from Python as follows:

with open("/dbfs/mnt/python/newdir/iris_labels.txt", "w") as outfile:

and then write to the file in ADLS as if it was a local file system.


Ok, that was more detailed than I intended when I started, but I hope that was interesting and helpful. Let me know if you have any questions on any of the above and enjoy the power of Databricks and ADLS combined.

Getting Started With Databricks on Azure

Databricks is a managed platform running Apache Spark. Spark is a fast general-purpose cluster computing system which provides high-level APIs in Java, Scala, Python and R. Spark programs have a driver program which contain a SparkContext object which co-ordinates processes running independently distributed across worker nodes in the cluster. Spark uses a cluster manager such as YARN to allocate resources to applications.

Databricks allows a user to access the functionality of Spark without any of the hassle of managing a Spark cluster. It also provides the convenience to be able to create and tear down a clusters at will.

Starting Databricks

Creating a Databricks service in Azure is simple. In the Azure portal select “New” then “Data + Analytics” then “Azure Databricks”:

image

Enter a workspace name, select the subscription, resource group and subscription and click create

image

Wait a few minutes and the Service will be read for use. Clicking on the Databricks icon will take you the Azure Databricks blade. Click “Launch Workspace” to access the Databricks portal which is something of a different experience than other services in Azure, the screen looks like this:

image

Note the icons on the dark background on the left which are useful to jump to links

Databricks comes with some excellent documentation so take a moment to review the “Databricks Guide” documentation. We are going to start with something simple, using a notebook. When you select Notebook you will be asked for a name and to choose a language, options are Python, Scala, SQL and R. However one of the features of notebooks in Databricks is that the language is only a default and can be overridden by specifying an alternative language within the notebook.

Having selected Notebook and provided a name and default language you will be presented with an empty notebook. In the top left hand corner you will see the word “Detached”. To use the notebook you you will need to attach the notebook to a Spark cluster or more if you haven’t done this before create a cluster. The dropdown on “Detached” provides this option:-

image

This will take you to a page such as the on below, Clearly if you are just investigating you will want to minimise the cluster size for now.

image

Having created a cluster (which will take a few minutes) you can navigate back to the workbook, Attach the workbook to the now running cluster, type a command and using the small arrow on the right hand size execute the command to test everything is working

image

What to do now, so many options. Well lets load some data and view it.

Databricks has its own file system which will have been deployed for you called the Databricks File System (DBFS). You can instead access Data Lake Store or Blob storage but for now this will do.

Click on “Data” on the left hand side then the “+” icon by tables. This is a little counter intuitive as it doesn’t look like it will lead to an upload option, but it does.

image

Browse to the file you want to upload, and the UI conveniently tells you where the file can be found in the file system.

image

Now the file can be accessed from the notebook, the syntax differs slightly depending on what language you choose, here using python the data is read into a dataframe and then output to the screen

image

As the latest addition to the Azure Analytics stables Databricks comes with great promise. It’s notably well documented already despite only being in preview and the UI is mainly intuitive even if it differs in style somewhat from other Azure analytics options. If you have used Jupyter notebooks before you will appreciate the notebooks interface as a great way to dive in and investigate data. Also once its all deployed its in-memory operation makes it feel fast compared to running small queries on for instance running Hive queries in HDI clusters and USQL queries in Azure Data Lake Analytics.

If you have any questions or comments let me know

Loss Functions and Gradient Descent in Machine Learning

In an earlier blog I explained some of the basic building blocks of Neural Networks and Deep Learning (here). This was very high level and omitted a number of concepts which I wanted to explain but for clarity decided to leave until later. In this blog I will introduce Loss Functions and Gradient Descent, however there are still many more which need to be explained.

Loss Functions are used to calculate the error between the known correct output and the actual output generated by a model, Also often called Cost Functions

Gradient Descent is an iterative optimization method for finding the minimum of a function. On each iteration the parameters in a model are amended in the direction of the negative gradient of the output until the optimum parameters for the model are identified.

These are fundamental to understanding training models and are common to both supervised machine learning and deep learning.

An Example

A worked example is probably the easiest way to illustrate how a loss function and gradient descent are used together to train a simple model. The model is simple to allow the focus can be on the methods and not the model. Lets see how this works.

Imagine there are 5 observations of the weight vs cost of a commodity, The object is to train a model to allow it to be used to predict the price for any weight of the commodity.

The observations are:-

image

I plot on a graph the Weight vs the Price and observe the following

image

Modelling this as a linear problem, the equation of a line is of course

y = Wx + b

where W = slope of the line and b is the intersection of the line on the y axis.

To make the problem simpler please accept the assumption that b = 0, this is logically reasonable as the price for zero grams of a commodity is reasonably zero and this is supported by the data.

The method described here to train the model is to iteratively evaluate a model using a loss function and amend the parameters in a model to reduce the error in the model. So the next step is to have a guess at a value for W, if doesn’t need to be a good guess, in machine learning initial values are often randomly created so they are very unlikely to be anywhere near accurate on a first iteration. The first guess is shown in red below:

image

Now its necessary to evaluate how bad this model is, this is where the loss function comes in. The loss function used here is called Mean Squared Error (MSE). For each observed point the difference between the observed (actual) value and the estimated value is calculated (the error is represented by the green lines in the diagram below). The errors are squared and then the average of the squared observations is taken to create an numerical representation for the error.This error is plotted on a graph showing Error vs Slope (W) This Error graph will be used in the Gradient Descent part of the method.

image 

image

Following this a small change to the value of W is made and the error is re-evaluated. In the graph below the original value for W is shown in blue the new value in red. The error is once again plotted on the error graph. The error graph reveals that the error is smaller and therefore the adjustment to the value of W was in the correct direction to reduce the error, in other words the model has been improved by the change.

image

image

Small increments are made to the value of W to cause a reduction in the size of the error, ie to reduce the value of the loss function. In other words we want to descent the gradient of the curve until we find a minimum value for the loss function.

Continuing the example, see below how we have continued to zoom in on a solution after several iterations.

image

image

At a certain point continued changes in the same direction cause the model to become worse rather than improve.

image

image

At this point the optimal value for W can be identified, its where the gradient of the error curve reaches zero or in other words the value of W pertaining to the lowest point on the graph (indicating the minimum error).

Summarising this then, the Loss function is used to evaluate the model on each training run and the output of the loss function is used on each iteration to identify the direction to adjust model parameters. The optimum parameters create the minimum error in the model.

Going forward we need to apply these 2 principals to explain Backpropagation, Backpropagation is the method by which Neural Networks learn, its the setting of all the Weights and Biases in the network to achieve the closest output possible to the desired output. That is for another blog which I hope to bring to you soon.

Introduction to Deep Learning, Neural Network Basics

My main focus is Business Intelligence and Data Warehousing but recently I have been involved in the area of Machine Learning and more specifically Deep Learning to see if there are opportunities to use the technology from these fields to solve common BI problems.

Deep Learning uses interconnected networks of neurons called Neural Networks. These are loosely based on or perhaps it would be better to say inspired by networks of neurons in mammalian brains. For a more detailed definition see:- http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html

In recent years Neural Networks have been used to solve problems which conventional computer programs have struggled to do. They have been particularly successful in areas where problems are highly intuitive to humans but incredibly complex to describe, eg understanding speech or controlling a moving vehicle.

Let’s start then by looking at neurons and how they operate, they are after all the basic building block of a neural network. The simplest (and original) neuron is called the Perceptron, others which will be described here are the Sigmoid, Tanh and Rectified Linear Units (ReLu). This is not an exhaustive list but it’s a good place to start.

Perceptron

Below is a representation of a perceptron. It has 3 inputs (X1, X2, X3) and one output, the inputs and the outputs are all binary values (0 or 1). It also has 3 Weights (W1, W2, W3) and a Bias represented by “b”

image

The output of the perceptron can be 1 or 0. If the sum of the Inputs (X) multiplied by the Weights (W) is greater than the Bias then the Output will be 1, otherwise it will be 0.

Bias then is the threshold for the neuron to “fire” or in other words return an output of 1. If the bias is set to a high value then the neuron is resistant to firing, or it could be said to have a high threshold. It will need higher weighted inputs to fire. If the bias is set to a low number then the neuron has a low resistance to firing.

As an example then, consider the scenario where there are 3 inputs into the perceptron:-

W1 has a weighting 0.7

W2 has a weighting 0.4

W3 has a weighting 0.2

Let’s assume then that for this example the bias is set at 0.8. Now lets work through some scenarios:

Scenario 1:-

X1 has a value 0

X2 has a value 1

X3 has a value 1

Evaluating this scenario:

Input = SUM((X1 * W1) + (X2 * W2) + (X3 * W3))

Input = SUM((0*0.7) + (1*0.4) + (1*0.2))

Input = 0.6

As 0.6 does not exceed 0.8 the output of the Perceptron will be 0

Scenario 2:-

X1 has a value 1

X2 has a value 0

X3 has a value 1

Evaluating this scenario:

Input = SUM((X1 * W1) + (X2 * W2) + (X3 * W3))

Input = SUM((1*0.7) + (0*0.4) + (1*0.2))

Input = 0.9

As 0.9 does exceed 0.8 the output of the Perceptron will be 1

This behaviour can be plotted as shown below, when the sum of the inputs multiplied by the weights exceed the bias (here 8) the output is 1 but less than the bias the output is zero:-

clip_image004


Other Neurons

The neurons below all behave in a similar way to the Perceptron, all can have multiple inputs and all have a single output. The output is always dependent on the inputs, weights and bias of the neuron. How they differ is in their their activation function, ie how they respond to the inputs.  

Sigmoid

The Sigmoid Neuron has decimal inputs and outputs which are numbers between 0 and 1. The shape of the activation function is smoother than the stepwise shape of the perceptron, as shown below.

clip_image006

Explaining this in a little more detail, when the sum of the Weights multiplied by the Inputs is much lower than the bias the output is close to zero. As the bias is approached the output begins to rise until it is 0.5 at the point of the bias (still here the value of 8) after which as the weights and inputs sum increases it continues upwards towards the value one. This subtle difference provides Sigmoid neurons a significant advantage over Perceptrons as networks of Sigmoids are more stable in gradient-based learning, I’ll explain this in another blog later.

Tanh

The Tanh neuron is similar to the Sigmoid neuron except its rescaled to have outputs in the range from -1 to 1

clip_image008

Choosing between Sigmoid and a Tanh is sometimes a matter of trial and error, however it is sometimes said that Tanh learn better as Sigmoids suffer from saturation more easily, again I would have to expand the scope of this blog quite a lot to explain this though.

ReLu

The ReLu neuron (short for Rectified Linear Unit) returns 0 until the Bias is reached then increase in a linear fashion as shown below.

clip_image010

ReLu neurons are favoured over Sigmoid neurons in feed forward neural networks (see below) because they are less susceptible to a problem known as Learning Slowdown.


A Simple Feed Forward Network

Neurons can be linked together in different ways, the simplest design to explain is the Feed Forward Network. In the diagram below there are 3 layers of neurons with every neuron in a layer connected to every neuron in the next layer. Every neuron has an individual bias setting and every connection has an individual weight.

clip_image012

This is called a feed forward network because the flow from inputs to outputs is unidirectional, there are no feedback loops here and the network is stateless as the output is calculated from the input without effecting the network.

The network is trained by adjusting the Weights and Biases on each neuron until the desired output is are produced for the provided inputs. This training is usually done working from the outputs to the inputs with a method called Back Propagation. Back Propagation is a whole subject which can be explored more here:- https://pdfs.semanticscholar.org/4d3f/050801bd76ef10855ce115c31b301a83b405.pdf

There are 3 layers in a feed forward neural network, the input layer, the output layer and the layer in the middle which is called the hidden layer and which may itself be made up of several layers of neurons. In the diagram below the hidden layer has 3 layers of neurons. A neural network is considered “deep” if there are 2 or more hidden layers in the network.

clip_image014

In summary then this blog introduces a simple neural network and explains a little of how neurons work. There are quite a few more concepts which need to be introduced to get a full picture but hopefully you found this interesting and informative and I’ll try to fill in some of the gaps in future blogs.

Introduction to TensorFlow

TensorFlow is an open source software library for Deep Learning that was released by Google in November 2015. It’s Googles second generation deep learning system succeeding the DistBelief program.

Deep learning is a sub-category of machine learning. Deep learning uses layers of interconnected neurons to find patterns in raw data and create data representations from it. See my previous blog on neurons and networks here:- http://blogs.adatis.co.uk/hughfreestone/post/Introduction-to-Deep-Learning-Neural-Network-Basics

The networks automatically learn by adapting and correcting themselves, fitting patterns observed in the data. One of the key advantages over conventional machine learning is that they don’t require the domain expertise and manual feature engineering usually associated with machine learning.

Installation

TensorFlow can be installed on Mac OS X, Ubuntu or Windows computers by downloading precompiled executables or on Mac OS X and Ubuntu by downloading the source code and compiling the it locally. All can be found at:- https://www.tensorflow.org/install/

The compiled versions are provided either with CPU support only or with GPU support. I choose CPU support only initially, this is the simplest install path and therefore the fastest way to get up and running. It wasn’t long before I needed to upgrade to the GPU support version, training deep learning models is computationally expensive and you either need to install the GPU support version or be a very patient person. For comparison I ran the same experiment on 2 machines:

Machine 1 - 4 core i7 2.8GHz CPU 16 GB RAM SSD no external graphics card

Machine 2 - 4 core i3 3.3GHz CPU 4 GB RAM HDD NVIDIA GTX760 graphics card. This is an entry level graphics card.

On the machine with the graphics card and running the experiment on GPU the experiment took 426 seconds. On machine with no GPU 3466 seconds.

Luckily upgrading from CPU to GPU version is simple so it’s not worth getting hung up on if you just want to dip in.

The TensorFlow libraries are best accessed through Python APIs, but if you prefer you can access them through C, Java or Go


Lets Get Started

TensorFlow represents machine learning algorithms as computational graphs. A computation graph is made up of a set of entities (commonly called nodes) that are connected via edges. To understand the graph, its sometimes helpful to think of the data as flowing from node to node via the edges being operated on as it goes.

So let’s look at a couple of simple TensorFlow programs, then we can describe the components which are created and what they do. Below is a simple example below which is taken from an interactive Python session:

>>> import tensorflow as tf

>>> hello = tf.constant('Hello, TensorFlow!')

>>> sess = tf.Session()

>>> print(sess.run(hello))

The first line finds and initialises TensorFlow.

The 2nd line build the computational graph, the node called “hello” is defined in this step.

The 3rd line creates a TensorFlow session

The final line executes the graph to calculate the value of the node “hello” which is passed in as a parameter to the session run method.


Let’s now consider this other example:

>>> import tensorflow as tf

>>> a = tf.constant(3, tf.int64)

>>> b = tf.constant(4, tf.int64)

>>> c = tf.add(a,b)

>>> sess = tf.Session()

>>> result = sess.run(c)

>>> print(result)

7

In this example 3 nodes are created a, b and c. This would be represented in the computational graph as below:-

image

In this graph node c is defined as being an Add operation of the values from nodes a and b. When the sessions run method is called asking for the computation of c TensorFlow works backwards down the graph computing the precedent nodes a and b to enable it to compute node c.

From this you can see why TensorFlow programs are often described as consisting of two sections:

  • Building of a computational graph
  • Executing of a computation graph.

as the computational graph is constructed before being “run” in a session.


TensorFlow Principal Elements

We can break the Tensorflow down into a set of its principal elements:

  • Operations
  • Tensors
  • Sessions

Operations

Operations are represented by Nodes that handle the combination or transformation of data flowing through the graph. They can have zero or more inputs (an example of an operation with zero inputs is a constant), and they can produce zero or more outputs. A simple operation would be a mathematical function as above but may instead represent control flow direction or File I/O.

Most operations are stateless, values are stored during the running of the graph and are then disposed of, but variables are a special type of operation that maintain state. Under the covers adding a variable to a graph adds 3 separate operations: the variable node, a constant to produce an initial value and an initialiser operation that assigns the value.

Tensors

From a mathematical perspective, tensors are multi-dimensional structures for data elements. The tensor may have any number of dimensions that is called its rank. A scalar has a rank of zero, a vector a rank of one and a matrix a rank of two and onward and upwards without limit. A tensor also has a shape which is a tuple describing a tensors size, i.e. the number of components in each direction. All the data in a tensor must be of the same data type.

From TensorFlows perspective, tensors are objects used in a computational graph. The tensor doesn’t hold data itself, it’s a symbolic handle to the data flowing from operation to operation so it can also be thought of as the edge in the computational graph as it holds the output of one operation which forms the input of the next.

Sessions

A session is a special environment for execution of operations and evaluation of tensors, it is responsible for allocation and management of resources. The session provides a run routine which takes as an input the nodes of the graph to be computed. TensorFlow works backwards from the node of the graph to be computed calculating all preceding nodes which are required.

When run, the nodes are assigned to one or many physical execution units which can be executed on CPU and GPU devices that can be physically located on a single machine or distributed across multiple machines.


Summary

So thats a start, TensorFlow is a software library designed as a framework for Deep Learning which uses a computational graph to process data in the form of Tensors.