Matt Willis' Blog

Databricks – Cluster Sizing


Setting up Clusters in Databricks presents you with a wrath of different options. Which cluster mode should I use? What driver type should I select? How many worker nodes should I be using? In this blog I will try to answer those questions and to give a little insight into how to setup a cluster which exactly meets your needs to allow you to save money and produce low running times. To do this I will first of all describe and explain the different options available, then we shall go through some experiments, before finally drawing some conclusions to give you a deeper understanding of how to effectively setup your cluster.

Cluster Types

Databricks has two different types of clusters: Interactive and Job. You can see these when you navigate to the Clusters homepage, all clusters are grouped under either Interactive or Job. When to use each one depends on your specific scenario.

Interactive clusters are used to analyse data with notebooks, thus give you much more visibility and control. This should be used in the development phase of a project.

Job clusters are used to run automated workloads using the UI or API. Jobs can be used to schedule Notebooks, they are recommended to be used in Production for most projects and that a new cluster is created for each run of each job.

For the experiments we will go through in this blog we will use existing predefined interactive clusters so that we can fairly assess the performance of each configuration as opposed to start-up time.

Cluster Modes


When creating a cluster, you will notice that there are two types of cluster modes. Standard is the default and can be used with Python, R, Scala and SQL. The other cluster mode option is high concurrency. High concurrency provides resource utilisation, isolation for each notebook by creating a new environment for each one, security and sharing by multiple concurrently active users. Sharing is accomplished by pre-empting tasks to enforce fair sharing between different users. Pre-emption can be altered in a variety of different ways. To enable, you must be running Spark 2.2 above and add the following coloured underline lines to Spark Config, displayed in the image below. It should be noted high concurrency does not support Scala.


Enabled – Self-explanatory, required to enable pre-emption.

Threshold – Fair share fraction guaranteed. 1.0 will aggressively attempt to guarantee perfect sharing. 0.0 disables pre-emption. 0.5 is the default, at worse the user will get half of their fair share.

Timeout – The amount of time that a user is starved before pre-emption starts. A lower value will cause more interactive response times, at the expense of cluster efficiency. Recommended to be between 1-100 seconds.

Interval – How often the scheduler will check for pre-emption. This should be less than the timeout above.

Driver Node and Worker Nodes

Cluster nodes have a single driver node and multiple worker nodes. The driver and worker nodes can have different instance types, but by default they are the same. A driver node runs the main function and executes various parallel operations on the worker nodes. The worker nodes read and write from and to the data sources.

When creating a cluster, you can either specify an exact number of workers required for the cluster or specify a minimum and maximum range and allow the number of workers to automatically be scaled. When auto scaling is enabled the number of total workers will sit between the min and max. If a cluster has pending tasks it scales up, once there are no pending tasks it scales back down again. This all happens whilst a load is running.


If you’re going to be playing around with clusters, then it’s important you understand how the pricing works. Databricks uses something called Databricks Unit (DBU), which is a unit of processing capability per hour. Based upon different tiers, more information can be found here.You will be charged for your driver node and each worker node per hour.

You can find out much more about pricing Databricks clusters by going to my colleague’s blog, which can be found here.


For the experiments I wanted to use a medium and big dataset to make it a fair test. I started with the People10M dataset, with the intention of this being the larger dataset. I created some basic ETL to put it through its paces, so we could effectively compare different configurations. The ETL does the following: read in the data, pivot on the decade of birth, convert the salary to GBP and calculate the average, grouped by the gender. The People10M dataset wasn’t large enough for my liking, the ETL still ran in under 15 seconds. Therefore, I created a for loop to union the dataset to itself 4 times. Taking us from 10 million rows to 160 million rows. The code used can be found below:

# Import relevant functions.

import datetime

from pyspark.sql.functions import year, floor

# Read in the People10m table.

people = spark.sql("select * from clusters.people10m ORDER BY ssn")

# Explode the dataset.

for i in xrange(0,4):

     people = people.union(people)

# Get decade from birthDate and convert salary to GBP.

people = people.withColumn('decade', floor(year("birthDate")/10)*10).withColumn('salaryGBP', floor(people.salary.cast("float") * 0.753321205))

# Pivot the decade of birth and sum the salary whilst applying a currency conversion.

people = people.groupBy("gender").pivot("decade").sum("salaryGBP").show()

To push it through its paces further and to test parallelism I used threading to run the above ETL 5 times, this brought the running time to over 5 minutes, perfect! The following code was used to carry out orchestration:

from multiprocessing.pool import ThreadPool

pool = ThreadPool(10)

     lambda path:

          "/Users/ Sizing/PeopleETL160M",

          timeout_seconds = 1200),



To be able to test the different options available to us I created 5 different cluster configurations. For each of them the Databricks runtime version was 4.3 (includes Apache Spark 2.3.1, Scala 2.11) and Python v2.

Default – This was the default cluster configuration at the time of writing, which is a worker type of Standard_DS3_v2 (14 GB memory, 4 cores), driver node the same as the workers and autoscaling enabled with a range of 2 to 8. Total available is 112 GB memory and 32 cores.

Auto scale (large range) – This is identical to the default but with autoscaling range of 2 to 14. Therefore total available is 182 GB memory and 56 cores. I included this to try and understand just how effective the autoscaling is.

Static (few powerful workers) – The worker type is Standard_DS5_v2 (56 GB memory, 16 cores), driver node the same as the workers and just 2 worker nodes. Total available is 112 GB memory and 32 cores.

Static (many workers new) – The same as the default, except there are 8 workers. Total available is 112 GB memory and 32 cores, which is identical to the Static (few powerful workers) configuration above. Therefore, will allow us to understand if few powerful workers or many weaker workers is more effective.

High Concurrency – A cluster mode of ‘High Concurrency’ is selected, unlike all the others which are ‘Standard’. This results in a worker type of Standard_DS13_v2 (56 GB memory, 8 cores), driver node is the same as the workers and autoscaling enabled with a range of 2 to 8. Total available is 448 GB memory and 64 cores. This cluster also has all of the Spark Config attributes specified earlier in the blog. Here we are trying to understand when to use High Concurrency instead of Standard cluster mode.

The results can be seen below, measured in seconds, a new row for each different configuration described above and I did three different runs and calculated the average and standard deviation, the rank is based upon the average. Run 1 was always done in the morning, Run 2 in the afternoon and Run 3 in the evening, this was to try and make the tests fair and reduce the effects of other clusters running at the same time.


Before we move onto the conclusions, I want to make one important point, different cluster configurations work better or worse depending on the dataset size, so don’t discredit the smaller dataset, when you are working with smaller datasets you can’t apply what you know about the larger datasets.

Comparing the default to the auto scale (large range) shows that when using a large dataset allowing for more worker nodes really does make a positive difference. With just 1 million rows the difference is negligible, but with 160 million on average it is 65% quicker.

Comparing the two static configurations: few powerful worker nodes versus many less powerful worker nodes yielded some interesting results. Remember, both have identical memory and cores. With the small data set, few powerful worker nodes resulted in quicker times, the quickest of all configurations in fact. When looking at the larger dataset the opposite is true, having more, less powerful workers is quicker. Whilst this is a fair observation to make, it should be noted that the static configurations do have an advantage with these relatively short loading times as the autoscaling does take time.

The final observation I’d like to make is High Concurrency configuration, it is the only configuration to perform quicker for the larger dataset. By quite a significant difference it is the slowest with the smaller dataset. With the largest dataset it is the second quickest, only losing out, I suspect, to the autoscaling. High concurrency isolates each notebook, thus enforcing true parallelism. Why the large dataset performs quicker than the smaller dataset requires further investigation and experiments, but it certainly is useful to know that with large datasets where time of execution is important that High Concurrency can make a good positive impact.

To conclude, I’d like to point out the default configuration is almost the slowest in both dataset sizes, hence it is worth spending time contemplating which cluster configurations could impact your solution, because choosing the correct ones will make runtimes significantly quicker.

An agile waterfall – An approach to transitioning from a waterfall to an agile methodology

Many large organisations are embarking on the journey from a waterfall methodology to an agile methodology. Sometimes this journey is underestimated. At Adatis this is a journey we have assisted clients with and in this blog I will share with you what we have learnt.

At Adatis we exclusively work in a scrum agile methodology, we work with agile clients, waterfall clients and ever increasingly, clients in the process of transitioning from waterfall to agile. This transition can be difficult, leading to both methodologies being in practise at the same time, and this can cause friction.

This blog will outline a hybrid approach, between waterfall and agile, which will assist in crossing that bridge between the two methodologies, making everyone’s life easier.

To better explain the technique, I have gone through an example using the scrum agile methodology. This blog assumes some basic understanding of the scrum methodology, if you do not have this, please refer to this website to get you up to speed.

How to navigate the agile waterfall


First things first, education, as is the case with most things in life, if you don’t understand something, then it can be daunting! Often people concentrate on the buzz words involved in agile and relate the buzz words back to waterfall, without really understanding the true definitions of the terms. Make sure everyone understands exactly what agile is, how it works and crucially why it works.

Sprint ahead

In true agile fashion, you should focus on one sprint at a time, but this is where the compromising starts and what defines this hybrid approach. As always, carry out detailed sprint planning for the current sprint. Everything else that is on your backlog should have high level, worst case estimates. Using those estimates, combined with your average capacity for a sprint, and your prioritised backlog, you should be able to work out which story point fits into which sprint. This is not fixed and will change over time, but the key here is that there is always a plan right up to a deadline in place.


In the diagram above, we have the prioritised backlog on the right-hand side, with the estimate in days in brackets. Our average capacity is 12 days per sprint. We pick story points off the backlog in order, and try to squeeze them into each sprint. For example, the customer dimension takes 8 days to complete, so easily fits into sprint 1. We have 4 days left, our next highest priority is the product dimension, we squeeze in as much as we can into sprint 1. There will still be 1 day left of the product dimension, making it our first item in sprint 2. We continue to do this until all story points fit within sprints.

Once it’s decided which story point fits into which sprint, you must calculate a more detailed capacity, featuring people, so that the sprint duration can be decided. Taking the sprint duration, you can start to map out dates, which will put you on a path to start satisfying the waterfall methodology.


Contingency, deprioritisation and delaying deadlines

If from here to the end of the project every single sprint goal is met then the project will be a roaring success. However, this will not always be the case and steps should be taken to proactively deal with potential issues.

To allow for minor issues, contingency should be added. For major issues, expectations should be set and bad news should be delivered as soon as possible. If a sprint goal is not met, and therefore carries over to the next sprint, this has ramifications. Do not fall into the trap of thinking you can make that time back later in the project, all these little setbacks add up. You must face up to it now, in this scenario there are two choices. Either you add in a new sprint and reshuffle everything around, thus extending the deadline. Or, you deprioritise a story point and everyone accepts that will no longer be delivered as part of the final solution.

It is perhaps that last paragraph that is most important and most critical to this hybrid approach. By mapping everything out in a structured waterfall approach based upon the agile methodology. This allows you to utilise agile - staying flexible and dynamic, and allows you to utilise waterfall - to plan and to identify and deal with problems early on. Thus, the two competing methodologies are playing to their strengths and working well together.

Using R Tools for Visual Studio (RTVS) with Azure Machine Learning

Azure Machine Learning

Whilst in R you can implement very complex Machine Learning algorithms, for anyone new to Machine Learning I personally believe Azure Machine Learning is a more suitable tool for being introduced to the concepts.

Please refer to this blog where I have described how to create the Azure Machine Learning web service I will be using in the next section of this blog. You can either use your own web service or follow my other blog, which has been especially written to allow you to follow along with this blog.

Coming back to RTVS we want to execute the web service we have created.

You need to add a settings JSON file. Add an empty JSON file titled settings.json to C:\Users\<your name>\Documents\.azureml. Handy tip: if you ever want to have a dot at the beginning of a folder name you must place a dot at the end of the name too, which will be removed by windows explorer. So for example if you want a folder called .azureml you must name the folder .azureml. in windows explorer.

Copy and paste the following code into the empty JSON file, making sure to enter your Workspace ID and Primary Authorization Token.


"id" : "<your Workspace ID>",

"authorization_token" : "<your Primary Authorization Token>",

"api_endpoint": "",

"management_endpoint": ""


You can get your Workspace ID by going to Settings > Name. And the Primary Authorization Token by going to Settings > Authorization Tokens. Once you’re happy save and close the JSON file.

Head back into RTVS, we’re ready to get started. There are two ways to proceed. Either I will take you line by line what to do or I have provided an R script containing a function, allowing you to take a shortcut. Whichever option you take the result is the same.

Running the predictive experiment in R – Line by line

With each line copy and paste it into the console.

Firstly a bit of setup, presuming you’ve installed the devtools package as described on the github page for the download, load AzureML and connect to the workspace specified in settings.JSON. To do this use the code below:

## Load the AzureML package.


## Load the workspace settings using the settings.JSON file.

workSpace <- workspace()

Next we need to set the web service, this can be any web service created in Azure ML, for this blog we will use the web service created in this blog. The code is as follows:

## Set the web service created in Azure ML.

automobileService <- services(workSpace, name = "Automobile Price Regression [Predictive Exp.]")

Next we need to define the correct endpoint, this can easily be achieved using:

## Set the endpoint from the web service.

automobileEndPoint <- endpoints(workSpace, automobileService)

Everything is set up and ready to go, except we need to define our test data. The test data must be in the exact same format as the source data of your experiment. So the exact same amount of columns and with the same column names. Even include the column you are predicting, entering just a 0 or leaving it blank. Below is the test data I used:


This will need to be loaded into R and then a data frame. To do so use the code below, make sure the path is pointing towards your test data.

## Load and set the testing data frame.

automobileTestData <- data.frame(read.csv("E:\\OneDrive\\Data Science\\AutomobilePriceTestData.csv"))

Finally we are ready to do the prediction and see the result! The final line of code needed is:

## Send the test data to the web service and output the result.

consume(automobileEndPoint, automobileTestData)

Running the predictive experiment – Short cut

Below is the entire script, paste the entirety of it into top left R script.

automobileRegression <- function(webService, testDataLocation) {

## Load the AzureML package.


## Load the workspace settings using the settings.JSON file.

amlWorkspace <- workspace()

## Set the web service created in Azure ML.

automobileService <- services(amlWorkspace, name = webService)

## Set the endpoint from the web service.

automobileEndPoint <- endpoints(amlWorkspace, automobileService)

## Load and set the testing data frame.

automobileTestData <- data.frame(read.csv(testDataLocation))

## Send the test data to the web service and output the result.

consume(automobileEndPoint, automobileTestData)


Run the script by highlighting the whole of the function and typing Ctrl + Enter. Then run the function by typing the below code into the console:

automobileRegression("Automobile Price Regression [Predictive Exp.]","E:\\OneDrive\\Data Science\\AutomobilePriceTestData.csv")

Where the first parameter is the name of the Azure ML web service and the second is the path of the test data file.

The Result

Both methods should give you the same result: an output of a data frame displaying the test data with the predicted value:


Wahoo! There you have it, a predictive analytic regression Azure Machine Learning experiment running through Visual Studio… the possibilities are endless!

Introduction to R Tools for Visual Studio (RTVS)


This blog is not looking at one or two exciting technologies, but THREE! Namely Visual Studio, R and Azure Machine Learning. We will be looking at bringing them together in harmony using R Tools for Visual Studio (RTVS).


As this blog will be touching on a whole host of technologies, I won’t be going into much detail on how to set each one up. However instead I will provide you with a flurry of links which will provide you with all the information you need.

Here comes the flurry…!

· Visual Studio 2015 with Update 1 – I hope anyone reading this is familiar with Visual Studio, but to piece all these technologies together version 2015 with Update 1 is required, look no further than here:

· R – Not sure exactly what version is needed but just go ahead and get the latest version you can, which can be found here:

· Azure Machine Learning – No installation required here, yay! But you will need to set up an account if you have not done so already, this can be done here

· R Tools for Visual Studio - More commonly known as RTVS. The name is fairly self-explanatory but it allows you to run R through Visual Studio. If you have used R and Visual Studio separately before it will feel strangely familiar. Everything you need to download, install and set up can be found here:

· Devtools Package - The final installation step is a simple one. Installing the correct R packages to allow you to interact with Azure ML. If you’ve used R to interact with Azure ML before you probably have already done this step, but for those who have not, all the information you will need to do so can be found here:

Introduction to RTVS

Once all the prerequisites have been installed it is time to move onto the fun stuff! Open up Visual Studio 2015 and add an R Project: File > Add > New Project and select R. You will be presented with the screen below, name the project AutomobileRegression and select OK.


Microsoft have done a fantastic job realising that the settings and toolbar required in R is very different to those required when using Visual Studio, so they have split them out and made it very easy to switch between the two. To switch to the settings designed for using R go to R Tools > Data Science Settings you’ll be presented with two pop ups select Yes on both to proceed. This will now allow you to use all those nifty shortcuts you have learnt to use in RStudio. Anytime you want to go back to the original settings you can do so by going to Tools > Import/Export Settings.

You should be now be looking at a screen similar to the one below:


This should look very recognisable to anyone familiar to R:


For those not familiar, the top left window is the R script, this will be where you do your work and what you will run.

Bottom left is the console, this allows you to type in commands and see the output, from here you will run your R scripts and test various functions.

Top right is your environment, this shows all your current objects and allows you to interact with them. You can also change to History, which displays a history of the commands used so far in the console.

Finally the bottom right is where Visual Studio differs from RStudio a bit. The familiar Solution Explorer is visible within Visual Studio and serves its usual function. Visual Studio does contain R Plot and R Help though, which both also feature in RStudio. R Plot will display plots of graphs when appropriate. R Help provides more information on the different functions available within R.

Look for my next blog, which will go into more detail on how to use RTVS.

Azure ML Regression Example - Part 3 Deploying the Model

In the final blog of this series we will take the regression model we have created earlier in the series and make it accessible so it can be consumed by other programs.

Making the experiment accessible to the outside world

The next part of the process is to make the whole experiment accessible to the world outside of Azure ML. To do so you need to create a Web Service. This can be achieved by clicking Set Up Web Service button next to Run and then selecting Predictive Web Service [Recommended]. The experiment will change in front of your eyes and you should be left with a canvas looking similar to the one displayed below.


If you would like to get back to your training experiment at any time you can do so by clicking Training experiment in the top right corner. You can then run the predictive web service again to update the predictive experiment.

Whilst in the predictive experiment window run the experiment once again and then click Deploy Web Service. Having done this, you should be displayed with the below screen:


Select Excel 2013 or later in the same row as REQUEST/RESPONSE. Click the tick to download the Excel document, open it and click Enable Editing, you will see something like the image below. If you are using Excel 2010 feel free to follow the example, it will be fairly similar, but not identical.


Click Automobile Price Regression [Predictive Exp.] to begin. Click Use sample data to quickly construct a table with all the appropriate columns and a few examples. Feel free to alter the sample data to your heart’s content. Once you’re happy with your data, highlight it and select it as the Input range selected. Chose an empty cell as the Output. Click Predict. You should see something similar to below:


You should now be able to see a replica table with a Scored Labels column displaying the estimated price for each row.

Go ahead and rerun the experiment putting in whatever attribute values you desire. This experiment will now always return a Scored Label relating to the price based upon the training model.

What next?

This has just been a toe dip into the world of Azure ML. For more information on getting started with Azure ML track down a copy of Microsoft Azure Essentials – Azure Machine Learning by Jeff Barnes, this is a great starting point.

If you want to know what you can do with Azure ML and how to start using Azure ML within other programs then check out my upcoming blog which will show you how to integrate Azure ML straight into Visual Studio.

Azure ML Regression Example - Part 2 Training the Model

This is where the real fun begins! In this blog we will get to the heart of machine learning and produce a regression model.

Training the model

We now need to split the data into training and testing sets. This is so we can train the algorithm using the training set and then test the accuracy of the prediction using the testing set. To do so search for ‘split’ in the Search experiment items search bar. Drag the Split Data task onto the canvas. Under the properties is a property called Fraction of rows in the first output dataset this lets you chose what percentage of rows is used for training and what percentage are held back to test the prediction accuracy. Let’s set it to 0.9, this means 90% will be used for training, 10% for testing. Leave the other properties as they are. The properties window should look like the below image:


Now let’s get to the very fundamental core of machine learning, the algorithm itself. For this we will use one of my personal favourites a Boosted Decision Tree. Decision Trees frequently have very high accurate prediction results and are great for discovering more about your data based on the leaves of the tree. Go to the item toolbox, clear the search box and navigate to Machine Learning > Initialize Model > Regression and drag on the Boosted Decision Tree Regression item onto the left side of the canvas. Change the properties to coincide with the values below, these have been selected after using a Sweep Parameters item to work out the optimal parameter settings.

Parameter Name

Parameter Value

Create trainer mode

Single Parameter

Maximum number of leaves per tree


Minimum number of samples per leaf mode


Learning rate


Total number of trees constructed


Random number seed


Allow unknown categorical levels


Drag on the Train Model item, which is located under Train on the item toolbox. Join up the appropriate output and input ports so your canvas looks like the image below.


Click on the Train Model item and select Launch column selector in the properties window. Here you are selecting the column you want to predict, so just select price.

Now we need to predict the results of the testing data. To do so, drag on a Score Model item (located under Score) and connect the Train Model and Split Data items to each input note of the Score Model. Once complete, hit Run to run the experiment, your canvas should be eliminated with green ticks like the image below.


Now let’s have a look and see if this algorithm has actually produced any decent results. Right click on destination node of the Score Model and left click on Visualise. You should see something similar to the below image.


This table displays the values for each and every piece of test data. If you scroll all the way to the right and you should see two columns: price and Scored Labels. Price is the actual price of the car. Scored Labels is the amount the regression algorithm has predicted the price of the car to be. The numbers are quite close, which is exactly the result we’re after. If you click on the Score Labels column header you can conduct some further analysis, scrolling down and making sure compare to is set to price you can view a scatter plot of the two values. I have done so on the image above and looking at the scatter plot you can see that there is a strong positive correlation with only a few outliers.

Your Azure Machine Learning regression algorithm is now complete! In the next blog we will be deploying the model so we can use it outside of Azure Machine Learning and really put what we have created into practice.

Azure ML Regression Example - Part 1 Preparing the File

This blog series will give you a quick run through of Azure Machine Learning and by the end of it will have you publishing an analytical model which can be consumed by various external programs. This particular blog will focus on just preparing the file before we will look at training the model in the next blog.

Source file

The source file we will be using can be found at The column names are fairly self-explanatory but if you would like a little bit more information please visit Make sure you download the file to somewhere you can easily access later.

Please open up the file and convert all column headers with a hyphen to camel case, so for example ‘normalized-losses’ to ‘normalizedLosses’ and save it.

Once you are logged into Microsoft Azure Machine Learning Studio you must add a data set. To do so go Datasets > New > From Local File. Click Browse and navigate to the source file you downloaded earlier and select it. The rest of the fields should populate automatically and display similar values to the image below, when you are happy click the tick to add the dataset.


Creating the experiment

Next, navigate to New > Experiment > Blank Experiment. You will be presented with a blank canvas, rename it by overwriting ‘Experiment Created on’ todays date to ‘Automobile Price Regression’. You should be looking at a screen similar to below:


Learning your way around

The far left blue strip allows you to carry out navigation within Azure Machine Learning Studio. It gives you access to Projects, Experiments, Web Services, Notebooks, Datasets, Trained Models and Settings. Projects and Notebooks are in preview so we won’t discuss these in this blog. Experiments is a list all the experiments you have created. Web Services is a list of all the experiments you have published, more information on this will be provided later. Trained Models are the complete predictive models that have been trained already, again, more information will be provided later. Finally, Settings is fairly self-explanatory and allows you to view information such as various IDs, tokens, users and information about your work space.

The white toolbox to the right of the navigation pane is the experiment item toolbox. This contains all the datasets and modules needed to create a predictive analytics model. The toolbox is searchable and the items can be dragged onto the canvas.

The canvas describes the experiment and shows all the datasets and modules used. The datasets and modules can be moved freely and lines are drawn between input and output ports to enforce the ordering.

The properties pane allows you to modify certain properties of a dataset or module by clicking on the item and modifying the chosen property in the pane.

Cleaning the source file

First up expand Saved Datasets > My Datasets within the experiment item toolbox and drag on your newly created dataset onto the canvas.

Next, expand Data Transformation > Manipulation and drag on Clean Missing Data. Connect the tasks by dragging the output port of the dataset to the input port of the Clean Missing Data task. Make sure the properties mirror the screenshot below. This will replace all missing values with the value ‘0’. The below image is what you should see when Clean Missing Data is selected:


Then drag on the Project Columns task onto the canvas. Drag the leftmost output port from the Clean Missing Data task to the input port of the Project Columns task. Select the Launch column selector and select All Columns under Begin With and make sure Exclude and column names are selected and add the columns bore and stroke. This will remove the selected columns because they are not relevant when predicting the price, and therefore will have an adverse effect on the accuracy of the prediction. When you are happy it should look something like the screen below. Click the tick to confirm.


The boring bit is now out of the way! In the next blog we will start the real machine learning by splitting the data into training and testing sets and then producing a model.