Adatis BI Blogs

Getting Started with PyTorch: A Deep Learning Tutorial

PyTorch is a deep learning framework created by the Artificial Intelligence Research Group at Facebook to build neural networks for machine learning projects. It isn’t brand new; PyTorch has been around since October 2016, almost exactly two years ago, but only now it is gaining the momentum it deserves. When used as an alternative to Keras, TensorFlow or NumPy, PyTorch shines in the following areas:It is very tightly integrated with native Python API’s which allows it to seamlessly interact with Python libraries such as NumPy, SciPy and Pandas.This also includes native debuggers so there is no need for a specialist debugger – Looking at you, TensorFlow…It supports both forward passes for prediction and back propagation using the Autograd library for training.And most importantly:PyTorch builds dynamic computation graphs which can run code immediately with no separated build and run phases. This makes the neural networks much easier to extend, debug and maintain as you can edit your neural network during runtime or build your graph one step at a time.** Note: If the basics of Deep Learning, Neural Networks and Back Propagation are alien to you or even if you fancy a little revision, I really recommend that you check out 3Blue1Brown’s awesome YouTube playlist covering the foundations of Neural Networks: **In this tutorial I will introduce a basic deep neural network in PyTorch and explain the important steps along the way. Now, let’s get started!InstallationThe simplest and recommended way to install PyTorch is through a package management tool like Conda or Pip but it can be installed directly from source using the instructions found at: using PipLinux:pip3 install torch torchvisionMac:pip3 install torch torchvision # MacOS Binaries dont support CUDA, install from source if CUDA is needed Windows:pip3 install pip3 install torchvisionInstall using CondaLinux:conda install pytorch torchvision -c pytorchMac:conda install pytorch torchvision -c pytorch # MacOS Binaries dont support CUDA, install from source if CUDA is needed Windows:conda install pytorch -c pytorch pip3 install torchvision PyTorch can be installed on Azure Databricks as a Databricks PyPI library and comes preinstalled and configured on Azure Data Science Virtual Machines. Dataset SelectionI originally found the dataset used in this tutorial in the UCI Machine Learning Repository. The dataset represents data on 11,000+ instances of phishing and non-phishing webpages which have 30 categorical attributes including PageRank, AbnormalURL, Google_Index and age_of_domain. The data also contained a label, 1 or -1 indicating if the webpage with a phishing webpage or not. The aim of this tutorial is to show how to use a deep neural network, so all data was cleansed before being split into two separate CSV files, train.csv and test.csv. Simple Neural Network DesignThe neural network I will build consists of:30 input nodes each representing a column of the dataset.These are passed through the first ReLU layer. Rectified Linear Units (ReLU) improve neural networks by speeding up training – all negative numbers are set to 0 and positive aren’t changed to speed up computations.Next is the first hidden layer consisting of 128 fully connected nodes.Then we have the second ReLU layer and second fully connected hidden layer which also consists of 128 nodes.The next layer is the Logarithmic Softmax layer. This layer integrates both the softmax and log functions to calculate probabilities of each output in the range of 0 and 1.Finally, the last layer comprises of just two nodes representing the two labels – phishing and non-phishing.This neural network is visualized below:Simple Neural Network BuildAs always, the first step is to import the libraries we’ll need.import numpy as np import torch import torch.nn as nn from torch.autograd import Variable import pandas as pd import The next step is to read in the CSVs that contain the data we need into a pandas dataframe.trainDataset = pd.read_csv("..\\train.csv", header=None) testDataset = pd.read_csv("..\\test.csv", header=None) Now we choose our hyperparameters. The input size will equal the number of attributes in our dataset, 30. The size of both hidden layers will be 128 nodes. The number of classes will be 2, non-phishing and phishing. The number of epochs will be 100. An epoch is both one forward pass through the network and one backward pass for training. We want to do these two steps 100 times. Our batch size will also be 100. This means we will complete each epoch with a batch of 100 rows of the dataset to speed up the training. The learning rate will be 0.001. The lower the learning rate, generally, the slower but more accurate the training is. The learning rate can be referred to as the speed of the gradient descent – Too big and you may overshoot the minimum, too small and it’ll take a long long time to converge. These parameters can be set using the following code: inputSize = len(trainDataset.columns) -1 hidden1Size = 128 hidden2Size = 128 numClasses = 2 numEpoch = 100 batchSize = 100 learningRate = 0.001 Data loaders are a really simple abstraction to the standard batch machine learning pipeline. Behind the scenes, the data loader will handle: Batching the data. Shuffling the data. Loading the data in parallel using multiprocessing workers. trainLoader =, batch_size=batchSize, shuffle=True) testLoader =, batch_size=batchSize, shuffle=False) Now it’s time to define our neural network. The best way to do this is to subclassing the nn.Module class. The new DeepNeuralNetwork class is made up of the seven layers discussed earlier. class DeepNeuralNetwork(nn.Module): def __init__(self, inputSize, hidden1Size, hidden2Size, numClasses): super(DeepNeuralNetwork, self).__init__() self.fc1 = nn.Linear(inputSize, hidden1Size) self.relu1 = nn.ReLU() self.fc2 = nn.Linear(hidden1Size, hidden2Size) self.relu2 = nn.ReLU() self.fc3 = nn.Linear(hidden1Size, numClasses) self.logsm1 = nn.LogSoftmax(dim=1) def forward(self, x): out = self.fc1(x) out = self.relu1(out) out = self.fc2(out) out = self.relu2(out) out = self.fc3(out) out = self.logsm1(out) return out dnn = DeepNeuralNetwork(inputSize, hidden1Size, hidden2Size, numClasses) We now need to define both the loss function and optimizer. The loss function will calculate how effective our current weights and biases are at producing an accurate classification and we will be using the NLL function or negative log-likelihood function. NLL is very often the loss function of choice alongside the softmax activation we have in our neural network. Additionally, we need an optimizer to take the results from the loss function and alter the weights and biases to move the accuracy in a positive direction. The optimizer we will want use is called Adam (Adaptive Moment Estimation) – a popular optimizer that will outperform (almost) every other optimization algorithm. lossFN = nn.NLLLoss() optimizer = torch.optim.Adam(dnn.parameters(), lr=learningRate) Time to train the network! We will loop through the training steps 100 times as we stated in our hyperparameters. These steps consist of: Looping through each batch of training data. Separating each batch of training data into two variables – one for the attributes, one for the class labels. Zero the gradient of the previous epoch. Completing a forward pass through the network with the batch of training attributes. Calculating the loss with respect to the class labels. And finally performing back propagation. for epoch in range(0, numEpoch): for i, data in enumerate(trainLoader,0): labels = Variable(data[:,-1]) data = Variable(data[:,0:30].float()) optimizer.zero_grad() outputs = dnn(data) loss = lossFN(outputs, labels.long()) loss.backward() optimizer.step() print('Epoch [%d/%d], Loss: %.4f' %(epoch+1, numEpoch, loss.item())) Finally, test our network using the following code: correct = 0 total = 0 for data in testLoader: labels = Variable(data[:,-1]) data = Variable(data[:,0:30].float()) outputs = dnn(data) _, predicted = torch.max(, 1) total += labels.size(0) correct += (predicted == labels.long()).sum() print('Accuracy of the network on the data: %d %%' % (100 * correct / total)) This tells us that our accuracy is at 95%. This is pretty good for a first try! We can now identify phishing websites with a high accuracy using only 30 features. Due to its Python integration and dynamic computational graphs, PyTorch is relatively easy to pick up making it a more approachable neural network framework than TensorFlow. However, PyTorch is a relatively new framework, so it only has a small community and limited resources hindering the ability to learn and debug. But as with any tech, it’s all a matter of personal preference.The dataset and code used in this tutorial have been uploaded to my GitHub account which can be found at:

Visualising Network Data in Power BI with Python Integration and NetworkX

The long awaited Python Integration in Power BI added earlier this month welcomes the opportunity for further customised reporting by exploiting the vast range of Python visualisation libraries.Among my favourite of these Python visualisation/ data science libraries is NetworkX, a powerful package designed to manipulate and study the structure and dynamics of complex networks. While NetworkX excels most at applying graph theory algorithms on network graphs in excess of 100 million edges, it also provides the capability to visualise these networks efficiently and, in my opinion, easier than the equivalent packages in R.In this article, I will explain how to visualise network data in Power BI utilising the new Python Integration and the NetworkX Python library.Getting StartedTo begin experimenting with NetworkX and Python in Power BI, there are several pre-requisites:Enable Python integration in the preview settings by going to File –> Options and Settings –> Options –> Preview features and enabling Python support.Ensure Python is installed and fully up-to-date.Install the following Python libraries:NetworkXNumPypandasMatplotlibLoading DataThe data I used was created to demonstrate this task in Power BI but there are many real-world network datasets to experiment with provided by Stanford Network Analysis Project. This small dummy dataset represents a co-purchasing network of books.The data I loaded into Power BI consisted of two separate CSVs. One, Books.csv, consisted of metadata pertaining to the top 40 bestselling books according to Wikipedia and their assigned IDs. The other, Relationship.csv, was an edgelist of the book IDs which is a popular method for storing/ delivering network data. The graph I wanted to create was an undirected, unweighted graph which I wanted to be able to cross-filter accurately. Because of this, I duplicated this edgelist and reversed the columns so the ToNodeId and FromNodeId were swapped. Adding this new edge list onto the end of the original edgelist has created a dataset with can be filtered on both columns later down the line. For directed graphs, this step is unnecessary and can be ignored.Once loaded into Power BI, I duplicated the Books table to create the following relationship diagram as it isn’t possible to replicate the relationship between FromNodeId to Book ID and ToNodeId to Book ID with only one Books table.From here I can build my network graph.Building the Network GraphFinally, we can begin the Python Integration!Select the Python visual from the visualizations pane and drag this onto the dashboard.Drag the Book Title columns of both Books and Books2 into Values.Power BI will create a data frame from these values. This can be seen in the top 4 lines in the Python script editor.The following Python code (also shown above) will create and draw a simple undirected and unweighted network graph with node labels from the data frame Power BI generated:import networkx as nx import matplotlib.pyplot as plt G = nx.from_pandas_edgelist(dataset, source="Book Title", target="Book Title.1") nx.draw(G, with_labels = True) ** NOTE: You may find that the code above will fail to work with large networks. This is because by default networkx will draw the graph according to the Fruchterman Reingold layout, which will position the nodes for the highest readability. This layout is unsuitable for networks larger than 1000 nodes due to the memory and run time required to run the algorithm. As an alternative, you can position the nodes in a circle or randomly by editing the linenx.draw(G, with_labels = True)tonx.draw(G, with_labels = True, pos=nx.circular_layout(G)) or nx.draw(G, with_labels = True, pos=nx.random_layout(G)) **This will produce the network graph below:You are also able to cross filter the network graph by selecting rows in the table on the right-hand side:ConclusionPython visuals are simple to produce and although the visual itself isn’t interactive, they will update with data refreshes and cross filtering, much like the R integration added 3 years ago. The introduction of Python in Power BI has opened doors for visualisation with libraries such as NetworkX, to visualise all BI networks from Airline Connection Flights and Co-Purchasing networks to Social Network Analysis.

Can the Custom Vision Service support the calculation of KPIs?

Let’s assume we have a company that distributes alcoholic drinks across the country. To determine their performance, they define a set of KPIs that will evaluate, between others, how many establishments (eg. pubs, bars, …) have their products exposed in the shelf. To achieve this goal, they have a set of sales reps that visit each establishment and take note of which products are exposed. One possible way to track the data is by accessing a mobile application, manually fill the form and upload the data, but, what if we could automate the process of identifying the products in a shelf by simply uploading a picture? To do that, we would need to apply a machine learning algorithm to classify the image and identify each product. To prove if the above scenario is achievable, I’ll demonstrate how to create a project using a tool called Custom Vision, a service that allow us to easily build a predictive model with just a few clicks and without the need of deep machine learning knowledge. What is the Custom Vision Service?Azure Custom Vision Service is a Microsoft Cognitive Services tool for tagging images using a custom computer vision model. Although very similar to Microsoft’s Computer Vision API, it has the advantage of fine-tuning a predictive model to a specific dataset, however, there are still a couple of minor disadvantages. For example, the service can only identify if an object is in an image and not where it stands within the image.Build the modelTo build a predictive model with this service we can either use the web interface or the REST API with support for C# and Python. The first step was to create a project by providing a name, a domain and a resource group. When selecting a domain, we can either choose a general domain optimized for a range of images or select a specific domain optimized for a certain scenario. In my case, I selected the Food domain, given I wanted to identify different kind of bottles.Below a description of each domain detailed by Microsoft. The next step was to upload and tag images. Here are a couple of considerations:- To start the prototype, we need at least two different tags and a couple of images, usually, a minimum of 30 per class- It is best practice to use a variety of good quality images (different angles, lights, background, size, …) to ensure a better differentiation and accurate results from the classifier. In my case, Google was the source of the images, which was a bit limited in some cases (surprisingly!!!)- It is best practice to include images that represent what the classifier will find in the real world, rather than images with neutral backgrounds- It is advised to avoid images with multiple entities. If we upload images with a bottle and a glass, because the classifier will learn the characteristics that the photos have in common, when comparing two images, the classifier might be comparing a bottle + cup with a single bottleI started by uploading 60 images for 2 classes, Baileys and Others. Once the images were uploaded, I trained my model and obtained the following results.The Precision and Recall indicators demonstrates how good the classifier is performing. Above, we can see that the analysis is done for the entire model and for each tag. It is important to refer that 100% precision is usually not achievable, however, having a model with 75%+ in Precision and Recall is an indication of an effective model.Precision – Indicates how likely the classifier is correctly classifying an image. Knowing that we had 60 images, having a precision of 84.3% means that roughly 51 of the images were correctly taggedRecall – From out of all the images that should have been classified correctly, how many did the classifier identified accurately. Having a precision of 34.5% means that only 20 images were correctly classifiedProbability Threshold – The slider, set by default at 90%, indicates what is the value used to calculate Precision and Recall. Let’s consider the following example. The probability that image A has a Baileys bottle is 94%. If the probability threshold is 90%, then Image A will be taken into consideration as a “correct prediction”.In the example below, we can see that I obtained better results when I changed the probability threshold to 75%. Based on this information I had two options, either correctly tag the wrong images or replace them with better ones.Fortunately, Custom Vision Service has a very handy functionality that, for each iteration, highlights which images confused the model according to the probability threshold value. In the image below, there are two images with a red square. If we hover the mouse on the image we can see the prediction for each tag. In this case, the prediction is below 90% for both tags, meaning the image was not considered in the Precision and Recall calculation.After a couple of iterations we finally obtained a model we could use to run a quick test. When having multiple iterations, we select the one with best results by selecting the option Make Default.To test the model, I selected two images from the internet showing individual brands. As we can see, the model correctly classified each image.Since I was happy with the results, I decided to increase the complexity of the model by creating a new tag and uploading a new set of images. After training the model, I noticed the results were not as good as before, since the new images were creating some confusion to the model. It took a couple of iterations until I got an acceptable model. Following the result of a quick test.I now had a model that could correctly classify 3 different entities, so I decided to increase the challenge and added an image with multiple entities. The result I obtained helped me understand the flaw in my model.The model identified the image should be classified as Others, however, although we have a Baileys and Hendricks bottle in the image, the probability for those two classes was too low. Here is why:- When uploading images to the project I only used 1 tag per image. Based on that, the model will always try to classify an image with a single tag. As soon as I added more tags per image, my predictions improved- All the examples used were showing a single entity, ie, only one type of bottle per image, except for the Others category. Example: I uploaded 30 images of Baileys bottles, 30 images of Hendricks bottles, and, for the category Others, 3 different types of bottles- My model was trained to identify bottles where the image didn’t have multiple entities. As referred above, the model should always be trained with images that represent what the classifier will predict- The number of images per class didn’t have enough variety. As any machine learning model, if we improve the number of examples, the model will perform betterFinal ConsiderationsWith just a couple of clicks and no deep machine learning knowledge I was able to create a predictive model that could accurately classify a set of images. With further iterations I could potentially have a model that could achieve the requirements of the scenario proposed at the beginning of this blog, however, would Custom Vision Service be the right tool? In my opinion the use of the Computer Vision API would be more adequate, but, this is definitely a service with massive potential.As always, if you have any queries or considerations do let me know.

Improving Machine Learning Models

As part of my MS Data Science Professional Program, a number of the topics recently have been based around getting the most out of an Azure ML model. As part of this blog, I will be looking at the techniques and ways in which you can model and improve a solution. While I was tackling this problem with Azure ML, these techniques apply to building better models through other languages/platforms such as Python or Scala.   Data Munging A process that undoubtedly every data scientist goes through with every DS problem they face, is that of data munging. This is a term that is being used more frequently to describe the process of transforming data from its raw state into another format, something more valuable for downstream analytics. Models running on data that is poor in quality, missing or duplicated will produce poor predictions. Therefore a very simple pre cursor to any problem is to explore the data and understand what is required to turn the data set into a form which is better for the latter stages of the model design. The following techniques are used as part of this stage: Removing Duplicate Rows Cleaning Missing Data (custom substitution of numerics, usually 1 or 0 or replacing with the mean, medium, mode across the dataset) Cleaning Missing Data (removing bad quality rows entirely or removing bad quality columns entirely – more extreme) Creating Categorical Features from String Features Normalising Numeric Features (ZScore or Min/Max) to bring everything on to the same scale. These steps are quite basic so I won’t go into detail here, but none the less, they should be considered at the start of a modelling a better solution to a problem.   Feature Selection One of the reoccurring principles that appears with machine learning is that of Ockham’s razor, which states that the best models are simple models that fit the data well; this is not an irrefutable principle of logic, but a preference for simplicity. Therefore there is a need of balance between accuracy and simplicity to limit the feature set which tends to lead to better predictions. Simpler models are also more interpretable to humans which also helps. While the data I was working with was limited to around 35 features, there are many data science problems which have thousands of features and so this technique is even more crucial. There are multiple methods to perform feature selection, of which a few will be covered here. The first method is greedy backward selection which starts with all the features and then finds the feature that hurts predictive power the least when removed, and you remove it. This is done iteratively until a point is met (which will be discussed later). Its known as greedy since it never looks back after removing the feature each time. An alternative method is greedy forward selection which is basically the inverse, starts with no features, and looks for the feature that by itself is the best model. This then carries on in a similar vein to the backward selection but adding features. The point at which you stop with forward selection is that of diminishing returns for your accuracy. Defining accuracy is important here, and this is where a formula called Adjusted R² comes in. R² is a measure of how well the model fits the data, with being closer to 1 than 0 being a better fit. The adjusted part adds a penalty for every term in the model, thus it measures on a scale the size and accuracy of a model. Therefore you need enough features for your R² to be large but not too many that it brings the Adjusted R² down. . Permutation Feature Importance Using the feature selection theory, and to prune the feature set down to those that are meaningful for prediction, you can use a module in Azure ML called Permutation Feature Importance. This essentially re-computes the model a number of times, leaving out each feature and looks at how much your metric changes because the feature was left out, and then ranks them in order of importance. Depending on what you are trying to model, i.e. a classification or regression problem – there are a number of options for the metrics to measure performance. In my instance, I was interested in the RMSE (Root Mean Squared Error) which in simple terms represents the sample standard deviation of the difference between prediction and observed values. It aggregates the magnitudes of error in prediction into a single measure of predictive power. The closer to 0, the better the predictive power – but it’s also good to note this is relative to what you are trying to measure. Once the model has been run through, you can visualise the list of features and their contribution to the RMSE. At this point, it does not necessarily matter whether the feature contributes a positive or negative value to the RMSE, as long as the value is not 0. Any values of 0 indicate that they have zero contribution to feature importance, essentially whether they are part of the model or not, add nothing to it. You can then follow backward pruning techniques to remove these columns from the feature set. It is then worth running the model again, to check the feature importance as the removal of those features may impact other features. If more features then have a value of 0, you should remove those too, and repeat. You can then measure the impact of the changes using separate pipelines, and passing the output into the same evaluation model, and checking the ROC curve (described below). Even with the RMSE staying the same between the 2 pipelines, by removing features, you are able to build a model which is more likely to generalize – be more effective in the real world when values change.     Picking the Best Model Type There is no reason to believe that any particular machine learning model will have the best performance (although we always have favourites); a classification model type that works best for one set of features and labels in a dataset does not always work best for another. As part of modelling any dataset, testing and comparing multiple machine learning models is usually a good approach. Its also important to note that the performance achieved with any particular machine learning model can change after performing feature engineering, therefore it is best to run the selection after this stage. The following model evaluates logistic regression, boosted decision trees, neural networks and support vectors with the same dataset to find out which is best.     To understand the performance of a machine learning model, there are a number of techniques to use. The easiest way is to pass the output of each model into an Evaluate Model module, which accepts up to 2 datasets at a time (left and right inputs). After the experiment is run, you can visualise the output of the models using this module, and examine the ROC curve. The first scored dataset (blue) represents the original model (in this case a neural network), and the scored dataset to compare against (red) represents the second dataset (in this case a support vector machine). The higher and further to the left the curve, the better the performance of the model (in this case, the neutral network). Scrolling down further, you can also use the Accuracy, Recall, and AUC performance metrics, which indicate the accuracy and area under the curve. The model with the higher metrics is performing better. In particular, the lower the recall metric, the higher the number of false negatives.     Parameter Sweeping Once you’ve picked the ML model contributing to making your predictive power better, it will require a set of parameters. For instance, with decision trees, this is in the form of a leaf count to determine depth, or no. of trees to determine width, along with their samples per leaf, and the learning rate. By default there is always a set available, but these will always need tweaking to improve things further and generate a better RMSE. This can be done by either sweeping a giant grid of parameters, or by a random sweep. The latter being a lot quicker to process at run time for obvious reasons Fortunately, the performance is not normally sensitive to a change in these values if you have done much of the previous analysis first. Parameter sweeping really starts to squeeze the best out of the model. In Azure ML, this can be done via a tune model hyper parameter module. The same options are available to measure metrics as the feature selection module, so I was interested in the RMSE again. As part of tuning the parameters through this module, we will need to split the training data beforehand, this can be done 50:50. This is so that the parameters have a set of data to validate against. This is then kept separate to the scoring data set as usual which is another completely separate set of data. Once the model has run, we can again evaluate the best parameters, against the original model and evaluate the RMSE, as well as the Accuracy, Recall and AUC. This is very similar to the previous techniques of evaluation. Visualising the sweep results, will display the parameters used, and then these can be programmed back into the original ML model, while removing the tune hyper parameter module, to speed things up on future runs. A process of nested cross validation can be used on top of this to build confidence that the correct parameters have been used and it wasn’t just luck that they ended up being better than another set.     Conclusion Once you have been through this process, you will then want to run a process of cross-validation, which runs the data through multiple times (folds) where each time, different data is used for training, and scoring. You can then generate a mean and standard deviation for each fold and prove the model is consistent across the data set, and that it will not be skewed by any new data for future predictions. This will give you a good idea of whether the model will generalise well and be robust enough to move to production. Of course, there are many more techniques to the ones listed here, but this should give you a good introduction to the ones to look for to deliver predictive power from your model.

Testing the Waters: An Overview of Data Science using Azure Data Lakes

Data Science can fit seamlessly within the ecosystem of the data lake, whether this is through HDInsight or the extensibility of Azure Data Lake Analytics and U-SQL. This blog will give a brief overview of what Data Science is; how to link Data Science toolkits to the Azure Data Lake; and best practices for managing the data output from experiments.   Data Science Data Science is the relatively new kid on the block. One way to consider data science is as an evolutionary step in interdisciplinary fields like business analysis that incorporate computer science, modelling, statistics, analytics, and mathematics.At its core, data science involves using automated methods to analyse massive amounts of data and to extract insight from them. Data science is helping to create new branches of science, and influencing areas of social science and the humanities. The trend is expected to accelerate in the coming years as data from mobile sensors, sophisticated instruments, the web, and more, grows.   Data Science In The Data Lake The nature of the Azure Data Lake Store lends itself to Data Science in that it can hold any data, which the data scientist will want to access, transform and analyse.   HDInsight contains many implementations for data science, such as Spark, R Server and others. Hooking HDInsight to Azure Data Lake Store is pretty simple and follows these steps: In the Azure Portal Marketplace, select HDInsight which will bring up a series of blades. In this blog, I will be using Spark as my cluster type on HDInsight. In the storage settings you can then link your HDInsight cluster to Azure Data Lake Store Confirm your configuration on the next blade and wait around 20 minutes for your cluster to deploy and you’re good to go!   With Azure Data Lake Analytics, you incorporate data science by extending the capabilities of U-SQL and you do this by installing a series of files. Open up your Azure Data Lake Analytics account and click on Sample Scripts at the top. This will bring forward the following blade From there you’ll want to click on the U-SQL Advanced Analytics tab, which will copy about 1.5GB of files to the default Azure Data Lake Store associated to your ADLA account. This will take about 3 minutes to complete. When it’s finished copying the files it will then call a job to register the extension, which can be found in the assemblies folder of the master database. More resources about extensibility of U-SQL can be found here: The Laboratory Within Azure Data Lake Store, Folder and File Management is incredibly important for a well running data lake. See my blogs on Storage and Best Practices and Shaping The Lake for more information on how to set up your Azure Data Lake Store. The Laboratory is an area to be exclusively used by a data scientist. It’s an area where they can persist the results of experiments and data sets without impacting the day-to-day operations within the data lake or other data scientists. The laboratory is organised in to two broad area: Desks and Exhibits. Desks contain personal workspaces, the contents of which can be organised however the person wishes. It can be as well organised, or disorganised, as the person themselves. The Exhibit contains data sources produced in the Laboratory which are ready to be consumed by other users or systems. Both of which are laid out below. As always, if you have any feedback or comments do let me know!

My Experience of the Microsoft Professional Program for Data Science

(Image 1 – Microsoft 2017 -   In 2016 I was talking to Andrew Fryer (@DeepFat)- Microsoft technical evangelist, (after he attended Dundee university to present about Azure Machine Learning), about how Microsoft were piloting a degree course in data science. My interest was immediately spiked. Shortly after this hints began appear and the Edx page went live. Shortly after the Edx page went live, the degree was rebranded as the "Professional Program". I registered to be part of the pilot, however was not accepted until the course went live in September 2016.   Prior to 2016 my background was in BI, predominately in Microsoft Kimball data warehousing using SQL Server. At the end of 2015 I enrolled on a Master's Degree in Data Science through the University of Dundee. I did this with the intention of getting exposure to tools I had an interest in, but had some/little commercial experience (R, Machine learning and statistics). This course is ongoing and will finish in 2018, I highly recommend it! I would argue that it is the best Data Science Master's degree course in the UK. So going in to the MPP I had a decent idea of what to expect, plus a lot of SQL experience, R and Power BI. Beyond that I had attended a few sessions at various conferences on Azure ML. When the syllabus for the MPP came out, it directly complemented my studies.   Link to program - Link to Dundee Masters -   Structure of the program The program is divided up in to 9 modules and a final project. All modules need to be completed but there are different options you can take - You can customise the course to suit your interests. You can choose to pay for the course (which you will need to do if you intend to work towards the certification) or audit the course for free.  I will indicate which modules I took and why. Most modules recommend at least 6 weeks part-time to complete. I started the first module in the middle of September 2016 and completed the final project middle of January 2017 – So the 6 week estimate is quite high, especially if you already have decent a base knowledge of the concepts already.   You can if you wish complete multiple modules at once. I am not sure I recommend this approach as to get the most out of the course, you should read around the subject as well as watching the videos. Each module has a start date and an end date that you need to complete it between. If you do not you will need to do it all again. You can start a module in one period and wait until the next for another module. You do not need to complete them all in 3 months. If you pay for the module but do not request your certificate before the course closes, you will need to take it again (top tip, as soon as you're happy with you score, request you certificate).   Module list Module Detail Time taken Data Science Orientation Data Science Orientation 2 - 3 days Query Relational Data Querying Data with Transact-SQL 1 day - Exam only Analyze and Visualize Data Analyzing and Visualizing Data with Excel  Analyzing and Visualizing Data with Power BI 2 - 4  days Understand Statistics Statistical Thinking for Data Science and Analytics 7 - 9 days Explore Data with Code Introduction to R for Data Science Introduction to Python for Data Science 7 - 9 days Understand Core Data Science Concepts Data Science Essentials 7 - 9 days Understand Machine Learning Principles of Machine Learning 2 weeks Use Code to Manipulate and Model Data  Programming with R for Data Science Programming with Python for Data Science R - 2 - 3 daysPython - 3 weeks Develop Intelligent Solutions   Applied Machine Learning  Implementing Predictive Solutions with Spark in HDInsight Developing Intelligent Applications 2 weeks Final Project Data Science Challenge 2 months*   The times taken are based on the time I had spare. I completed each module between projects, in the evening and at the weekend. This module can be completed in a few days, however you need to wait until it has completed to get you grade.   Structure of the modules Each modules is online. You log on to the Edx website and watch videos by leading experts. Either at the end of the video, after reading some text or at the end of a section of the modules you are given a multiple choice test. The multiple choice options are graded and form part of your overall score. The other main assessment method is labs, where you will be required to complete a series of tasks and enter the results. Unlike certifications, you get to see what your score is as you progress through the module. The multiple choice questions generally allow you to have two to three attempts at the answer, sometimes these are true/false with two attempts, which does undermine the integrity of the course.   There is normally a final section which you're only given one chance to answer, and holds a higher % towards your final mark. You need 70% to pass. Once you hit 70% you can claim your certificate - if you have chosen to pay for the module. Modules range from $20 to $100. For the most part I answered the questions fully and tried for the highest score possible. However, In all honestly towards the end, once I hit around 80%, I started looking at a different module. If the module was really interesting I would persevere.   Modules Data Science Orientation, Query Relational Data & Analyze and Visualize Data. These modules are very basic and really only skim the surface of all the topics they describe. The first module is a gentle introduction to the main concepts you will learn throughout the program. The next modules focused on querying data with SQL. Regardless of your opinion of SQL, you must agree that SQL the is language of data. Having an understanding of the fundamentals of SQL is paramount, as almost every level of the Microsoft Data Science stack has integration with databases. If you're familiar with SQL (I already held an MCSE in SQL 2012) you can skip the main content of this module and just take the test at the end. For the next you have an option of Excel or Power BI for visualisation. As I have experience with Power BI I opted for this module. Once again this is a very basic introduction to Power BI. It will get you familiar enough with the tool that you can do basic data exploration. Some parts of this course jarred with me. Data visualisation is so important and a key skill for any data scientist. In the Power BI module one of the exercises was to create a 3d pie chart. Pie charts are not a good visualisation as it is hard to differentiate between angles and making it 3d only escalates the issue. I wish Microsoft would have made reference to some of the great data viz experts when making this module - I cannot comment on the Excel version.   Understanding statistics. This module is different from its predecessors, in that it is not run by Microsoft. This is a MOOC from Columbia university, which you might have completed before. It covers a lot of the basic and more advanced stats that you need to know for data science. In particular a solid grounding in probability and probability theory. In BI you become familiar with descriptive stats and measures of variance, however I had not done a great deal of stats beyond this. I have researching statistical methods for the MSc, but I had not done any real stats since A-Level maths. This course was really interesting and I learnt a lot. I don’t know if this is the best way to really learn stats, but it is a good primer to what you need to know. I found topping up my understanding with blogs, books and YouTube helped support this module.   Explore data with code. You have two options again for this module, R and Python. Which should you learn I imagine you're asking, well the simple answer is both. Knowing either R or Python will get you so far, knowing both with make you a unicorn. Many ask why to learn one language over the other - aside from the previous point. R is very easy to get in to, it has a rich catalogue of libraries written by some of the smartest statistical minds. It has a simple interface and is easy to install. Python is harder to learn in my opinion as the language is massive! I found Python harder to work with, but it is much richer. I would recommend Python just for SciKitLearn the machine learning library. The python module is extended to use code dojo (the great online tuition site). As you progress through the questions and examples, you have an ide which will check you understanding and  will grade you as you go. I found this really helpful. This module is again a bit on the easier side. If you think the later Python module will be similar, you are in for a surprise! I did not take the R module as I was already using R in my day job.   Understand core data science concepts. Almost a redo of the first module and the understanding statistics module. Not a lot to say here, but repetition helped me understand and remember the concepts. The more I had to think about the core concepts the more they stuck. This module could have been removed with little to no impact on the course, but helped solidify my knowledge.   Understanding Machine learning. As this is a Microsoft course this module is all about Azure Machine Learning. If you have not used Azure ML before, it has a nice drag and drop interface which allows you to build quick simple models and create a web api key which you can then pass data to using any tool with a REST API. This module is half theory and half practical. There are a lot of labs, so you will need to take you time. If you skip ahead you will get the answers wrong and might not make it to 70%.   Using code to manipulate and model data. This section has two options again R and Python. I know quite a bit or R already so I started with Python. I wanted to do them both to see how you can do machine learning in both. I was expecting a continuation of the code dojo format from the previous module, this was far from the case. Each of the modules up until this point have worked with you to find the right answer. This module will equip you with the basics, but expect you to find the correct function and answer. Believe me when I say it was hard (with little prior experience of Python). The course will lead you to towards the right resources, but you need to read the documentation to answer the question. This was a great change of pace. Having to search for the answers made me absorb more than just the quizzes. This module was a struggle. Once I completed this I did the same for R. On a difficulty scale, if the Python module was 100, R was only at 20. The disparity in difficult is massive and frankly unfair. I was able to complete the R module very quickly. I left feeling disappointed that this did not have the same complexity that the Python module did.   Develop intelligent solutions. For this section you can pick one of three modules, Machine learning, Spark or micro services. I went with Spark. Why? Because I had already worked with Spark and Hadoop as part of the MSc at Dundee. I knew how it worked and what it did from an open source point of view, but not from a Microsoft HD-Insight perspective. This module was tricky but nothing compared to the Python module. I spent the best part of the week working on Spark, setting up HD-Insight clusters and forgetting to tear them down (top tip! Don’t leave a HD-Insight cluster running - They are EXPENSIVE!). The last module is a machine learning project, so picking the "Applied Machine Learning" option might put you in a better place than your competition. I did not attempt either the Machine Learning or the Micro-services modules.   Final project. Here is where the fun begins. You're given a problem and a dataset. You need to clean, reduce, derive features and process the dataset, then apply an ML technique to predict something. In my case it was whether or not someone will default on a loan. You could use any technique you liked as long as the final result was in Azure ML. I was pretty happy with my model early on and made very few tweaks as the course progressed. Unlike the previous modules where you can complete a module and get your score, your final score is only available once the module has ended. You will build an ML experiment and test against a private dataset. You can submit your experiment 3 times a day to be scored against the private data (maximus of 100 attempts). This will give you an indication of your score, but this is not your score! You score is calculated against a different dataset after the module has finished.  You top 5 scores will be used to test against the private closed data. If you have over-fitted you model, you might have a shock (as many did on the forums) when you score is marked.   I completed all modules at the start of January and waited until February to get my final score. My highest scoring answer, when used against the closed private dataset, did not get over the required 70% to pass. This was surprising but not all that unexpected. I had over-fitted the model. To counter balance this, I created 5 different experiments with 5 similar but different approaches. All score similar (~1-3% accuracy difference). This was enough to see me past the required 70% and to obtain the MPP in data science. The private dataset has been published now. In the coming weeks I will blog about the steps I took to predict if someone would default on their loan.   I have been asked at different stages of the course "would you recommend the course?". It really depends on what you want out of the course! If you expect to be a data scientist after completing the MPP, then you might be in for a shock. To get the most out of the course you need to supplement it with wider reading / research. YouTube has many great videos and recorded lectures which will really help process the content and see it taught from a different angle. If you're looking to get an understanding of the key techniques in  Data Science (from a Microsoft point-of-view) then you should take this course. If you're doing a degree where you need to do research, many of the modules will really help and build upon what you already know.   I hope you have found this interesting and that it has helped you decide whether or not you want to invest the time and money (each module is not free). If you do decide and you persevere you will too be the owner of the MPP in Data Science (as seen below).   Terry McCann - Adatis Data Science Consultant & Organiser of the Exeter Data Science User Group - You can find us on MeetUp.