Tristan Robinson

Tristan Robinson's Blog

Improving Machine Learning Models

As part of my MS Data Science Professional Program, a number of the topics recently have been based around getting the most out of an Azure ML model. As part of this blog, I will be looking at the techniques and ways in which you can model and improve a solution. While I was tackling this problem with Azure ML, these techniques apply to building better models through other languages/platforms such as Python or Scala.


Data Munging

A process that undoubtedly every data scientist goes through with every DS problem they face, is that of data munging. This is a term that is being used more frequently to describe the process of transforming data from its raw state into another format, something more valuable for downstream analytics. Models running on data that is poor in quality, missing or duplicated will produce poor predictions. Therefore a very simple pre cursor to any problem is to explore the data and understand what is required to turn the data set into a form which is better for the latter stages of the model design. The following techniques are used as part of this stage:

  • Removing Duplicate Rows
  • Cleaning Missing Data (custom substitution of numerics, usually 1 or 0 or replacing with the mean, medium, mode across the dataset)
  • Cleaning Missing Data (removing bad quality rows entirely or removing bad quality columns entirely – more extreme)
  • Creating Categorical Features from String Features
  • Normalising Numeric Features (ZScore or Min/Max) to bring everything on to the same scale.

These steps are quite basic so I won’t go into detail here, but none the less, they should be considered at the start of a modelling a better solution to a problem.


Feature Selection

One of the reoccurring principles that appears with machine learning is that of Ockham’s razor, which states that the best models are simple models that fit the data well; this is not an irrefutable principle of logic, but a preference for simplicity. Therefore there is a need of balance between accuracy and simplicity to limit the feature set which tends to lead to better predictions. Simpler models are also more interpretable to humans which also helps. While the data I was working with was limited to around 35 features, there are many data science problems which have thousands of features and so this technique is even more crucial.

There are multiple methods to perform feature selection, of which a few will be covered here. The first method is greedy backward selection which starts with all the features and then finds the feature that hurts predictive power the least when removed, and you remove it. This is done iteratively until a point is met (which will be discussed later). Its known as greedy since it never looks back after removing the feature each time.

An alternative method is greedy forward selection which is basically the inverse, starts with no features, and looks for the feature that by itself is the best model. This then carries on in a similar vein to the backward selection but adding features. The point at which you stop with forward selection is that of diminishing returns for your accuracy.

Defining accuracy is important here, and this is where a formula called Adjusted R² comes in. R² is a measure of how well the model fits the data, with being closer to 1 than 0 being a better fit. The adjusted part adds a penalty for every term in the model, thus it measures on a scale the size and accuracy of a model. Therefore you need enough features for your R² to be large but not too many that it brings the Adjusted R² down.


Permutation Feature Importance

Using the feature selection theory, and to prune the feature set down to those that are meaningful for prediction, you can use a module in Azure ML called Permutation Feature Importance. This essentially re-computes the model a number of times, leaving out each feature and looks at how much your metric changes because the feature was left out, and then ranks them in order of importance. Depending on what you are trying to model, i.e. a classification or regression problem – there are a number of options for the metrics to measure performance. In my instance, I was interested in the RMSE (Root Mean Squared Error) which in simple terms represents the sample standard deviation of the difference between prediction and observed values. It aggregates the magnitudes of error in prediction into a single measure of predictive power. The closer to 0, the better the predictive power – but it’s also good to note this is relative to what you are trying to measure.

Once the model has been run through, you can visualise the list of features and their contribution to the RMSE. At this point, it does not necessarily matter whether the feature contributes a positive or negative value to the RMSE, as long as the value is not 0. Any values of 0 indicate that they have zero contribution to feature importance, essentially whether they are part of the model or not, add nothing to it. You can then follow backward pruning techniques to remove these columns from the feature set. It is then worth running the model again, to check the feature importance as the removal of those features may impact other features. If more features then have a value of 0, you should remove those too, and repeat. You can then measure the impact of the changes using separate pipelines, and passing the output into the same evaluation model, and checking the ROC curve (described below). Even with the RMSE staying the same between the 2 pipelines, by removing features, you are able to build a model which is more likely to generalize be more effective in the real world when values change.




Picking the Best Model Type

There is no reason to believe that any particular machine learning model will have the best performance (although we always have favourites); a classification model type that works best for one set of features and labels in a dataset does not always work best for another. As part of modelling any dataset, testing and comparing multiple machine learning models is usually a good approach. Its also important to note that the performance achieved with any particular machine learning model can change after performing feature engineering, therefore it is best to run the selection after this stage. The following model evaluates logistic regression, boosted decision trees, neural networks and support vectors with the same dataset to find out which is best.




To understand the performance of a machine learning model, there are a number of techniques to use. The easiest way is to pass the output of each model into an Evaluate Model module, which accepts up to 2 datasets at a time (left and right inputs). After the experiment is run, you can visualise the output of the models using this module, and examine the ROC curve. The first scored dataset (blue) represents the original model (in this case a neural network), and the scored dataset to compare against (red) represents the second dataset (in this case a support vector machine). The higher and further to the left the curve, the better the performance of the model (in this case, the neutral network).

Scrolling down further, you can also use the Accuracy, Recall, and AUC performance metrics, which indicate the accuracy and area under the curve. The model with the higher metrics is performing better. In particular, the lower the recall metric, the higher the number of false negatives.




Parameter Sweeping

Once you’ve picked the ML model contributing to making your predictive power better, it will require a set of parameters. For instance, with decision trees, this is in the form of a leaf count to determine depth, or no. of trees to determine width, along with their samples per leaf, and the learning rate. By default there is always a set available, but these will always need tweaking to improve things further and generate a better RMSE.

This can be done by either sweeping a giant grid of parameters, or by a random sweep. The latter being a lot quicker to process at run time for obvious reasons Fortunately, the performance is not normally sensitive to a change in these values if you have done much of the previous analysis first. Parameter sweeping really starts to squeeze the best out of the model.

In Azure ML, this can be done via a tune model hyper parameter module. The same options are available to measure metrics as the feature selection module, so I was interested in the RMSE again. As part of tuning the parameters through this module, we will need to split the training data beforehand, this can be done 50:50. This is so that the parameters have a set of data to validate against. This is then kept separate to the scoring data set as usual which is another completely separate set of data. Once the model has run, we can again evaluate the best parameters, against the original model and evaluate the RMSE, as well as the Accuracy, Recall and AUC. This is very similar to the previous techniques of evaluation. Visualising the sweep results, will display the parameters used, and then these can be programmed back into the original ML model, while removing the tune hyper parameter module, to speed things up on future runs.

A process of nested cross validation can be used on top of this to build confidence that the correct parameters have been used and it wasn’t just luck that they ended up being better than another set.





Once you have been through this process, you will then want to run a process of cross-validation, which runs the data through multiple times (folds) where each time, different data is used for training, and scoring. You can then generate a mean and standard deviation for each fold and prove the model is consistent across the data set, and that it will not be skewed by any new data for future predictions. This will give you a good idea of whether the model will generalise well and be robust enough to move to production.

Of course, there are many more techniques to the ones listed here, but this should give you a good introduction to the ones to look for to deliver predictive power from your model.

Modelling Survey Style Data

In a recent project, one of the business areas which required modelling was that of survey data. This can often be quite tricky, due to the fact the data is not so quantitative in nature as other business areas such as sales.

How do you go about calculating measures against this type of data? In this blog I will go about explaining a relatively simple model you can use to achieve this goal.


Master Data Services

To aid with modelling I used Master Data Services (MDS) to help map the questions and responses from the surveys.

The idea behind using MDS is that regardless of the questions asked, whether it be in one format or another, English or Spanish, you can map them to a set of master questions. Usually these will align closely to your measures/KPIs and should be relatively finite in number so that the mapping process is feasible for a business user. In my business case, the master set of questions revolved around items such as quality, price, and promotion of products. For responses, I needed to map these to a flag which indicated we wanted the response to contribute towards a measure and ultimately a KPI.

I first created the following entities in MDS.

  • Survey (holds the survey name, a unique identifier for the survey, and in my case I also created a master survey lookup to group the surveys together)
  • Source Question  (holds the distinct set of questions assigned to each survey, along with identifying codes, and question orders - each question would also be mapped to a master question) 
  • Source Response (holds a set of response options for each question, along with identifying codes)
  • Master Question (holds the master set of questions and joins to the KPIs)
  • KPI (holds a list of KPIs that you need to address by aggregating response data)
  • Response of Interest (holds a list of responses that are regarded as positive / contributing towards the KPI when answered by the question)
  • Response of Interest Mapping (allows the user to map the response options available on each question to a response of interest)

In terms of the response of interest, I was only interested in responses where the answer from the survey was “Yes” so this was all that was required here. However for more complex response options, the model can provide the scalability required. For instance, if you were looking for an answer between 7-10 and the survey had been answered with a 7, 8, 9, or 10 – each of these could be mapped to 7-10 without having to create responses of interest for all particular combinations. This scales well and can cover scenarios for instance where the answer should be between 700 to 1000 in the same way.

I also created a Master Question and Response of Interest value for N/A. This way, only the blanks on the mapping entities required populating and the user was never unsure whether a blank represented a question/response that was not of interest, or something that required mapping still.

All the entities above apart from Master Question, KPI, and Response of Interest were populated automatically from ETL with a SQL script used to extract the contents of those entities from source. The other 3 entities were populated manually by a business user. I also locked the entities / columns that the user shouldn’t be changing by using read-only permissions.

Some examples of the manually populated tables can be seen below:




Data Warehouse

For modelling the tables in the data warehouse, I created a separate dimension for each of the Response, Question, Survey, and KPI entities, and a single Fact to capture the responses of interest against these dimensions.

The majority of dimension lookups were straight forward along with the response of interest measure which can be seen below:

SU.Code AS SurveyId, 
SR.Name AS ResponseName, 
1 AS ResponseOfInterest 
FROM mdm.PL_ResponseOfInterestMapping RM 
INNER JOIN mdm.PL_SourceResponse SR
ON RM.SourceResponse_Id = SR.Id AND RM.Survey_Id = SR.Survey_Id 
INNER JOIN mdm.PL_Survey SU 
ON RM.Survey_Id = SU.Id 
RM.ResponseOfInterest_Code IS NOT NULL AND RM.ResponseOfInterest_Name <> 'N/A'

During our ETL runs for the fact we also checked for responses that had not been yet mapped – and did not pull these through.

If you then have a cube sat on top of your DW, you can then write measures across the fact to count the number of responses of interest. An example of which can be seen here:

Price Activation Standard:=
        CALCULATETABLE('Outlet Survey','KPI'[Sales Driver] = "Price"),
        'Outlet Survey'[IsResponseOfInterest] = 1

This was then checked against a Target fact table to calculate the compliance and the KPI was essentially an aggregation of the compliance across geography.



Overall, the model has proved very popular with the business. It’s easy to understand and gives the business control over which responses to count towards the KPI – without having to hard code values into the ETL which had been seen in previous solutions. It can also be extended easily be adding new KPIs and mapping them to new master questions without having to change any ETL. From a development perspective it also means that nothing should go into the DW as Unknown for a Dimension since the SQL to populate MDS, can also be used for the DW and therefore should always match.

If you have any questions, please feel to ask them in the comments.