PiotrStarczynski

Piotr Starczynski's Blog

Azure Databricks - overwriting table that is also being read from

Problem Definition

Recently I have reached interesting problem in Databricks Non delta. I tried to read data from the the table (table on the top of file) slightly transform it and write it back to the same location that i have been reading from. Attempt to execute code like that would manifest with exception:“org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from”

The lets try to answer the question How to write into a table(dataframe) that we are reading from as this might be a common use case?

This problem is trivial but it is very confusing, if we do not understand how queries are processed in spark.

I explain it on example.

Let define a process that would allow us to:

  • Filter out objects that we already processed.
  • Store history of loaded object for 3 days only (History need to take minimum amount of space on the file system)
  • Todays load will be part of history for tomorrow load

Process for this task would look like this:

  1. We Read CurrentLoad (new objects that we want to load in)
  2. Read limited Historic to last 3 days (objects that we already processed)
  3. Create new set called Unprocessed by removing existing History records from CurrentLoad (That is made via left anti join)
  4. Process/transform Unprocessed (optional process that may/may not use unprocessed further in further processing)
  5. Join only last 3 days of the history with Unprocessed (To maintain only last 3 days of history and keep the set as small as possible)
  6. Overwrite History

Above process is not the most efficient way of checking new load against historic loads, but it is a good example of overwriting dataframe/table that we might read from the same time.

Diagram of this process will look like this:

image

So let’s see what might go wrong in this process. We need to add Unprocessed Dataframe (which contains reference to History Dataframe) into existing History and at the same time we need to remove first day (to maintain only 3 days of history). This operation might be not as straight forward as it may seems. The reason is that Databricks by default use lazy execution it means that execution of the code is not happening immediately. Query is not going to be evaluated until last moment – that is when we will try to write data down on the file system. Spark is doing so to optimize the queries by creating execution plan. Execution plan helps spark to evaluate quires in way that would give the best performance. If we wait with processing of the Unprocessed Dataframe, it will hold reference to two sets History and Current load, while we try to write to History. It means that we are trying to perform a write operation to the Dataframe we are reading from.

There is simple solution for this problem.

 

Solution

Materialize the Dataframe that you want to overwrite data with (HistoryTemp) so as it clears out the dependencies on the tables that we read from (History, CurrentLoad). The simples way to do so is to write Dataframe (HistoryTemp) to the file system into temporary location and then re-read the data into new Dataframe. This will enable us to write data into History location. This approach requires to set a temp location your temp data. Make sure that you overwrite you HistoryTemp each time You could perform additional clean up on the end of your script to delete additional temp data and metadata (table, files).


This Process would look like:
1. Calculate new Unprocessed (CurrentLoad left anti join  History)
2. Union Unprocesed with current History to persist all records that we want to save (HistoryTemp)
3. Output HistoryTemp (overwriting set) to some temp location in the file system
4. Re-read the data from that we outputted (HistoryTemp) into new DataFrame
5. Write new Dataframe to you History location. Make sure that Unprocessed, History temp set is not used further in the notebook, so if you require to use it, perform write operation on the end
image

 

Code

You can download full notebook from Reference section.

image

image


What Did not work for me

I have seen some suggestions on different forums that I tried and they DID NOT WORK for me, and i was receiving the same exceptions:

  • Creating temp view on the data and cache it:
    Unprocessed.createOrReplaceTempView(“unprocessedTemp”)
    Unprocessed.cache()
  • Checkpointing the  Unprocessed dataframe with this commands: 
    spark.sparkContesxt.setCheckpointDire(“dbfs:/unprocessed/checkpointDirectory”)
    Unprocessed.rdd.checkpoint()

 

Reference

  1. You can download code to play around from here:
  2. users with similar problems were posting: here

Azure Databricks Windowing Functions with Dataframes API

This blog is going to cover Windowing Functions in Databricks. I will describe concept of Windowing Functions and how to use them with Dataframe API syntax. If you have not used Dataframes yet, it is rather not the best place to start. Azure Databricks also support Spark SQL syntax to perform queries, but this is not going to be covered in this blog.

Introduction – What are Window Functions?

Databricks was design to work with large sets. Data distribution and parallelization of the work, makes queries run against data very fast. Those queries can be used not only to read data but also to transform them.

There are different kind of transformations that you can apply on the Dataframe:

  • Operations on the single rows and produce single value output (build in functions, and UDF, single column transformations e.g. cast string, substring etc. if statements),
  • Exploding arrays of complex data – e.g. explode(“col.name”)
  • Aggregate Functions - Grouping data with groupBy(…) and apply function on the groups e.g. max(), avg(), min() etc. (columns that do are not in groupBy(…) or as a part of aggregate function can not exists in result Dataframe
  • Window Functions – allow us to partition data by columns and then apply aggregate, rank functions within each partition separately - similar to Grouping but allows us to preserve columns that do not take part in partitioning, It also allows us to order elements within each partition.

The last one are the subject of the current blog.

 
problem definition

Lets say that we got to answer questions  that are listed below:

  • calculate running totals for the data set
  • rank data based on some columns
  • group rows and assign row number from 1…N within each group
  • calculate difference between each record and aggregated values within the group that it belongs to e.g. difference between current value and the average from the partition/group it belongs to and we want to preserver all the columns from the set that we operate onto

Some of this questions could be answered by using combination of groupBy(…) and join(..), some of the could be answered using looping mechanism. They could, but there is a better inbuilt mechanism that would allow us to do that queries in a bit more neat manner. This mechanism is called Window Function. If you are familiar with Window Function in T-SQL, this is not going to be new concept for you as it works in very similar way as T-SQL window function.

 

How is it constructed?

There are three logical parts in window function

  1. Defining what function we want to use to calculate value e.g. max(“column.name”), row_number() etc.
  2. Defining what window we want to use to calculate this value – that function might look differently for different kind of transformations and  of the components are optional
    1. what we are going to partition by ( what columns are we going to use for partitioning – grouping our data before calculations are applied. Function will be executed in each of this group)
    2. what order is required in each portion before making calculation – define columns and ordering direction before the function is applied. It would have no big influence on function like max, min, be might have a great impact on functions like row_number, lag, lead )
    3. what is the scope/Boundaries of the window
  3. Use function in the data frame

# What window function DataFrame syntax look like that:

  1. Import required namespaces and libraries – you are required to import Window function from pyspark.sql.window namespace before you start using it:from pyspark.sql import *
    from pyspark.sql.window import Window
    from pyspark.sql.functions import *
  2. Define window specification – one of the specifications bellow, depending what type of window we want to define
    ## if we do not want to specify boundaries, sliding window etc.
    windowSpecification = Window.partitionBy(…).orderBy(…)

    ## if we specify window boundaries based on rows
    windowSpecification = Window.partitionBy(…).orderBy(…).rowsBetween(…)

    ## if we specify window boundaries based on rows column values
    windowSpecification = Window.partitionBy(…).orderBy(…).rangeBetween(…)

    Explanations:
    Window – window function object (always stay as Window – see import statement)
    Partition columns are specified by: putting name of the columns in quotations in partitionBy() e.g. Window.partitionBy(“col1”,”col2”,…,”colN”).
    Sorting columns are specified by: putting name of the columns in quotations in orderBy() e.g. Window.orderBy(“col1”,”col2”,…,”colN”)
    Direction of the sort is specified by: adding desc(“col.name”) or asc(“col.name”) in the order by clause e.g. Window.orderBy(asc(“col1”),desc(”col2”),asc(”colN”))
  3. Use below syntax to apply function over the specified window
    max(“Salary”).over(windowSpecification)

    In data frame we might use it by creating new column name that would hold value. Please note that we need to create windowSpecification first:
  4. Full code can look like this:
  5. ### Import statementes

    from pyspark.sql import *
    from pyspark.sql.window import Window
    from pyspark.sql.functions import *

    ### Define window specification
    windowSpecification = Window.partitionBy(“column.to.partition.by”).orderBy(desc(”column.to.order.by”))

    ### Produce new Dataframe with new column that would held RowNumber in each partition
    dfResult = dfSourceData.withColumn(“RowNumber”,row_number().over(windowSpecification))

 

Functions that we can use with Windowing Function:

There are plenty of windowing function that we can use. Some of them can be found here:

 

  • Accessing absolute/relative partition’s records
    • lag(“col.name”) – returns value of the specified column from the row preceding the current row that we calculate. For the first row Null will be returned as there is no previous value

    • lead(“col.name”) – returns value of the specified column from the row following the current row that we calculate. For the first row Null will be returned as there is no previous value

    • first(“col.name”) – gets first value from the partition in current order

    • last(“col.name”) – gets last value from the partition in current order

  • Ranking function
    • row_number() – assigns row number starting with 1 and increasing by 1 with each row in the partition
    • rank() – calculate rank number
    • dense_rank() – calculates dense rank
    • ntile(N) – splits set in N  equal buckets (if set can not be divided equally the last set will be uneven in size)
  • Aggregation
    • max(“col.name”) – calculate max value of chosen column within specified partition. Remark:I we bound the window like unbounded preceding and current row this is might return different results depending on the ordering type (asc, desc)
    • min(“col.name”) – calculate min value of chosen column within specified partition. Remark:I we bound the window like unbounded preceding and current row this is might return different results depending on the ordering type (asc, desc)
    • sum(“col.name”) – calculates sum of the chosen column name. Can be used  for running total with bounded.
    • avg(“col.name”) – calculates average
    • count(“col.name”) – calculates count of items in the partition
    • mean(“col.name”) – calculates mean in the partition
  • Other
    • collect_list("col.name") – collects list of elements in the partition of chosen column in order they resides and return value in form of array

    • collect_set("col.name") – collects set of elements in the partition of chosen column without ordering and return value in form of array

  • Not Supported
    • countDistinct(“col.name”)/countDistinct(“col.name”)

 
Remarks

Some of the functions like Aggregation will behave differently with orderBy(…) clause and differently without it:

  • with orderBy e.g. Window.partitionBy(…).orderBy() – it will be calculated in running manner (for first row value  = min(from 1st record) then for second value = min( from 1st & 2nd record), for third value = min( from 1st & 2nd & 3rd) and so on
  • without orderBy e.g. Window.partitionBy(…)  - it will calculate value across whole partition so each record will get the same value e.g. average of the column from the partition.

 

Unbounded Examples

Unbounded Window Functions are those that do not define boundary on window specification, that is they do not have rowsBetween() or rangeBetween() statements on the window object. Unbounded Windows functions executes function in each partition across all records. In other words the do not limit records aggregation/rank function will be executed within each partition. They will be useful to answer questions like:

image

 

Row Bounded Examples

Row Bounded Window Functions are those that define boundary on window specification by using rowsBetween(…) statement on the window object. This allows to limit records that we executes function in each partition. This limit is enforced by specifying the offset of records as number of records, it does not use values of the records. This can be dynamic relative to the current row, or we can bound it on one side (lower, upper), both sides - relatively to the current row. They will be useful to answer questions like running totals et, or if we want to make calculations on records relative.

image

 

Range Bounded Examples

Range Bounded Functions defines boundary on window specification by  using rangeBetween(…) statement on window object. This allows to limit records that we execute our function on in each partition. This limit is enforced by specifying offset of records by using values of the column. This is dynamic - each window is calculated based on the values that determines the size of the window. This kind of functions will be useful to answer question where we need to define the size of the window based on the values of the column that we apply our function on.

image

 

Summary

Windowing functions in data bricks can cover very wide range of use cases. It provides nice alternative to loops, so if you need to perform looping operations on the data that you have, ask yourself if it is not easier to solve that problem with Windowing Functions. Be aware that this windowing function may cause large amount of data shuffling, so use windowing function with care.

 

References

You can find more information in bellow links

  1. Data Frames – explains what  Dataframes are and how to use them
  2. Introduction to window Function – another good blog about windowing function (Spark and SQL API)

 

Download

  1. Databricks Notebook – you can download all examples used in this blog, and play without need of  mounting external storage. Data are generated by script.

Parsing JSON with null-able properties in Logic Apps

Welcome to the next episode of Logic apps for newbies!

JSON is nowadays one of the most common open-standard file format that is used to transmit data in the IT industry. We used it with logic apps a lot in our Data lake Ingestion engine that we created to one of our clients. JSON is used nearly everywhere and that’s why it is so important to know how to deal with it.

Logic app’s provides very decent support for this kind of data, but some solutions are not so obvious, even for a person with good programming experience. If you are reading this blog there is a chance that you know what I am talking about.

In this blog we are going to discuss:

  1. How to Parse JSON in Logic Apps
    • Introduction to Parse Component
    • Generating JSON parsing schema
    • Logic App example
  2. What is the problem with JSON that contains nulls
    • Parsing JSON property set to null on string typed property
  3. What is the solution to deal with nulls
    • Modifying existing schema to allow nulls
    • Execution that allows nulls to be parsed

# How to parse JSON in Logic Apps

To explain how to Parse JSON in Logic apps I am going to use an example JSON Object with a single property like below:

{

  “Name”:”Piotr”

}

  

Introduction to Parse Component

So lets focus on Parse JSON component as it is going to do all the Parsing for us.

This component is composed of two text boxes and single action link:

  1. Content – this takes the JSON that we want to parse as an input (Required)
  2. Schema – this defines how our JSON is constructed, so it can be parsed in the correct way and strongly type all properties of the object for the further references (Required).
  3. Use sample payload to generate the schema” action link that is placed below the schema box that could help us generate the schema for the input that we want to pass.

image

 

Generating Schema

By clicking “Use sample payload to generate the schema” link, a little window like below will pop up and we will be able to generate the Schema based on the input JSON that we want to parse. Logic Apps will generate that schema automatically.

This process Is shown on the below picture:

Payload

If we hit done our component should look like on the picture below 

image

The schema generated, based on our example object that we passed in, would look like the picture above. Now our Parse Component expects the Content object to have single property named “Name” that is of type “String”, but we can define it to be any allowed type like e.g. integer, object, array etc.

 

Logic App Example

Our Example logic app to parse JSON is very simple. It uses two components, and single HTTP trigger (Trigger is not going to be covered in this blog)

Those two components are:

  1. Compose Example Object (Compose Component) – the main purpose of this component is to define our JSON example so we can feed it to our Parse Example Object that would do the parsing. In a real world application it would be an API call or Reading a file from some storage component that returns JSON data
    In our example, like on the picture below, we would insert our JSON example object from the above text box.
  2. Parse Example Object (Parse JSON Component) – this is the sole object that we are interested in. It is going to parse JSON into a Logic App Object (With the properties that we can access in further actions).
Initial

 

if we run our example logic app, and pass in our example object, it would finish successfully like on the picture below. We can see in the OUTPUTS in the picture, that it we received Object with the property “Name” that is set to “Piotr”, which means that parsing was correct.

Please do not confuse OUTPUTS on below picture to be JSON Object that we passed in. This is the result of parsing - Logic app represents parsed object in the form of a JSON object. In further actions we would be able to reference the “Name” property of the object we parsed.

ValidObjectSuccess_withMarking

Now that we understand how to parse JSON in Logic apps, lets see what happens when we try to input objects that do not have a value set. It will be a very common scenario that an API that you use would return objects that’s property is null.

 

# Parsing object that would contain Null value

So lets say our API for some of the calls would return object like below with one of the properties set to null.

{

  “Name”:null

}

Let’s see what happens if we try to pass object like that into our Logic App. Our schema still expects string for the property named “Name”.

Null-Failure_Markings

Our logic app fails, showing us in OUTPUS the Error Message:

“Invalid type. Expected String but got Null.”

So lets see what would be a quick solution for that.

 

# Solution

The solution for successful parsing is to modify the schema to allow nulls.

 

Modifying Existing schema

It can be done in very simple step by modifying type of the object into array of types like shown below:

Modified SchemaWithMarkings

 

Execution that allow nulls to be parsed

If we do that we will be able to parse nulls and strings like on the picture below:

NullSuccess_with Markings

We see that this time Logic app did not thrown an exception and would allow us to deal with this null value within our code.

Let’s bear in mind that allowing nulls it is not always the best option.

There is also default values, that could be considered, but everything depends on the context your logic app operates in.

Thanks for reading and I hope you enjoyed the blog post.

 

You can download working example from the section below.

# Download

  1. You can download Logic app with null example working from here

Converting JSON with nested arrays into CSV in Azure Logic Apps by using Array Variable

Recently our company needed to rebuild the internal BI solution.The change was caused by an upgrade of some external APIs that our BI solution uses to collect timesheet data. In the new solution we use Azure Logic Apps to call Timesheet APIs and it returns all required timesheet data in form of JSON objects. One of the problems that we encountered during the work, was how to convert the JSON response objects that the API returns into delimited flat files. The solution that we found is trivial, but not immediately obvious for the person with little experience in Azure Logic Apps, so that’s why I decided to share it with you.

This blog is going to explain how we can easily convert complex JSON objects into a flat CSV Table by using Azure Logic App’s. This blog is not going to cover how to use Logic Apps from scratch. I will assume that you have basic knowledge about how this service work, and you can create simple logic flows yourself. I am going to focus more on what components we should link together, and how to solve the problem of flattening JSON objects, rather than on technical details how particular logic components works.

If you think that you would require some lectures before, please visit Azure Logic App Documentation website.

To start with we would need an example of the JSON object that we will try to convert to CSV table.

So lets dive in!

The picture below shows a JSON object that we want to convert to a CSV table. The snippet below shows only the first element, but in reality this JSON array would be composed of multiple objects like this. You can find the download link to the full object in the download section on the bottom of this blog.

image

We can decompose the object from above picture as shown below:

Please note that everything in red is only for annotation.

This object is constructed of multiple levels

  • (Level 0) The root of the object is an Array of Project objects (Each Project object in the Array of Projects Objects have two properties):

    • (Level 1-Project level properties) Project property that describes project name
    • (Level 1-Project level properties) AssignedEmployees array that are assigned to the project (Each Employee in AssignedEmployees array contain also two properties):
      • (Level 2-Employee level properties) EmployeeName that contains Name of the employee
      • (Level 2-Employee level properties) Rates array that stores information about rate for that particular employee on that project (Each Rate in Rates array contain two properties):
        • >(Level 3-Rate level properties) RateAmount
        • >(Level 3-Rate level properties) RateEffectiveDate

Lets focus for a moment on the Logic app side and try to present the components that we would use for the  CSV conversion.

Logic apps have a Create CSV Table action component that can produce a CSV table based on the JSON array of an object (see screenshot below)

image

Looking at the Text box above, if we pass JSON object reference into “FROM” to above component the output from that component will be a CSV Table, but there are some serious limitations.

This component expects a JSON array of simple objects. This means that each object of the array should be composed only from single level of properties, like on the screenshot below:

JSON_FlattenArrayWithAnnotations

Complex objects are not supported at present, and if we provide an array of objects which we had in our example at the start, everything below “level 1”  i.e. the Employees and Rate information will be treated as a single string, like in the “Outputs” of the “Create CSV table” component on the picture below:

Try of inserting complex object_Colors

The solution is to “flatten” our complex object (multi level) into a flat object (single level) and pass it into the CSV Table component.

The rest of the blog is going to be focussed on how to convert the complex object into simple JSON so we can leverage the CSV Table component to create JSON.

Solution

As we have just discussed above, in order to convert the JSON object into a CSV we would need to flatten the JSON object first.

In order to flatten this object, we use the combination of a few Azure logic app components:

Our solution is composed of the components listed below:

  1. Http trigger (“When a HTTP request is received”) – this will run when the http uri is called (It is not relevant to the solution, we can use any type of trigger).
    image
  2. Compose action component – this creates an example array of JSON Project Objects. In a real world example, that would come as a part of the HTTP request, Service bus message, or any other way we can read the JSON data in logic apps. The picture below shows only part of the whole object. You can view the whole object by downloading the example of the file in the Download section at the bottom of this blog.
    image
  3. Parse JSON action component– this component parses the JSON array and translates it into Logic Apps Objects, so we can access objects and properties  within our Logic app flow. To generate the validation and parsing schema, we use the option “Use sample payload to generate schema” (at the bottom of the component) and paste an example of the JSON that our application will take. In our case this is going to be the example from the component from the picture above.
    1.Parse JSON Array of Projects

  4. Initialize Variable action component – This component is going to be our temporary storage for newly created flattened objects. We name it Rates Array and It need to be of type Array. We do not require to set any initial value as we are going to populate this array in further steps. Please do not confuse Rates Array with Array of rates from Initial Json. This is a variable to hold newly created objects.

    2.Initialize Rates Array
  5. Use Nested For each Loop flow component (For each Project, For each Employee, For Each Rate) in order to iterate through all levels of our object.
    3.Iteration_throug all levels
  6. Compose action component  and Append to Array component - On the lowest level of the array (inside Rate loop) we create a new flatten Rate JSON Object with “Compose” type component (This object contains properties from all levels – Project Level, Employee Level, Rate Level).
    Next we use Append to Array component and paste the newly created Rate object like on the picture below


    4. Create new flatten array object and add it to rates array
  7. Create CSV Table component – This component will perform all the heavy lifting for us. It gets the already populated Array of Rate objects as a parameter (From) and converts it into CSV table. Note that RatesArray contains the already flattened rate object. Each property of that object will produce one column in the CSV table.

    5. Pass object into CSV Table conversion

    We can use +New Step on the picture above to create additional steps if we would like. For example, save the created CSV into blob storage or data lake.


    The full logic app will look like on the picture below
    image

    Now when we have everything for the logic app in place, lets see the result of the execution of this logic app
     
    SuccessfullConversion

    As we can see in the Output section, the conversion resulted a CSV table being created successfully for our example of the Array of Complex project objects. Please note that the input section of above picture. It takes flattened input objects that we created in the Rates Array variable.

Conclusion

Logic apps provide a simple way of converting an array of JSON objects into a CSV Table. The  advantage of using logic apps for this type of conversion is that there are inbuilt actions components to perform the task. Moreover, we have different types of triggers available that we can configure this web service with, and chose different varieties of connectors for our input and output data.

There are certain constraints and drawbacks to consider before we use this in our workflow.

This method is not the most efficient as we nest “for each” in another “for each” etc. The more nesting the worse performance. In addition, Logic apps have execution time limits. E.g. HTTP Request trigger which we used, is limited to 120s of executions, therefore we shouldn’t use this method to convert JSON arrays composed of millions of objects.

If we need to convert very large files or files with a very complicated structure it would be probably be more suitable to use other types of services like e.g. Azure Functions instead and perform all the heavy lifting in c#.


Example Code

If you want to see above working solution in action you will need to:

  1. Get Azure Subscription (Free trial if you do not own one)
  2. Create Empty Logic app (Empty template) with name of your choice
  3. Download logic app working Azure Logic App JSON code from the download section
  4. Open code view and replace empty template with the copied code
  5. You can run the trigger manually or by calling Logic App http trigger address in browser or by using Postman HTTP GET request to the logic app http address which is specific to your application. Request do not require any body parameters or additional headers.


Download

  1. Click here to download Logic App JSON code
  2. Click here to download complex JSON example
  3. Click here to download flatten complex JSON example

Using Azure Machine Learning and Power BI to Predict Sporting Behaviour

Can we predict peoples’ sporting performance by knowing some details about them?

If so, what is better at making these predictions: machines or humans?

These questions seem so interesting so we decided to answer them; by creating a working IT solution to see how it would perform.

The blog will provide an overview of the project providing a simple results analysis and details of technologies that we used to make it happen.

As It would be hard to check all available sport disciplines we decided to focus on the one we love the most – Cycling. Our aim was to predict the maximum distance a rider would ride within one minute from standing start. Although “one minute” sounds insignificant, this is really tough exercise as we were simulating quite a tough track.

We used the following equipment to perform the experiment:

  • bike with all necessary sensors to enable recording of speed, cadence, crank turns, wheel turns, distance
  • velodrome bike simulator
  • heart rate monitor in form of wrist band

    Using this equipment allowed us to capture data about the ride in real time and display this using streaming analytics and Power BI in a live interactive dashboard as shown below (Picture 1):

     

    image

    Picture 1: Real time Power BI dashboards showing:  average heart rate(top row left); current speed in km/h(top row middle); average of speed(top row right); Current Crank turns, wheel turns and cadence(bottom row Left); Average of Crank turns, wheel turns and cadence (bottom row right)

  •  

    Sensors were used to capture information about how fast our rider was cycling, how many crank turns they made, what was their heart rate during the ride and the most important - how far they did go within the time limit.

    Each rider had a chance to try to predict their maximum distance before their ride. We also made a prediction based upon previous cyclist results using Machine Learning algorithms.

    In order for the Machine Learning Algorithms to make estimates about each of the riders, we had to capture some representative properties about each rider before the ride. All riders needed to categorise themselves for each of properties listed below:

  • age
  • height
  • weight
  • gender
  • smoking volume
  • drinking volume
  • cycling frequency

     

    So taking weight as an example, people were asked to allocate themselves to the one out of the available buckets:

    e.g.

  • Bucket 1 - 50-59kg,
  • Bucket 2 - 60-69kg,
  • Bucket 3 – 70-79kg
  • Bucket N – Above 100kg
  •  

    Bucketing properties were used to help us reduce amount of distinct values, so it increased the probability that for a given ride we would find someone with similar characteristics, who had already had a ride.

    Obviously to make the prediction work we had to have an initial sample. That’s why we asked “Adatis people” to have a go on Friday afternoon. In true competitive spirit some of them even tried a few times a day! By the beginning of the SQLBits conference we had managed to save details of around 40 different rides.

    In a nutshell let me describe the process that we repeated for each rider.

    First step was to capture details of the volunteer by using ASP.NET Web app, including the maximum distance they think they will be able to reach (human prediction). Next, behind the scenes we provided their details to the machine learning algorithm exposed as web service to get a predicted distance. We then turned on all sensors and let the cyclist ride the bike. During the ride we captured all the data from the sensors and transferred it to the database through the Azure IoT stack. After the ride finished we updated the distance reached by the rider. The more cyclists participated, the bigger sample size we had to predict result for the next rider.

    Overall we captured 150 rides for 138 different riders. The initial sample size we used to make prediction was 40 riders and it grew up as more riders got involved. 

    The table below (Table 1) contains basic statistics of the differences between the machine learning predictions and human predictions.

     

    Prediction Type

    Avg. Difference

    Std. Dev. For Difference

    Max Difference

    Min Difference

    Azure Machine Learning 119m 87m 360m 2m
    Humans 114m 89m 381m 0m

    Table 1: Absolute difference between Predicted and Reached distance for 1 minute ride. (Average distance reached 725m)

     

    From these numbers we can easily see that neither Humans nor Machine Learning  came close to the real results reached by riders. The average difference over a 725m ride was 114m for humans with a standard deviation of 89 meters and 119 with a standard deviation of 87 meters. That means both of them were equally inaccurate. Although it is worth mentioning that we had single cases when the prediction was very close or even equal to the one reached. In trying to determine the reason behind the miscalculations in the ML prediction? I would say that the main reason is the sample size was not sufficient to make accurate predictions. Besides the small l sample there might be other reasons why predictions were so inaccurate such as:

  • Incorrect bucket sizes for rider properties
  • Too many properties to make a match
  • Lack of strong enough correlation between properties and distance reached

     

    It is also worth mentioning that some properties would show high correlation between property and distance like height of the rider or low correlation like drinking volume.

    The Best examples of high correlation we can see are on the charts attached below (Chart 1):

     

    image

    Chart 1: Correlation between distance reached in meters and height  category of the rider

  •  

    And even more significant regarding fitness level (Chart 2):

     

    image

    Chart 2: Correlation between distance reached in meters and fitness category of the rider

     

    On the other hand, some rider’s properties did not show the correlation that we would expect e.g. age (Chart 3)

     

    image

    Chart 3: Correlation between distance reached in meters and age of the rider

     

    Although there is no straightforward correlation as previously stated we can observe a general trend that we tend to perform better the closer we get to our round birthdays. We can observe peaks at the ages of 18, 29, 39, 49. Is it perhaps because of the fear of getting to the next decade?

    I will leave this up to your interpretation…

    If you are interested into more technical explanation how we designed and build our project, I would like to invite you to the second part of the blog that would cover top level architecture of the project and also some deep insights into some core technologies used including: Azure Stream Analytics, Azure Event Bus, PowerBI web, ASP.NET MVC4 and SignalR.