Ust

Ust Oldfield's Blog

Forecasting – Principles and Methods

What comes to your mind when you hear the words forecasting, forecasts etc?

Invariably, you’ll think of weather forecasts. But forecasts are much more than that.

Forecasting is the process of making predictions of the future based on past and present data and analysis of trends. It’s a process that has existed for millennia, though often with dubious methodologies… Instead of looking in to a crystal ball to predict the future we are going to employ the power of statistics!

Why use forecasting?

Nearly every function within a company needs an estimate of what the future will look like as a foundation to create and update plans. For example:

• Marketers need forecasts to estimate the expected demand that will be generated from their activities.

• Salespeople use forecasts to understand what volumes are expected in each period and to evaluate the sales pipeline activity to make sure that it supports those expectations.

• Managers in the supply chain use forecasts to make timely purchasing requests, develop production plans, evaluate capacity needs, and develop logistical plans.

• Finance professionals use forecasts to prepare and understand financial plans. They also use them to report on earnings expectations.

What can be forecast?

The predictability of an event or a quantity depends the following conditions:

  1. How well we understand the factors that contribute to it;

    • For example, in energy consumption the change in temperature will have an impact on the amount of energy we use to heat our homes.

  2. How much data is available;

    • Generally, the more data you have to hand the more accurate your forecasts will be.

  3. Whether the forecasts can affect the thing we are trying to forecast.

    • This is the principle of self-fulfilling forecasts. For example, if you publicly forecast a share price buyers will adjust their behaviour in order to achieve the forecasted price.

Often in forecasting, a key step is knowing when something can be forecast accurately, and when forecasts will be no better than tossing a coin. Good forecasts capture the genuine patterns and relationships which exist in the historical data, but do not replicate past events that will not occur again.

Forecasting methods

1.       Average approach

a.       The prediction of all future values are the mean of the past values

mean2

2.       Naïve approach

a.       The prediction of all future values is that of the last observed value

b.       Approach is cost effective and provides a benchmark for more sophisticated models

 

naive2

3.       Drift method

a.       A variation on the Naïve method whereby the increase or decrease (drift) is the average change seen across the historical data

drift3

4.       Seasonal Naïve

a.       Accounts for seasonal change seen in the data and sets each prediction to the last observed value in the same season

seasonal2

5.       Autoregressive integrated moving average (ARIMA)

a.       Used to apply a better understanding of data or to predict future points

ARIMA2

6.       Exponential smoothing

a.       Applies smoothing to the data in order to remove noise

b.       Three types

                                                               i.      Simple exponential

1.       Models level

simple exp

                                                             ii.      Double exponential

1.       Models level and trend

double exp

                                                           iii.      Triple exponential

1.       Models trend and seasonal components

Triple exp

7.       Probabilistic

a.       Assign a probability value to each of a variety of different outcomes, and the complete set of probabilities represents a probability forecast.

b.       Most common example is in weather forecasting, but can be used in energy consumption, sports betting and population.

weather

8.       Qualitative approaches

a.       Typically asking people, usually experts, what they think is going to happen.

                                                               i.      Can be interpreted as a crowd sourced forecast

 

Forecast Planning

Forecasting is a common statistical task which helps to inform decisions about scheduling of staff, production, etc., and provides a guide for strategic planning. Forecasting is often used in conjunction with planning and setting goals.

Forecasting is about the process of making predictions of the future based on past and present data and analysis of trends.

Goals are what you would like to happen. You should plan how you are going to achieve them and forecast to see if they are realistic.

Planning is a response to forecasts and goals. It involves determining an appropriate response to make forecasts match goals.

The application of these three activities sets the length of time you want to forecast.

Short-term forecasts:

              The sort of activity this is needed for is scheduling of staff, production, transportation etc. Demand forecasts are often required

Medium-term forecasts:

              Needed to plan future resource requirements, such as hiring of staff and ordering materials and parts.

Long-term forecasts:

              Used in strategic planning.

Forecasting is a really powerful tool to add to your toolkit in order to better understand the future and to better accommodate what that forecast might bring so that you can grow your business in the best way possible.  I will be publishing a follow-up blog in how to perform forecasting in R in the near future.

SQL 2016 CTP3 R Setup and Review

The release of SQL Server 2016 CTP3 is pretty big as it’s the first CTP which incorporates in-database support for R. This means that you can call R scripts and functions within a SQL query without any workarounds as previously demonstrated.

To get yourself up and running on CTP3 you’ll need to follow these steps:

1)      Download, install and configure CTP3 https://www.microsoft.com/en-us/evalcenter/evaluate-sql-server-2016

a.       Make sure that Advanced Analytics Extension is installed

b.       verify that the service for Advanced Analytics Extensions starts automatically using the MSSQLLaunchpad account.

c.        On the Database Engine Configuration page, for Authentication mode, select Mixed Mode (SQL Server authentication and Windows authentication).

                                                               i.      For CTP3, a SQL login is required when using the RevoScaleR functions from a remote workstation to run R scripts in the context of the server.

2)      Download, install and configure Revolution R Open https://www.microsoft.com/en-us/download/details.aspx?id=49525

3)      Download, install and configure Revolution R Enterprise https://www.microsoft.com/en-us/download/details.aspx?id=49505

4)      Enable external scripts

exec sp_configure 'external scripts enabled',1;

RECONFIGURE;

5)      Run the post-installation script(s)

a.       Open an instance of command prompt with administrator privileges and run the following script

b.       C:\Program Files\RRO\RRO-3.2.2-for-RRE-7.5.0\R-3.2.2\library\RevoScaleR\rxLibs\x64\RegisterRExt.exe /install

                                                               i.      More info can be found here: https://msdn.microsoft.com/en-us/library/mt590536.aspx

Now you should be up and running and ready to fire up a new instance of SQL Server Management Studio to begin working with R. For the rest of the demo I will be using the latest version of Adventure Works for SQL 2016 CTP3, which can be found here: https://www.microsoft.com/en-us/download/confirmation.aspx?id=49502 Please restore the backed up databases to continue.

To start off we’re just going to do something really simple, which is return the mean of the UnitPrice from Sales.SalesOrderDetail in a result set, just to demonstrate functionality and ease in which you can write and execute R scripts within SQL.

The script I’m going to use is:

USE AdventureWorks2016CTP3;

 

EXECUTE sp_execute_external_script

@language = N'R'

,@script = N'OutputDataSet <-data.frame(mean(InputDataSet[,1]))'

,@input_data_1 = N'SELECT UnitPrice FROM [Sales].[SalesOrderDetail]'

WITH RESULT SETS ((col int not null));

 

What it does is call a stored procedure with three input parameters: @language; @script; and @input_data_1

The only valid input for @language is 'R’.

@script is the external language script specified as a literal or variable input. Datatype of @script is NVARCHAR(MAX).

@input_data_1 specifies the input data for the external script in the form of a T-SQL query. Datatype of @input_data_1 is NVARCHAR(MAX).

              With the specification of _1 it would appear that in future one might be able to pass in more than one lot of data.

The syntax for sp_execute_external_script can be much more complex than demonstrated and more information can be found here: https://msdn.microsoft.com/en-us/library/mt604368.aspx

So, we’ve run our script. SQL has initiated the R script, which is taking a mean of the first column and converting it to a data frame. A data frame is similar to a table and, at the moment, the output data has to be in a data frame structure in order for it to be passed back into SQL. The end result is that the mean of the UnitPrice has been returned.

Query and result should look something like this:

clip_image001

R is a really powerful tool as it is and by incorporating in-database support for R within SQL Server 2016 allows for a convenient, usable and even more powerful toolkit to process and analyse data. The possibilities of application to your data are vast but can range from simple data exploration, as demonstrated above, to predictive modelling.

Be on the lookout for more demonstrations of R functionality within SQL Server 2016.

Connecting SQL Server to R

In this post I’m going to use R to retrieve some data from SQL Server.

In order to use R in conjunction with SQL Server, but in the absence of SQL Server 2016 and its soon to be incorporated R functionality, it is necessary to use a few workarounds in order to produce the desired outcome.

R is a package based platform and does not inherently communicate with other platforms unless the relevant package is installed and called. There are quite a few packages that can be used for R to talk to SQL Server, however I prefer to use the RODBC package as it is simpler to use than other packages available, and will be using it for this example.

CONNECTING SQL SERVER TO R

Step 1: Create a connection

As RODBC requires a connection to your SQL Server database you’ll have to open up the ODBC Data Source Administrator instance on the machine you’re using. Under the User DSN tab (though you could use the System DSN) click Add to create a new connection.

Select SQL Server Native Client 11.0 and Finish.

clip_image002

It will then open up the following screen and fill in appropriately

clip_image004

It will then open up with the necessary security information, but as I’m using a local version I will persist with Windows authentication.

The next screen is where you choose the database you want to connect to. By default you will just connect to the server, but if you wish to save the results from your analysis in R back to SQL Server and to the correct database it is important that you select the desired database. For this example I’m connecting to Adventure Works.

clip_image006

The next screen is general configuration properties and can be changed for the user’s needs

clip_image008

Click finish, test the connection and you’re all done for this step!

Step 2: Connecting R

For this next part we’ll be using an R client. My preferred client is R Studio as it provides a clean interface.

Now that you’ve created your connection you’ll want to use it within R. After firing up an instance of R with the RODBC package installed you will want to invoke it with the following syntax:

library(RODBC)

To bring the connection through to R you’ll need to assign a variable to it with the help of the odbcConnect function.

The format for invoking the function is as follows:

connectionstring <- odbcConnect("some dsn", uid = "user", pwd = "****")

connectionstring is the variable assigned to store the connection

odbcConnect is the function

“some dsn” is the name of your DSN connection

uid  and pwd are the User ID and password for the server, if needed

For our example using AdventureWorks on a local machine the syntax is as follows:

AdventureWorks <- odbcConnect ("AdventureWorks")<?xml:namespace prefix = "o" />

In order to see which objects are in your database you should run the following syntax:

sqlTables(AdventureWorks)

Which produces an output similar to this:

clip_image010

You can then begin to use your data from SQL Server in R by using the sqlQuery function to extract data.

Employee <- sqlQuery(AdventureWorks,

"SELECT * FROM HumanResources.Employee")

The purpose of the sqlQuery function is to be able to get a specific set of data, potentially from multiple tables. If you just wish to return the contents of one table it would be better to use the sqlFetch function.

Employee <- sqlFetch(AdventureWorks,"HumanResources.Employee")

sqlFetch returns the contents of the specified table and stores it in the assigned variable.

 

Connecting R to SQL Server is relatively easy and allows you to unleash the power of R on your data without employing expensive add-ons or waiting for a future SQL Server 2016 CTP to be released.

Currency Conversion (Triangulation Arbitrage)

Some systems use a single currency as a base, which is something that I noticed recently when working with IBM Congos Controller, e.g. USD to convert local currencies into. But what if you want / need to rebase into another currency but still retain the original base?

This doesn’t appear to be easy to achieve within Cognos Controller itself, but it is achievable within SQL and a wider SSIS framework.

The basis for rebasing exchange rates uses the technique of triangulation arbitrage. The technique is most often used within the banking system and more information can be found here: http://en.wikipedia.org/wiki/Triangular_arbitrage In principle you deal with two known exchange rates and one unknown.

For example, if you have an exchange rate base of USD and know that the GBP/USD exchange rate is 1.54631 and the EUR/USD exchange rate is 1.11470 and wish to rebase from USD to EUR. You would begin by finding the inverse rates of GBP/USD (1/1.54631 = 0.6467), multiply by the EUR/USD rate (0.6467*1.11470 = 0.72087649) which produces the EUR/GBP and then find the inverse (1/0.72087649 = 1.3872) to produce the GBP/EUR rate. In order to find the USD/EUR exchange rate one simply finds the reverse of the EUR/USD rate (1/1.11470 = 0.8971).

That might sound simple, but most exchange rates are held in a table across a range of dates. This complicates the calculation somewhat. I’ve used CTEs because I find that it makes the script neater and easier to debug. Below is an example of the triangulation using the Sales.CurrencyRate table in the AdventureWorks2012 database.

 

USE AdventureWorks2012
;

WITH
LocalExchange AS (    
SELECT
     ToCurrencyCode                                                                                    AS CurrencyCode
    ,CASE WHEN ToCurrencyCode <> 'EUR' THEN (1 / AverageRate) END                                    AS LocalCurrencyConversion
    ,CurrencyRateDate
    FROM [Sales].[CurrencyRate]
),
EuroExchange AS (    
SELECT
    ToCurrencyCode                                                                                    AS CurrencyCode
    ,CASE WHEN(ToCurrencyCode = 'EUR')THEN AverageRate END                                            AS EuroConversion
    ,CurrencyRateDate
    FROM [Sales].[CurrencyRate]
    WHERE CASE
        WHEN(ToCurrencyCode = 'EUR')  THEN AverageRate END IS NOT NULL
)
SELECT DISTINCT
    --Keys
    C.CurrencyRateID
    ,C.CurrencyRateDate   
    ,CASE
        WHEN C.ToCurrencyCode = 'EUR' THEN 'USD'
        WHEN C.AverageRate = 1 THEN 'EUR'
        ELSE C.ToCurrencyCode
     END                                                                                                AS CurrencyCode
    ,1/(CASE
        WHEN C.AverageRate = 1 THEN 1 ELSE
            (1 / (CASE
            WHEN C.ToCurrencyCode <> 'EUR' THEN (L.LocalCurrencyConversion * E.EuroConversion)
            ELSE C.AverageRate
            END))   
        END)                                                                                                AS ImplicitEuroExchangeRate               
    --Measures                                   
    --,C.ExchangeRate                                                                                        AS ExchangeRate                                                           
FROM
    [Sales].[CurrencyRate] AS C
INNER JOIN LocalExchange AS L
    ON L.CurrencyRateDate = C.CurrencyRateDate
    AND L.CurrencyCode = C.ToCurrencyCode
INNER JOIN EuroExchange AS E
    ON E.CurrencyRateDate = C.CurrencyRateDate
WHERE
    CASE
        WHEN c.ToCurrencyCode <> 'EUR' THEN (E.EuroConversion * L.LocalCurrencyConversion)
        ELSE 1   
    END    IS NOT NULL
ORDER BY C.CurrencyRateID

 

As always, if you have any feedback and can suggest a simpler way of performing the triangulation I would love to hear it.

How to get an age from two dates in SQL

For a project that I’ve been working on recently I was tasked with finding the Age of a product at a certain point in time in year and months. What should have been a relatively simple task proved to be a bit more difficult as the requirement was that the age be returned in two separate columns: Years and Months.

I approached it by using DATEDIFF as a starting point as that’s what I would want to emulate in order to produce my age.

DATEDIFF(DAY,@StartDate,@EndDate)

I used DAY as the interval as I thought it would be easier to calculate the age in a more granular format, if required.

I then started breaking down the output as working in number of days is a bit unmanageable. I broke the average number of days into a month (30.42) and the number of months in a year (12)

DATEDIFF(DAY,@StartDate,@EndDate)/30.42/12

Then I hit a stumbling block. How was I to split out the month from the year and the year from the month and have them displayed as whole numbers?

To get the year I used the following query:

ROUND((DATEDIFF(DAY,@StartDate,@EndDate)/30.42/12),1)-ROUND((DATEDIFF(DAY,@StartDate,@EndDate)/30.42/12),1)%1

To get the month I used the following query, not as long as the query to produce the year but still similar:

ROUND((DATEDIFF(DAY,@StartDate,@EndDate)/30.42/12),1)%1 * 12

The multiplication of the month by 12 is crucial, as it’s the only way to turn a fraction of a year into a number of months.

I have then wrapped the overall query up within a CTE producing the following query script:

DECLARE @StartDate AS DATE
DECLARE @EndDate AS DATE

SET @StartDate = '1969-07-16'
SET @EndDate = GETDATE()
;

WITH Age AS(
SELECT
CONVERT(INT,(ROUND(((DATEDIFF(DAY,@StartDate,@EndDate)/30.42)/12),1))%1 *12 )
    AS AgeMonth,
CONVERT(INT,(ROUND(((DATEDIFF(DAY,@StartDate,@EndDate)/30.42)/12),1))-(ROUND(((DATEDIFF(DAY,@StartDate,@EndDate)/30.42)/12),1))%1)
    AS AgeYear
            )
SELECT AgeYear, AgeMonth
FROM Age

Adatis Hackathon Jan 2015 – Predictive Analytics and Azure ML

Back in January, Adatis hosted an internal hackathon in order to explore the various features of the Azure platform. We were split into three teams: STEAM (using stream analytics), DREAM (using Azure ML) and GLEAM (using PowerBI).

I was looking after the DREAM Team, which was responsible for providing a prediction on when a thermostat system would breakdown and, therefore, need to be replaced. We used the Azure Machine Learning (ML) platform to achieve this.

We had quite a substantial element of preparation work to do before we could even begin using Azure ML. We began by creating a database and tables within the Azure platform so that we could connect the data to the ML system.

Once connected we were up and running, trying different statistical techniques in order to produce our desired outcome. Our objective was to predict at what age a thermostat system would breakdown so that it could be replaced before it broke. So, our initial hypothesis was that as the age of the system increased the likelihood of the system failing would increase. Our first test was using a linear regression and this proved to be inaccurate with an accuracy rating of 0.04%. We then used a Two-Class Boosted Decision Tree model which proved to be the complete opposite with an accuracy rating of 100%. How do two models produced vastly differing results?

So we had to delve into the data itself and look row by row at what the data was telling us. Long and short of it was that by using dummy data we were unable to accurately predict an outcome because there was an absence of relationships with the different data points.

To snatch victory from the jaws of defeat we used the remainder of the exercise to give an overview of ML and its capabilities to members of the team.

Our impressions were that Azure ML is a very powerful platform and easy to get to grips with, due to the drag-and-drop interface that many of us are familiar with due to working with SSIS. However, it despite the easy to use interface it does require you to have, at least, a knowledge of statistics and statistical techniques otherwise you could end up wandering blindly through a forest. When things go wrong, it’s not intuitive to fix. It is a bit of a black box in this regards but, with the inclusion of running R Scripts, there are ways around this.

As it’s still in preview I expect that it can only get better and more intuitive. But a word of warning to would be users: brush up on your stats beforehand.