Ust

Ust Oldfield's Blog

R - What Is It?

I have blogged often on the subject of R, but have not previously addressed what R is and why you should use it. In the blog post, I will set out what R is, why you should use R and how you can learn R.

What Is R?

R is a powerful tool for statistical programming statistics and graphics. There are lots of software available which can do all of these things: spreadsheet applications like Excel; point-and-click applications like SPSS; data mining applications like SSAS; and so on. But what sets R apart from applications like those listed?

R is a free and open source application. Because it is free you don't have to worry about subscription fees, usage caps or licence managers. Just as importantly, R is open. You can inspect the source code and tinker with it as much as you want.

Leading academics and researchers use R to develop latest methods in statistics, machine learning and predictive modelling. These methods are stored in packages which can be accessed by anyone for free! There are thousands of packages available to download and use.

R is an interactive language. In R you do analysis by writing functions and scripts, not by pointing and clicking. As an interactive language (as opposed to a data-in-data-out black box), R promotes experimentation and exploration, which improves data analysis and sometimes leads to discoveries that would not have been made otherwise. Scripts document all your work, from data access to reporting, which can be re-run at any time.This makes it easier to update results when the data changes. Scripts also make it easy to automate a sequence of taks that can be integrated into other processes, such as an ETL.

One of the design principles of R was that visualtision of data through charts and graphs is an essential part of data analysis. As a result, it has excellent tools for creating graphics, from staples like bar charts to brand new graphics of your own design.

With R you are not restricted to choosing a rigid set of routines and procedures. YOu can use code and packages contributed by others in the community, or extend R with your own functions and packages. R is also excellent for mash-ups with other applications. For example you can build it into your SSIS routine, or take advantage of the new integration in SQL Server 2016.

We've covered, briefly, what R is. But why do you want to use it?

Why Use R?

There is a vibrant community built around R. With thousands of contributors and millions of users around the worldif you have a question about R chances are someone has answered it, or can answer it.

It is quickly becoming an integral part of the Microsoft BI stack. Since Microsoft's acquisition of Revolution Analytics, R has been featuring in the more recent releases in the Microsoft BI world. From Power BI to SQL Server, Visual Studio to Azure ML; R is becoming an integral component of the BI stack.

Where can you learn R?

There are various books and online courses on R which can be used to quickly skill you up in this powerful language. These are some which I recommend:

EdX: Introduction to R for Data Science

Book: R Cookbook by Paul Teetor

Book: R Programming for Data Science by Roger Peng

Here, at Adatis, we run internal R training courses which mean that all of our employees have the opportunity to learn from internal subject matter experts and improve their knowledge and skills.

You can also try out the R tutor from Revolution Analytics, which is a package built for R.









Forecasting: R and SQL Server 2016

Forecasting is pretty crucial to any business as nearly every function within a company needs an estimate of what the future will look like as a foundation to create and update plans. In a previous blog I addressed the principles and methods of forecasting and in this blog I will go through how you can forecast in R and SQL using the Autoregressive Integrated Moving Average (ARIMA) technique.

I will approach this in two parts. Part 1 will be demonstrating the forecast in an R platform; and in Part 2 I will be demonstrating the forecast using the R integration in SQL Server in conjunction with SSRS to visualise the results.

Part 1: The Pure R Approach

As mentioned, this will be creating a forecast using R and taking data from a SQL database. You will need to be able to connect SQL to R in order to do follow this section. You’ll also need to install the forecast CRAN package. As always we’ll be using AdventureWorks for the provision of data.

The R script is as follows:

library(RODBC)

library(forecast)

SQLConnection <- odbcConnect("AdventureWorks")

 

Data <- sqlQuery(SQLConnection, "WITH CTE AS(SELECT YearDate as[Year], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12]  FROM (

SELECT * FROM (SELECT --DISTINCT

             SUM(salesamount) as monthsales

                                ,MONTH([OrderDate]) as monthdate

                                ,YEAR(OrderDate) as  yeardate

                   FROM [AdventureworksDW2016CTP3].[dbo].[FactInternetSales]

                   group by MONTH(orderdate), year(orderdate)) t )

AS Table1

PIVOT(SUM(MonthSales) FOR Monthdate IN([1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12])) AS PVT

)

SELECT [Year], cast([1] as numeric(28,8)) as Jan, cast([2] as numeric(28,8)) as Feb, cast([3] as numeric(28,8)) as Mar, cast([4] as numeric(28,8)) as Apr, cast([5] as numeric(28,8)) as May,

cast([6] as numeric(28,8)) as Jun, cast([7] as numeric(28,8)) as Jul, cast([8] as numeric(28,8)) as Aug, cast([9] as numeric(28,8)) as Sep, cast([10] as numeric(28,8)) as Oct,

cast([11] as numeric(28,8)) as Nov, cast([12] as numeric(28,8)) as [Dec] FROM CTE

ORDER BY [Year] ASC")

 

month <- c(1,2,3,4,5,6,7,8,9,10,11,12)

mn<- month.abb[month]

 

DataMonth <- Data[ ,mn]

 

DataTS<- ts(c(t(DataMonth)),start = c(2010,12), end = c(2014,1),frequency=12)

ForecastArima <- forecast(auto.arima(DataTS))

plot(ForecastArima)

ForecastArima

 

The numerical output will look like this:


With the graphical output should look something like this:


Part 2: Forecasting in SQL

Forecasting in SQL follows a pretty similar approach to the pure R approach in producing the numerical output. But that’s where the similarity ends. As we’ll want to visualise the output we’ll need to do some manipulation on the result from the R script we will have to wrap the initial query in a stored procedure which can be called by the stored procedure which will populate the graph in SSRS.

The first step is to write the initial stored procedure which will contain the R script

CREATE PROCEDURE dbo.spForecastSalesPreparation AS

BEGIN

EXECUTE sp_execute_external_script

      @language = N'R'

     ,@script = N'library(forecast);

                                Data<- InputDataSet

                                month<- c(1,2,3,4,5,6,7,8,9,10,11,12)

                                mn<-month.abb[month]

                                DataMonth <- Data[ ,mn]

                                DataTS <- ts(c(t(DataMonth)), start = c(2010,12), end = c(2014,1), frequency =12)

                                ForecastArima <- forecast(auto.arima(DataTS))

                                ForecastArimadf <- data.frame(ForecastArima)'

     ,@input_data_1 = N'WITH CTE AS(SELECT YearDate as[Year],[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12]  FROM (

                                                                      SELECT * FROM (SELECT

                                                                       SUM(salesamount) as monthsales

                                                                       ,MONTH([OrderDate]) as monthdate

                                                                       ,YEAR(OrderDate) as  yeardate

                                                                       FROM [AdventureworksDW2016CTP3].[dbo].[FactInternetSales]

                                                                       group by MONTH(orderdate), year(orderdate)) t )

                                                          AS Table1

                                                          PIVOT(SUM(MonthSales) FOR Monthdate IN([1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12])) AS PVT)

                                      SELECT [Year],

                                      cast([2] as numeric(28,8)) as Feb,

                                      cast([3] as numeric(28,8)) as Mar,

                                      cast([1] as numeric(28,8)) as Jan,

                                      cast([4] as numeric(28,8)) as Apr,

                                      cast([5] as numeric(28,8)) as May,

                                      cast([6] as numeric(28,8)) as Jun,

                                      cast([7] as numeric(28,8)) as Jul,

                                      cast([8] as numeric(28,8)) as Aug,

                                      cast([9] as numeric(28,8)) as Sep,

                                      cast([10] as numeric(28,8)) as Oct,

                                      cast([11] as numeric(28,8)) as Nov,

                                      cast([12] as numeric(28,8)) as [Dec] FROM CTE

                                      ORDER BY [Year] ASC'

     ,@output_data_1_name = N'ForecastArimadf'

    WITH RESULT SETS (("Point.Forecast" numeric(28,2) NOT NULL,

                                      "Lo.80" NUMERIC(28,2) NOT NULL,

                                      "Hi.80" NUMERIC(28,2) NOT NULL,

                                      "Lo.95" NUMERIC(28,2) NOT NULL,

                                      "Hi.95" NUMERIC(28,2) NOT NULL));

END;

 

The R script that you invoke is very similar to the one in the pure R approach, as is the SQL query. But the key element is the WITH RESULT SETS command, which returns the results of the query in a tabular format. Note, it is important to specify the data types for each of the columns you wish to return.

Because of the desire to visualise the results from R we’ll need to join the output from R with the input to R and keep it in a structure that makes sense, both as a table and as a graph. The stored procedure that I’ve created is as follows:

CREATE PROCEDURE dbo.spForecastSales AS

BEGIN

DECLARE @ForecastResults TABLE (Id INT IDENTITY(1,1),

                                                   Value NUMERIC(28,2),

                                                   Lo80 NUMERIC(28,2),

                                                   Hi80 NUMERIC(28,2),

                                                   Lo95 NUMERIC(28,2),

                                                  Hi95 NUMERIC(28,2))

 

DECLARE @FinalResults TABLE (Id INT IDENTITY(1,1),

                                            ResultType VARCHAR(10),

                                            YearDate INT,

                                            MonthDate INT,

                                            Value NUMERIC(28,2),

                                            Lo80 NUMERIC(28,2),

                                            Hi80 NUMERIC(28,2),

                                            Lo95 NUMERIC(28,2),

                                            Hi95 NUMERIC(28,2))

 

 

-- insert the actual sales values

INSERT INTO @FinalResults (ResultType,

                                            YearDate,

                                            MonthDate,

                                            Value)

SELECT       'Actual'

             ,YEAR(OrderDate)   as  YearDate

             ,MONTH(OrderDate)  as      MonthDate

             ,SUM(SalesAmount)  as    Value

FROM [dbo].[FactInternetSales]

GROUP BY MONTH(OrderDate), YEAR(OrderDate)

ORDER BY YearDate ASC, MonthDate ASC

                                     

-- insert the forecast sales values

INSERT INTO @ForecastResults

EXECUTE dbo.spForecastSalesPreparation

 

DECLARE @MaxId INT,

             @LastDate DATE;

 

SET @MaxId = (SELECT MAX(Id) FROM @FinalResults)

 

SELECT @LastDate = DATEFROMPARTS(YearDate, MonthDate, 1)

FROM @FinalResults

WHERE Id = @MaxId

 

INSERT INTO @FinalResults (ResultType,

                                         YearDate,

                                         MonthDate,

                                         Value,

                                         Lo80,

                                         Hi80,

                                         Lo95,

                                         Hi95)

SELECT 'Forecast',

         YearDate = DATEPART(year, DATEADD(month, Id, @LastDate)),

         MonthDate = DATEPART(Month, DATEADD(month, Id, @LastDate)),

         Value,

         Lo80,

         Hi80,

         Lo95,

         Hi95

FROM @ForecastResults

ORDER BY Id ASC

 

SELECT  Id,

             ResultType,

             DATEFROMPARTs(YearDate, MonthDate, 1) AS YearDate2,

             Value,

             Lo80,

             Hi80,

             Lo95,

             Hi95

FROM @FinalResults

 

END;

 

Following the normal steps of creating a report in SSRS using a stored procedure as a dataset your graph might look something like this:


So, we’ve shown two similar yet different approaches to forecasting using the ARIMA method and utilising R and SQL. Hopefully this demonstrates how quick and easy it is to establish a rudimentary forecasting practice wherever you are.

Utilising R in SSRS

With the release of SQL Server 2016 CTP3 the inclusion of R has been a powerful and welcome addition. As R can now be directly queried from inside SQL Server Management Studio (SSMS) one can now use the data outputs from R in SQL Server Reporting Services (SSRS) thereby utilising the power of R in the convenience of SSRS.

This blog will guide you through the process of creating a simple report in SSRS using data from R.

As demonstrated in a previous blog, it is very easy to begin using R within SQL Server and this is no different.

First of you will need your SQL R Script, for which I’m producing a simple K Means cluster of employees in the Adventure Work Data Warehouse. Then you will want to wrap that query inside a stored procedure.

CREATE PROCEDURE dbo.spKMeansEmployee

AS

BEGIN

      EXECUTE sp_execute_external_script

                   @language = N'R',

                   @script = N'ClusterCount <- 4;

                                      df <- data.frame(InputDataSet);

                                      ClusterFeatures <- data.frame(df$BaseRate, df$VacationHours, df$SickLeaveHours, df$SalaryFlag);

                                      ClusterResult <- kmeans(ClusterFeatures, centers = ClusterCount, iter.max = 10)$cluster;

                                      OutputDataSet <- data.frame(df, ClusterResult);',

                   @input_data_1 = N'SELECT

                                                   EmployeeKey,

                                                   BaseRate,

                                                   VacationHours,

                                                   SickLeaveHours,

                                                   CAST(SalariedFlag AS VARCHAR(1)) AS SalaryFlag

                                            FROM dbo.DimEmployee;'

      WITH RESULT SETS (( EmployeeKey INT NOT NULL,

                                      BaseRate MONEY NOT NULL,

                                      VacationHours INT NOT NULL,

                                      SickLeaveHours INT NOT NULL,

                                      SalaryFlag VARCHAR(1) NOT NULL,

                                      ClusterResult INT NOT NULL

                                ));

END

 

The next step is to create a new report in Visual Studio and add a new Data Source.


Then create a dataset.


And link that dataset to a new report.


Then build the report how you want, using that dataset. This is the quick output I’ve opted for as you can quickly analyse employees based on the cluster they are in.


As you can see, and hopefully reproduce, it’s a very quick and relatively easy process that allows you to make business decisions by utilising the combined powerful capabilities of R and SQL.

Connecting SQL Server to R

In this post I’m going to use R to retrieve some data from SQL Server.

In order to use R in conjunction with SQL Server, but in the absence of SQL Server 2016 and its soon to be incorporated R functionality, it is necessary to use a few workarounds in order to produce the desired outcome.

R is a package based platform and does not inherently communicate with other platforms unless the relevant package is installed and called. There are quite a few packages that can be used for R to talk to SQL Server, however I prefer to use the RODBC package as it is simpler to use than other packages available, and will be using it for this example.

CONNECTING SQL SERVER TO R

Step 1: Create a connection

As RODBC requires a connection to your SQL Server database you’ll have to open up the ODBC Data Source Administrator instance on the machine you’re using. Under the User DSN tab (though you could use the System DSN) click Add to create a new connection.

Select SQL Server Native Client 11.0 and Finish.

clip_image002

It will then open up the following screen and fill in appropriately

clip_image004

It will then open up with the necessary security information, but as I’m using a local version I will persist with Windows authentication.

The next screen is where you choose the database you want to connect to. By default you will just connect to the server, but if you wish to save the results from your analysis in R back to SQL Server and to the correct database it is important that you select the desired database. For this example I’m connecting to Adventure Works.

clip_image006

The next screen is general configuration properties and can be changed for the user’s needs

clip_image008

Click finish, test the connection and you’re all done for this step!

Step 2: Connecting R

For this next part we’ll be using an R client. My preferred client is R Studio as it provides a clean interface.

Now that you’ve created your connection you’ll want to use it within R. After firing up an instance of R with the RODBC package installed you will want to invoke it with the following syntax:

library(RODBC)

To bring the connection through to R you’ll need to assign a variable to it with the help of the odbcConnect function.

The format for invoking the function is as follows:

connectionstring <- odbcConnect("some dsn", uid = "user", pwd = "****")

connectionstring is the variable assigned to store the connection

odbcConnect is the function

“some dsn” is the name of your DSN connection

uid  and pwd are the User ID and password for the server, if needed

For our example using AdventureWorks on a local machine the syntax is as follows:

AdventureWorks <- odbcConnect ("AdventureWorks")<?xml:namespace prefix = "o" />

In order to see which objects are in your database you should run the following syntax:

sqlTables(AdventureWorks)

Which produces an output similar to this:

clip_image010

You can then begin to use your data from SQL Server in R by using the sqlQuery function to extract data.

Employee <- sqlQuery(AdventureWorks,

"SELECT * FROM HumanResources.Employee")

The purpose of the sqlQuery function is to be able to get a specific set of data, potentially from multiple tables. If you just wish to return the contents of one table it would be better to use the sqlFetch function.

Employee <- sqlFetch(AdventureWorks,"HumanResources.Employee")

sqlFetch returns the contents of the specified table and stores it in the assigned variable.

 

Connecting R to SQL Server is relatively easy and allows you to unleash the power of R on your data without employing expensive add-ons or waiting for a future SQL Server 2016 CTP to be released.