Adatis

Adatis BI Blogs

Scraping Website Data using Power Query

In Power BI and more specifically Power Query, there is a simple, out of the box Web connector that enables you to scrape data from a website.  This blog will show you how to achieve this in both Power BI desktop and through Excel.

Power BI Desktop

1.      Open a Power BI Desktop.  Select ‘Get Data’.


clip_image002

2.      In the pop up window, select ‘Other’ > ‘Web’ and click ‘Connect’.

clip_image004

3.      Type in the website you would like to scrape.  For this example, I have used the BBC Sport website – looking at the ‘Premier League’ football league table.  Click ‘OK’ to continue.


clip_image005

4.      Power Query will try and obtain the data from the website and may take a couple of minutes loading.  As long as you are connected to the Internet, the query will complete and looks like the below:

clip_image007

5.      I have ticked the second option, however, you could select all three datasets if preferred.  When happy with your selections, click ‘Load’.

6.      You have now imported the scraped data, but to view it in Power Query, click ‘Edit Queries’ from the top menu pane.

clip_image008

7.      The Query Editor will open in a separate window, with the scraped data showing in its raw state.  You can now modify/join/clean until your hearts content!


clip_image010

Excel

1.      Firstly, you must have Excel 2013 or later and the Power Query add in for Excel installed to be able to scrape a website using the below method.

2.      Open Excel and start a new workbook.  Then, click ‘Power Query’ from the top menu pane and select the ‘From Web’ button.


clip_image012

3.      Enter the URL (demonstrated in step 3 of Power BI Desktop example) and click ‘OK’.

4.      Choose your desired data set or alternatively, pick more than one by ticking the ‘Select Multiple Items’ option.  Click ‘Load’ to download the data.

clip_image014

5.      When loaded, you can preview the data by hovering over it – found under the ‘Workbook Queries’ menu pane.

clip_image016

6.      To open the data in Power Query, double click the workbook query.  A new window will open and the data available to edit – see below:


clip_image018

Limitations

Whilst playing with the web scraping feature, I have found that some websites work better with Power Query than others.  At time of press, I cannot find any official Microsoft documentation on what elements of a website do work and more importantly, what doesn’t.  However, web data that is wrapped in an html table or around div tags seem to work better with Power Query.

What’s Next?

As I have demonstrated, scraping websites using Power Query is very easy.  This feature does not require any M or coding expertise, as Power Query works its magic under the hood.  This is a great alternative to C#, JavaScript, etc. for scraping a web API.

This article shows you how to bring the data into Power Query but in reality, it will need to be cleaned and transformed to be useful.  Excel Power Query Web Scrape - AFL Fixtures by Matt Allington and Using Power Query M Language for Scraping Any Website Data by Chris Webb are two very useful blogs demonstrating how this can be achieved.  If you are already a very capable M developer or have used Power Query in the past, you will know how powerful, yet simple this is.

Future Blogs

Look out for more blogs on M and Power Query in the future.  The next installment will be on how we can write native M to scrape websites, focusing specifically on web data that cannot be accessed through the user interface.

SQL - Using The MERGE Statement To Apply Type 2 SCD Logic

Introduced in SQL 2008 the merge function is a useful way of inserting, updating and deleting data inside one SQL statement.

In the example below I have 2 tables one containing historical data using type 2 SCD (Slowly changing dimensions) called DimBrand and another containing just the latest dimension data called LatestDimBrand. Using the merge function I will insert new records from LatestDimBrand into DimBrand, I will archive (apply an end date) to any DimBrand records which do not appear in the latest data, and finally enter a new record and archive the old record for any Brands which may have changed.

DimBrand (the target of our Inserts, Updates, Deletes) and DimLatestBrand (the source for Inserts,Updates,Deletes):

image     image

The merge code in it’s entirety:

INSERT #DimBrand    ([BrandCode],[BrandName],[StartDate])
SELECT                 [BrandCode],[BrandName],getdate()                                    
FROM                (
                    MERGE #DimBrand AS Target
                    USING    (
                            SELECT    [BrandCode],[BrandName],[StartDate],[EndDate]                       
                            FROM    #LatestDimBrand
                            ) AS Source
                    ON     (Target.[BrandCode] = Source.[BrandCode])    
                    -------------------------------                       
                    WHEN MATCHED AND Target.[BrandName] <> Source.[BrandName]
                        THEN
                        UPDATE SET Target.[EndDate] = getdate()
                    -------------------------------
                    WHEN NOT MATCHED BY TARGET 
                        THEN
                        INSERT ( 
                                [BrandCode]                        
                                ,[BrandName]                    
                                ,[StartDate]              
                                ,[EndDate]                                            
                        )
                        VALUES (      
                                Source.[BrandCode], 
                                Source.[BrandName], 
                                Source.[StartDate],
                                Source.[EndDate]   
                        )
                    -------------------------------
                    WHEN NOT MATCHED BY SOURCE 
                        THEN 
                        UPDATE SET Target.[EndDate] = getdate()
                    -------------------------------
                    OUTPUT $Action, Source.*
                    ) As i([Action],[BrandCode],[BrandName],[StartDate],[EndDate])
                    -------------------------------
WHERE                [Action] = 'UPDATE'
AND                    BrandCode IS NOT NULL

 

The insert statement, although appearing at the top of the SQL statement is the last thing to be executed and uses the results set of the merge function specified in the OUTPUT clause further down.

image

Next is the start of our merge statement here we specify a Target and a Source table. We also specify which columns the 2 tables should match on, in this case only the BrandCode.

image

The next step is to specify and handle any matches, here we are looking for Brand Code’s which appear in both tables, but with differing Brand Names. If a match is found the row in the target table is given an end date.  As with the following “NOT MATCHED” section inserts,updates or deletes can be applied to either table here. This handles the first part of our Slowly Changing Dimension requirement.

image

Where there is no match between the Target table and the Source table, the relevant records from the Source table are inserted into the Target table.

image

The penultimate part of the merge statement takes any Target records which do not appear in our “Latest” Source table and sets an EndDate for them as they are no longer a valid record.

image

The OUTPUT clause populates the outer INSERT statement, here the Source rows which are involved in the UPDATE statements above are pulled out to the outer FROM clause. We have also selected $Action so that we can filter out the INSERTED rows under the NOT MATCHED by Target statement.

image

The result of all the above is the brand “Googles” is archived along with the “MikeRowSoft” which as you can see from the BrandCode has been rebranded as MickRowSoft.

image

Forecasting: R and SQL Server 2016

Forecasting is pretty crucial to any business as nearly every function within a company needs an estimate of what the future will look like as a foundation to create and update plans. In a previous blog I addressed the principles and methods of forecasting and in this blog I will go through how you can forecast in R and SQL using the Autoregressive Integrated Moving Average (ARIMA) technique.

I will approach this in two parts. Part 1 will be demonstrating the forecast in an R platform; and in Part 2 I will be demonstrating the forecast using the R integration in SQL Server in conjunction with SSRS to visualise the results.

Part 1: The Pure R Approach

As mentioned, this will be creating a forecast using R and taking data from a SQL database. You will need to be able to connect SQL to R in order to do follow this section. You’ll also need to install the forecast CRAN package. As always we’ll be using AdventureWorks for the provision of data.

The R script is as follows:

library(RODBC)

library(forecast)

SQLConnection <- odbcConnect("AdventureWorks")

 

Data <- sqlQuery(SQLConnection, "WITH CTE AS(SELECT YearDate as[Year], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12]  FROM (

SELECT * FROM (SELECT --DISTINCT

             SUM(salesamount) as monthsales

                                ,MONTH([OrderDate]) as monthdate

                                ,YEAR(OrderDate) as  yeardate

                   FROM [AdventureworksDW2016CTP3].[dbo].[FactInternetSales]

                   group by MONTH(orderdate), year(orderdate)) t )

AS Table1

PIVOT(SUM(MonthSales) FOR Monthdate IN([1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12])) AS PVT

)

SELECT [Year], cast([1] as numeric(28,8)) as Jan, cast([2] as numeric(28,8)) as Feb, cast([3] as numeric(28,8)) as Mar, cast([4] as numeric(28,8)) as Apr, cast([5] as numeric(28,8)) as May,

cast([6] as numeric(28,8)) as Jun, cast([7] as numeric(28,8)) as Jul, cast([8] as numeric(28,8)) as Aug, cast([9] as numeric(28,8)) as Sep, cast([10] as numeric(28,8)) as Oct,

cast([11] as numeric(28,8)) as Nov, cast([12] as numeric(28,8)) as [Dec] FROM CTE

ORDER BY [Year] ASC")

 

month <- c(1,2,3,4,5,6,7,8,9,10,11,12)

mn<- month.abb[month]

 

DataMonth <- Data[ ,mn]

 

DataTS<- ts(c(t(DataMonth)),start = c(2010,12), end = c(2014,1),frequency=12)

ForecastArima <- forecast(auto.arima(DataTS))

plot(ForecastArima)

ForecastArima

 

The numerical output will look like this:

clip_image002

With the graphical output should look something like this:

clip_image004

Part 2: Forecasting in SQL

Forecasting in SQL follows a pretty similar approach to the pure R approach in producing the numerical output. But that’s where the similarity ends. As we’ll want to visualise the output we’ll need to do some manipulation on the result from the R script we will have to wrap the initial query in a stored procedure which can be called by the stored procedure which will populate the graph in SSRS.

The first step is to write the initial stored procedure which will contain the R script

CREATE PROCEDURE dbo.spForecastSalesPreparation AS

BEGIN

EXECUTE sp_execute_external_script

      @language = N'R'

     ,@script = N'library(forecast);

                                Data<- InputDataSet

                                month<- c(1,2,3,4,5,6,7,8,9,10,11,12)

                                mn<-month.abb[month]

                                DataMonth <- Data[ ,mn]

                                DataTS <- ts(c(t(DataMonth)), start = c(2010,12), end = c(2014,1), frequency =12)

                                ForecastArima <- forecast(auto.arima(DataTS))

                                ForecastArimadf <- data.frame(ForecastArima)'

     ,@input_data_1 = N'WITH CTE AS(SELECT YearDate as[Year],[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12]  FROM (

                                                                      SELECT * FROM (SELECT

                                                                       SUM(salesamount) as monthsales

                                                                       ,MONTH([OrderDate]) as monthdate

                                                                       ,YEAR(OrderDate) as  yeardate

                                                                       FROM [AdventureworksDW2016CTP3].[dbo].[FactInternetSales]

                                                                       group by MONTH(orderdate), year(orderdate)) t )

                                                          AS Table1

                                                          PIVOT(SUM(MonthSales) FOR Monthdate IN([1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12])) AS PVT)

                                      SELECT [Year],

                                      cast([2] as numeric(28,8)) as Feb,

                                      cast([3] as numeric(28,8)) as Mar,

                                      cast([1] as numeric(28,8)) as Jan,

                                      cast([4] as numeric(28,8)) as Apr,

                                      cast([5] as numeric(28,8)) as May,

                                      cast([6] as numeric(28,8)) as Jun,

                                      cast([7] as numeric(28,8)) as Jul,

                                      cast([8] as numeric(28,8)) as Aug,

                                      cast([9] as numeric(28,8)) as Sep,

                                      cast([10] as numeric(28,8)) as Oct,

                                      cast([11] as numeric(28,8)) as Nov,

                                      cast([12] as numeric(28,8)) as [Dec] FROM CTE

                                      ORDER BY [Year] ASC'

     ,@output_data_1_name = N'ForecastArimadf'

    WITH RESULT SETS (("Point.Forecast" numeric(28,2) NOT NULL,

                                      "Lo.80" NUMERIC(28,2) NOT NULL,

                                      "Hi.80" NUMERIC(28,2) NOT NULL,

                                      "Lo.95" NUMERIC(28,2) NOT NULL,

                                      "Hi.95" NUMERIC(28,2) NOT NULL));

END;

 

The R script that you invoke is very similar to the one in the pure R approach, as is the SQL query. But the key element is the WITH RESULT SETS command, which returns the results of the query in a tabular format. Note, it is important to specify the data types for each of the columns you wish to return.

Because of the desire to visualise the results from R we’ll need to join the output from R with the input to R and keep it in a structure that makes sense, both as a table and as a graph. The stored procedure that I’ve created is as follows:

CREATE PROCEDURE dbo.spForecastSales AS

BEGIN

DECLARE @ForecastResults TABLE (Id INT IDENTITY(1,1),

                                                   Value NUMERIC(28,2),

                                                   Lo80 NUMERIC(28,2),

                                                   Hi80 NUMERIC(28,2),

                                                   Lo95 NUMERIC(28,2),

                                                  Hi95 NUMERIC(28,2))

 

DECLARE @FinalResults TABLE (Id INT IDENTITY(1,1),

                                            ResultType VARCHAR(10),

                                            YearDate INT,

                                            MonthDate INT,

                                            Value NUMERIC(28,2),

                                            Lo80 NUMERIC(28,2),

                                            Hi80 NUMERIC(28,2),

                                            Lo95 NUMERIC(28,2),

                                            Hi95 NUMERIC(28,2))

 

 

-- insert the actual sales values

INSERT INTO @FinalResults (ResultType,

                                            YearDate,

                                            MonthDate,

                                            Value)

SELECT       'Actual'

             ,YEAR(OrderDate)   as  YearDate

             ,MONTH(OrderDate)  as      MonthDate

             ,SUM(SalesAmount)  as    Value

FROM [dbo].[FactInternetSales]

GROUP BY MONTH(OrderDate), YEAR(OrderDate)

ORDER BY YearDate ASC, MonthDate ASC

                                     

-- insert the forecast sales values

INSERT INTO @ForecastResults

EXECUTE dbo.spForecastSalesPreparation

 

DECLARE @MaxId INT,

             @LastDate DATE;

 

SET @MaxId = (SELECT MAX(Id) FROM @FinalResults)

 

SELECT @LastDate = DATEFROMPARTS(YearDate, MonthDate, 1)

FROM @FinalResults

WHERE Id = @MaxId

 

INSERT INTO @FinalResults (ResultType,

                                         YearDate,

                                         MonthDate,

                                         Value,

                                         Lo80,

                                         Hi80,

                                         Lo95,

                                         Hi95)

SELECT 'Forecast',

         YearDate = DATEPART(year, DATEADD(month, Id, @LastDate)),

         MonthDate = DATEPART(Month, DATEADD(month, Id, @LastDate)),

         Value,

         Lo80,

         Hi80,

         Lo95,

         Hi95

FROM @ForecastResults

ORDER BY Id ASC

 

SELECT  Id,

             ResultType,

             DATEFROMPARTs(YearDate, MonthDate, 1) AS YearDate2,

             Value,

             Lo80,

             Hi80,

             Lo95,

             Hi95

FROM @FinalResults

 

END;

 

Following the normal steps of creating a report in SSRS using a stored procedure as a dataset your graph might look something like this:

clip_image005

So, we’ve shown two similar yet different approaches to forecasting using the ARIMA method and utilising R and SQL. Hopefully this demonstrates how quick and easy it is to establish a rudimentary forecasting practice wherever you are.

Deploying to SQL SERVER 2016 from source control

This is a quick blog to help others avoid what I went through while trying to deploy a DB from TFS to SQL Server 2016 with no success.

Apart from the usual option of backup/restore there was no other viable way of deploying through source control. Visual Studio would throw an “unable to connect to target server” error despite the connection being successfully tested.



In mid-dec 2015 Microsoft released an update to Visual Studio 2015 Data Services (link) where it made possible to deploy to SQL SERVER 2016.



Utilising R in SSRS

With the release of SQL Server 2016 CTP3 the inclusion of R has been a powerful and welcome addition. As R can now be directly queried from inside SQL Server Management Studio (SSMS) one can now use the data outputs from R in SQL Server Reporting Services (SSRS) thereby utilising the power of R in the convenience of SSRS.

This blog will guide you through the process of creating a simple report in SSRS using data from R.

As demonstrated in a previous blog, it is very easy to begin using R within SQL Server and this is no different.

First of you will need your SQL R Script, for which I’m producing a simple K Means cluster of employees in the Adventure Work Data Warehouse. Then you will want to wrap that query inside a stored procedure.

CREATE PROCEDURE dbo.spKMeansEmployee

AS

BEGIN

      EXECUTE sp_execute_external_script

                   @language = N'R',

                   @script = N'ClusterCount <- 4;

                                      df <- data.frame(InputDataSet);

                                      ClusterFeatures <- data.frame(df$BaseRate, df$VacationHours, df$SickLeaveHours, df$SalaryFlag);

                                      ClusterResult <- kmeans(ClusterFeatures, centers = ClusterCount, iter.max = 10)$cluster;

                                      OutputDataSet <- data.frame(df, ClusterResult);',

                   @input_data_1 = N'SELECT

                                                   EmployeeKey,

                                                   BaseRate,

                                                   VacationHours,

                                                   SickLeaveHours,

                                                   CAST(SalariedFlag AS VARCHAR(1)) AS SalaryFlag

                                            FROM dbo.DimEmployee;'

      WITH RESULT SETS (( EmployeeKey INT NOT NULL,

                                      BaseRate MONEY NOT NULL,

                                      VacationHours INT NOT NULL,

                                      SickLeaveHours INT NOT NULL,

                                      SalaryFlag VARCHAR(1) NOT NULL,

                                      ClusterResult INT NOT NULL

                                ));

END

 

The next step is to create a new report in Visual Studio and add a new Data Source.

clip_image001

Then create a dataset.

clip_image003

And link that dataset to a new report.

clip_image005

Then build the report how you want, using that dataset. This is the quick output I’ve opted for as you can quickly analyse employees based on the cluster they are in.

clip_image006

As you can see, and hopefully reproduce, it’s a very quick and relatively easy process that allows you to make business decisions by utilising the combined powerful capabilities of R and SQL.

Master Data Services Many to Many Relationships

Many to Many Relationships (M2M) are possible in the current version of Master Data Services, but they come with a few workarounds. Thankfully, many to many hierarchies are one of the many new features that can be found in Master Data Services 2016. This post will contrast the M2M approach in the current version of MDS, before showing how its greatly improved in the current MDS 2016 CTP.

Customer Example – MDS 2014

The simple example that I’m going to show is Customers and Bank Accounts. That is, a customer can have more than one bank account, but a bank account can also have more than one customer, e.g. a joint account or a business account.

First, to show the MDS 2014 approach, there are a few entities to look at, namely Customer and Account:

image

image

Then there is an other entity that relates to both Customer and Account, called CustomerAccount, which essentially acts as a bridge table, allowing the M2M relationship to naturally exist in the data:

image

Moving away from the entities, I’ve created two derived hierarchies so that we can look at the data by account or by customer. So here we can see, for example, that Bob is keeping his own personal account, as well as the account with his wife…hmmm:

image

Or looking at this another way, we can see the inverted version of the hierarchy, where we see the Customers as the top level, with some accounts belonging to more than one customer:

image

As you may have spotted, there’s an issue with this approach. Whilst we do see the name of the Customer in the first hierarchy, its pointing at a member in the CustomerAccount entity, which is why the code is different each time. Enter SQL Server 2016…

Master Data Services 2016

MDS 2016 does a far better job at visualising the M2M relationship. You still need the link entity, but it will allow you to show the actual Customers and Accounts with their correct codes.

I’ve used the exact same entity setup as MDS 2014, so lets begin by looking at how we can build a hierarchy that shows customers by accounts. When creating a new Derived Hierarchy, we can see that the initial available levels list is just the same as it would be in MDS 2014:

image

The first step is to drag over the entity called Customer to the Current levels section. Once we perform this action, the Available Entities pane looks very different in SQL 2016 that in would in the current version. Now MDS is allowing us to pick the Account entity, but via another entity. Hence it says “mapped via” on the left hand side:

image

Once we finished the hierarchy build, and explore the hierarchy, we can see the same data as before, but this time its the actual correct customer codes that are displayed:

image

The derived hierarchy by Customer also displays the correct customer and account codes:

image

Summary

A derived hierarchy in Master Data Services 2016 now allows the correct display of many to many relationships between two entities. This is one of many new features in MDS 2016, offering quite a number of improvements over the SQL Server 2014 version.

“Master” Master Data: Entity Sync in Master Data Services 2016

When I heard that “entities could be shared” between different models in Master Data Services 2016, my curiosity was piqued.  There is a nice video of this feature and hierarchy management here. The concept of sharing entities is called Entity Sync in Master Data Services 2016.

You may ask the question as to why entities need to be shared between models? This is something I have thought about a lot with deciding how to split up the subject areas into different models in Master Data Services.  The conclusion I have come to is that is a better to keep subject areas separated rather than having one very large model in master Data Services with many entities.  Keeping the models separated makes it much easier to manage when working in teams, deployment, versioning etc. 

However, there are some use cases where having a single entity with the same definition and data might be needed in different models. 

For example…

  • Say different departments have their own data stewards for master data in an organisation maintain their own master data in different models and one needs to secure those models such that different departments cannot see or modify each others data.  Then a department entity may be needed in each of those models
  • Some master data may change very infrequently, but need to be used in multiple models e.g. currency, location, title.  In the ‘old days’ of Master Data Services 2014 and prior one would have to create replica entities and make sure that data is loaded into each of the entities in each of the models. 

    I have always thought it would be nice to have “Master” master data in Master Data Services.  This is where Entity Sync in Master Data Services 2016 comes in…

    Setting up a new Entity Sync Relationship

    Using the ChartOfAccounts, Customer and Product sample Master Data Services models I created a new entity called Disney Characters in the ChartOfAccounts model and want to sync this to the other two models i.e. Customer and Product.

    image

    What is nice with Master Data Services 2016 is that one can add descriptions to the Models, Entities and Attributes, which is a nice feature to ‘document as you develop’.

    image

    To set up Entity Sync, go to System Administration, and select ‘Entity Sync’ from the ‘Manage’ menu

    image

    Create a new Entity Sync and select the Source Model, Entity and Version.  Then select the Target Model and Version.  You can even change the name of the target entity to something different.  I’m thinking ‘role playing’ entities.  You can set a schedule for when the entity gets synced, or ‘Sync On Demand’. So the below screenshot shows the setup of the Entity Sync for the Disney Characters entity from ChartOfAccounts model to the Customer model and renaming it along the way to Cartoon Characters.

    image

    The entity will not sync from the Source Entity (ChartOfAccounts) to the Target Entity (Customer) until the sync is run.  You can’t seem to select more than one Entity Sync to sync at a time through the Master Data Services web UI.

    image

    On going to the Customer model and viewing the synced Cartoon Characters entity, all the metadata (including all those descriptions) and data appears.  However it is not possible to add or modify members.  And we cannot add any new attributes to the synced entity.  This is what we want, right?  “Master” master data with a single source in the source entity and model.

    image

    In the management of entities it is also indicated that this entity is a ‘Sync Target’. Nice.

    image

    Now that the entity is synced I can use it like I would any other entity in the Customer model.  I can add domain based dependencies and use it in other entities in the model, and I can apply security onto it.  I can make changes to the source entity and re-sync and the new metadata and data will be propagated to the target entities.

    Deleting Entity Sync Relationship

    But what if you delete the entity sync relationship? Will the target table get deleted too?  So trying this I removed the Entity Sync for the Disney Characters entity from ChartOfAccounts model to Cartoon Characters entity in the Customer model.

    Nope. The entity is still there, but now you can edit it (I’ve added ‘Road Runner’), and carry on modelling it like you would any other entity in the model.  Although I did note that it lost it’s setting of ‘Create Code Values Automatically’.  Not sure if this is a feature?

    image

    I can turn the entity sync relationship back on but specify ‘Existing Entity’. 

    image

    But this threw an error, it seemed to have a problem with the new member ‘Road Runner’ that I had created. 

    image

    I deleted ‘Road Runner’ and set up the Entity Sync relationship no problem.  I guess one could also drop and recreate the entity using the Entity Sync.

    So what happens if I try to synchronise and entity that has dependencies on other domain entities e.g. Trying this out with the SubClass entity from the ChartOfAccounts model gives the following error “Domain based attribute(s) in the source entity will not be synced”.

    image

    This is what the SubClass source entity looks like with a domain based attribute to the Class entity in the ChartOfAccounts model.

    image

    This is now what the SubClass target entity looks like in the Customer model.  It has lost all of it’s domain attributes!

    image

    So I thought I would try to first synchronise the Class domain entity first, then synchronise the SubClass entity. Same “Domain based attribute(s) in the source entity will not be synced” as before and the domain based attributes of SubClass will not be synched.

    Deployment Considerations

    With the current SQL Server 2016 CTP 3.2 it does not appear that Entity Sync relationships will be packaged up with creation of packages either from the Master Data Services web UI or using the MDSModelDeploy command line utility.  Although I guess they might not be? As one would have to consider the order of deployment of models with the dependencies that the Entity Sync relationships create.  They are very easy to set up, but then again there is a new meta data table in the Master data Services database called [mdm].[tblSyncRelationship]. There are also corresponding stored procedures that operate on this table.

    image

    So I guess one could populate the required metadata in [mdm].[tblSyncRelationship] and regenerate the Entity Sync relationships as part of a deployment process.

    In Conclusion

    The introduction of the Entity Sync functionality in Master Data Services for SQL Server 2016 is a welcome new feature and opens up all sorts of possibilities with creating “Master” master data.  The new Master Data Services web UI for managing these is great.  So long as the entities you want to sync to other models do not contain any domain based entities.

    I will definitely be using it, as:

    It will definitely speed up development – goodbye to all those SSIS packages to load the same data into the same entity in different models!

    Add value to the end user as everything will be in perfect sync

  • Generating Master Data Services Subscription Views

    Creating subscription views through the Master Data Services web UI is easy. If not a little slow at times if the Master Data Services installation is on a busy or slow server.  Generally one does not have that many models and entities for which subscription views need to be created.  But what if you do? 

    I am working for a client where they are planning a large Master Data Services implementation with possibly hundreds of entities. Reverse engineering the entities from a SQL Server relational engine to Master Data Services can be done using the MDSModelImport utility on Codeplex.  Theoretically this utility could be extended to create subscription views too.  However there is an easier way that could be used in a couple of other scenarios e.g.

    • Regenerating subscription views when you have updated entities after the subscription view on the entity has been created
    • Some of my colleagues reported that subscription views had ‘disappeared’ when deploying models to a new Master Data Services server. A bug?
    • Regenerating subscription views for a different version of the model

    So I started doing some digging about in the Master Data Services database and came across a nicely named stored procedure called

    [mdm].[udpCreateSubscriptionViews] 

    The content of this stored procedure is very useful and has a wealth of information on what the parameters are.  I put in the required parameters for a leaf level subscription view on one of my entities.  It created the view, but the view did not show up in the Master Data Services web UI. In addition on querying the view no data was returned, even though it had created all the expected attribute columns. Hmm?

    So, using the above stored procedure as a starting point, and trawling through a number of stored procedures I came upon:

    • the table [mdm].[tblSubscriptionView]. Of course!
    • and the stored procedure [mdm].[udpCreateAllViews]

    Using information that I gleaned from above mentioned digging, I inserted the required metadata into the table [mdm].[tblSubscriptionView], and then ran the stored procedure [mdm].[udpCreateAllViews]. Voilà! 

    Using the sample ‘Customer’ model provided with Master Data Services, I wanted to create leaf format subscription views for all thirteen entities.  So using the below code to populate the metadata to the [mdm].[tblSubscriptionView] table to generated Leaf format views.  Also be sure to create a subscription view name that is unique.  I like ModelName_EntityName to keep all the views from a model together.  Generating other kinds of subscription views would use the same method, the metadata tags would have to change.  See contents of the stored procedure [mdm].[udpCreateSubscriptionViews].

    In SQL Server 2012 or SQL Server 2014

    image

    In SQL Server 2016 (a few minor changes)

    image

    Then in execute the stored procedure [mdm].[udpCreateAllViews] to update all the views.  This stored procedure will use whatever data is in the table [mdm].[tblSubscriptionView] to create any missing subscription view and to update any that might require updating.  So remember this is a very useful stored procedure in the Master Data Services database.

    image

    This is what the data looks like in the [mdm].[tblSubscriptionView] table

    image

    The subscription views appear in the Master Data Services web UI

    In SQL Server 2014

    Capture

    In SQL Server 2016

    image

    And data come back from all the views. 

    image

    In Summary…

    So whether this is for ten views or hundreds the process would be the same and just as quick! 

    It is worth remembering the stored procedure [mdm].[udpCreateAllViews] as it might be best practice to run this as part of the Master Data Services model deployment process to make sure that all subscription views are up to date.

    Also if any have ‘disappeared’ but the relevant metadata exists in the table [mdm].[tblSubscriptionView] then running this stored procedure will regenerate them.

    Smile

    Next on the agenda is to start digging through these views…the possibilities are endless.

    image

    Sunburst Charts in SQL Server Reporting Services 2016

    The Sunburst is the second of the new chart types in Reporting Services 2016.  It was made available by Microsoft in CTP 2.3.  The aim of this blog is to describe what a Sunburst chart does and provide a guide on how to create one.

    What are Sunburst Charts?

    Like Tree Maps, Sunburst charts are an ideal candidate to display hierarchical data.  They work particularly well with ragged hierarchies and clearly indicates where relationships exist.  A ragged hierarchy is when some levels do not automatically roll up to its parent, e.g. an organizational hierarchy.  The CEO may have two managers – one who has a subordinate under them (COO)and the other who is not responsible for anyone (Secretary).

    clip_image002

    This is how a ragged hierarchy looks in a Sunburst Chart:

    clip_image004

    Source: Koen Verbaeck (www.mssqltips.com)

    The Sunburst is just like a Donut chart but with multiple layers.  You can see there is an extra ‘ray’ coming out from Feb, as this is the only month that contains weekly data.  The hierarchy may in fact contain a weekly leaf member throughout but because Feb is the only month with values at the week level, none of the other members are visible.

     The whole purpose of this chart type is to help a user visualise how an outer ring (child member) contributes to an inner ring (parent member).  If you have a hierarchy that does not aggregate naturally, I would advise using other SSRS visualisations. 

    Creating a Sunburst Chart

    I will now show you how to create a simple Sunburst chart in SSRS 2016.  The functionality is available through bothReport Builder and Visual Studio.  For this demo, I will use Report Builder.

    1.      Navigate to Report Builder through your configured SSRS web service and choose Blank Report.

    2.      Set up your own relevant Data Source, dimension Datasets and report Parameters.  I suggest using a simple fact table and dimension from AdventureWorksDW2016.  A copy of this database can be found here.

    3.      Select Insert > Chart > Insert Chart from the Menu Pane.

     

     

    clip_image005 

    4.      Select the Sunburst chart and click OK. 

    clip_image006

    5.      Create the Sunburst Dataset, which is derived from the fact table.  The SQL for my chart is as follows:

    SELECT

                 DC.DateKey,

                 CalendarDate,

                 'Week ' + CAST(CalendarWeekNumber AS VARCHAR(2)),

                 CalendarMonthName,

                 CalendarQuarterName,

                 CalendarYearName,

                 COUNT(*) AS No_Of_Games

    FROM          [Warehouse].[DimDateCalendar] DC

    INNER JOIN

                 [Warehouse].[FactResults] FR

                 ON  FR.DateKey = DC.DateKey

    INNER JOIN

                 Warehouse.DimLeague DL

                 ON DL.LeagueKey = FR.LeagueKey

    WHERE         DL.LeagueName = 'Premier League'

    GROUP BY

                 DC.DateKey,

                 CalendarDate,

                 'Week ' + CAST(CalendarWeekNumber AS VARCHAR(2)),

                 CalendarMonthName,

                 CalendarQuarterName,

                 CalendarYearName

     

    The query is bringing back the number of Premier League football matches played on a given day.  It only contains data from 2010 to 2015.

     

    image 

    6.      Now let’s bring in just one level of the hierarchy and a Sum of the No_Of_Games.

    clip_image008

     

    7.      If we preview the chart, it still looks very incomplete. 

    clip_image002[6]

    8.      If we add more Category Groups, it will add additional rings to the outside of the chart.  Eventually building up a number of layers. 

    clip_image012

    9.      The chart now looks like a Sunburst but it is very difficult to make sense of what is going on.  Therefore, we will add some colours, labels and groupings to the data.

    10.   Right Click on the ‘No_Of_Games’ Value property and ensure Show Data Labels is ticked.


    clip_image013 

    11.   The Sunburst can be partitioned (by colour) depending on the desired attributes.  For simplicity, we will use ‘CalendarYearName’. 

    clip_image015



    12.   Add any additional chart headers, logos, etc to your report and Save the report.  Click Run to view it.


    clip_image016 

    13.   That is it - the report is now ready.  You can clearly see a pattern in the data and each year represents a colour. In addition, it is easy to see what proportion of quarters, months and weeks make up a given year.

    Conclusion

    The Sunburst chart type is a very visual chart type and not to be used to interrogate numbers at a granular level.  It is far better to spot anomalies within a hierarchy or highlight any bias towards a certain segment.

    One major drawback I found was that it is very difficult to view or report on accurate information, especially when the data becomes more granular.  This is the age old issues with Pie Charts, which is why the BI community frown upon this chart type.  In the example above, the most outer ring is very busy and virtually impossible to analyse.  I would definitely advise using Tree Maps for more detailed analysis. 

    There are also a couple of known limitations in CTP 2.3 for Sunburst charts.  Tooltips only show in the outer ring when an expression is defined and you cannot control the different slices.  On the other hand, Pie Charts enable a user to set a PieStartAngle, as well as a variety of other property options.  I am sure a lot of the missing functionality will be brought in – either in newer CTP releases or when the SQL Server 2016 is fully available.

    For more interactive and aesthetically pleasing Sunburst charts, I would advise using either Power BI or D3 components (for web developers).  Not only is it more configurable but you can offer some really cool features like removing ‘ray’s, drill through and chart zoom.  A pre-configured D3 example can be found here.  I had a quick look through the SunburstChart Properties in SSRS 2016 and cannot see such options.

    References/Future Reading

    For more information on Tree Maps, I recommend the below resources/blogs:

    ·        Adventure Works DW 2016 CTP 3.0 database download - https://www.microsoft.com/en-us/download/details.aspx?id=49502

    ·        Koen Verbaeck - https://www.mssqltips.com/sqlservertip/4030/how-to-create-a-sunburst-graph-in-sql-server-reporting-services-2016/

    ·        Sorna Kumars Muthuraj (MVP) Web Blog  - https://www.youtube.com/watch?v=LKSecJ4_lzE

    ·        D3 Chart Type Gallery - https://github.com/mbostock/d3/wiki/Gallery

    Tree Maps in SQL Server Reporting Services 2016

    The current instalment in my SSRS 2016 blog series focuses on one of the two new chart types – Tree Maps.  They were made available to Reporting Services 2016 CTP 2.3.

    What are Tree Maps?

    Tree Maps are designed to analyse quantitative measures that fit into a hierarchy.  To put it simply, this enables a user to visualise how an attribute is performing against its parent category and in turn, how this category performs against other categories.  The size of rectangle (in the example below) is proportionate to the numeric quantity.

    clip_image002

    Source: MSDN (Microsoft)

    Creating a Tree Map

    I will now show you how to create a simple Tree Map chart in SSRS 2016.  The functionality is available through both Report Builder and Visual Studio.  For this demo, I will use Report Builder.

    1.      Navigate to Report Builder through your configured SSRS web service and choose Blank Report.

    2.      Set up your own relevant Data Source, dimension Datasets and report Parameters.  I suggest using a simple fact table and dimension from AdventureWorksDW2016.  A copy of this database can be found here.

    3.      Select Insert > Chart > Insert Chart from the Menu Pane.

     

     

    clip_image003

    4.      Create the Tree Map Dataset, which is derived from the fact table.  The SQL for my chart is as follows:

    SELECT       DT.Country,

                 DL.LeagueName,

                 DT.TeamName,

                 SUM(LT.TotalPoints) AS Points

    FROM         [Warehouse].[VwFactLeagueTable] AS LT

    INNER JOIN

                 [Warehouse].[DimTeam] AS DT

                 ON DT.TeamKey = LT.TeamKey

    INNER JOIN

                 [Warehouse].[DimLeague] AS DL

                 ON DL.LeagueKey = LT.LeagueKey

    GROUP BY

                 DT.Country,

                 DL.LeagueName,

                 DT.TeamName

     

    The query is bringing back the total number of points each English football team has accumulated since the 2010/11 season.

     

    image

    5.      Create Values, Category Groups and Series Groups (as shown below).  These work in exactly the same way as other versions of SSRS.


    clip_image005

    6.      Right Click on ‘Sum(Points)’ and select Series Properties.


    clip_image006

    7.      In the Series Data tab, select the Tooltip Expression.

    clip_image008

    8.      In the screenshot below, I have concatenated the team name field and the value together, as the Tree Map will truncate descriptions that do not fit in a section.  The logic could be different for your own Tree Map.  Click OK to confirm changes.

    clip_image010

    9.      Add any additional chart headers, logos, etc to your report and Save the report.  Click Run to view it.


    clip_image012

    10.   That is it - the report is now ready.  As you can see, ‘Man United’ are not visible until you hover over the applicable section.

    Conclusion

    Designing and implementing a Tree Map is very easy in SSRS 2016.  Although this is a new chart type, you can set up a simple example in minutes.  A BI analyst, report user or a company CEO can use Tree Maps to quickly spot well or badly performing products/features/departments.  As this chart is hierarchical, you are able to analyse different levels of data in one visualisation.

    Once we have grasped the concept, we can begin to create far more sophisticated Tree Maps.  Although I will not be touching on them in this blog, there is plenty of material out there already that goes into greater detail. 

    References/Future Reading

    For more information on Tree Maps, I recommend the below resources/blogs:

    ·        Adventure Works DW 2016 CTP 3.0 database download - https://www.microsoft.com/en-us/download/details.aspx?id=49502

    ·        MDSN Blog - https://msdn.microsoft.com/en-us/library/mt416753.aspx

    ·        Sorna Kumars Muthuraj (MVP) Web Blog  - https://www.youtube.com/watch?v=mj2eqelJt7E

    ·        Using Tree Maps in Excel (Microsoft) - https://blogs.office.com/2015/08/11/breaking-down-hierarchical-data-with-treemap-and-sunburst-charts/

    My next blog will focus on another new chart type – Sunbursts.   

    Power BI Dynamically Filtering Dates

    One problem which forced me to pause for thought recently was how to dynamically filter data in Power BI. Eg: Filtering data by a date range to only show sales for the last 7 days.

    I believe the two simplest options are using either the Q&A natural query language or using Power BI Desktop’s built in Date/Time filter options found in the query editor, both of which have been documented below.

    Option 1 – Using the Power BI web app

    Using the Q&A text box of your dashboard you can simply ask in “natural language” to see your data from the last 7 days, as shown in the screenshot below.  This can then be pinned to the dashboard if it is a question you are going to ask on regular basis.

    This does take a little bit of tweaking to get the desired results as you are relying on the Power BI engine to understand your query.

    image

    Fortunately if no results are found Power BI does give you some suggestions should it not be able to find useful answers to your question, as shown in the screenshot below.

    image

    Option 2 – Using the Power BI Desktop app

    When you import data using Power BI Desktop you are greeted with the Query Editor, from here you can click the drop down arrow for the date column in question. Next select Date/Time Filters and you will be exposed to a number of filtering options unavailable in the reports or dashboard tabs, From here I have selected “In the Previous”.

    image

    Once you have selected your filter (eg: In the Previous) all you need to do is set how many days you want to count back.  In this instance the result of this filter means that I am only going to see data in my reports where the Created Date is in the previous 7 days.

    image

    The Microsoft Power BI team have already included a few useful built in filters, such as the Year To Date filter shown in the screenshot below.

    image

    Customise Parameters Pane in SQL Server Reporting Services 2016

    I will be writing a series of blogs that specifically focus on SQL Server Reporting Services (SSRS) - illustrating new features, new chart types, enhancements and general improvements. 

    In this article, I will be looking at the new and improved Parameter Pane that was made available in CTP 2.4.

    What is a Parameter Pane?

    In short, this is a ‘drop down’ field in SSRS.  Any attribute that can be sliced, diced or filtered are strong parameter candidates for a dashboard report.  The screenshot below illustrates how the current Parameter Pane looks in in the current version of SSRS. 

    clip_image002[6]

    Current Limitations

    Although still a very powerful feature in SSRS, the current nagging issue is that developers have very limited control of where a parameter can be placed in a report.  This is always at the top of the page and reliant on parameter ordering in Designer.

    It looks like Microsoft have finally listened to the SQL Community and finally rectified this.  Although it has taken what feels like forever, it is good news for any budding Reporting Services developer out there. 

    New to 2016 CTP 2.4

    I will now demonstrate how easy it is to customise your parameters in SSRS 2016.  As this is only available in CTP 2.4 and above, you will need to use Report Builder to make use of this functionality.

    1.      Navigate to Report Builder through your configured SSRS web service and choose Blank Report.

    clip_image004[6]


    2.      On the View tab, select the Parameters checkbox to display the parameters pane.  For simplicity, tick all of the checkboxes available.

    clip_image006[6]

    3.      You will now notice a pane appears at the top of the design surface and looks like the below.


    clip_image008[6]

    4.      Assuming you have a data source and data sets configured, to add a parameter to the pane, carry out the following:

    a.      Right click an empty cell in the parameters pane, and then click Add Parameter.

    clip_image009[6]

    b.      There are other ways to add a Parameter – click here to see these.

    5.      To move a parameter to a new location in the parameters pane, drag the parameter to a different cell in the pane.  In my example, I have placed Country in the centre of the screen, but State and City left and right respectively.

    clip_image011[6]

    6.      To access the properties for a parameter, there is now an alternative option.

    a.      Right click the parameter in the parameters pane, and select Parameter Properties.

     

    clip_image012[6]

     

    7.      To add new columns and rows to the pane, or delete existing rows and columns, right click anywhere in the parameters pane and then click a command on the menu that displays.  These options are shown in the above image but after adding 3 new columns and rows, the Parameter Pane now looks like the following:


    clip_image014[6]

     

    The white grids clearly indicate where the new columns and rows are positioned.

    8.      To delete a parameter from the pane and from the report, you can use the traditional methods, as well as:

    a.      Right click the parameter in the parameters pane, and then click Delete.

    clip_image016[6]

    image

    9.      Now Run the report to visualise how the drop downs will show in a report.

    clip_image017[6]

    10.   That is it.  You can now view the parameters.


    clip_image019[6]

    Conclusion

    Although this is a step in the right direction, the parameters are still limited to the top of an SSRS report.  However, you can develop custom workarounds, such as Excel-style Slicers - suggested by Simon Sabin in his blog. 

    It is surprising that Microsoft have not come up with a fully configurable Parameter Pane as the current workarounds are not perfect.  I am hoping further enhancements are added before the full version of SQL Server 2016 is released. 

    References/Future Reading

    For more information on Report Parameters, I recommend the below resources/blogs:

    ·        Koen Verbeeck’s SQL Server Blog - https://www.mssqltips.com/sqlservertip/4088/customize-the-sql-server-reporting-services-2016-parameter-pane/

    ·        MDSN Blog (Customize the Parameters Pane) - https://msdn.microsoft.com/en-GB/library/mt574039.aspx

    ·        MDSN Blog (Chang Parameter ordering) - https://msdn.microsoft.com/en-GB/library/dd255258.aspx

    ·        Simulating Slicers in SSRS Reports - http://sqljason.com/2012/07/simulating-slicers-in-ssrs-reports.html

     

    Look out for future blogs on SSRS2016; one particular interest is Report Builder, which Microsoft have apparently greatly improved. I have yet to fully dissect all of the new features so we will see whether the changes are significant enough to improve on the poor adoption seen previously.

    SQL Server backup to Azure Storage with SSIS

    This is the final part of a 3 blog series detailing some of the basics regarding SQL Server backups to Azure Blob storage. This blog shows how backups (and management of) to Azure Storage can be part of your ETL process using SSIS.

    The SSIS backup database task has gained the dialog that allows you to back up a database to a URL. This is simply leveraging the same backup to URL functionality present in SSMS. This way you can implement your backups as part of your ETL process, perhaps running at the end of a load.

    The guide below is for a Full Backup of a single database. It also contains a PowerShell script that will clean up old backup files to stop the container getting too full. It does this using the last modified date property of the blobs and a provided threshold value. The database being backed up in this example is a very small database used for testing out programming concepts such as test driven development.

    image

    Create a SQL Server Credential

    You need to add a credential into SQL Server for access to Azure Storage. Use SQL Server Management Studio to do this with the follow code, substituting your details.

    CREATE CREDENTIAL [https://adatissqlbackupstorage.blob.core.windows.net]

    WITH IDENTITY = '<storage account name>',

    SECRET = '<Access Key>'

    The backup task will need this credential in order to create the backup. You could be more specific and try setting up tighter access to a single container user a SAS (Shared Access Signature) but in this I’m keeping it simple and just using the storage account.

    When you open up the backup database task, in the ‘Back up to‘ dropdown, select ‘URL’.  Also choose which database(s) you want to backup up.

    image

    In the destination window select the credential that you created and then enter the name of the storage container you are backing up into. The URL prefix should be in the form of:

    https://<storage account name>.blob.core.windows.net/<container name>

    image

    PowerShell maintenance script

    Because there is no way of maintaining backups stored in the cloud via SSIS natively, we have to create some scripts that run using PowerShell to clean up backups that have reached a certain age.

    This is the editor for the execute process task. Here you want to edit the following properties:

    image

    Executable - This is the property where you set the path to the executable, i.e. Azure PowerShell.

    Arguments - This is the property where you set the path to your script and also provide any variables.

    Standard Input Variable -This is where you can input the actual variables.

    In my example I have tried to “variablize” as much as possible:

    image

    The script below accepts two parameters:

    • AzureStorageAccountName – Name of the storage account

    • AzureContainerName – Name of the Container that the backup blobs reside in

    In order to simplify the process of signing into an azure subscription you can create something called an Azure Publish Settings File using the following command in PowerShell:

    Get-AzurePublishSettingsFile

    You may find that you are having problems with your scripts trying to use different subscriptions if you have imported more than one of these files. The easy solution is to just delete all the files in the folder C:\Users\<USER>\AppData\Roaming\Windows Azure Powershell. Once this is done you can re-run the above statement to get the file for the Azure subscription you want.

    The example has a time of only 24 hours which I used mainly for testing. You will probably want to set this to longer than that, perhaps a week or a month or whatever period of time you want to retain your backups for.

    clip_image002[5]

    Save your script with the extension ‘.ps1’. This extension is primarily associated with PowerShell.

    There you have it. A very basic script that will allow you to stop you hoarding too many backup file blobs.

    SPSS Tutorial - Getting Started with SPSS

    In my previous role, as a Consumer Data Analyst I found myself often in the position to teach “a bit of SPSS” for people who had no previous exposure to SPSS but they needed some basic knowledge to be able to support me and my team in our work. 

    In my current position – where my work is still very much SPSS based in an environment full of BI Professionals, who are all very savvy in all sort of database related magic like SQL, SSIS and other acronyms I can’t even guess, but still quite suspicious towards SPSS - we are trying to close the knowledge-gap by shedding some light on what SPSS is and how do we use it. 

    The material in these blog posts can be used as additional material for classroom based training or it could help anyone who want to do some self-learning on SPSS.

     

    1st Class: What is SPSS?

    Below is a quick introduction of what SPSS really is and some useful tips how to get started with the software. 

    SPSS is a powerful statistical software that enables quick data manipulation, data presentation & complicated statistical calculations on big sets of data. “SPSS” stands for “Statistical Package for the Social Science”. It was launched in 1968 and purchased by IBM in 2010. You can download a trial version (accessible for a couple of weeks) from IBM’s homepage: https://www-01.ibm.com/marketing/iwm/iwm/web/preLogin.do?source=SWG-STATS-DESKTOP_TRIAL)

    In my examples I will concentrate on standard Consumer Insight data retrieved from online surveys. In this case a “Big set of data” would mean some thousands of rows (usual range 2,000-40,000) and a few thousands of columns (number of columns ranging an average of 500-5,000).

     

    To really get started, this is how an SPSS file looks like when you open it:

    image

    Every row is a Respondent and every column is an answer that the respondent gave to a particular question (or part of a question) in the online survey.

     

    You’ll be able to see more details of your data (more than numeric codes) by clicking on the “Value labels” icon in the menu row on the top:

    image

     

    If you hover over the column headers it will give you the variable label:

    image

     

    Even more details can be accessed by clicking on “Variable View” on the bottom left corner of the same window:

    image

    (Hints & tips: you can achieve the same by double clicking on the Variable Name (top, grey cell in each column and swap back by double clicking on the first grey cell for a certain row of the Variable View:

    image

     

    Most helpful items from “Variable View” are the following columns:

    -          Label: in a clean, well-structured file it will tell you what the exact text of the question was and / or shows the answer options for Multi-Choice questions (E.g. Q: Which of these games have you played during the last month? A1: Halo, A2: Fallout, A3: Medal of Honor, etc. – Respondent can select more than one answer options.)

    -          Values: it shows all answer options for Single choice questions (E.g. Q: How much you like playing this game? A1: Love it, A2: Sort of like it, A3: It’s OK, A4: I don’t like it, A5: I hate it – Respondent can only select one answer option.)

    -          Measure: some statistical functions can only be performed on a certain type of variables (i.e. you only can run “Means” on SCALE variables only. (E.g. you can phrase the above question “Please tell us on a 10 point scale how much you like playing this game?”– in this case you can run standard statistics, e.g. means, on the variable. The same wouldn’t be accurate on the original 5 point Nominal scale as the distance between the individual values is subjective.
    Because of this it is good to know where to check the correct “Variable level” (= other name for “Measure” commonly used in SPSS syntaxes.)

     

    With all this information you are now in good position to start using the data in your SPSS file.

    SQL Server Managed Backup to Windows Azure

    This Blog is part 2 of a 3 blog series concerning backing up your SQL Server databases to Azure Blob storage. This part details the basic setup of the SQL Server Managed Backups feature.

    With automating backups you need to schedule jobs as part of a strategy and write custom scripts to maintain these backups – there is currently no way to delete old backups. The SQL Server Managed Backup service allows you to specify a retention period which can be between a minimum of 1 day and a maximum of 30 days. Other than the storage location there is nothing else to configure, at least when getting started and there are some advanced options available including encryption and a custom backup schedule (2016).

    The frequency and the type of the backups, by default, is determined by the workload. A full backup takes place if the managed backup service is enabled for the first time, log growth is equal to or larger than 1GB, a week has passed since the last full back up or if the log chain is broken. Transaction log backups happen if log space is 5MB or larger, 2 hours has passed since the last log backup and any time the transaction log backup falls behind the full database backup.

    With the custom backup schedule you can specify your own frequency. Full backups can be set to daily or weekly. Log backups can be set with minutes or hours. Weekly backups can be taken on the day of your choice and the start time for any backup is also configurable. Finally you can allow a maximum amount of time for a backup to complete in.

    With SQL Server 2014 the maximum size allowed for a backup is 1TB because the storage used (Page Blob), has this limitation. 2016 uses backup to block blob which has a maximum size of 200GB but through striping allows up to 12TB. Block blobs cost less to store data than page blobs. If using 2014, there is a requirement for databases to be in full or bulk logged models as simple is not supported. However simple recovery model is supported for 2016. System databases are not supported in SQL Server 2014 but they are for SQL Server 2016.

    Create an Azure Storage Account

    The first prerequisite is to create an Azure Storage Account. You need this regardless of whether you choose to use managed backups or whether you use your own strategy using SSIS or scripting.

    clip_image002

    Create a Blob Container

    After your storage account has been created you should then create a blob container. A container is a means of providing a grouping for sets of blobs. Blobs have to be in a container. There is no limit to the number of blobs you can have inside a container and no limit to the number of containers in a storage account so you can go to town with your organising of blobs and their containers.

    clip_image004

    In the Azure Portal if you select the storage account you just created you can create a container by navigating via Blobs > Containers. Here is where you can add the container. You must give it a name and also an access type. The choices here are Private, Public Blob & Public Container.

    · Private

    o No anonymous access and a shared access policy is required to get in.

    · Public Blob

    o Container access is restricted and blobs within it won’t be listed if anonymous access is attempted. However blobs themselves can be anonymously accessed via a URL directly.

    · Public Container

    o Container and all the blobs in it are open to anonymous access. All blobs within the container can be listed and viewed by anyone.

    The access policy can be changed after the container is created. I created this one with the Private setting.

    Create a Shared Access Signature

    Next if you wish to allow access to your private blob container then you need a Shared Access Signature (SAS). You can’t do this via the portal and have only two options: programmatically through PowerShell or by using Azure Storage Explorer. Below you can see that I am using Azure Storage Explorer. Add your storage account and navigate to your containers. Highlight your container and click security. Here you can generate the SAS. Give it full permissions so that the SQL credential you will create afterwards is able to carry out the backups.

    clip_image007

    You can also specify an expiry on the SAS. The SAS will be provided together with a URL so separate out the URL for the container from the SAS token.

    Create a SQL Server Credential

    Now you need to add a credential into SQL Server for access into the blob container. Use SQL Server Management Studio to do this with the follow code, substituting your details.

    2016

    CREATE CREDENTIAL https://adatissqlbackupstorage.blob.core.windows.net/managed-backups-container]

    WITH IDENTITY = 'SHARED ACCESS SIGNATURE',

    SECRET = '<SAS TOKEN STRING HERE>'

    2014

    CREATE CREDENTIAL [https://adatissqlbackupstorage.blob.core.windows.net]

    WITH IDENTITY = 'adatissqlbackupstorage',

    SECRET = '<ACCESS KEY STRING HERE>'

    With 2014 the managed backup creates the containers to store the backups. It is created with the naming format ‘machine name-instance name’.

    Create the Managed Backup

    Note: Ensure that the SQL Server Agent is running

    Note: The following code is for SQL Server 2014. The code is different for 2016.

    Use the following code, substituted with your own values to set up the managed backup. If you wish to apply this to all databases (effectively the instance) then use * or NULL in the database name parameter.

    USE msdb;

    EXEC smart_admin.sp_set_db_backup

    @database_name = 'DOJO'

    ,@retention_days = 5

    ,@credential_name = [https://adatissqlbackupstorage.blob.core.windows.net]

    ,@encryption_algorithm = NO_ENCRYPTION

    ,@enable_backup = 1;

    This code specifies not to use any encryption, just to make this simple.

    Once created, you can monitor the health status of the job and also configure email notifications for errors and warnings. Check out the following link for more information:

    https://msdn.microsoft.com/en-us/library/dn449498%28v=sql.120%29.aspx

    SQL Server Backup and Restore with Microsoft Azure Blob Storage Service

    This Blog is part 1 of a 3 blog series concerning backing up your SQL Server databases to Azure Cloud Storage. This first blog will provide a summary of the options currently available to you.

    The reason to leverage Azure for your SQL Server backups is obvious. It is a far better alternative to tape backups. It eliminates the need for offsite transportation and makes high availability of the backups easier to obtain. In addition to this there is no hardware management overhead. Other things that should be factored are the cost (only pay for what you use) and the increased accessibility of the backups themselves.

    When 2016 is released, there will be 3 supported methods of storing SQL Server backups in Azure Blob storage. The first two below can be done from on premise SQL Server instances but the third requires the instance to be running in an Azure VM:

    Backup to URL

    Done in the same way as backups have been done before to disk (or tape). This became available in SQL Server 2012 SP1 CU2. From 2016 onwards you will be able to use block blobs instead of page blobs. This method can be leveraged using the SSIS backup task.

    SQL Server Managed Backups

    A way to let SQL Server manage backups, negating the need to form your own strategies. This feature also allows you to set retention periods and works at a database level or at instance level.

    Snapshot Backups for Database Files

    Coming with SQL Server 2016 and provides near instantaneous backups and restores with point in time restore. The database files themselves need to be directly in Azure Storage as well as the instance being in an Azure VM.

    Here are a few positives and negatives about each option:

    · Backup to URL

    o + Can be done during an SSIS load

    o + Can be executed whenever you wish

    o - Currently no way to maintain the backups from within SSIS natively

    · SQL Server Managed Backups

    o + Fully managed by SQL Server

    o + Can have custom schedules

    o + Maintains backups (retention period)

    o - Always invoked as part of a schedule or under certain conditions

    · Snapshot Backups

    o + Very fast

    o - Requires database files to be situated in Azure Blobs

    o - Requires instance to be on an Azure VM – not an on-prem solution

    The first two options I will explore in parts 2 and 3. However I will not go into more detail about database file snapshot backups as I will stick to on premise solutions.

    Costs

    Costs in this section are correct at the time of writing (2015/10/28). For updated costs check the links.

    Data Transfer

    The following information has been lifted directly with minor alterations from: https://azure.microsoft.com/en-gb/pricing/details/data-transfers/

    Inbound data transfers

    There is no cost for transferring data into Azure.

    Outbound data transfers

    Outbound data transfers

    Zone 1*

    Zone 2*

    Zone 3*

    First 5 GB/Month 1

    Free

    Free

    Free

    5 GB – 10.0 TB 2/month

    £0.0532 per GB

    £0.0844 per GB

    £0.1106 per GB

    Next 40 TB
    (10-50 TB)/month

    £0.0508 per GB

    £0.0825 per GB

    £0.107 per GB

    Next 100 TB
    (50-150 TB)/month

    £0.0428 per GB

    £0.0795 per GB

    £0.1039 per GB

    Next 350 TB
    (150-500 TB)/month

    £0.0306 per GB

    £0.0734 per GB

    £0.0978 per GB

    Over 500 TB/month

    Contact Microsoft

    Contact Microsoft

    Contact Microsoft

    A sub-region is the lowest level geo-location that you may select to deploy your applications and associated data. For data transfers (except CDN), the following regions correspond to Zone 1, Zone 2 and Zone 3.

    Zone 1: US West, US East, US North Central, US South Central, US East 2, US Central, Europe West, Europe North

    Zone 2: Asia Pacific East, Asia Pacific Southeast, Japan East, Japan West, Australia East, Australia Southeast

    Zone 3: Brazil South

    As an example you have a Blob Storage account in West Europe (Zone 1). In this you have a backup file for a database that is approximately 9GB in size. You begin a restore of that data and pull it from azure to a local disk.

    The first 5GB of this file are free. The remaining 4 GB falls into the £0.0508 per GB tariff. The total cost of transferring the file out of Azure would amount to £0.2032, assuming you only ever did this once per month.

    Standard Storage

    Block blobs

    There is a charge of £0.0022 per 100,000 transactions for Block Blobs. Transactions include both read and write operations to storage.

    Storage Capacity

    LRS

    ZRS

    GRS

    RA-GRS

    First 1 TB/month

    £0.0147 per GB

    £0.0184 per GB

    £0.0294 per GB

    £0.0373 per GB

    Next 49 TB (1 to 50 TB)/month

    £0.0145 per GB

    £0.0181 per GB

    £0.0289 per GB

    £0.0366 per GB

    Next 450 TB (50 to 500 TB)/month

    £0.0142 per GB

    £0.0178 per GB

    £0.0284 per GB

    £0.036 per GB

    Next 500 TB (500 to 1,000 TB)/month

    £0.014 per GB

    £0.0175 per GB

    £0.0279 per GB

    £0.0354 per GB

    Next 4,000 TB (1,000 to 5,000 TB)/month

    £0.0137 per GB

    £0.0172 per GB

    £0.0274 per GB

    £0.0348 per GB

    Over 5,000 TB/month

    Contact Microsoft

    Contact Microsoft

    Contact Microsoft

    Contact Microsoft


    Page Blobs and Disks

    Storage Capacity

    LRS

    GRS

    RA-GRS

    First 1 TB/month

    £0.0306 per GB

    £0.0581 per GB

    £0.0734 per GB

    Next 49 TB (1 to 50 TB)/month

    £0.0306 per GB

    £0.0489 per GB

    £0.0611 per GB

    Next 450 TB (50 to 500 TB)/month

    £0.0306 per GB

    £0.0428 per GB

    £0.055 per GB

    Next 500 TB (500 to 1,000 TB)/month

    £0.0306 per GB

    £0.0398 per GB

    £0.0489 per GB

    Next 4,000 TB (1,000 to 5,000 TB)/month

    £0.0275 per GB

    £0.0367 per GB

    £0.0459 per GB

    Over 5,000 TB/month

    Contact Microsoft

    Contact Microsoft

    Contact Microsoft

     

    This concludes the first overview blog. Hopefully this has given you a rough idea of the options available to you for backing up to Azure Storage. The next two blogs coming shortly will focus on Managed backups and SSIS backups to URL in a more detail, showing how to configure and use them.

    Automation Variables In Azure

    If you need to to share values between Azure runbooks or parameterise them, Automation Variables may be just the thing you need.

    Essentially these are exactly what you might expect – variables which you define and can use within Azure. The variables can be updated via Powershell scripts, manually via the Azure portal and of course by runbooks.

    The first thing to do is to define the variable. To do this go to the Azure portal and go into the Automation section. Following this navigate to the assets section, click new and select ‘Add Setting’

    image

    Select ‘Add Variable’

    image

     

    Fill in the relevant details for the variable. Note the variable type can be changed programmatically later if this is required.

    image

    Set a default value for the variable if it’s required and if the variable contains sensitive information you can also select for it to be encrypted.

    image

    To access the variable from the Powershell script the syntax used is like this:

    Get-AutomationVariable -Name 'VariableName'

    So for example, in order to access the value of a variable called “ETLProcessStatus” and assign this to a variable named $CurrentRunStatus you would do the following:

    $CurrentRunStatus = Get-AutomationVariable -Name "ETLProcessStatus"

    In case you forget the syntax for retrieving or updating the value of the variables there is a nice feature where it will write the basic syntax out for you.

    Click insert at the bottom of the ribbon and then ‘Setting’

    image

    Following this you can select from the following options and template commands will be inserted into the script.

    image

    In a blog which will be coming shortly, I will detail how I use this variable and some other techniques in order to assist in running a scalable cloud ETL solution.

     

    Addressing the Travelling Salesman Problem with Azure ML and Power Map

    I’ve recorded a short video (< 7 minutes) showing how Azure ML, Power BI and Power Map can address the Travelling Salesman Problem.

    The Travelling Salesman Problem is firmly routed in academia and is computationally complex to solve with many of the solutions tackling it in complete isolation to the real world, therefore it’s computed based on straight line distance only. Here we’ve utilised actual driving distance, journey time and real world information such as traffic information to provide a slightly more pragmatic result..

    The video can be found here, on YouTube.

    clip_image002

    Getting Started with Azure Data Lake Analytics &amp; U-SQL

    Data Lake Analytics is the querying engine that sits on top of the Azure Data Lake (ADL) Storage layer. If you have not yet got to grips with ADL Store, I’d suggest you go through my quick introduction here.

    Azure’s Data Lake Analytics has developed from internal languages used within Microsoft – namely ‘SCOPE’. It is evolved from Apache YARN which, in turn, is a reimplementation of their original Apache MadReduce language. For a little light reading around the history of ADL, I’d suggest looking at the ADL Analytics Overview here.

    The new language introduced by ADL Analytics, mysteriously named U-SQL, brings .NET functionality and data types to a SQL syntax. You declare variables as strings, not varchars, but frame your code in SELECT, FROM and WHERE clauses. The extensibility of the code is huge as a result – you can easily write your own C# methods and call them within your select statements. It’s this unification of SQL and .NET that supposedly gives U-SQL its name. The familiarity of code on both sides should open this up to Database and Application developers alike.

    Setting up your Data Lake Analytics Account

    Assuming you have already set up an ADL Store, setting up your own Analytics service is just as easy.

    First, go through New > Data & Analytics > Data Lake Analytics:

    clip_image002

    You’ll get the usual new item configuration screen, simply pop in your details and link it to your ADL Store account.

    image

    A few minutes later, you’ll be ready to go with your Analytics service.

    To start with, we’ll write a very basic U-SQL Job through the Azure Preview Portal. This way, you can start performing large transformations on your files without the need to download any extensions, updates etc. You don’t even need a copy of Visual Studio! However, as you formalise your system and begin to rely on it as your primary datasource, you’ll definitely want to be keeping your code as source-controlled solutions and making use of the various capacity management tools Microsoft have recently released for managing your Data Lake projects.

    Download Sample Data

    Back in the Azure Preview Portal, when we open up our new ADL Analytics account we see the familiar overview blade:

    image

    There’s a decent overview of the U-SQL language here, along with several sample jobs provided through the “Explore Sample Jobs” link on the overview blade. If you follow the samples link, you’ll see a couple of options on a new blade.

    clip_image006

    For now, click the “Copy Sample Data” button at the top. This will populate your data lake store with the sample files used by the provided examples. I’ll walk through some more advanced examples over the next few posts, but let’s simply try and access some data first. The first example uses SearchLog.tsv found in /Samples/Data/ after installing the samples.

    U-SQL functions by defining rowset variables and passing them between various functions. Your first rowset may be data extracted from your sample text file, this rowset is then passed to an output which writes it to an aggregate table, or another file.

    Your first U-SQL Job

    Simply click on the “New Job” icon on the ADL Analytics Overview blade to start writing your very first job.

    clip_image008

    Admittedly, this throws you in the deep end. You’re faced with a blinking cursor on line one of your script, but I’ll talk you through the structure of the first example query.

    The ADL Store can contain SQL tables, as well as unstructured objects, and the syntax used varies depending on what you’re using. Tables are accessed using the traditional SELECT clause whereas for files we use EXTRACT. I’m assuming most readers will be familiar with a select statement, so let’s get an initial Extract working.

    We start by defining our rowset variable, let’s call it @searchlog for now. There’s no need to declare this formally, we can just go ahead and assign the results of a query to it.

    The basic structure of this query would be:

    @searchlog = 
        EXTRACT <column1> <datatype>
        FROM <sourcelocation>
        USING <extraction method>;

    The major assumption is that we will be defining schema on query – the flat files do not contain their own schema information and so we define it when writing the query.

    So, to bring back some data from the “SearchLog.tsv” sample file, we need to give each column a name a data type. It appears that we need to define the whole file for now, although it seems that support for querying across variable structures is on its way – it doesn’t seem to be documented just yet.

    Defining each column, we build up the EXTRACT statement to:

    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string

    Remember, we’re using C# datatypes so we don’t need to worry about lengths of strings etc.

    Next, we define the filename. In the first example, we can use a reference to a specific file – this can be the fixed URL to the file specifically, or a relative reference within the Store itself. Our FROM statement for the SearchLog file is therefore:

    FROM @"/Samples/Data/SearchLog.tsv"

    Finally, we need to tell the query how to understand the particular file we’re attempting to extract data from. There are many extraction interfaces defined by default, for many of the most common flat files, so don’t worry if you prefer CSVs to TSVs, or even if you prefer to define your own delimiters.

    In this case, as we’re using a TSV, we use the inbuilt Extractors.TSV() function.

    Putting this all together gives us the example query:

    @searchlog = 
        EXTRACT UserId          int, 
                Start           DateTime, 
                Region          string, 
                Query           string, 
                Duration        int, 
                Urls            string, 
                ClickedUrls     string
        FROM @"/Samples/Data/SearchLog.tsv"
        USING Extractors.Tsv();

    This will leave us with a rowset variable that has been populated with the columns defined from the TSV file. In SQL parlance, this is like defining a table variable and throwing it away at the end of the query.

    In order to view our results, we need to output our resultset somewhere, this is where the OUTPUT clause comes into play.

    A full OUTPUT statement requires:

    OUTPUT 
        <rowset variable>
    TO 
        <location>
    USING 
        <output method>

    We know our rowset variable, that’s the @searchlog we’ve just extracted data into. We can define a new file for our location, this simply needs to be a relative path and the name of the file to be created.

    Finally, as with Extractors, we need to instruct the query what function to use to output the data if we’re pushing it to a flat file. Once again, many providers are included as standard, but let’s stick with TSV for simplicity.

    Our output statement therefore looks like:

    OUTPUT @searchlog 
        TO @"/Samples/Output/SearchLog_output.tsv"
        USING Outputters.Tsv();

    Putting this together our full U-SQL script is:

    @searchlog = 
        EXTRACT UserId          int, 
                Start           DateTime, 
                Region          string, 
                Query           string, 
                Duration        int, 
                Urls            string, 
                ClickedUrls     string
        FROM @"/Samples/Data/SearchLog.tsv"
        USING Extractors.Tsv();
     
    OUTPUT @searchlog 
        TO @"/Samples/Output/SearchLog_output.tsv"
        USING Outputters.Tsv();

    Now this isn’t terribly exciting. This will simply take the contents of the sample file and dump it into a new file with exactly the same structure.

    For now, click the Submit Job button to execute your job.

    clip_image010

    One key point here is that it will take at least a minute or two for the job to run, even if you have very little data in your files. The whole system is optimised for massive scale not lightning fast micro transactions. The real gains are running queries across hundreds or thousands of files at once, scaling to run across many Terabytes of data efficiently.

    Hit refresh on the job status blade and you’ll eventually see the results of your job, hopefully succeeded.

    image

    You’ll then be able to navigate to the output file and view your results.

    That’s your very first foray into writing U-SQL, and not a terribly exciting example. However, we can write additional queries in between the EXTRACT and OUTPUT steps that can add calculations, aggregations, join to additional rowsets and even apply any C# libraries that we associate. This in itself if nothing new, especially if you’ve familiar with PIG, however this can all be scaled to massive levels using a simple slider and a pay-as-you-go charge rate. We’ll come to these more advanced examples in future posts.

    Technorati Tags: ,,,

    Getting Started with Azure Data Lake Store

    You may have heard about Microsoft’s new Azure Products – Azure Data Lake Storage and Data Lake Analytics. If not, take a look at the official blurb here!

    What is it?

    Essentially it’s hadoop-made-easy – the storage engine is a WebHDFS layer with a management overlay that takes care of scaling and cluster distribution so you don’t have to. For many hadoop veterans this will feel as uncomfortable as a grizzled DBA allowing Azure SQL DB to determine the best backup schedule on their behalf. But for everyone else, this removes a significant technical barrier to using truly scalable, distributed storage. What’s more, your store can be accessed by any hadoop-compatible technologies, so you can integrate with your existing architecture as a hassle-free migration path.

    Data Lake Analytics provides the parallel querying engine, comes courtesy of a new language U-SQL, based on Apache YARN (which is essentially a version 2.0 of their populate Apache MapReduce). Once again, the big difference being that all of the cluster management is being done for you – and on a pay-as-you-use model.

    For a more detailed overview, take Microsoft’s own overview page. Their architectural vision has the Data Lake as the central hub for all data-based operations, they headline with a diagram making this very clear:

    clip_image002

    The goal is for the Data Lake Storage to be a dumping ground for any and all of your data, be it structured, semi-structured or entirely unstructured. The competitive price of Azure storage means they’re now pushing a “store first, think of questions to ask later” methodology – initial indications are that 1Gb of data will cost just £0.05 per month. This makes it an attractive offer for most archiving scenarios where the data still needs to be accessible in some form. I’ve a few thoughts around where this is heading and what it means to the Data Warehouse, but first I’ll run through how to get your Data Lake set up – you can figure out what you’re going to do with it later!

    Creating your Azure Data Lake Store

    First you’ll need to set up your storage account – you can do this through any valid Azure subscription. At the time of writing, the Azure Data Lake elements are in public preview, so you’ll need to sign up and be accepted first.

    To create your storage account, simply go to New > Data + Storage > Data Lake Store:

    clip_image004

    You’ll be presented with the usual configuration screen:

    clip_image005

    Again, at the time of writing this service is only available in the East US 2 location. A few minutes later, your Data Lake Store will be ready to go!

    You can now begin loading files into the store, ready for use. You have several options on how to get your data into the store, but for now we’ll go with the most basic.

    Uploading your first file using Data Explorer

    Open up the resource blade for your DLS and you’ll see a standard overview panel for your new object.

    clip_image006

    I’ll let you explore the settings and monitoring options yourself. What we’re interested in is the Data Explorer. Click that button (or there’s another entry point via settings) and you’ll be presented with the new Data Explorer blade.

    clip_image008

    Here you can easily see all of the objects that have been uploaded to the store so far. Let’s add a file by clicking the upload button.

    You’ll see a basic upload screen:

    clip_image010

    And if we select a csv file (or one of many supported file types that takes your fancy), we can click start upload to send the file across.

    You’ll receive a portal notification when the upload has completed, and the data explorer view will refresh to show us our new contents:

    clip_image012

    Clicking on the file will then open it up in the file preview blade, so you can explore the various objects contained in the store:

    image

    And that’s it – your file is now available through a variety of methods. You can treat your Azure Data Lake Store like any other WebHDFS layer, or you can spin up an Azure Data Lake Storage account to write U-SQL against your stored file.

    A minor word of warning here – in my initial attempts of using the store, I found it to have difficulty recognising column titles within flat files, I’d receive parsing errors where it was trying to convert titles into numerical fields etc. This is most likely because I’m new to U-SQL as a language and the documentation is very much a work in progress still. I’m assuming this limitation will be easy to work around but have, for now, removed column titles from my sample sheets.

    Over the next few weeks, I’ll be exploring other methods of getting data into your Data Lake Store, whether it’s programmatically through C# or Powershell, streamed through Event Hubs and Azure Streaming Analytics or even more traditionally, uploading to relational database tables through SSIS, all of which can then be accessed via a single interface.

    I will also be looking at U-SQL in particular, and how it can be leveraged in a new-world Warehousing Solution.