Adatis

Adatis BI Blogs

SQL PASS Summit 2014 – Kick Off

Day 1 has kicked off in Seattle, a remarkable city. Having arrived a week early, I’ve had plenty of time to check out the sights and the food and have enjoyed it immensely - a brilliant venue for PASS! There were a few announcements at this morning’s Keynote, mostly regarding Azure. Azure SQL Databases are gaining larger index handling, parallel queries, extended events and in-memory columnstore for data marts. Joseph Sirosh gave a talk about Machine Learning & Information management showing a cool example of Azure Stream Analytics using Microsoft Kinect sensor information of customer interactions in a shop being uploaded straight into Power Map! I am looking forward to hearing more on Machine Learning. There are also a handful of great improvements for Power BI. I am most looking forward to the new Live Ops Dashboards and drill-through actions! Combo Reports also look promising… Moving onto the first session, I chose to attend ‘What’s new in Microsoft Power Query for Excel’. As it turns out there’s not a massive amount of new stuff – some new data sources and a tick box when you do a merge to remove name prefixes. However one of these new sources is the long-awaited Analysis Services data source. The new ODBC Data Source is a great addition also. There was a mention regarding the possibility of a decoupled M-query SSIS component! We probably won’t hear of anything until later in 2015, unfortunately. I would say this was not a level 300 session, more like 100/200. The second session was ‘SQL Server 2014 Always On (High Availability and Disaster Recovery)’: a nice overview of what was brought in in 2012 and the newer 2014 features – these including an increased maximum number of secondary replicas, increased availability of readable secondary’s and the Add Azure Replica Wizard. Despite not being a DBA and it being a level 300 session, I found it easy to follow and absorbable. I feel many of the DBAs in the room may not have taken away any information they would have not already have known, however. Niko Neugebauer gave a fantastic, in depth session on ‘ETL Patterns with Clustered Columnstore Indexes’. It was a fast moving talk, despite the time spent waiting for some executions. It demanded your full attention! Definitely worthy of its 400 level. It left me a little tired for Marco Russo’s ‘DAX Patterns’ session which showed some examples of workarounds for common tabular weaknesses like distinct counts for Type 2 slowly changing dimensions and cumulative quantities. Overall it was a strong day. I am looking forward to tomorrow. More to follow…

SQL PASS Summit – Day 1

Having spent several days enjoying Seattle and getting to know and enjoy the city the conference is now underway. Yesterday there were some introductory meetings where we got to meet some of the other attendees and get a feel for the environment, today everything was in full swing. The morning started with the opening keynote presented to a huge audience. There were demonstrations of some really exciting new features – in one we observed Kinect being used with Power Map in order to track customer movements to observe customer interest in different parts of the store. We saw some great looking Power BI dashboarding functionality with the ability to drillthrough into detailed reports. As well as this we saw some further enhancements SQL Server Azure and on-premise integration including a new stretch functionality which will allow users to seamlessly ‘stretch’ their data into the cloud, keeping the most frequently queried records on premise and the other ones in the cloud. We also saw a Columnstore index being created on an in memory table! Miguel Lopes gave a talk on the new features in Power Query where we saw the ability to query using ODBC and support for Analysis services connections, on the whole though whilst I think the ODBC will be particularly useful for some situations, much of this talk was giving an overview of Power query as a whole rather than demonstrating new functionality. The integration of SSIS and power query in future was mentioned, however no dates have been set for this and we are told that this may (or may not) be available at some point in 2015. Jen Stirrup gave an interesting overview of some of the features available in R, the session was so popular that many people had to sit round the front on the floor! Niko Neugebauer’s contagious enthusiasm when presenting his session on ETL Patterns with Clustered Columnstore indexes was great to see and I picked up a few tricks here that were quite handy when working in 2014 environments. I also very much enjoyed John Welch’s session in Continuous Delivery for Data Warehouses and Marts, this is something I myself have been involved with a lot recently and it was very interesting to see his methods of achieving this and also to discover that in many cases we were both doing things in the same way :) Overall the day has been very interesting, we have seen some exciting new features announced today and some significant enhancements to the Power BI product range, it seemed to me for some time that the lack of dashboarding functionality in Power View was holding it back and I think many people will be very pleased with this new functionality and the further enhancements to the Azure service.

Azure Machine Learning Example - Insurance

Predicting data is nothing new in the Microsoft BI world, with the data mining capabilities in Analysis Services, but there is now another option – namely Azure ML, Microsoft’s cloud based predictive analytics solution. Like many people, I’ve been keen to see what Azure ML can do. Luckily, you can still sign up for a completely free trial for Azure, meaning you can be up and running with Azure ML in a matter of minutes. For this blog post I’ve selected an insurance use case scenario, as I’ve been working in the insurance industry quite a bit over the past couple of years. My reason for using Azure ML is that I would like to see if it’s possible to predict the retention rate, i.e out of all our policies that were due for renewal, what percentage of those policies actually renewed? Therefore I’m going to try and predict whether a car policy holder will renew based on a dataset comprising of previous policies that show whether or not the policy was renewed. Loading Data I didn’t want to use real data for this example, so I’ve got some made up data, which I’ve saved as a simple CSV. Thankfully this can be easily uploaded to Azure using ML Studio, which is the main development tool for Azure ML: Experiments Having uploaded my test dataset, I’m ready to build a predictive model. Azure ML allows me to do this by creating an experiment, which is to ML Studio what a project is to Visual Studio. Within ML Studio, the starting point is to pick a dataset. Data can be consumed from a variety of sources, such as SQL Azure, Azure Blob Storage and http to name a few. But in this case I’m going to use my csv file, which is available under Saved Datasets, along with a host of other ‘modules’ that perform various tasks within ML. The modules can be dragged and dropped onto the canvas, so the starting point is to drag the insurance dataset onto the canvas: Once I have a dataset on the canvas, I can right click on the saved dataset and choose Visualize, which will profile my data for me, showing me statistics such as the minimum, maximum and standard deviation for each column: There are a myriad of factors that could potentially affect the retention rate, such as renewal price, number of other policies held (e.g. perhaps they have both life and car insurance with the insurance company), how long the policy holder has been a customer, as well as make/model, insurance add-ons (legal cover, hire car cover) age and other demographics to name a few. I’ve not got all of these factors in my dataset, but I have a fair few, just to show an example. Once I’ve got my insurance dataset, the next task is to add a Split module, which will allow me to split the rows into two buckets, in a 75%/25% split. Firstly, a training set is needed, which will be fed into the actual model, allowing it to learn based on the past data. The other bucket will be the test set in this case, which will be used to test how the trained model performs. Here you can see the two dots at the bottom of the Split module are the two outputs: The next task is to drag a Train Model onto the canvas and then connect the training set to the second input of the Train Model. For the first input, the Train model is expecting an input of untrained learner, so I’ve given it the input of an untrained Two-Class Neural Network. The second input expected by the Train Model is a dataset, which is the training set that comes from the Split. Therefore we now have the following shape to the experiment: The red icon on the train model indicates that there is no column selected, so therefore I’ve picked the column “Renewed” (Yes/No) from the column selector to indicate that this is the outcome that we wish to predict. In addition, we need to add a Score Model, which will allow us to score the now trained model against the test dataset. Therefore the Score Model accepts two inputs, namely a Trained Model and the test dataset: If I now run the experiment then we can click on the Visualize option on the output of the Score Model. This will show a comparison of the predicted outcome against the outcome on the test dataset, on a row by row basis: Finally we can add in an Evaluate Model, which will measure the performance of the model using a specific set of metrics. Again we can choose the Visualize option, which will give us metrics such as True/False Positive/Negative and Cumulative AUC:   What I think’s great about Azure ML is that its very easy to compare the performance of the Neural Network to other models. Therefore I can add in a Boosted Decision Tree, another Train Model and also another Score Model: These can be quickly wired up to the relevant inputs. The final step is then to then set the Evaluate Model to accept a second input, which is the output of the second Score model. It all ends up looking like this: As you can see, we now have two inputs into the Evaluate Model, which will allow us to compare the performance of the two models that have been chosen. In this example, when choosing Visualize on the Evaluate model, I can see that the Two Class Boosted Decision Tree gave the best overall results: The final result is that we have a model that is capable of predicting whether a policy holder is likely to renew or not. Of course this is just a quick example but it hopefully gives you an idea of what Azure ML is capable of.

Analytics Platform System Polybase Example

To me one of the most exciting parts of the Microsoft Analytics Platform System (APS) is Polybase, which in a nutshell allows you to access data residing in Hadoop or Windows Azure Blob Storage, all via T-SQL in PDW. What this means is that data can be transparently queried by a user or developer, in real time, regardless of whether the data lives in PDW or Hadoop/Azure. James Roland-Jones gives a thorough overview of all things Polybase here. What I’m going to do in this post is to show an example of how existing data within Hadoop can be combined with data that resides in an APS PDW region. Polybase Example - Setup There are two key tasks to complete before we’re able to start querying data. We need to setup a data source, so that Polybase knows where to get the data from, plus we need to describe the file format of the external file that we’re going to read. The data source here specifies that we’re going to use the Hadoop nodes that are co-located with the PDW nodes, but this could be a non-appliance Hadoop cluster: CREATE EXTERNAL DATA SOURCE HadoopRegion_DataSource WITH ( TYPE = HADOOP, LOCATION = 'hdfs://hdfs://H12345-C-HHN01', --Optional specify a tracker location to enable predicate --push down to Hadoop JOB_TRACKER_LOCATION = 'H12345-C-HHN01:50300' ); The next task is to setup a file format. In this case we’re defining that the file is pipe delimited, although we can use Polybase with other formats (e.g. RCFile): CREATE EXTERNAL FILE FORMAT HadoopRegion_DelimitedPipe_NoCompression WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS ( Field_terminator = '|') ); Accessing External Data Before we access data in Hadoop, I want to show the file for this example. In this case I’ve got a customer dimension within PDW, but I have customer survey data within Hadoop that i wish to combine with my data in the warehouse. The survey data has been uploaded in this case to the Hadoop region within APS, via the portal: The following shows a sample of the Hadoop file, note that its pipe delimited, with columns for Date, Survey Id, Product, Customer Id and Survey Score: Now I want to access the Hadoop data and combine it with other data in the warehouse. To do this we need to use the CREATE EXTERNAL TABLE command. CREATE EXTERNAL TABLE [dbo].[HDFS_CustomerSurvey] ( DateId int NOT NULL, SurveyResponseId int NOT NULL, ProductCategoryName varchar(100) NULL, CustomerId varchar(50) NULL, SurveyResult INT NULL ) WITH ( LOCATION = '/user/data/survey/', DATA_SOURCE = HadoopRegion_DataSource, FILE_FORMAT = HadoopRegion_DelimitedPipe_NoCompression ) ; This results in an external table being available within the PDW region, as shown below: Now if we want to query the Hadoop data it’s just a simple case of T-SQL. Here I’m joining Hadoop survey data to the customer dimension: SELECT CS.CustomerId, C.FirstName, C.LastName, C.EmailAddress, C.Gender, CS.DateId AS SurveyDate, CS.ProductCategoryName AS SurveyProduct, CS.SurveyResult FROM dbo.HDFS_CustomerSurvey CS INNER JOIN dbo.DimCustomer C ON C.CustomerAlternateKey = CS.CustomerId Which gives the following results: So that’s it, just a simple T-SQL query in the end. Polybase has taken away the complexity and allowed us to integrate different data sources using a widely used standard query language.