Adatis BI Blogs

Getting Started with Azure Data Lake Store

You may have heard about Microsoft’s new Azure Products – Azure Data Lake Storage and Data Lake Analytics. If not, take a look at the official blurb here! What is it? Essentially it’s hadoop-made-easy – the storage engine is a WebHDFS layer with a management overlay that takes care of scaling and cluster distribution so you don’t have to. For many hadoop veterans this will feel as uncomfortable as a grizzled DBA allowing Azure SQL DB to determine the best backup schedule on their behalf. But for everyone else, this removes a significant technical barrier to using truly scalable, distributed storage. What’s more, your store can be accessed by any hadoop-compatible technologies, so you can integrate with your existing architecture as a hassle-free migration path. Data Lake Analytics provides the parallel querying engine, comes courtesy of a new language U-SQL, based on Apache YARN (which is essentially a version 2.0 of their populate Apache MapReduce). Once again, the big difference being that all of the cluster management is being done for you – and on a pay-as-you-use model. For a more detailed overview, take Microsoft’s own overview page. Their architectural vision has the Data Lake as the central hub for all data-based operations, they headline with a diagram making this very clear: The goal is for the Data Lake Storage to be a dumping ground for any and all of your data, be it structured, semi-structured or entirely unstructured. The competitive price of Azure storage means they’re now pushing a “store first, think of questions to ask later” methodology – initial indications are that 1Gb of data will cost just £0.05 per month. This makes it an attractive offer for most archiving scenarios where the data still needs to be accessible in some form. I’ve a few thoughts around where this is heading and what it means to the Data Warehouse, but first I’ll run through how to get your Data Lake set up – you can figure out what you’re going to do with it later! Creating your Azure Data Lake Store First you’ll need to set up your storage account – you can do this through any valid Azure subscription. At the time of writing, the Azure Data Lake elements are in public preview, so you’ll need to sign up and be accepted first. To create your storage account, simply go to New > Data + Storage > Data Lake Store: You’ll be presented with the usual configuration screen: Again, at the time of writing this service is only available in the East US 2 location. A few minutes later, your Data Lake Store will be ready to go! You can now begin loading files into the store, ready for use. You have several options on how to get your data into the store, but for now we’ll go with the most basic. Uploading your first file using Data Explorer Open up the resource blade for your DLS and you’ll see a standard overview panel for your new object. I’ll let you explore the settings and monitoring options yourself. What we’re interested in is the Data Explorer. Click that button (or there’s another entry point via settings) and you’ll be presented with the new Data Explorer blade. Here you can easily see all of the objects that have been uploaded to the store so far. Let’s add a file by clicking the upload button. You’ll see a basic upload screen: And if we select a csv file (or one of many supported file types that takes your fancy), we can click start upload to send the file across. You’ll receive a portal notification when the upload has completed, and the data explorer view will refresh to show us our new contents: Clicking on the file will then open it up in the file preview blade, so you can explore the various objects contained in the store: And that’s it – your file is now available through a variety of methods. You can treat your Azure Data Lake Store like any other WebHDFS layer, or you can spin up an Azure Data Lake Storage account to write U-SQL against your stored file. A minor word of warning here – in my initial attempts of using the store, I found it to have difficulty recognising column titles within flat files, I’d receive parsing errors where it was trying to convert titles into numerical fields etc. This is most likely because I’m new to U-SQL as a language and the documentation is very much a work in progress still. I’m assuming this limitation will be easy to work around but have, for now, removed column titles from my sample sheets. Over the next few weeks, I’ll be exploring other methods of getting data into your Data Lake Store, whether it’s programmatically through C# or Powershell, streamed through Event Hubs and Azure Streaming Analytics or even more traditionally, uploading to relational database tables through SSIS, all of which can then be accessed via a single interface. I will also be looking at U-SQL in particular, and how it can be leveraged in a new-world Warehousing Solution.

SQL PASS Summit – Day 2

This morning started with a talk from Rimma Nehme a key Microsoft architect. The talk was informative and clearly Azure and the cloud are something which Microsoft are very keen, it was well thought through and there was a fantastic analogy between pizza and different flavours of Azure which is not what I had expected. Adam Machanic did a presentation covering different methods which could be used in order to force the query optimiser into choosing more efficient query plans. This was not for the feint hearted and included some ingenious methods which he had developed which could be used to ‘fool’ the optimiser into selecting different query plans. The methods tried to push towards parallel queries and ensuring that the right workload was assigned to each of the threads so they finished at roughly the same time. The only drawback to this was that some of the examples produced code which was quite obfuscated and not something which I thought could be used except in the case that there was an exceptional issue which necessitated their use. Davide Mauri’s session on Unit Testing and Agile development was very interesting, he showed demonstrations of NUnit, BI.Quality and NBI. These are tools which allow users to create unit tests when following an agile approach, whilst we also use similar methods at Adatis to provide unit testing functionality the demonstrations were very effective and I will be doing some further research into these products to realise their full potential. Connor Cunningham also presented a fantastic talk titled ‘Managing 1M + db’s – How big data is used to run SQL Azure’. This was a very insightful talk which detailed the issues that had arisen when running the SQL Azure service and the methods used in order to identify, quickly resolve and identify the root cause of the problems. Central to this was both the collection of telemetry data from many sourced which could be used to perform diagnostics and the methods used in order to identify problems. Here we saw a very interesting usage of Azure Machine Learning, which had been setup to trigger Alerts on the occurrence of unusual behaviour. Connor is a fantastic speaker and clearly one of the most knowledgeable people on the SQL engine, I really enjoyed this talk. I’m off to the evening party now at the EMP museum, looking forward to having a few beers and meeting some new people :)

SQL PASS Summit 2014 – Kick Off

Day 1 has kicked off in Seattle, a remarkable city. Having arrived a week early, I’ve had plenty of time to check out the sights and the food and have enjoyed it immensely - a brilliant venue for PASS! There were a few announcements at this morning’s Keynote, mostly regarding Azure. Azure SQL Databases are gaining larger index handling, parallel queries, extended events and in-memory columnstore for data marts. Joseph Sirosh gave a talk about Machine Learning & Information management showing a cool example of Azure Stream Analytics using Microsoft Kinect sensor information of customer interactions in a shop being uploaded straight into Power Map! I am looking forward to hearing more on Machine Learning. There are also a handful of great improvements for Power BI. I am most looking forward to the new Live Ops Dashboards and drill-through actions! Combo Reports also look promising… Moving onto the first session, I chose to attend ‘What’s new in Microsoft Power Query for Excel’. As it turns out there’s not a massive amount of new stuff – some new data sources and a tick box when you do a merge to remove name prefixes. However one of these new sources is the long-awaited Analysis Services data source. The new ODBC Data Source is a great addition also. There was a mention regarding the possibility of a decoupled M-query SSIS component! We probably won’t hear of anything until later in 2015, unfortunately. I would say this was not a level 300 session, more like 100/200. The second session was ‘SQL Server 2014 Always On (High Availability and Disaster Recovery)’: a nice overview of what was brought in in 2012 and the newer 2014 features – these including an increased maximum number of secondary replicas, increased availability of readable secondary’s and the Add Azure Replica Wizard. Despite not being a DBA and it being a level 300 session, I found it easy to follow and absorbable. I feel many of the DBAs in the room may not have taken away any information they would have not already have known, however. Niko Neugebauer gave a fantastic, in depth session on ‘ETL Patterns with Clustered Columnstore Indexes’. It was a fast moving talk, despite the time spent waiting for some executions. It demanded your full attention! Definitely worthy of its 400 level. It left me a little tired for Marco Russo’s ‘DAX Patterns’ session which showed some examples of workarounds for common tabular weaknesses like distinct counts for Type 2 slowly changing dimensions and cumulative quantities. Overall it was a strong day. I am looking forward to tomorrow. More to follow…

SQL PASS Summit – Day 1

Having spent several days enjoying Seattle and getting to know and enjoy the city the conference is now underway. Yesterday there were some introductory meetings where we got to meet some of the other attendees and get a feel for the environment, today everything was in full swing. The morning started with the opening keynote presented to a huge audience. There were demonstrations of some really exciting new features – in one we observed Kinect being used with Power Map in order to track customer movements to observe customer interest in different parts of the store. We saw some great looking Power BI dashboarding functionality with the ability to drillthrough into detailed reports. As well as this we saw some further enhancements SQL Server Azure and on-premise integration including a new stretch functionality which will allow users to seamlessly ‘stretch’ their data into the cloud, keeping the most frequently queried records on premise and the other ones in the cloud. We also saw a Columnstore index being created on an in memory table! Miguel Lopes gave a talk on the new features in Power Query where we saw the ability to query using ODBC and support for Analysis services connections, on the whole though whilst I think the ODBC will be particularly useful for some situations, much of this talk was giving an overview of Power query as a whole rather than demonstrating new functionality. The integration of SSIS and power query in future was mentioned, however no dates have been set for this and we are told that this may (or may not) be available at some point in 2015. Jen Stirrup gave an interesting overview of some of the features available in R, the session was so popular that many people had to sit round the front on the floor! Niko Neugebauer’s contagious enthusiasm when presenting his session on ETL Patterns with Clustered Columnstore indexes was great to see and I picked up a few tricks here that were quite handy when working in 2014 environments. I also very much enjoyed John Welch’s session in Continuous Delivery for Data Warehouses and Marts, this is something I myself have been involved with a lot recently and it was very interesting to see his methods of achieving this and also to discover that in many cases we were both doing things in the same way :) Overall the day has been very interesting, we have seen some exciting new features announced today and some significant enhancements to the Power BI product range, it seemed to me for some time that the lack of dashboarding functionality in Power View was holding it back and I think many people will be very pleased with this new functionality and the further enhancements to the Azure service.

Azure Machine Learning Example - Insurance

Predicting data is nothing new in the Microsoft BI world, with the data mining capabilities in Analysis Services, but there is now another option – namely Azure ML, Microsoft’s cloud based predictive analytics solution. Like many people, I’ve been keen to see what Azure ML can do. Luckily, you can still sign up for a completely free trial for Azure, meaning you can be up and running with Azure ML in a matter of minutes. For this blog post I’ve selected an insurance use case scenario, as I’ve been working in the insurance industry quite a bit over the past couple of years. My reason for using Azure ML is that I would like to see if it’s possible to predict the retention rate, i.e out of all our policies that were due for renewal, what percentage of those policies actually renewed? Therefore I’m going to try and predict whether a car policy holder will renew based on a dataset comprising of previous policies that show whether or not the policy was renewed. Loading Data I didn’t want to use real data for this example, so I’ve got some made up data, which I’ve saved as a simple CSV. Thankfully this can be easily uploaded to Azure using ML Studio, which is the main development tool for Azure ML: Experiments Having uploaded my test dataset, I’m ready to build a predictive model. Azure ML allows me to do this by creating an experiment, which is to ML Studio what a project is to Visual Studio. Within ML Studio, the starting point is to pick a dataset. Data can be consumed from a variety of sources, such as SQL Azure, Azure Blob Storage and http to name a few. But in this case I’m going to use my csv file, which is available under Saved Datasets, along with a host of other ‘modules’ that perform various tasks within ML. The modules can be dragged and dropped onto the canvas, so the starting point is to drag the insurance dataset onto the canvas: Once I have a dataset on the canvas, I can right click on the saved dataset and choose Visualize, which will profile my data for me, showing me statistics such as the minimum, maximum and standard deviation for each column: There are a myriad of factors that could potentially affect the retention rate, such as renewal price, number of other policies held (e.g. perhaps they have both life and car insurance with the insurance company), how long the policy holder has been a customer, as well as make/model, insurance add-ons (legal cover, hire car cover) age and other demographics to name a few. I’ve not got all of these factors in my dataset, but I have a fair few, just to show an example. Once I’ve got my insurance dataset, the next task is to add a Split module, which will allow me to split the rows into two buckets, in a 75%/25% split. Firstly, a training set is needed, which will be fed into the actual model, allowing it to learn based on the past data. The other bucket will be the test set in this case, which will be used to test how the trained model performs. Here you can see the two dots at the bottom of the Split module are the two outputs: The next task is to drag a Train Model onto the canvas and then connect the training set to the second input of the Train Model. For the first input, the Train model is expecting an input of untrained learner, so I’ve given it the input of an untrained Two-Class Neural Network. The second input expected by the Train Model is a dataset, which is the training set that comes from the Split. Therefore we now have the following shape to the experiment: The red icon on the train model indicates that there is no column selected, so therefore I’ve picked the column “Renewed” (Yes/No) from the column selector to indicate that this is the outcome that we wish to predict. In addition, we need to add a Score Model, which will allow us to score the now trained model against the test dataset. Therefore the Score Model accepts two inputs, namely a Trained Model and the test dataset: If I now run the experiment then we can click on the Visualize option on the output of the Score Model. This will show a comparison of the predicted outcome against the outcome on the test dataset, on a row by row basis: Finally we can add in an Evaluate Model, which will measure the performance of the model using a specific set of metrics. Again we can choose the Visualize option, which will give us metrics such as True/False Positive/Negative and Cumulative AUC:   What I think’s great about Azure ML is that its very easy to compare the performance of the Neural Network to other models. Therefore I can add in a Boosted Decision Tree, another Train Model and also another Score Model: These can be quickly wired up to the relevant inputs. The final step is then to then set the Evaluate Model to accept a second input, which is the output of the second Score model. It all ends up looking like this: As you can see, we now have two inputs into the Evaluate Model, which will allow us to compare the performance of the two models that have been chosen. In this example, when choosing Visualize on the Evaluate model, I can see that the Two Class Boosted Decision Tree gave the best overall results: The final result is that we have a model that is capable of predicting whether a policy holder is likely to renew or not. Of course this is just a quick example but it hopefully gives you an idea of what Azure ML is capable of.

Analytics Platform System Polybase Example

To me one of the most exciting parts of the Microsoft Analytics Platform System (APS) is Polybase, which in a nutshell allows you to access data residing in Hadoop or Windows Azure Blob Storage, all via T-SQL in PDW. What this means is that data can be transparently queried by a user or developer, in real time, regardless of whether the data lives in PDW or Hadoop/Azure. James Roland-Jones gives a thorough overview of all things Polybase here. What I’m going to do in this post is to show an example of how existing data within Hadoop can be combined with data that resides in an APS PDW region. Polybase Example - Setup There are two key tasks to complete before we’re able to start querying data. We need to setup a data source, so that Polybase knows where to get the data from, plus we need to describe the file format of the external file that we’re going to read. The data source here specifies that we’re going to use the Hadoop nodes that are co-located with the PDW nodes, but this could be a non-appliance Hadoop cluster: CREATE EXTERNAL DATA SOURCE HadoopRegion_DataSource WITH ( TYPE = HADOOP, LOCATION = 'hdfs://hdfs://H12345-C-HHN01', --Optional specify a tracker location to enable predicate --push down to Hadoop JOB_TRACKER_LOCATION = 'H12345-C-HHN01:50300' ); The next task is to setup a file format. In this case we’re defining that the file is pipe delimited, although we can use Polybase with other formats (e.g. RCFile): CREATE EXTERNAL FILE FORMAT HadoopRegion_DelimitedPipe_NoCompression WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS ( Field_terminator = '|') ); Accessing External Data Before we access data in Hadoop, I want to show the file for this example. In this case I’ve got a customer dimension within PDW, but I have customer survey data within Hadoop that i wish to combine with my data in the warehouse. The survey data has been uploaded in this case to the Hadoop region within APS, via the portal: The following shows a sample of the Hadoop file, note that its pipe delimited, with columns for Date, Survey Id, Product, Customer Id and Survey Score: Now I want to access the Hadoop data and combine it with other data in the warehouse. To do this we need to use the CREATE EXTERNAL TABLE command. CREATE EXTERNAL TABLE [dbo].[HDFS_CustomerSurvey] ( DateId int NOT NULL, SurveyResponseId int NOT NULL, ProductCategoryName varchar(100) NULL, CustomerId varchar(50) NULL, SurveyResult INT NULL ) WITH ( LOCATION = '/user/data/survey/', DATA_SOURCE = HadoopRegion_DataSource, FILE_FORMAT = HadoopRegion_DelimitedPipe_NoCompression ) ; This results in an external table being available within the PDW region, as shown below: Now if we want to query the Hadoop data it’s just a simple case of T-SQL. Here I’m joining Hadoop survey data to the customer dimension: SELECT CS.CustomerId, C.FirstName, C.LastName, C.EmailAddress, C.Gender, CS.DateId AS SurveyDate, CS.ProductCategoryName AS SurveyProduct, CS.SurveyResult FROM dbo.HDFS_CustomerSurvey CS INNER JOIN dbo.DimCustomer C ON C.CustomerAlternateKey = CS.CustomerId Which gives the following results: So that’s it, just a simple T-SQL query in the end. Polybase has taken away the complexity and allowed us to integrate different data sources using a widely used standard query language.