Adatis BI Blogs

How to do row counts in Azure SQL Data Warehouse

Continuing on from my last couple of blog post about working with the Azure Data Warehouse, here is another issue which has came up during development and is handy to know if you are going to be developing a solution! Keeping track of how much data has been loaded plays a key part in a BI Solution. It is important to know for a given load for example, how many rows were inserted, updated or deleted. Traditionally, we were able to use the @@ROWCOUNT function @@ROWCOUNT returns the number of rows affected by the last statement. Unfortunately, in Azure SQL Data Warehouse @@ROWCOUNT is not supported. How does it work? In the Microsoft Azure documentation,they do provide a workaround for this, please see here  for more information and a list of other unsupported functions. They suggest creating a stored procedure which will query the system tables sys.dm_pdw_exec_requests and sys.dm_pdw_request_steps in order to get the row count for the last SQL statement for the current session. sys.dm_pdw_exec_requests holds information about all requests currently or recently active in SQL Data Warehouse. It lists one row per request/query. holds information about all SQL Server query distributions as part of a SQL step in the query. sys.dm_pdw_request_steps holds information about all steps that are part of a given request or query in SQL Data Warehouse. It lists one row per query step. This is an example of what the stored procedure would look like:   As you can see above, we pass through a ‘LabelContext’ parameter. A Label is a concept in Azure SQL Data Warehouse that allows us to provide our queries with a text label that is easy to understand and we can find that label in the DMVs. For example: Here, we have given our query the label ‘Test Label’ and if we wanted to find information about this query in the DMVs we can search using this label like:   So, putting this into context, in the ETL we are calling stored procedures to load our data (for example between clean and warehouse). Therefore, within the stored procedure we have the query written to insert or update the data and we would give this query a label. Then, within the same stored procedure, we would call the Row Count stored procedure, passing through the Label as parameter so we can retrieve the row count.   Be careful though! On my current project we have come across times where we haven’t been able to get the row count back. This is because the sys.dm_pdw_exec_requests DMV we are querying is transient and only stores  the last 10,000 queries executed. So when we were running the query above, our requests were no longer there and we were getting nothing back! The table holds data on all queries that go against the distribution nodes and statistics gathering for each of the nodes. So in order to try and limit the records in this table, keep the nesting level of queries as low as possible to avoid the table blowing up and not having the data you need in it!   Stay tuned for another blog about working with Azure Data Warehouse!

Statistics in Azure SQL Data Warehouse

Following on from my previous post about Update Queries in Azure SQL Data Warehouse, I thought I would put together a mini-series of blogs related to my ‘struggles’ working with the Azure SQL DW. Don’t get me wrong, its great, just has some teething issues of which there are work-arounds! This blog post is going to look at what Statistics in the database world are, the differences between them on-prem (SQL Server) and in the cloud (Azure SQL Data Warehouse) and also how to use them in Azure Data Warehouse. What are statistics? Statistics are great, they provide information about your data which in turn helps queries execute faster, The more information that is available about your data, the quicker your queries will run as it will create the most optimal plan for the query.  Think of the statistics as you would the mathematical ones- they give us information regarding the distribution of values in a table, column(s) or indexes. The statistics are stored in a histogram which shows the distribution of values, range of values and selectivity of values. Statistics objects on multiple columns store information regarding correlation of values among the columns. They are most important with queries that have JOINS and GROUP BY, HAVING, and WHERE clauses. In SQL Server, you can get information about the statistics by querying the catalog views sys.stats and sys.stats_columns. By default, SQL Server automatically creates statistics for each index, and single columns. See here for more information. How does it work in Azure SQL Data Warehouse? In Azure SQL Data Warehouse, statistics have to be created manually. On previous SQL Server projects, creating and maintaining statistics wasn’t something that we had to incorporate into our design (and really think about!) however with SQL DW we need to make sure we think about how to include it in our process in order to make sure we take advantage of the benefits of working with Azure DW. The major selling point of Azure SQL Data Warehouse is that it is capable of processing huge volumes of data, one of the specific performance optimisations that has been made is the distributed query optimiser. Using the information obtained from the statistics (information on data size and distribution), the service is able to optimize queries by assessing the cost of specific distributed query operations. Therefore, since the query optimiser is cost-based, SQL DW will always choose the plan with the lowest cost. Statistics are important for minimising data movement within the warehouse i.e. moving data from distributions to satisfy a query. If we don’t have statistics, azure data warehouse could end up performing data movement on the larger (perhaps fact) table instead of the smaller (dimension) table as it wouldn’t know any information about the size of them and would just have to guess! How do we implement statistics in Azure Data Warehouse? Microsoft have actually provided the code of how to generate the statistics so its just a case of deciding when in your process you want to create them or maintain. In my current project, we have created a stored procedure which will create statistics and another that will update them if they already exists. Once data has ben loaded into a table, we call the stored procedure and then the statistics will be created or updated (depending on what is needed). See the documentation for more information and the code. Tip: On my current project, we were getting errors when running normal stored procedures to load the data. Error message: ‘Number of Requests per session had been reached’. Upon investigation in the system tables,’Show Statistics’ was treated as a request which was also evaluated for each node causing the number of requests to blow up. By increasing the data warehouse units (DWUs) and also the resource group allocation this problem went away. So, take advantage of the extra power available to you!   There is a big list on the Microsoft Azure website of features not supported in Azure SQL Data Warehouse, take a look ‘here’. I will cover further issues in my next few blogs

SQL PASS Summit–Day 2

Day 2, Thursday, started off with a keynote from David DeWitt on cloud data warehousing, scalable storage and scalable compute. This set my theme for the majority of the day – which turned out to be big data tech.   My first session was with James Rowland-Jones and Kevin Ngo on sizing Azure SQL Data Warehouse for proposals – essentially answering “how much is this going to cost me?”. There are various factors to consider, which I will blog on separately. I’ve already briefly fed back to members of the team and they’re excited to know what I learnt in more detail.   My second session was about best practices for Big BI which, unfortunately, ended up being a sales pitch and I came away having felt that I’ve didn’t learn anything. There’s a lot of promise for BI in the big data space, so watch this space as we explore Azure SQL Data Warehouse, Azure Data Lake (Store and Analytics), and other big data technology for BI.   The third session was with Michael Rys on Tuning and Optimising U-SQL Queries for Maximum Performance. It was a full on session, learnt loads and took loads of notes. I need time to digest this information as Michael covered off a very complex topic, very quickly. I will, however, be blogging on it in due course.   After an intense third session, I chose a less intense session for the last session of the day: a Q&A with the SQL Engineering team. This was a great opportunity to learn from other users how they’re using SQL. Most users who asked questions were wanting to know about indexing, backups and High Availability.   Tonight – packing, and networking before the last day of PASS tomorrow!

SQL PASS Summit–Day 1

Day 1, Wednesday, technically started on Tuesday with a newbies speed networking event in which we had to rotate through a crowd of 10 other people - introducing ourselves and asking questions about our professional lives. This was awkward to begin with but, as the evening wore on, introducing ourselves to strangers became a lot easier and more normal. We then moved on to the Welcome Reception and then a #SQLKaraoke event. Great opportunities to meet new people from different areas of the world and parts of the community. Wednesday morning proper, began with a keynote from Joseph Sirosh. This keynote from Joseph essentially set the tone and theme for a large part of the conference sessions - Azure, Big Data and the Cortana Intelligence Suite. The first session I attended was on Design Patterns for Azure SQL Database (for which a separate blog will be forthcoming). The next session I attended was about incorporating Azure Data Lake Analytics into a BI environment (again, another blog is in the pipeline). My final session of the day was Going Under the Hood with Azure Data Lake. This was the most insightful session of the day, which has subsequently sparked my brain into Data Lake mode (expect many blogs on this), and went through how Azure Data Lake works as well as how the U-SQL language works and resources are allocated. Tonight - more networking. So far, the community has been so welcoming and I’m very much looking forward to tomorrow where I’ll be learning about Big Data solutions and best practices. I’m also looking forward to sharing all my experiences and learning's with my colleagues and wider SQL Community.

Stretch Databases: The Basics

What Is It? A stretch database is a database where the rarely used portions of the data are stored in the cloud. Offered by SQL Server 2016 alongside Azure, it will allow seamless migration of data between the cloud and your local server databases. Consider data as being warm or cold; warm data is accessed often, cold is not. So for all intents and purposes, cold data is wasting space on the local server most of the time it’s there. The feature was created with historical data and other similar types of data in mind. Data that would not be accessed regularly, but still needs to be available on occasion. A stretch enabled database can have tables that query as if they are local, yet have some of their rows stored in Azure blob storage. The only noticeable difference by the user is that some queries take longer to process. There is a fair amount of control over data migration. Functions can be created to stretch specific rows, based custom filter functions. Migration can be monitored and troubleshooted, or simple paused and resumed at will. Once the data is in the cloud it is backed up automatically, with Azure holding storage snapshots from the last 7 days for easy recovery by the user. Any backups of the stretch enabled database actually only takes the local data – migrated data can only be restored using the Azure storage snapshots.   Requirements And Prerequisites The main requirement is having an Azure account with a billable subscription already set up. Blob storage is set up during the stretch database setup, so nothing further is needed on the Azure side. The Stretch Database Advisor, included in the separately downloaded Data Migration Assistant, can indicate which tables are eligible for stretching to Azure (I’m planning on going through this in a future blog post). They do not mention what incompatibilities the other tables have, however. Any table that is not shown as compatible must be manually examined, to see what incompatibilities there are. I have included a full list of incompatible features, column types, and data types below.   Limitations And Incompatibilities As of writing this, there are still a large number of index, constraint, and data type limitations on tables that use stretch databases according to the MSDN website. Data management operations are also limited – you cannot update or delete rows marked for migration, or that have already been migrated. These rows must be migrated back to the local database in order to be altered. Table Properties · Memory optimised tables cannot become stretch enabled · Replicated tables cannot become stretch enabled. · Tables using change tracking and change data capture cannot become stretch enabled. Column And Data Types · A table cannot be stretch enabled if it has more than 1,023 columns. · FileTables and Filestream data are incompatible. · Column set and computed columns are incompatible. · Text, ntext, image, timestamp, sql_variant, XML and CLR data types are all incompatible. Constraints · Uniqueness is not enforced on migrated Unique and Primary Key constraints. · Default and Check constraints are incompatible. · Foreign Key constraints cannot reference a stretch enabled table Indexes · Indexed views are incompatible. · Fulltext, XML, and Spatial indexes are incompatible. · A table cannot be stretch enabled if it has more than 998 indexes. · Filtered indexes do not include migrated data on stretch enabled tabled.   Further Reading The MSDN website has the full description of stretch databases SQL Data Partners have also done an interview with Anthony van Gemert, one of the product managers for the stretch feature

Update queries in Azure SQL Data Warehouse

  I’ve recently started working on a project where we working in the cloud with Azure SQL Data Warehouse: “Azure SQL Data Warehouse is a cloud-based, scale-out database capable of processing massive volumes of data, both relational and non-relational”  For more information about Azure SQL Data Warehouse, see here. Although we develop with the same T-SQL as we do using the on-prem version of SQL Server, I did come across a bit of a quirk when writing some update statements. If we are using the on-prem version of SQL Server, when we need to update data, we would have a SQL query like:   That is a basic update to a specific row of data in the Sales.MyOrderDetails table, using a where clause to filter for the row. Sometimes, it isn’t always as straight forward and we need to join to other tables, so that we can refer to attributes from those rows for filtering purposes. For example:     However, if we take this approach in SQL Data Warehouse, we get the following error.   SQL Data Warehouse doesn't support ANSI joins in the FROM clause of an UPDATE statement (it is also the case for DELETE statements too). There is a way round it and it uses an implicit join. Before we look at how the update query can be written, it is a good place to point out that Inner joins can be written in a couple of different ways to what we had above. In an inner join, the ON and WHERE clause both perform the same filtering and they both return rows where the ON and WHERE predicate is true. Therefore, we could write an inner join as or implicitly like, However, it is normally best to stick with the original example rather than the implicit version as although it is still supported, it is an old deprecated syntax and not considered best practise. So, in order to write an update query in SQL Data Warehouse that uses inner joins to filter the rows, the workaround is as follows:   In conclusion, most SQL statements written in Azure SQL Data Warehouse are written in the same way we would with he on-prem  version of SQL Server, however, there are some cases where the syntax differs slightly and I will be blogging more about these special cases as I come across them!

The Cloud BI Transition: Warehouse of the Future Talk

While we wait for the SQLBits XV videos to be published, I actually gave the same talk back in March for the London PASS Chapter MeetUp. If you were unfortunate enough to miss both talks, or happened to steal a quick nap mid-way through, the recording can be found here. I wrote up a quick accompanying blog earlier this week, which includes a link to the slides and some additional thoughts. As usual, if anyone has any questions following the talk, I’d love to hear from you in the comments or directly on twitter via @MrSiWhiteley.

Getting Started with Azure Data Lake Store

You may have heard about Microsoft’s new Azure Products – Azure Data Lake Storage and Data Lake Analytics. If not, take a look at the official blurb here! What is it? Essentially it’s hadoop-made-easy – the storage engine is a WebHDFS layer with a management overlay that takes care of scaling and cluster distribution so you don’t have to. For many hadoop veterans this will feel as uncomfortable as a grizzled DBA allowing Azure SQL DB to determine the best backup schedule on their behalf. But for everyone else, this removes a significant technical barrier to using truly scalable, distributed storage. What’s more, your store can be accessed by any hadoop-compatible technologies, so you can integrate with your existing architecture as a hassle-free migration path. Data Lake Analytics provides the parallel querying engine, comes courtesy of a new language U-SQL, based on Apache YARN (which is essentially a version 2.0 of their populate Apache MapReduce). Once again, the big difference being that all of the cluster management is being done for you – and on a pay-as-you-use model. For a more detailed overview, take Microsoft’s own overview page. Their architectural vision has the Data Lake as the central hub for all data-based operations, they headline with a diagram making this very clear: The goal is for the Data Lake Storage to be a dumping ground for any and all of your data, be it structured, semi-structured or entirely unstructured. The competitive price of Azure storage means they’re now pushing a “store first, think of questions to ask later” methodology – initial indications are that 1Gb of data will cost just £0.05 per month. This makes it an attractive offer for most archiving scenarios where the data still needs to be accessible in some form. I’ve a few thoughts around where this is heading and what it means to the Data Warehouse, but first I’ll run through how to get your Data Lake set up – you can figure out what you’re going to do with it later! Creating your Azure Data Lake Store First you’ll need to set up your storage account – you can do this through any valid Azure subscription. At the time of writing, the Azure Data Lake elements are in public preview, so you’ll need to sign up and be accepted first. To create your storage account, simply go to New > Data + Storage > Data Lake Store: You’ll be presented with the usual configuration screen: Again, at the time of writing this service is only available in the East US 2 location. A few minutes later, your Data Lake Store will be ready to go! You can now begin loading files into the store, ready for use. You have several options on how to get your data into the store, but for now we’ll go with the most basic. Uploading your first file using Data Explorer Open up the resource blade for your DLS and you’ll see a standard overview panel for your new object. I’ll let you explore the settings and monitoring options yourself. What we’re interested in is the Data Explorer. Click that button (or there’s another entry point via settings) and you’ll be presented with the new Data Explorer blade. Here you can easily see all of the objects that have been uploaded to the store so far. Let’s add a file by clicking the upload button. You’ll see a basic upload screen: And if we select a csv file (or one of many supported file types that takes your fancy), we can click start upload to send the file across. You’ll receive a portal notification when the upload has completed, and the data explorer view will refresh to show us our new contents: Clicking on the file will then open it up in the file preview blade, so you can explore the various objects contained in the store: And that’s it – your file is now available through a variety of methods. You can treat your Azure Data Lake Store like any other WebHDFS layer, or you can spin up an Azure Data Lake Storage account to write U-SQL against your stored file. A minor word of warning here – in my initial attempts of using the store, I found it to have difficulty recognising column titles within flat files, I’d receive parsing errors where it was trying to convert titles into numerical fields etc. This is most likely because I’m new to U-SQL as a language and the documentation is very much a work in progress still. I’m assuming this limitation will be easy to work around but have, for now, removed column titles from my sample sheets. Over the next few weeks, I’ll be exploring other methods of getting data into your Data Lake Store, whether it’s programmatically through C# or Powershell, streamed through Event Hubs and Azure Streaming Analytics or even more traditionally, uploading to relational database tables through SSIS, all of which can then be accessed via a single interface. I will also be looking at U-SQL in particular, and how it can be leveraged in a new-world Warehousing Solution.

SQL PASS Summit – Day 2

This morning started with a talk from Rimma Nehme a key Microsoft architect. The talk was informative and clearly Azure and the cloud are something which Microsoft are very keen, it was well thought through and there was a fantastic analogy between pizza and different flavours of Azure which is not what I had expected. Adam Machanic did a presentation covering different methods which could be used in order to force the query optimiser into choosing more efficient query plans. This was not for the feint hearted and included some ingenious methods which he had developed which could be used to ‘fool’ the optimiser into selecting different query plans. The methods tried to push towards parallel queries and ensuring that the right workload was assigned to each of the threads so they finished at roughly the same time. The only drawback to this was that some of the examples produced code which was quite obfuscated and not something which I thought could be used except in the case that there was an exceptional issue which necessitated their use. Davide Mauri’s session on Unit Testing and Agile development was very interesting, he showed demonstrations of NUnit, BI.Quality and NBI. These are tools which allow users to create unit tests when following an agile approach, whilst we also use similar methods at Adatis to provide unit testing functionality the demonstrations were very effective and I will be doing some further research into these products to realise their full potential. Connor Cunningham also presented a fantastic talk titled ‘Managing 1M + db’s – How big data is used to run SQL Azure’. This was a very insightful talk which detailed the issues that had arisen when running the SQL Azure service and the methods used in order to identify, quickly resolve and identify the root cause of the problems. Central to this was both the collection of telemetry data from many sourced which could be used to perform diagnostics and the methods used in order to identify problems. Here we saw a very interesting usage of Azure Machine Learning, which had been setup to trigger Alerts on the occurrence of unusual behaviour. Connor is a fantastic speaker and clearly one of the most knowledgeable people on the SQL engine, I really enjoyed this talk. I’m off to the evening party now at the EMP museum, looking forward to having a few beers and meeting some new people :)

SQL PASS Summit 2014 – Kick Off

Day 1 has kicked off in Seattle, a remarkable city. Having arrived a week early, I’ve had plenty of time to check out the sights and the food and have enjoyed it immensely - a brilliant venue for PASS! There were a few announcements at this morning’s Keynote, mostly regarding Azure. Azure SQL Databases are gaining larger index handling, parallel queries, extended events and in-memory columnstore for data marts. Joseph Sirosh gave a talk about Machine Learning & Information management showing a cool example of Azure Stream Analytics using Microsoft Kinect sensor information of customer interactions in a shop being uploaded straight into Power Map! I am looking forward to hearing more on Machine Learning. There are also a handful of great improvements for Power BI. I am most looking forward to the new Live Ops Dashboards and drill-through actions! Combo Reports also look promising… Moving onto the first session, I chose to attend ‘What’s new in Microsoft Power Query for Excel’. As it turns out there’s not a massive amount of new stuff – some new data sources and a tick box when you do a merge to remove name prefixes. However one of these new sources is the long-awaited Analysis Services data source. The new ODBC Data Source is a great addition also. There was a mention regarding the possibility of a decoupled M-query SSIS component! We probably won’t hear of anything until later in 2015, unfortunately. I would say this was not a level 300 session, more like 100/200. The second session was ‘SQL Server 2014 Always On (High Availability and Disaster Recovery)’: a nice overview of what was brought in in 2012 and the newer 2014 features – these including an increased maximum number of secondary replicas, increased availability of readable secondary’s and the Add Azure Replica Wizard. Despite not being a DBA and it being a level 300 session, I found it easy to follow and absorbable. I feel many of the DBAs in the room may not have taken away any information they would have not already have known, however. Niko Neugebauer gave a fantastic, in depth session on ‘ETL Patterns with Clustered Columnstore Indexes’. It was a fast moving talk, despite the time spent waiting for some executions. It demanded your full attention! Definitely worthy of its 400 level. It left me a little tired for Marco Russo’s ‘DAX Patterns’ session which showed some examples of workarounds for common tabular weaknesses like distinct counts for Type 2 slowly changing dimensions and cumulative quantities. Overall it was a strong day. I am looking forward to tomorrow. More to follow…

SQL PASS Summit – Day 1

Having spent several days enjoying Seattle and getting to know and enjoy the city the conference is now underway. Yesterday there were some introductory meetings where we got to meet some of the other attendees and get a feel for the environment, today everything was in full swing. The morning started with the opening keynote presented to a huge audience. There were demonstrations of some really exciting new features – in one we observed Kinect being used with Power Map in order to track customer movements to observe customer interest in different parts of the store. We saw some great looking Power BI dashboarding functionality with the ability to drillthrough into detailed reports. As well as this we saw some further enhancements SQL Server Azure and on-premise integration including a new stretch functionality which will allow users to seamlessly ‘stretch’ their data into the cloud, keeping the most frequently queried records on premise and the other ones in the cloud. We also saw a Columnstore index being created on an in memory table! Miguel Lopes gave a talk on the new features in Power Query where we saw the ability to query using ODBC and support for Analysis services connections, on the whole though whilst I think the ODBC will be particularly useful for some situations, much of this talk was giving an overview of Power query as a whole rather than demonstrating new functionality. The integration of SSIS and power query in future was mentioned, however no dates have been set for this and we are told that this may (or may not) be available at some point in 2015. Jen Stirrup gave an interesting overview of some of the features available in R, the session was so popular that many people had to sit round the front on the floor! Niko Neugebauer’s contagious enthusiasm when presenting his session on ETL Patterns with Clustered Columnstore indexes was great to see and I picked up a few tricks here that were quite handy when working in 2014 environments. I also very much enjoyed John Welch’s session in Continuous Delivery for Data Warehouses and Marts, this is something I myself have been involved with a lot recently and it was very interesting to see his methods of achieving this and also to discover that in many cases we were both doing things in the same way :) Overall the day has been very interesting, we have seen some exciting new features announced today and some significant enhancements to the Power BI product range, it seemed to me for some time that the lack of dashboarding functionality in Power View was holding it back and I think many people will be very pleased with this new functionality and the further enhancements to the Azure service.

Azure Machine Learning Example - Insurance

Predicting data is nothing new in the Microsoft BI world, with the data mining capabilities in Analysis Services, but there is now another option – namely Azure ML, Microsoft’s cloud based predictive analytics solution. Like many people, I’ve been keen to see what Azure ML can do. Luckily, you can still sign up for a completely free trial for Azure, meaning you can be up and running with Azure ML in a matter of minutes. For this blog post I’ve selected an insurance use case scenario, as I’ve been working in the insurance industry quite a bit over the past couple of years. My reason for using Azure ML is that I would like to see if it’s possible to predict the retention rate, i.e out of all our policies that were due for renewal, what percentage of those policies actually renewed? Therefore I’m going to try and predict whether a car policy holder will renew based on a dataset comprising of previous policies that show whether or not the policy was renewed. Loading Data I didn’t want to use real data for this example, so I’ve got some made up data, which I’ve saved as a simple CSV. Thankfully this can be easily uploaded to Azure using ML Studio, which is the main development tool for Azure ML: Experiments Having uploaded my test dataset, I’m ready to build a predictive model. Azure ML allows me to do this by creating an experiment, which is to ML Studio what a project is to Visual Studio. Within ML Studio, the starting point is to pick a dataset. Data can be consumed from a variety of sources, such as SQL Azure, Azure Blob Storage and http to name a few. But in this case I’m going to use my csv file, which is available under Saved Datasets, along with a host of other ‘modules’ that perform various tasks within ML. The modules can be dragged and dropped onto the canvas, so the starting point is to drag the insurance dataset onto the canvas: Once I have a dataset on the canvas, I can right click on the saved dataset and choose Visualize, which will profile my data for me, showing me statistics such as the minimum, maximum and standard deviation for each column: There are a myriad of factors that could potentially affect the retention rate, such as renewal price, number of other policies held (e.g. perhaps they have both life and car insurance with the insurance company), how long the policy holder has been a customer, as well as make/model, insurance add-ons (legal cover, hire car cover) age and other demographics to name a few. I’ve not got all of these factors in my dataset, but I have a fair few, just to show an example. Once I’ve got my insurance dataset, the next task is to add a Split module, which will allow me to split the rows into two buckets, in a 75%/25% split. Firstly, a training set is needed, which will be fed into the actual model, allowing it to learn based on the past data. The other bucket will be the test set in this case, which will be used to test how the trained model performs. Here you can see the two dots at the bottom of the Split module are the two outputs: The next task is to drag a Train Model onto the canvas and then connect the training set to the second input of the Train Model. For the first input, the Train model is expecting an input of untrained learner, so I’ve given it the input of an untrained Two-Class Neural Network. The second input expected by the Train Model is a dataset, which is the training set that comes from the Split. Therefore we now have the following shape to the experiment: The red icon on the train model indicates that there is no column selected, so therefore I’ve picked the column “Renewed” (Yes/No) from the column selector to indicate that this is the outcome that we wish to predict. In addition, we need to add a Score Model, which will allow us to score the now trained model against the test dataset. Therefore the Score Model accepts two inputs, namely a Trained Model and the test dataset: If I now run the experiment then we can click on the Visualize option on the output of the Score Model. This will show a comparison of the predicted outcome against the outcome on the test dataset, on a row by row basis: Finally we can add in an Evaluate Model, which will measure the performance of the model using a specific set of metrics. Again we can choose the Visualize option, which will give us metrics such as True/False Positive/Negative and Cumulative AUC:   What I think’s great about Azure ML is that its very easy to compare the performance of the Neural Network to other models. Therefore I can add in a Boosted Decision Tree, another Train Model and also another Score Model: These can be quickly wired up to the relevant inputs. The final step is then to then set the Evaluate Model to accept a second input, which is the output of the second Score model. It all ends up looking like this: As you can see, we now have two inputs into the Evaluate Model, which will allow us to compare the performance of the two models that have been chosen. In this example, when choosing Visualize on the Evaluate model, I can see that the Two Class Boosted Decision Tree gave the best overall results: The final result is that we have a model that is capable of predicting whether a policy holder is likely to renew or not. Of course this is just a quick example but it hopefully gives you an idea of what Azure ML is capable of.

Analytics Platform System Polybase Example

To me one of the most exciting parts of the Microsoft Analytics Platform System (APS) is Polybase, which in a nutshell allows you to access data residing in Hadoop or Windows Azure Blob Storage, all via T-SQL in PDW. What this means is that data can be transparently queried by a user or developer, in real time, regardless of whether the data lives in PDW or Hadoop/Azure. James Roland-Jones gives a thorough overview of all things Polybase here. What I’m going to do in this post is to show an example of how existing data within Hadoop can be combined with data that resides in an APS PDW region. Polybase Example - Setup There are two key tasks to complete before we’re able to start querying data. We need to setup a data source, so that Polybase knows where to get the data from, plus we need to describe the file format of the external file that we’re going to read. The data source here specifies that we’re going to use the Hadoop nodes that are co-located with the PDW nodes, but this could be a non-appliance Hadoop cluster: CREATE EXTERNAL DATA SOURCE HadoopRegion_DataSource WITH ( TYPE = HADOOP, LOCATION = 'hdfs://hdfs://H12345-C-HHN01', --Optional specify a tracker location to enable predicate --push down to Hadoop JOB_TRACKER_LOCATION = 'H12345-C-HHN01:50300' ); The next task is to setup a file format. In this case we’re defining that the file is pipe delimited, although we can use Polybase with other formats (e.g. RCFile): CREATE EXTERNAL FILE FORMAT HadoopRegion_DelimitedPipe_NoCompression WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS ( Field_terminator = '|') ); Accessing External Data Before we access data in Hadoop, I want to show the file for this example. In this case I’ve got a customer dimension within PDW, but I have customer survey data within Hadoop that i wish to combine with my data in the warehouse. The survey data has been uploaded in this case to the Hadoop region within APS, via the portal: The following shows a sample of the Hadoop file, note that its pipe delimited, with columns for Date, Survey Id, Product, Customer Id and Survey Score: Now I want to access the Hadoop data and combine it with other data in the warehouse. To do this we need to use the CREATE EXTERNAL TABLE command. CREATE EXTERNAL TABLE [dbo].[HDFS_CustomerSurvey] ( DateId int NOT NULL, SurveyResponseId int NOT NULL, ProductCategoryName varchar(100) NULL, CustomerId varchar(50) NULL, SurveyResult INT NULL ) WITH ( LOCATION = '/user/data/survey/', DATA_SOURCE = HadoopRegion_DataSource, FILE_FORMAT = HadoopRegion_DelimitedPipe_NoCompression ) ; This results in an external table being available within the PDW region, as shown below: Now if we want to query the Hadoop data it’s just a simple case of T-SQL. Here I’m joining Hadoop survey data to the customer dimension: SELECT CS.CustomerId, C.FirstName, C.LastName, C.EmailAddress, C.Gender, CS.DateId AS SurveyDate, CS.ProductCategoryName AS SurveyProduct, CS.SurveyResult FROM dbo.HDFS_CustomerSurvey CS INNER JOIN dbo.DimCustomer C ON C.CustomerAlternateKey = CS.CustomerId Which gives the following results: So that’s it, just a simple T-SQL query in the end. Polybase has taken away the complexity and allowed us to integrate different data sources using a widely used standard query language.