Simon

Simon Whiteley's Blog

A Guide to Azure SQL DataWarehouse

So you've heard the hype - the Azure SQL DW is going to solve all of your problems in one fell swoop… Right? Well… maybe. The system itself is a mix of technologies designed for low concurrency analytics across huge amounts of relational data. In short, it's a cloud-scalable T-SQL-based MPP platform, with all the benefits and restrictions that performing everything in parallel brings. If your problem can be solved by performing lots of calculations over small of your data before aggregating the results into a whole, this is the technology for you.

However, before you jump right in, be aware that SQLDW is a very different beast to other SQL tools. There are specific concepts you need to be familiar with before building your system, otherwise you're not going to see the promised performance gains and will likely lose faith very quickly!

I'm in the process of documenting these concepts, plus there is a wealth of information available on the Azure SQLDW site. For the next few months, I'll be running through the blog topics below and updating the links accordingly. If there are topics you'd like me to add to the list, please get in touch!

Azure SQLDW Core Concepts:

- What is it?

- How Does Scaling Work?

- Distributions

- Polybase

- Polybase Limitations

- Polybase Design Patterns

- CTAS

- Resource Classes

- Partitioning

Designing ETL (or ELT) In Azure SQLDW

- Row counts

- Statistics

- Surrogate Keys

Performance Tuning Azure SQLDW

- Plan Exploration in SQLDW

- Data Movement Types

- Minimising Data Movement

Managing Azure SQLDW

- Backup & Restore

- Monitoring Distributions

- System Monitoring

- Job Orchestration

- Scaling and Management

- Performance Tuning Queries

Azure SQLDW Architecture

- Presentation Layers

- Data Lake Integrations

The Cloud BI Transition: Business Intelligence Key Players and their changing roles

I’ve previously talked around how Azure is changing traditional BI approaches, and the various architectural frameworks we’re going to see coming out of it. But what about the existing job roles within BI Teams and related structures? How are they going to change with the new ways of working that are emerging?

It’s a question we need to start considering now as cloud architectures become more commonplace and business are more willing to trust their data to cloud providers. From my recent work in piecing together new architectures and frameworks, I’ve been fortunate enough to have these conversations with people who are currently evaluating their position and their future training needs. From these conversations, and my own research, I’ve put some thoughts together around the “traditional” BI roles and how they’re changing:

BI Architect – In many ways, this is the role that changes the most. We now have a wealth of options for our data, each with their own specialist uses. The BI Architect needs to be familiar with both new functionality and old, able to identity the most relevant technology choice for each. Whereas we would previously be able to host the majority of our different systems within a single, cover-all SQL Server architecture, we will now find certain structures to be more aligned with the performance profiles of specific Azure components and less so with others. Business Users expecting lightning-fast dashboard performance would not benefit from a Data Lake system, whereas a Data Scientist would be unable to work if confined to a pre-defined, data model. A small, lightweight data mart of several Gb would likely perform worse in a full Azure DW system, given the data would be split across so many distributions the overhead of aggregating the results would outweigh the parallelism gains – in this case we would introduce an Azure SQL DB or Azure SQL VM to cater for these smaller marts.

Infrastructure Specialist – Gone are the days of the infrastructure specialist needing to know the install parameters of the SQL Server and the best disk configurations to use in different scenarios. We’re now focusing on security models, network architecture and automation and scaling management. The PaaS and IaaS systems differ greatly in their approach to security, with IaaS requiring traditional networking, setting up VLANs/security layers and PaaS components each having their own firewall layers with individual exception management. The infrastructure specialist should also be advising/designing the Azure subscription setup itself, connections to other subscriptions, perhaps managing expressroute and gateway connections back to the on-premise systems. There is also the considerations of whether to extend Active Directory into the Azure domain, making the Azure estate more an extension of the company’s internal network.

Data Modeller – The end role here doesn’t change dramatically, many previous design principles are still the case in the new approaches we’ve outlines. However there are some additional performance considerations that they will need to build into their designs. Azure DataWarehouse, for example, fundamentally relies on minimising data movements that occur when querying data. A snow-flaked model, or a model with several very high cardinality dimensions, might find performance degrades significantly, when it may have been the most performant design in a traditional multidimensional cube.

BI Developer – This role will still include many of the traditional tools, very strong SQL skills, an understanding of data movement & transformation technologies and excellent data visualisation skills. However, the traditional “stack” skills of SSIS/SSAS/SSRS are extended and augmented with the additional tools at our disposal – components such as Stream Analytics, Data Factory, Event Hubs and IoT sensor arrays could all easily fall into the domain of BI yet require radically different skillsets. Exposure to C# and the .NET framework becomes more valuable in extending systems beyond the basic BI stack. The management of code is essential as these environments grow – being able to rapidly deploy systems and being confident in the development process is vital in order to get the most out of cloud technologies.

Data Scientist – For the first time, users with advanced analytical skills have a place in the architecture to allow for experimentation, ad-hoc analysis and integration with statistical tools. This free-form analysis outside of strict development protocols accelerates the business’s access to the insight and understanding locked within their collected data. Many of these insights will mature into key measures for the business and can be built into the more stable, curated data models

Data Steward – If anything, the importance of a nominated data steward grows as we introduce systems designed for ad-hoc analysis. Without governance and controls around how data is stored within a data lake, it can quickly become a dumping ground for anything and everything. Some critics see data lakes as a dystopian future with uncontrolled “swamps” of data that grow meaningless over time. Whilst “store everything” is a fundamental tenant of the data lake mentality, everything stored should be carefully catalogued and annotated for maximum usefulness – the importance of this should not be underestimated. Our steward should embody the meticulous collector, not the disorganised hoarder.

Database Administrator – The introduction of PaaS services as our main components change this role dramatically, but they remain a core member of the business intelligence team. The common DBA tasks of security management, capacity and growth planning, performance optimisation and system monitoring are all very much a part of day-to-day life. Certain tasks such as backup & recovery are taken away as services Microsoft provide but the additional skills needed to manage performance on these new technologies are critical. Data Lakes produce large amounts of output as a by-product of running jobs and queries, these need to be cleaned and maintained over time. Access levels to different areas of the lakes will be a growing concern as our lake models mature. Finally, the tuning of high-performance queries, whether in U-SQL or Azure DW now require a whole new set of skills to analyse and optimise. PowerShell, a traditional tool of the DBA, becomes hugely powerful within Azure as it is the key to managing system automation – scaling systems up, down, on and off requires a reasonable grasp of PowerShell if you want to get the most out of your systems.

BI Analyst – Somewhere in between our Data Scientists, Developers and Consumers, we have our BI Analysts. Whereas previously they may have been expert cube users, building reports and dashboards for end users to consume using the BI systems provisioned, they now have far more autonomy. PowerBI and other reporting technologies, whilst being touted as the silver-bullet for all self-service reporting needs, can deliver far more when in the hands of an experienced reporting analyst. Essentially, the analyst still acts as the champion of these tools, pushing data exploitation and exposure, except they now have the ability to deliver far more powerful systems than before.

Data Consumer – The business user is, by far, the beneficiary of these tools and systems. With a flexible, scalable architecture defined, as well as different streams of data management and exploitation, there will be many benefits for those who need to gain insight and understanding from the company’s various data sources. The data models supporting self-service tools will benefit from faster performance and can include data sets that were previously size-prohibitive, giving the data consumers instant access to wider models. If the datasets are too complex or new to be exposed, they can contact specialists such as the BI Analyst or Data Scientist to investigate the data on their behalf. These manual-analysis tasks, because the architecture is built and designed to support them, should be more maintainable for a BI team to provide as an ongoing service.

Conclusions

In many ways, things aren’t changing that much – we still need this mix of people in our BI team in order to succeed. But as always, technology is moving on and people must be willing to move along with it. Many of the design patterns and techniques we’ve developed over the past years may no longer apply as new technology emerges to disrupt the status quo. These new technologies and approaches bring with them new challenges, but the lessons of the past are essential in making the most of the technology of the future.

The people who will be thriving in this new environment are those who are willing to challenge previous assumptions, and those who can see the new opportunities presented by the changing landscapes – both in terms of delivering value quicker and more efficiently, and finding value where previously there was none.

Getting Started with Azure Data Lake Analytics & U-SQL

Data Lake Analytics is the querying engine that sits on top of the Azure Data Lake (ADL) Storage layer. If you have not yet got to grips with ADL Store, I’d suggest you go through my quick introduction here.

Azure’s Data Lake Analytics has developed from internal languages used within Microsoft – namely ‘SCOPE’. It is evolved from Apache YARN which, in turn, is a reimplementation of their original Apache MadReduce language. For a little light reading around the history of ADL, I’d suggest looking at the ADL Analytics Overview here.

The new language introduced by ADL Analytics, mysteriously named U-SQL, brings .NET functionality and data types to a SQL syntax. You declare variables as strings, not varchars, but frame your code in SELECT, FROM and WHERE clauses. The extensibility of the code is huge as a result – you can easily write your own C# methods and call them within your select statements. It’s this unification of SQL and .NET that supposedly gives U-SQL its name. The familiarity of code on both sides should open this up to Database and Application developers alike.

Setting up your Data Lake Analytics Account

Assuming you have already set up an ADL Store, setting up your own Analytics service is just as easy.

First, go through New > Data & Analytics > Data Lake Analytics:

clip_image002

You’ll get the usual new item configuration screen, simply pop in your details and link it to your ADL Store account.

image

A few minutes later, you’ll be ready to go with your Analytics service.

To start with, we’ll write a very basic U-SQL Job through the Azure Preview Portal. This way, you can start performing large transformations on your files without the need to download any extensions, updates etc. You don’t even need a copy of Visual Studio! However, as you formalise your system and begin to rely on it as your primary datasource, you’ll definitely want to be keeping your code as source-controlled solutions and making use of the various capacity management tools Microsoft have recently released for managing your Data Lake projects.

Download Sample Data

Back in the Azure Preview Portal, when we open up our new ADL Analytics account we see the familiar overview blade:

image

There’s a decent overview of the U-SQL language here, along with several sample jobs provided through the “Explore Sample Jobs” link on the overview blade. If you follow the samples link, you’ll see a couple of options on a new blade.

clip_image006

For now, click the “Copy Sample Data” button at the top. This will populate your data lake store with the sample files used by the provided examples. I’ll walk through some more advanced examples over the next few posts, but let’s simply try and access some data first. The first example uses SearchLog.tsv found in /Samples/Data/ after installing the samples.

U-SQL functions by defining rowset variables and passing them between various functions. Your first rowset may be data extracted from your sample text file, this rowset is then passed to an output which writes it to an aggregate table, or another file.

Your first U-SQL Job

Simply click on the “New Job” icon on the ADL Analytics Overview blade to start writing your very first job.

clip_image008

Admittedly, this throws you in the deep end. You’re faced with a blinking cursor on line one of your script, but I’ll talk you through the structure of the first example query.

The ADL Store can contain SQL tables, as well as unstructured objects, and the syntax used varies depending on what you’re using. Tables are accessed using the traditional SELECT clause whereas for files we use EXTRACT. I’m assuming most readers will be familiar with a select statement, so let’s get an initial Extract working.

We start by defining our rowset variable, let’s call it @searchlog for now. There’s no need to declare this formally, we can just go ahead and assign the results of a query to it.

The basic structure of this query would be:

@searchlog = 
    EXTRACT <column1> <datatype>
    FROM <sourcelocation>
    USING <extraction method>;

The major assumption is that we will be defining schema on query – the flat files do not contain their own schema information and so we define it when writing the query.

So, to bring back some data from the “SearchLog.tsv” sample file, we need to give each column a name a data type. It appears that we need to define the whole file for now, although it seems that support for querying across variable structures is on its way – it doesn’t seem to be documented just yet.

Defining each column, we build up the EXTRACT statement to:

EXTRACT UserId          int, 
        Start           DateTime, 
        Region          string, 
        Query           string, 
        Duration        int, 
        Urls            string, 
        ClickedUrls     string

Remember, we’re using C# datatypes so we don’t need to worry about lengths of strings etc.

Next, we define the filename. In the first example, we can use a reference to a specific file – this can be the fixed URL to the file specifically, or a relative reference within the Store itself. Our FROM statement for the SearchLog file is therefore:

FROM @"/Samples/Data/SearchLog.tsv"

Finally, we need to tell the query how to understand the particular file we’re attempting to extract data from. There are many extraction interfaces defined by default, for many of the most common flat files, so don’t worry if you prefer CSVs to TSVs, or even if you prefer to define your own delimiters.

In this case, as we’re using a TSV, we use the inbuilt Extractors.TSV() function.

Putting this all together gives us the example query:

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

This will leave us with a rowset variable that has been populated with the columns defined from the TSV file. In SQL parlance, this is like defining a table variable and throwing it away at the end of the query.

In order to view our results, we need to output our resultset somewhere, this is where the OUTPUT clause comes into play.

A full OUTPUT statement requires:

OUTPUT 
    <rowset variable>
TO 
    <location>
USING 
    <output method>

We know our rowset variable, that’s the @searchlog we’ve just extracted data into. We can define a new file for our location, this simply needs to be a relative path and the name of the file to be created.

Finally, as with Extractors, we need to instruct the query what function to use to output the data if we’re pushing it to a flat file. Once again, many providers are included as standard, but let’s stick with TSV for simplicity.

Our output statement therefore looks like:

OUTPUT @searchlog 
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv();

Putting this together our full U-SQL script is:

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();
 
OUTPUT @searchlog 
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv();

Now this isn’t terribly exciting. This will simply take the contents of the sample file and dump it into a new file with exactly the same structure.

For now, click the Submit Job button to execute your job.

clip_image010

One key point here is that it will take at least a minute or two for the job to run, even if you have very little data in your files. The whole system is optimised for massive scale not lightning fast micro transactions. The real gains are running queries across hundreds or thousands of files at once, scaling to run across many Terabytes of data efficiently.

Hit refresh on the job status blade and you’ll eventually see the results of your job, hopefully succeeded.

image

You’ll then be able to navigate to the output file and view your results.

That’s your very first foray into writing U-SQL, and not a terribly exciting example. However, we can write additional queries in between the EXTRACT and OUTPUT steps that can add calculations, aggregations, join to additional rowsets and even apply any C# libraries that we associate. This in itself if nothing new, especially if you’ve familiar with PIG, however this can all be scaled to massive levels using a simple slider and a pay-as-you-go charge rate. We’ll come to these more advanced examples in future posts.