Simon

Simon Whiteley's Blog

Azure SQLDW–How Does Scaling Work?

We've discussed what Azure SQLDW actually is, so we know it works on scalable MPP technology. But what does that actually mean?

If you're happy with the idea that MPP aligns compute with storage then read on. If not, have a quick skim of my blog here.

Ok, so we're using MPP, we have storage and we apply compute to it. In SQLDW, the magic number is 60. There are always sixty distributions. Whenever you write a record to a table, it will always be written to one of those 60 distributions.

You can think of those distributions however you like - as individual buckets of data, as isolated SQL databases, as magic data-shelves in the sky - as long as you know that there are always sixty, regardless of scale, tier, price, country or any other factor. That's a major difference from the APS/PDW systems that SQLDW evolved from - they would change the number of distributions as you scales. However in order to get the dynamic scaling that cloud technologies rely on, they couldn't have data being redistributed every time you scale up or down. So distributions are fixed, permanently, at 60.

You scale the system by adding/removing compute to work against these 60 distributions.

IMPORTANT: When you scale SQLDW, you will kill all running transactions. Scaling effectively restarts the server

How to Scale SQLDW

There are several different ways of scaling SQLDW, these are:

  • Through the Azure Portal
  • Via T-SQL
  • Via PowerShell

The portal is the only one that informs you of the cost when you scale. You should be aware that you will be charged per hour - so if you turn it up for 5 minutes then back down, you'll still be charged for the remainder of the hour.

Firstly, the portal. This is the easiest when you're first getting started, however cannot be automated and is very manual.

To do so, navigate to your SQLDW blade, click on the "Scale" button. When you select your chosen DWU level, it will tell you how much this amount of compute will cost per hour.

clip_image001

Next, there’s the SQL route. Really easy, simply issue an alter statement command against your database and it’ll update the “Service Objective”, which for SQLDW is the DWU.

Note that this will NOT work if you run it against the SQLDW directly. Like other Platform-as-a-Service databases (aka: Azure SQL DB), the server it is allocated to will have a master database, used to control database-level metadata and security settings.

You need to connect directly to the master database to issue a T-SQL scaling command – this isn’t always possible with production systems but it’s worth bearing in mind as a potential method.

ALTER DATABASE [DBNAME] MODIFY (SERVICE_OBJECTIVE = 'DW1000');

Finally there’s PowerShell – either via the Azure Powershell IDE or through an Azure Automation Account – which essentially provides a Platform-as-a-Service Powershell node.

There are a couple of cmdlets available – these are pre-built Powershell functions you can pass parameters to in order to perform standard tasks.

There are three standard cmdlets I would use, these are:


Set-AzureRmSqlDatabase -DatabaseName “Database"
-ServerName “Server"

-RequestedServiceObjectiveName "DW1000"

This is the same as the cmdlet used to change the service objective of Azure SQLDBs. Simply select your database & server and provide the required DWU level to move to.

The next two are interesting - you can entirely pause the compute associated to your SQLDW. This allows you to save a lot of money in out-of-hours times, maybe turning things off in the evenings and weekends.


Suspend-AzureRmSqlDatabase
–ResourceGroupName "ResourceGroup"
–ServerName "Server"
–DatabaseName "Database"

Resume-AzureRmSqlDatabase
–ResourceGroupName
"ResourceGroup"
–ServerName "Server"
–DatabaseName "Database"

In each of these methods, the scaling will take anywhere from 30 seconds to 5 minutes. If you have long running transactions when you issue the scaling command, it will need to roll these back, in these scenarios the scaling action could take a long time!

What happens when you scale SQLDW?

When we change the number of DWUs (that's data warehouse units, similar to the SQLDB concept of DTUs) - we're actually just changing the number of distributions each compute node has to look after.

At 100 DWUs, a single compute node manages all storage:

clip_image002

At 200 DWUs, the number of distributions is only 30 per compute node. This is already a drastic difference in performance.

clip_image003

Again, at 300 DWUs, we add another compute node and associate it to storage. No data is moving but we now have less work for each compute node to do, therefore it completes faster.

clip_image004

That's why we can keep scaling all the way to 6000 DWUs - at that point we are associating each distribution directly with a compute node - there's a one-to-one mapping. We can't scale any further than that as we would begin sharing storage between compute, and that's the SMP problem we're trying to avoid!

It might help to think of this in MPP terms - our 300 DWU example would look like this:

clip_image005

Hopefully that helps understand what's happening when you hit that "Scale" button - we'll be talking about distributing data next. If you haven't distributed your data properly, the all the scaling won't help you very much!

A Guide to Azure SQL DataWarehouse

So you've heard the hype - the Azure SQL DW is going to solve all of your problems in one fell swoop… Right? Well… maybe. The system itself is a mix of technologies designed for low concurrency analytics across huge amounts of relational data. In short, it's a cloud-scalable T-SQL-based MPP platform, with all the benefits and restrictions that performing everything in parallel brings. If your problem can be solved by performing lots of calculations over small of your data before aggregating the results into a whole, this is the technology for you.

However, before you jump right in, be aware that SQLDW is a very different beast to other SQL tools. There are specific concepts you need to be familiar with before building your system, otherwise you're not going to see the promised performance gains and will likely lose faith very quickly!

I'm in the process of documenting these concepts, plus there is a wealth of information available on the Azure SQLDW site. For the next few months, I'll be running through the blog topics below and updating the links accordingly. If there are topics you'd like me to add to the list, please get in touch!

Azure SQLDW Core Concepts:

- What is it?

- How Does Scaling Work?

- Distributions

- Polybase

- Polybase Limitations

- Polybase Design Patterns

- CTAS

- Resource Classes

- Partitioning

Designing ETL (or ELT) In Azure SQLDW

- Row counts

- Statistics

- Surrogate Keys

Performance Tuning Azure SQLDW

- Plan Exploration in SQLDW

- Data Movement Types

- Minimising Data Movement

Managing Azure SQLDW

- Backup & Restore

- Monitoring Distributions

- System Monitoring

- Job Orchestration

- Scaling and Management

- Performance Tuning Queries

Azure SQLDW Architecture

- Presentation Layers

- Data Lake Integrations

The Cloud BI Transition: ETL & Data Movement in the Cloud

The cloud warehouse is a different beast to the on-premise systems most of us have been using for the past decade. These times, they are a changing. I’ve talked about the key concepts that we, at Adatis, see as core to this new world here, but how does this work in practice? To start us off, the core of any BI Developer’s skillset – ETL.

The main two players are Azure DataWarehouse and Azure Data Lake. I recently gave a talk to SQLBits to introduce some key concepts, explain how I see them fitting in to the new world and, hopefully, challenge some existing preconceptions - slides can be found here. I’ll pop the video link up when published.

For those in a hurry, I’ve summarised the main thrust of my talk below.

The Ways of the Past

For many years, the fastest way to transform data whilst moving it between two places was to extract, transform in memory via SSIS, then load into their final structures, hence ETL.

clip_image002[4]

However – this was eventually rivalled by massive parallelism, whilst one database engine may be slower to work through a load than memory-based operations, 10 might be much faster. But the Hadoop systems that provided this, were not very accessible to the bulk of BI developers, seen as requiring different skillsets and not being particularly standard.

The management of these clusters wasn’t always achievable even with developer acceptance, certainly not in the on-premise world. This is why the latest Azure technologies are looking to make such a difference.

The Scalable Warehouse

With Azure Data Warehouse, you can perform queries over large, distributed tables – both reading and writing using massive levels of parallelism, depending on your service tier. You can also scale your service dynamically, so if your day-to-day usage required 100 DWUs, you’d be seeing 8 reader threads and 60 writer threads. Even at this level you may find performance to be similar to SSIS.

But these are PaaS components and so are incredibly flexible and scalable. We can scale our ADW instance up to 400 DWUs just for the duration of our data load (bear in mind, you pay by the hour for the maximum level used in that time). This gives us 32 readers and 60 writers – at this level we’ll be far outstripping the performance of a powerful SSIS box and this will cost us just a couple of pounds per hour. This scaling can be done via a number of methods – a couple of lines of T-SQL, a simple powershell script or even through the UI.

Once we’ve scaled, transformations are quite likely to be faster performed directly by our data warehouse engine than a data movement tool. This gives us Extract > Load > Transform, which is a common technique by users of MPP systems – with Azure DW coming from APS/PDW this makes sense.

clip_image004[4]

Targeted Compute

With Azure Data Lake Analytics, you can spin up the equivalent of 1000 cheap servers (essentially managed Hadoop instances) for the lifetime of your query. You’ll need to write U-SQL to do this, a new language combining C# and SQL syntax, but it’s not a huge leap for an experienced SQL developer with just a hint of programming skills. But the ability to write a query directly in the browser that can utilise massive amounts of parallelism removes a lot of barriers – you don’t even need visual studio installed to take advantage of ADL.

Of course, if you do build your queries in visual studio, there is a whole tooling set released to allow you to predict compute utilisation, spotting bottlenecks and planning capacity. Not to mention the benefits that source controlling your code give you!

clip_image006[4]

So Which Do We Use?

This gives us two options for our transformation, the Data Lake and within the Warehouse itself.

My assumption is that we’re going to see TELT, some upfront targeted transformation, then an extract/load into our final technology, then some further transformation.

Transform (targeted) > Extract > Load > Transform (general)

Whereas transformation used to be the bottleneck, it’s now the data movement itself that’s slowing things down. If we have just one very large job, we can perform this in isolation within the lake before moving anything and not have to scale our warehouse. If we have a lot of transformations to do on a variety of tables, we can scale up the warehouse for the duration. Maybe we do both to limit how much we scale the ADW. Perhaps we perform the transformation as increments throughout the day, either as smaller data lake jobs, data factory movements or even stream transformations. Once again, that’s what a distributed cloud architecture gives us – flexibility and choice.

The actual answer depends heavily upon the data scenario that you’re attempting to work with – the speed of incoming data, the types of data involved, the volume and complexity of the data. The new components do not provide you with a one-size-fits-all answer, they do, however, provide you with the tools to overcome almost any data challenges you may be facing.

Further Reading

Azure DataWarehouse Loading Reference Guide

Azure Data Warehouse Documentation

Azure Data Lake Analytics Documentation