Simon

Simon Whiteley's Blog

Azure SQLDW–Distribution

We've spoken about the MPP nature of SQLDW, and how it scales by changing the number of compute nodes associated to a fixed number of storage distributions. But the way that you organise data across these distributions can drastically change how the system performs - if it's done badly, one of your compute nodes will be doing all the work and your performance will be very poor.

For now, there are two main methods of distribution. We can focus on spreading data evenly across all distributions to ensure we get predictable performance or we can distribute on a specific column to optimise for specific query performance. These distribution types are known as ROUND ROBIN and HASH distributions respectively.

Round Robin Distribution

Each record that is inserted goes onto the next available distribution. This guarantees that you will have a smooth, even distribution of data, but it means you have no way of telling which data is on which distribution. This isn't always a problem!

If I wanted to perform a count of records, grouped by a particular field, I can perform this on a round-robin table. Each distribution will run the query in parallel and return it's grouped results. The results can be simply added together as a second part of the query, and adding together 60 smaller datasets shouldn't be a large overhead. For this kind of single-table aggregation, round-robin distribution is perfectly adequate!

However, the issues arise when we have multiple tables in our query. In order to join two tables. Let's take a very simple join between a fact table and a dimension. I've shown 6 distributions for simplicity, but this would be happening across all 60.

clip_image001[4]

In this case, both tables have been distributed using ROUND ROBIN. There are an even number of records on each distribution, however there is no consistency where records are stored. There are multiple fact records for "A1" but they are held on different distributions.

If I run a simple select statement joining the two tables, each distribution needs to be able to run the query in isolation. Taking the first distribution, we have records with foreign keys for A1 & A2 on the fact, and only records for A1 on the dimension. But more than that - the server doesn't KNOW which records it has.

In order to execute this query, a thing called data movement happens, which would essentially take a full copy of the dimension table and copy it to all other distributions. At this point, each distribution knows it has all records needed in order to execute the query.

clip_image002[4]

This happens at query runtime. So when you're user is trying to get the results of a select statement, they have to wait while the system reorganises data. This is done for the specific query and dropped after execution - so it'll happen every time the query is run.

Basically - data movement is bad. It's not disastrous, it's unavoidable in many queries, but it's something you should have in mind whenever designing your distributions.

HASH Distribution

This is the alternative - designing our distributions on a specific key. When a record is inserted, a HASH is created on the column name and value, and it is allocated to a distribution. Then, if another record comes in that has the same value for that column, it will be co-located in that same distribution. This is true across tables - so if two tables are distributed on the same column, the related records will be stored on the same distribution.

Let's take a look at our fact and dimensions, assuming they were distributed on the foreign key:

clip_image003[4]

You can immediately see two things - firstly, all records relating to A1 are on the same distribution as the A1 dimension records, same for B1, B2 etc. Secondly, there is obvious skew - there are more A1 records than on any other distribution! This means that the distribution with the A1 records on with be slower to finish it's query than the others.

This is a tiny example - two extra records makes very little difference, but when we're dealing with hundreds of millions of records, this skew can become significant. SQLDW can only return queries once all distributions have finished executing, meaning the whole query is only as fast as your slowest distribution.

However, when we execute our query this time, let's look at our A1 distribution:

clip_image004[4]

We now know which records this distribution holds for both tables, we also know that this distribution can execute the query without the need for data from other distributions. The query can therefore execute immediately with no data movement steps taking place - we have effectively eliminated a whole step from the query.

Replicated Distribution

In APS/PDW world, there is a third distribution type - REPLICATED. This would create a full copy of the table on all distributions, eliminating the need for many types of data movement. At the time of writing, this is being worked on but has not yet been released, even within preview.

Picking a Distribution key

Based on that knowledge, we’re after the following factors for a good distribution:

  1. Distinct Values - We need at least 60 unique values – but more than this, we want LOTS more. More than 600 at least, would be preferable. If we have fewer than 60 values, we would not be using all distributions and essentially wasting potential performance!
  1. Even Distribution - We want the values to be evenly spread. If one of our distribution keys has 10 records with that key, but another has 1000, that’s going to cause a lot of discrepancy in distribution performance.
  1. Commonly Used in Table Joins - Next, the biggest reason for data movement to happen is where joins are occurring. Look at the queries your users are going to be issuing commonly, the keys used in those joins are going to be candidates for distribution.
  1. Not Commonly Filtered - Finally, if users commonly use your distribution key as a search predicate, this is going to drastically reduce any parallelism. If it happens now and then, it’s not terrible – but if it affects the bulk of queries then it’s going to be a real performance drain.

That's probably the hardest part of designing an Azure SQL Datawarehouse - a lot of the performance and design comes down to knowing how your users will query the system, and this is rarely something we know in advance! Luckily, it's fairly easy to change distributions, we'll come on to that in future posts.

Azure SQLDW–How Does Scaling Work?

We've discussed what Azure SQLDW actually is, so we know it works on scalable MPP technology. But what does that actually mean?

If you're happy with the idea that MPP aligns compute with storage then read on. If not, have a quick skim of my blog here.

Ok, so we're using MPP, we have storage and we apply compute to it. In SQLDW, the magic number is 60. There are always sixty distributions. Whenever you write a record to a table, it will always be written to one of those 60 distributions.

You can think of those distributions however you like - as individual buckets of data, as isolated SQL databases, as magic data-shelves in the sky - as long as you know that there are always sixty, regardless of scale, tier, price, country or any other factor. That's a major difference from the APS/PDW systems that SQLDW evolved from - they would change the number of distributions as you scales. However in order to get the dynamic scaling that cloud technologies rely on, they couldn't have data being redistributed every time you scale up or down. So distributions are fixed, permanently, at 60.

You scale the system by adding/removing compute to work against these 60 distributions.

IMPORTANT: When you scale SQLDW, you will kill all running transactions. Scaling effectively restarts the server

How to Scale SQLDW

There are several different ways of scaling SQLDW, these are:

  • Through the Azure Portal
  • Via T-SQL
  • Via PowerShell

The portal is the only one that informs you of the cost when you scale. You should be aware that you will be charged per hour - so if you turn it up for 5 minutes then back down, you'll still be charged for the remainder of the hour.

Firstly, the portal. This is the easiest when you're first getting started, however cannot be automated and is very manual.

To do so, navigate to your SQLDW blade, click on the "Scale" button. When you select your chosen DWU level, it will tell you how much this amount of compute will cost per hour.

clip_image001

Next, there’s the SQL route. Really easy, simply issue an alter statement command against your database and it’ll update the “Service Objective”, which for SQLDW is the DWU.

Note that this will NOT work if you run it against the SQLDW directly. Like other Platform-as-a-Service databases (aka: Azure SQL DB), the server it is allocated to will have a master database, used to control database-level metadata and security settings.

You need to connect directly to the master database to issue a T-SQL scaling command – this isn’t always possible with production systems but it’s worth bearing in mind as a potential method.

ALTER DATABASE [DBNAME] MODIFY (SERVICE_OBJECTIVE = 'DW1000');

Finally there’s PowerShell – either via the Azure Powershell IDE or through an Azure Automation Account – which essentially provides a Platform-as-a-Service Powershell node.

There are a couple of cmdlets available – these are pre-built Powershell functions you can pass parameters to in order to perform standard tasks.

There are three standard cmdlets I would use, these are:


Set-AzureRmSqlDatabase -DatabaseName “Database"
-ServerName “Server"

-RequestedServiceObjectiveName "DW1000"

This is the same as the cmdlet used to change the service objective of Azure SQLDBs. Simply select your database & server and provide the required DWU level to move to.

The next two are interesting - you can entirely pause the compute associated to your SQLDW. This allows you to save a lot of money in out-of-hours times, maybe turning things off in the evenings and weekends.


Suspend-AzureRmSqlDatabase
–ResourceGroupName "ResourceGroup"
–ServerName "Server"
–DatabaseName "Database"

Resume-AzureRmSqlDatabase
–ResourceGroupName
"ResourceGroup"
–ServerName "Server"
–DatabaseName "Database"

In each of these methods, the scaling will take anywhere from 30 seconds to 5 minutes. If you have long running transactions when you issue the scaling command, it will need to roll these back, in these scenarios the scaling action could take a long time!

What happens when you scale SQLDW?

When we change the number of DWUs (that's data warehouse units, similar to the SQLDB concept of DTUs) - we're actually just changing the number of distributions each compute node has to look after.

At 100 DWUs, a single compute node manages all storage:

clip_image002

At 200 DWUs, the number of distributions is only 30 per compute node. This is already a drastic difference in performance.

clip_image003

Again, at 300 DWUs, we add another compute node and associate it to storage. No data is moving but we now have less work for each compute node to do, therefore it completes faster.

clip_image004

That's why we can keep scaling all the way to 6000 DWUs - at that point we are associating each distribution directly with a compute node - there's a one-to-one mapping. We can't scale any further than that as we would begin sharing storage between compute, and that's the SMP problem we're trying to avoid!

It might help to think of this in MPP terms - our 300 DWU example would look like this:

clip_image005

Hopefully that helps understand what's happening when you hit that "Scale" button - we'll be talking about distributing data next. If you haven't distributed your data properly, the all the scaling won't help you very much!

Azure SQLDW–What is it?

Azure SQL DataWarehouse has had a lot of attention recently as a super-scalable SQL Server in Azure, But there's not too much understanding about what is actually is. I could tell you it's the cloud version of PDW/APS, although that requires a fair bit of background knowledge. Technically it's a cloud-scalable T-SQL-based MPP platform… does that clear things up? No? Well let's break that down - first let's look at what MPP means…

MPP

I like to play this game at the start of Azure SQLDW sessions. I ask who has used an MPP system before, a couple of hands go up. I ask who has used an SMP system, a few hesitant hands are raised. I ask who has used a SQL Server before and suddenly we have a room full of waving hands.

Traditional SQL Server IS an SMP system. SMP stands for Symmetric Multi-Processing - so without really knowing it, we're using SMP systems all the time. I'm hoping I can stop playing the wavy-hands game soon when everyone either knows this or has used SQLDW…

What SMP means is that you have one data store that is shared between all CPUs and a service bus to manage interactions. If you have a particular record in a table, each of the CPUs can access that at any time. If they all try to access it at once, then we need locking and queue management in the middle. Sure, you can be clever with your FileGroup/File structures and have parallel I/O reads, but all that storage is still shared across the CPUs.

clip_image001

Now we can add more and more compute to this solution, but there comes a point where the service bus is working too hard to try and balance everything out. Essentially you get diminishing returns on your hardware investment as it has to manage more complex interactions with your storage.

MPP, or Massively Parallel Processing, on the other hand, separates storage and aligns it to specific compute. What this means is each CPU can only access a certain set of records - but we know that there is no competition from other CPUs to access that storage. The idea being that each compute node does its job in isolation, then we can add the various results together to get our answer.

clip_image002

That's an important point to remember - a single record is only held on ONE of the storage buckets (In most scenarios currently… we'll get onto that in later posts)

The way we scale this is to add more compute notes and associated buckets. For example, if we had 4 Tbs of data, each of our storage buckets in the above example would be looking after 1.33Tb of data. If we add a further compute & storage combination and spread our data out, then each would be holding 1Tb of data. We can keep going - at 8 compute & storage combinations, each of our nodes will only be querying 500Gb of data and we will see performance increase as a result!

Our scaled-up MPP would look something like this:

clip_image003

The overall storage volume is the same, however it's spread across more compute. So each compute finishes the job quicker, meaning we get our results faster!

Cloud Scalable

That's the MPP part - so what about Cloud Scalable? Well we know that we can make it go faster by adding more compute nodes and spreading the data out across those nodes. But we can take this a step further in Azure - the compute nodes can be allocated to storage rather than redistributing data, so we can add and remove compute nodes dynamically.

Essentially, you take the compute to the storage - it's a fundamental approach in big data to get real scalability. Moving data is expensive and slow, moving compute is quick, efficient and cheap if you have the right hardware setup. This also means that you can take compute away entirely when you're not using it - you'll still pay for the storage, but it's a lot cheaper than having a server you're not using!

Want your server to go faster for the next hour? Double the compute nodes! Not using it overnight? Turn it off! Doing a big data load over night? Turn the server back on at a high rate, blast through the processing then pause the compute again!

clip_image004

T-SQL Based

Finally - yes, it is indeed a SQL Server. However, it uses a specific version of the SQL Server engine, evolved from SQL Server PDW edition. This means that a lot of your favourite functionality is still there and works fine, however there are certain feature limitations due to the system's parallel nature. They're slowly going through and adding these features, but we won't ever achieve full engine parity!

So - CTEs, window functions, aggregations, stored procs, views - these are all present and accounted for!

Identity columns, @@Rowcount, table-valued functions are a few which are not.

It's a very familiar environment to anyone who develops in T-SQL, but it's not exactly the same.

There we go. So now we can happily say that Azure SQL DataWarehouse is a cloud-scalable T-SQL-based MPP platform and not be left reeling.

A Guide to Azure SQL DataWarehouse

So you've heard the hype - the Azure SQL DW is going to solve all of your problems in one fell swoop… Right? Well… maybe. The system itself is a mix of technologies designed for low concurrency analytics across huge amounts of relational data. In short, it's a cloud-scalable T-SQL-based MPP platform, with all the benefits and restrictions that performing everything in parallel brings. If your problem can be solved by performing lots of calculations over small of your data before aggregating the results into a whole, this is the technology for you.

However, before you jump right in, be aware that SQLDW is a very different beast to other SQL tools. There are specific concepts you need to be familiar with before building your system, otherwise you're not going to see the promised performance gains and will likely lose faith very quickly!

I'm in the process of documenting these concepts, plus there is a wealth of information available on the Azure SQLDW site. For the next few months, I'll be running through the blog topics below and updating the links accordingly. If there are topics you'd like me to add to the list, please get in touch!

Azure SQLDW Core Concepts:

- What is it?

- How Does Scaling Work?

- Distributions

- Polybase

- CTAS

- Resource Classes

- Partitioning

Designing ETL (or ELT) In Azure SQLDW

- Row counts

- Statistics

- Surrogate Keys

Performance Tuning Azure SQLDW

- Plan Exploration in SQLDW

- Data Movement Types

- Minimising Data Movement

Managing Azure SQLDW

- Backup & Restore

- Monitoring Distributions

- System Monitoring

- Job Orchestration

- Scaling and Management

- Performance Tuning Queries

Azure SQLDW Architecture

- Presentation Layers

- Data Lake Integrations