Simon

Simon Whiteley's Blog

Azure SQLDW-CTAS Statements

So here we are - you've designed your target distribution, you're scaling your warehouse appropriately and you've set up Polybase to access raw data outside of the system. You are now ready to bring data into the warehouse itself so we can start working on it.

Firstly, you should be aware that we're building using ELT not ETL, ie: we design to allow for the warehouse itself to do the work as it is the powerful processing option available to us. When first introducing Polybase, I described how it is the only method of loading data into SQLDW that does not get bottlenecked through the control node. Well the same limitations will be present if we try and transform our data using traditional ETL tools tool. If we used an SSIS data flow, for example, this would extract data out, pull it across to the SSIS server, perform the transformation and then attempt to reinsert the data. Not only are we introducing network latency, but we are also ignoring the power of the MPP engine AND encountering the throughput limits of the control node.

clip_image001

In order to utilise SQLDW effectively, we write SQL for our transformations, rather than relying on external tools such as SSIS. This ensures the work is being done by our compute nodes and, therefore, can be scaled up to increase performance.

General best practice, therefore, would be write stored procedures for each of the data movements we want to occur. This allows us to add in auditing, logging etc. But what we're interested in here is the core data movement itself.

Writing a traditional INSERT statement isn’t the fastest way to get data into a table. There is a special syntax which creates a new table and inserts into it, that is automatically configured for optimal bulk loading, this is the CTAS, or "Create Table as Select" statement.

CREATE TABLE [dbo].[MyTable]
WITH
(
HEAP,
DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT *
FROM [dbo].[MySourceTable]

 

As it is the only minimally-logged, bulk insert method within SQLDW, it’s the fastest way to get data into a table. If you have the table already exists and you want to wipe it and fill it with data, the fastest way to do so is to delete the table and recreate it using a CTAS. Performing a truncate/insert combination would be slower, as each row insert causes a write to the transaction log. SQLDW performs all standard actions as a transaction by default, so if you have a huge insert statement and something goes wrong, be prepared for a lengthy rollback process.

The table creates by a CTAS statement infers the schema from the query, so you need to be explicit with ISNULLs, CASTs and anything needed to point the query in the right direction. This is where the bulk of the data processing happens - the select part of the CTAS both defines the destination table and carries out the data transformations, all as a single step.

If you’ve ever used “SELECT field, field2 INTO [dbo].[NewTable] FROM [dbo].[SourceTable]” then you’ll have a pretty good idea of how it’ll work. But if you ran the two side by side, CTAS would perform faster as, again, it is specifically treated differently by the SQLDW engine.

What does this mean to our traditional ETL pattern? If you're following the standard approach of having intermediary tables between data processing steps (ie: Staged data, cleaned data, transformed data and so on), these transient tables can actually be deleted between loads. They won't exist in your database project. They won't exist until your load process has been run.

A CTAS to 'Clean' our fact table would be:

BEGIN TRY
    --Primary CTAS to insert data from external table into strongly typed Clean table
      CREATE TABLE Clean.MyTable
      WITH(
        HEAP,
        DISTRIBUTION = ROUND_ROBIN)
      AS
      SELECT
          ISNULL(CAST(Column_0 as INT),-1)  TableId, 
          CAST(Column_1 as VARCHAR(50))     SomeTextField, 
          CAST(Column_2 as DECIMAL(12,5))   SomeNumber
      FROM SRC.MySourceTable
      OPTION (LABEL = 'Clean.LoadMyTable.CTAS');
    END TRY
    BEGIN CATCH
        --CTAS Failed, mark process as failed and throw error
        SET @ErrorMsg = 'Error loading table "MyTable" during Cleaning CTAS: ' + ERROR_MESSAGE()
        RAISERROR (@ErrorMsg, 16, 1)
    END CATCH

By including the CAST and ISNULL statements, I’m controlling the schema of the created tables. The resulting table would have the following definition:

CREATE TABLE [Clean].[MyTable]
WITH
(
    HEAP,
    DISTRIBUTION = ROUND_ROBIN
)
(
    [TableId]       INT                NOT NULL,
    [SomeTextField] VARCHAR(50)        NULL,
    [SomeNumber]    NUMERIC(12,5)    NULL
);

By using CTAS wherever possible, you will maintain the speed of your data loads and avoid and issues around long-running transactions. An efficient SQLDW loading process contains ONLY CTAS functions.

For example – there is no such thing as a MERGE statement within SQLDW. In order to merge the results of one table into another, you can write a CTAS that selects a union of the two tables and inserts them into a new table. You then drop the old table and rename your newly created table in it’s stead.

There are many similar patterns that will help you around any data movement problem – we’ll discuss them in later posts in this series.

For more information about the CTAS statement, see the Microsoft Docs.

Azure SQLDW-Polybase Design Patterns

In my last posts, I've spoken about using Polybase as a basic data loading mechanism for Azure SQLDW, some of the ways to maximise performance and some of the limitations that exist within the current implementation. But we haven't gone into specific implementation designs… until now.

The key ideas we will discuss are:

  • File Loading structures
  • Fast-Moving Entities

Data Loading

As discussed in the Polybase Limitations post - it is difficult to get file lineage from records extracted from a Polybase external table. Therefore, when I am using Polybase to load a strict warehouse, I want to be able to quickly switch the context of the external table.

Let's assume we have a vertically partitioned file within an Azure Data Lake structure - files are loaded into separate date folders. If multiple files are loaded in the same day, they will be placed in the same "day" folder.

clip_image001

One of Polybase's major benefits is that is can loop through all subfolders within the root folder. The first and most obvious use case is to point Polybase at the root folder. This brings back all records from files within any folder - essentially we can load all records that exist anywhere within that whole structure.

clip_image002

This is useful as a validation technique, using the full history of our data loads to check against what has currently been loaded through to the warehouse.

However - as there is no push-down predicate support in Polybase, we don't want to be using this for regular loads otherwise each query will be reading ALL files, then discarding anything except the latest. Our loads would slow down over time.

The next progression is to separate files that have not yet been loaded into a loading or "active" folder. This means you can have a single definition for your external table and only query the files you are interested in. However, it does mean you have to manage the archiving of these files once the load has finished successfully, it's another step in the process.

clip_image003

The process I have settled on recently is a little more complicated but much more flexible. As a part of my usual metadata structures, I keep a table of files that require loading into the warehouse. This allows for tracking lineage, progress, history etc and is something I include in most systems regardless.

A simplified version would look something like this:

image

Given this exists, we can use it to programmatically recreate the External Table, changing the [LOCATION] setting to point to the relevant file to be loaded using our [Source Path] column. Our pattern now looks like this:

clip_image004

We can then use a WHILE loop to run through the files to be loaded, bringing the rows from each into the warehouse. This has the added advantage of also allowing us to track exactly which file has been loaded from each query.

This isn't the most efficient use of Polybase - we gain much more efficiency by loading lots of files at once. However - we can be running data loads from many different file types at once, using the available external readers from our SQLDW compute nodes as effectively as possible.

This technique has added benefits however - if we had to perform a historical load, or do a one-off checksum, we can use this same technique to re-point the loading External Table at a mid-level folder.

We do this by simply adding a dummy record pointing at the upper path:

image

If we put this through our usual process, the External Table would be recreated at load and pointed to the 2017 folder:

clip_image005

This provides a nice, easy historical loading mechanism, as well as a flexible daily load process. It does, however, require writing a little Dynamic SQL to do the external table recreation.

Fast-Moving Entities

That's enough about data loading for now, there's another major use case for Polybase that we haven't yet discussed. Many data processing solutions have a huge, unwieldy overnight batch job that performs aggregates, lookups, analytics and various other calculations.

However, it is often the case that this is not timely enough for many business requirements. This is where Polybase can help.

If we have an External Table over the newest files, this will read these new records at query time. We can write a view that combines the External Table with our batch-produced table. This will obviously go a little slower than usual, given it has to read data from flat files each time, however the results returned will be up to date.

This isn't always going to be possible - sometimes the transformations and calculations applied to the source files are too complex to be replicated on the fly. But it is worth bearing in mind as a simple design pattern for providing near real-time reports without over-engineering your solution.

clip_image006

If you're familiar with the Lambda Architecture (one of the key movements within traditional Big Data architectures) - this is akin to using the Azure SQL DataWarehouse as the "Serving Layer" component, unifying the batch and speed processes.

As usual, if you have any questions, please get in touch!

Azure SQLDW-Polybase Limitations

If it wasn't clear from my other blogs, I LOVE polybase. It's a super powerful tool that seamlessly integrates flat-file parallelism with relational querying. However - there are some ways that the implementation really has to improve - These are true as of June 2017 - given how fast things change, I'm hoping they won't be limitations for too much longer…

Push-down predicates

This one is a biggie - if you're querying over a whole range of flat files that are organised into [YEAR]/[MONTH] folders, for example, you should be able to write a query like the following:

SELECT * FROM dbo.MyExternalTable WHERE [YEAR] > 2016

This filter would be pushed down to the polybase engine and tell it to ignore any files that have been vertically partitioned outside of our chosen range. Instead, all files are read and returned to the SQL Server engine and the filtering is done in-memory on the returned dataset. This is obviously hugely inefficient in some cases - especially when you're using Polybase as a data loading mechanism. This feature is available in HIVE tables and you can do it in U-SQL - hopefully it's only a matter of time before a similar feature is implemented here.

File Metadata

As a BI Developer, I love metadata - I want to know where a record came from, which flat file it was carried into the Lake aboard. The ability to query from thousands of files at once recursively is insanely useful - it gives you the ability to keep delta files as an immutable raw layer that you can query as a single entity. However, there is currently no ability to include the "filename" within the columns returned by the external table.

Again, this is something that you can do in equivalent technologies and seems mad that it's not included by default. When you look at the DMVs underneath the external readers, you can see, in detail, the file that each thread is currently reading, it's very aware of the file that it's pulling data from.

Just query the sys.dm_exec_external_work table whilst running a query over a large set of files in an external table and you'll see what I mean.

Rejection Handling

This one is even more frustrating - Polybase has a really useful feature where you can tell it to reject rows that fail to be parsed into the data types you've specified.

The most common usage for this is around file headers, discussed in the next limitation. However - the reject functionality has two flaws:

Firstly - you can't see the rows that failed. Such a simple thing, but if you're using Polybase as a data loading tool for an analytics warehouse, you can't accept that some rows simply won't make it through and you'll never know which rows failed.

This is a tricky limitation however - I have no idea how they would model this effectively within SQLDW. If we were in the standard SQL engine, I'd like to see External Tables to have an option to automatically have a shadow "External Error Rows" table created. Any row that fails parsing gets dumped into that table as a JSON record with a query date and object_id.

I've actually raised the idea on connect here: http://bit.ly/2rWT35D

There is also the issue of speed - I've seen real performance problems when having rejection percentages turned on, the continuous checking against tolerance seems to be a big drain on execution.

Header Row Skipping

Related to the above issue is Header Row skipping. Such a common thing to do, available in U-SQL, HIVE and ETL tools going back to the dawn of time… but Polybase cannot currently skip header rows. Preferably this would give you the option of skipping x rows of the file for situations where there is a metadata/header section at the top of your file.

Currently, you either have to accept your fate and perform some pre-processing on files to strip out and header data, maybe by using U-SQL or accept the performance overhead of having a reject rows tolerance set within your external tables.

Conclusions

None of the above are deal-breakers for me - the sheer speed and opportunities opened up by being able to query over a huge flat-file store, whilst still within your relational engine more than makes up for the limitations. However many of the limitations don't exist within similar tools - this at least gives me hope that they can be fixed and we should be seeing improvements over time.

Unfortunately, for the time being the focus seems to be on increasing Polybase compatibility - adding further data stores and additional security protocols rather than improving the functionality of the tool itself.

In the next post, I want to share some design patterns about Polybase use cases – dynamically switching external locations, wider architectures and more.

Azure SQLDW-Polybase

The last time I gave a SQLDW conference talk, one of the audience members commented that I probably like Polybase a little bit too much. But I disagree, I love it just as much as it deserves!

In a nutshell:

“Polybase is by far the fastest way to import/export data from SQLDW. If you're loading any reasonably sized data volume, you should be using Polybase”

That's not a quote - I just thought it would be more impactful looking like one!

For those of a traditional "Big Data" background, Polybase is essentially the same as an external Hive table, embedded into a traditional relational database.

For everyone else - Polybase is a special kind of SQL table. It looks like a table, it has columns and you can run normal T-SQL on it. However, instead of holding any data locally, your query is converted into map-reduce jobs against flat files - even those in sub-folders. For me, this is mind-bogglingly cool - you can query thousands of files at once by typing "select * from dbo.myexternaltable".

Why Use It?

Polybase has some limitations, it's true. I've included a section below discussing these. However, the other data import options are more limited.

SQLDW has two different types of server within its architecture - a single Control node that orchestrates and aggregates, and a variable number of Compute nodes linked to your current DWU level.

Any data loading method that 'pushes' data to the SQLDW will be handled by the control node. This includes all current ETL tools - SSIS dataflows, the staple diet of MS BI developers for many years now - will encounter a row throughput bottleneck when inserting rows because of this.

Whether you're using BCP, SSIS or a high-performance ETL solution - you're going to his this bottleneck:

clip_image001

Because of this - it doesn't matter how you scale the warehouse. No matter how many compute nodes you have, the initial data throughput won't go any faster.

What we want is a data import method that scales - one that has a predictable increase in performance as we add compute nodes (or DWUs). This is where Polybase comes in.

clip_image002

Each compute node has a Hadoop File System (HDFS) bridge - essentially a map-reduce engine - included. This gives it external reader threads. For each compute node, you can be reading 8 different flat files. Raising the DWU level, adds additional external readers, speed import speed increases. This is the ONLY scalable data loading approach.

Polybase Performance Patterns

We're settled on Polybase then - but how do you get the most out of Polybase?

To properly optimise Polybase, you should consider how it accesses files. Whether you're using Blob Storage (HDFS) or Data Lake Store (WHDFS), both systems are splitting your files up and storing it across a range of disks. This means that we can be reading separate parts of the file in parallel, without causing any kind of I/O contention.

A standard, uncompressed file will be split into its component parts - essentially each 512mb is a separate extent to be accessed.

clip_image003

Whether this is a single file split into parallel chunks, or separate files, the Polybase table pulls the results of each thread together into a single data set that is returned by the query.

Another option is to compress the file - traditionally this speeds things up as we're reducing the amount of data retrieved from disk, which is usually our largest bottleneck. However, Polybase cannot multi-thread a compressed file, so this reduces it to a single thread, which is inevitably slower.

clip_image004

If you're really trying to optimise the performance, then you can get the best of both worlds. You can predict the number of threads that will be read in the file (ie: the number of 512mb chunks within the uncompressed file). If you then split the file into this number of separate flat files, then compress each file individually, you are both maintaining the level of parallelism whilst simultaneously reducing I/O impact.

clip_image005

 

Conclusions

Hopefully this has helped you to understand the part that Polybase plays in the SQLDW architecture. For further information, see the Azure CAT team's guide to SQLDW loading patterns & best practices - http://bit.ly/2tezjxP

In the next post, we'll discuss some of the limitations with the current Polybase implementation and ways that I'd like to see it improved.

Azure SQLDW–Distribution

We've spoken about the MPP nature of SQLDW, and how it scales by changing the number of compute nodes associated to a fixed number of storage distributions. But the way that you organise data across these distributions can drastically change how the system performs - if it's done badly, one of your compute nodes will be doing all the work and your performance will be very poor.

For now, there are two main methods of distribution. We can focus on spreading data evenly across all distributions to ensure we get predictable performance or we can distribute on a specific column to optimise for specific query performance. These distribution types are known as ROUND ROBIN and HASH distributions respectively.

Round Robin Distribution

Each record that is inserted goes onto the next available distribution. This guarantees that you will have a smooth, even distribution of data, but it means you have no way of telling which data is on which distribution. This isn't always a problem!

If I wanted to perform a count of records, grouped by a particular field, I can perform this on a round-robin table. Each distribution will run the query in parallel and return it's grouped results. The results can be simply added together as a second part of the query, and adding together 60 smaller datasets shouldn't be a large overhead. For this kind of single-table aggregation, round-robin distribution is perfectly adequate!

However, the issues arise when we have multiple tables in our query. In order to join two tables. Let's take a very simple join between a fact table and a dimension. I've shown 6 distributions for simplicity, but this would be happening across all 60.

clip_image001[4]

In this case, both tables have been distributed using ROUND ROBIN. There are an even number of records on each distribution, however there is no consistency where records are stored. There are multiple fact records for "A1" but they are held on different distributions.

If I run a simple select statement joining the two tables, each distribution needs to be able to run the query in isolation. Taking the first distribution, we have records with foreign keys for A1 & A2 on the fact, and only records for A1 on the dimension. But more than that - the server doesn't KNOW which records it has.

In order to execute this query, a thing called data movement happens, which would essentially take a full copy of the dimension table and copy it to all other distributions. At this point, each distribution knows it has all records needed in order to execute the query.

clip_image002[4]

This happens at query runtime. So when you're user is trying to get the results of a select statement, they have to wait while the system reorganises data. This is done for the specific query and dropped after execution - so it'll happen every time the query is run.

Basically - data movement is bad. It's not disastrous, it's unavoidable in many queries, but it's something you should have in mind whenever designing your distributions.

HASH Distribution

This is the alternative - designing our distributions on a specific key. When a record is inserted, a HASH is created on the column name and value, and it is allocated to a distribution. Then, if another record comes in that has the same value for that column, it will be co-located in that same distribution. This is true across tables - so if two tables are distributed on the same column, the related records will be stored on the same distribution.

Let's take a look at our fact and dimensions, assuming they were distributed on the foreign key:

clip_image003[4]

You can immediately see two things - firstly, all records relating to A1 are on the same distribution as the A1 dimension records, same for B1, B2 etc. Secondly, there is obvious skew - there are more A1 records than on any other distribution! This means that the distribution with the A1 records on with be slower to finish it's query than the others.

This is a tiny example - two extra records makes very little difference, but when we're dealing with hundreds of millions of records, this skew can become significant. SQLDW can only return queries once all distributions have finished executing, meaning the whole query is only as fast as your slowest distribution.

However, when we execute our query this time, let's look at our A1 distribution:

clip_image004[4]

We now know which records this distribution holds for both tables, we also know that this distribution can execute the query without the need for data from other distributions. The query can therefore execute immediately with no data movement steps taking place - we have effectively eliminated a whole step from the query.

Replicated Distribution

In APS/PDW world, there is a third distribution type - REPLICATED. This would create a full copy of the table on all distributions, eliminating the need for many types of data movement. At the time of writing, this is being worked on but has not yet been released, even within preview.

Picking a Distribution key

Based on that knowledge, we’re after the following factors for a good distribution:

  1. Distinct Values - We need at least 60 unique values – but more than this, we want LOTS more. More than 600 at least, would be preferable. If we have fewer than 60 values, we would not be using all distributions and essentially wasting potential performance!
  1. Even Distribution - We want the values to be evenly spread. If one of our distribution keys has 10 records with that key, but another has 1000, that’s going to cause a lot of discrepancy in distribution performance.
  1. Commonly Used in Table Joins - Next, the biggest reason for data movement to happen is where joins are occurring. Look at the queries your users are going to be issuing commonly, the keys used in those joins are going to be candidates for distribution.
  1. Not Commonly Filtered - Finally, if users commonly use your distribution key as a search predicate, this is going to drastically reduce any parallelism. If it happens now and then, it’s not terrible – but if it affects the bulk of queries then it’s going to be a real performance drain.

That's probably the hardest part of designing an Azure SQL Datawarehouse - a lot of the performance and design comes down to knowing how your users will query the system, and this is rarely something we know in advance! Luckily, it's fairly easy to change distributions, we'll come on to that in future posts.

Azure SQLDW–How Does Scaling Work?

We've discussed what Azure SQLDW actually is, so we know it works on scalable MPP technology. But what does that actually mean?

If you're happy with the idea that MPP aligns compute with storage then read on. If not, have a quick skim of my blog here.

Ok, so we're using MPP, we have storage and we apply compute to it. In SQLDW, the magic number is 60. There are always sixty distributions. Whenever you write a record to a table, it will always be written to one of those 60 distributions.

You can think of those distributions however you like - as individual buckets of data, as isolated SQL databases, as magic data-shelves in the sky - as long as you know that there are always sixty, regardless of scale, tier, price, country or any other factor. That's a major difference from the APS/PDW systems that SQLDW evolved from - they would change the number of distributions as you scales. However in order to get the dynamic scaling that cloud technologies rely on, they couldn't have data being redistributed every time you scale up or down. So distributions are fixed, permanently, at 60.

You scale the system by adding/removing compute to work against these 60 distributions.

IMPORTANT: When you scale SQLDW, you will kill all running transactions. Scaling effectively restarts the server

How to Scale SQLDW

There are several different ways of scaling SQLDW, these are:

  • Through the Azure Portal
  • Via T-SQL
  • Via PowerShell

The portal is the only one that informs you of the cost when you scale. You should be aware that you will be charged per hour - so if you turn it up for 5 minutes then back down, you'll still be charged for the remainder of the hour.

Firstly, the portal. This is the easiest when you're first getting started, however cannot be automated and is very manual.

To do so, navigate to your SQLDW blade, click on the "Scale" button. When you select your chosen DWU level, it will tell you how much this amount of compute will cost per hour.

clip_image001

Next, there’s the SQL route. Really easy, simply issue an alter statement command against your database and it’ll update the “Service Objective”, which for SQLDW is the DWU.

Note that this will NOT work if you run it against the SQLDW directly. Like other Platform-as-a-Service databases (aka: Azure SQL DB), the server it is allocated to will have a master database, used to control database-level metadata and security settings.

You need to connect directly to the master database to issue a T-SQL scaling command – this isn’t always possible with production systems but it’s worth bearing in mind as a potential method.

ALTER DATABASE [DBNAME] MODIFY (SERVICE_OBJECTIVE = 'DW1000');

Finally there’s PowerShell – either via the Azure Powershell IDE or through an Azure Automation Account – which essentially provides a Platform-as-a-Service Powershell node.

There are a couple of cmdlets available – these are pre-built Powershell functions you can pass parameters to in order to perform standard tasks.

There are three standard cmdlets I would use, these are:


Set-AzureRmSqlDatabase -DatabaseName “Database"
-ServerName “Server"

-RequestedServiceObjectiveName "DW1000"

This is the same as the cmdlet used to change the service objective of Azure SQLDBs. Simply select your database & server and provide the required DWU level to move to.

The next two are interesting - you can entirely pause the compute associated to your SQLDW. This allows you to save a lot of money in out-of-hours times, maybe turning things off in the evenings and weekends.


Suspend-AzureRmSqlDatabase
–ResourceGroupName "ResourceGroup"
–ServerName "Server"
–DatabaseName "Database"

Resume-AzureRmSqlDatabase
–ResourceGroupName
"ResourceGroup"
–ServerName "Server"
–DatabaseName "Database"

In each of these methods, the scaling will take anywhere from 30 seconds to 5 minutes. If you have long running transactions when you issue the scaling command, it will need to roll these back, in these scenarios the scaling action could take a long time!

What happens when you scale SQLDW?

When we change the number of DWUs (that's data warehouse units, similar to the SQLDB concept of DTUs) - we're actually just changing the number of distributions each compute node has to look after.

At 100 DWUs, a single compute node manages all storage:

clip_image002

At 200 DWUs, the number of distributions is only 30 per compute node. This is already a drastic difference in performance.

clip_image003

Again, at 300 DWUs, we add another compute node and associate it to storage. No data is moving but we now have less work for each compute node to do, therefore it completes faster.

clip_image004

That's why we can keep scaling all the way to 6000 DWUs - at that point we are associating each distribution directly with a compute node - there's a one-to-one mapping. We can't scale any further than that as we would begin sharing storage between compute, and that's the SMP problem we're trying to avoid!

It might help to think of this in MPP terms - our 300 DWU example would look like this:

clip_image005

Hopefully that helps understand what's happening when you hit that "Scale" button - we'll be talking about distributing data next. If you haven't distributed your data properly, the all the scaling won't help you very much!

Azure SQLDW–What is it?

Azure SQL DataWarehouse has had a lot of attention recently as a super-scalable SQL Server in Azure, But there's not too much understanding about what is actually is. I could tell you it's the cloud version of PDW/APS, although that requires a fair bit of background knowledge. Technically it's a cloud-scalable T-SQL-based MPP platform… does that clear things up? No? Well let's break that down - first let's look at what MPP means…

MPP

I like to play this game at the start of Azure SQLDW sessions. I ask who has used an MPP system before, a couple of hands go up. I ask who has used an SMP system, a few hesitant hands are raised. I ask who has used a SQL Server before and suddenly we have a room full of waving hands.

Traditional SQL Server IS an SMP system. SMP stands for Symmetric Multi-Processing - so without really knowing it, we're using SMP systems all the time. I'm hoping I can stop playing the wavy-hands game soon when everyone either knows this or has used SQLDW…

What SMP means is that you have one data store that is shared between all CPUs and a service bus to manage interactions. If you have a particular record in a table, each of the CPUs can access that at any time. If they all try to access it at once, then we need locking and queue management in the middle. Sure, you can be clever with your FileGroup/File structures and have parallel I/O reads, but all that storage is still shared across the CPUs.

clip_image001

Now we can add more and more compute to this solution, but there comes a point where the service bus is working too hard to try and balance everything out. Essentially you get diminishing returns on your hardware investment as it has to manage more complex interactions with your storage.

MPP, or Massively Parallel Processing, on the other hand, separates storage and aligns it to specific compute. What this means is each CPU can only access a certain set of records - but we know that there is no competition from other CPUs to access that storage. The idea being that each compute node does its job in isolation, then we can add the various results together to get our answer.

clip_image002

That's an important point to remember - a single record is only held on ONE of the storage buckets (In most scenarios currently… we'll get onto that in later posts)

The way we scale this is to add more compute notes and associated buckets. For example, if we had 4 Tbs of data, each of our storage buckets in the above example would be looking after 1.33Tb of data. If we add a further compute & storage combination and spread our data out, then each would be holding 1Tb of data. We can keep going - at 8 compute & storage combinations, each of our nodes will only be querying 500Gb of data and we will see performance increase as a result!

Our scaled-up MPP would look something like this:

clip_image003

The overall storage volume is the same, however it's spread across more compute. So each compute finishes the job quicker, meaning we get our results faster!

Cloud Scalable

That's the MPP part - so what about Cloud Scalable? Well we know that we can make it go faster by adding more compute nodes and spreading the data out across those nodes. But we can take this a step further in Azure - the compute nodes can be allocated to storage rather than redistributing data, so we can add and remove compute nodes dynamically.

Essentially, you take the compute to the storage - it's a fundamental approach in big data to get real scalability. Moving data is expensive and slow, moving compute is quick, efficient and cheap if you have the right hardware setup. This also means that you can take compute away entirely when you're not using it - you'll still pay for the storage, but it's a lot cheaper than having a server you're not using!

Want your server to go faster for the next hour? Double the compute nodes! Not using it overnight? Turn it off! Doing a big data load over night? Turn the server back on at a high rate, blast through the processing then pause the compute again!

clip_image004

T-SQL Based

Finally - yes, it is indeed a SQL Server. However, it uses a specific version of the SQL Server engine, evolved from SQL Server PDW edition. This means that a lot of your favourite functionality is still there and works fine, however there are certain feature limitations due to the system's parallel nature. They're slowly going through and adding these features, but we won't ever achieve full engine parity!

So - CTEs, window functions, aggregations, stored procs, views - these are all present and accounted for!

Identity columns, @@Rowcount, table-valued functions are a few which are not.

It's a very familiar environment to anyone who develops in T-SQL, but it's not exactly the same.

There we go. So now we can happily say that Azure SQL DataWarehouse is a cloud-scalable T-SQL-based MPP platform and not be left reeling.

A Guide to Azure SQL DataWarehouse

So you've heard the hype - the Azure SQL DW is going to solve all of your problems in one fell swoop… Right? Well… maybe. The system itself is a mix of technologies designed for low concurrency analytics across huge amounts of relational data. In short, it's a cloud-scalable T-SQL-based MPP platform, with all the benefits and restrictions that performing everything in parallel brings. If your problem can be solved by performing lots of calculations over small of your data before aggregating the results into a whole, this is the technology for you.

However, before you jump right in, be aware that SQLDW is a very different beast to other SQL tools. There are specific concepts you need to be familiar with before building your system, otherwise you're not going to see the promised performance gains and will likely lose faith very quickly!

I'm in the process of documenting these concepts, plus there is a wealth of information available on the Azure SQLDW site. For the next few months, I'll be running through the blog topics below and updating the links accordingly. If there are topics you'd like me to add to the list, please get in touch!

Azure SQLDW Core Concepts:

- What is it?

- How Does Scaling Work?

- Distributions

- Polybase

- Polybase Limitations

- Polybase Design Patterns

- CTAS

- Resource Classes

- Partitioning

Designing ETL (or ELT) In Azure SQLDW

- Row counts

- Statistics

- Surrogate Keys

Performance Tuning Azure SQLDW

- Plan Exploration in SQLDW

- Data Movement Types

- Minimising Data Movement

Managing Azure SQLDW

- Backup & Restore

- Monitoring Distributions

- System Monitoring

- Job Orchestration

- Scaling and Management

- Performance Tuning Queries

Azure SQLDW Architecture

- Presentation Layers

- Data Lake Integrations

A Roundup of MS Data AMP 2017

There was a bit of buzz last night as Microsoft held the first "Data Amp" online conference. It was essentially a chance to make a bit of marketing buzz around some new products, and there was a whole lot of hype associated. But what actually happened?

Well we learned a lot about the next version of SQL Server, several Azure components moved from preview into General Availability and, amidst the noise, some cool new features added to existing ones.

So, to help cut through the hype, here are my top 3 highlights from MS Data Amp:

1. SQL Server 2017

It's official, SQL Server vNext is coming this year and will be called SQL Server 2017. CTP2.0 is now available and is deemed production ready so you can grab a copy and start preparing migrations & new applications.

But should you adopt it early? Is there anything radically different with this version?

Well actually yes, there's quite a lot in there of interest to a BI Developer, Data Scientist or anyone with analytical querying requirements.

We already knew about a few features, some small functions such as STRING_AGG() that will make like a whole lot easier, some huge, such as On-Premise PowerBI Hosting.

The major announcements last night were:

a) PYTHON in SQL Server: Hot on the heels of 2016's R integration, SQL Server 2017 will now support Python in a similar way. This provides a few more integration options and some benefits in terms of performance and optimisation, but mostly, it provides choice and compatibility. You no longer have to choose R if you're going SQL Server, you can port your existing Python across directly.

Will it change the world? Probably not - the majority of data science is currently performed on dedicated platforms, and the ease at which these can be created and torn down in Azure means there is little barrier to entry to continuing to do so. It does, however, make it really easy to try out, and super easy to integrate into existing applications and workflows. I'd disagree with it being called "Built-in AI in SQL Server", but it's certainly a step in that direction.

b) SQL Graph: You can now have Graph Databases inside SQL Server. Essentially you mark tables as edges and nodes, then use specific SQL syntax to query them. If you've got existing Neo4J databases that you would prefer to integrate directly into SQL Server, this is for you. Your edges and nodes also act like normal tables, so you can still query them in the traditional relational way.

Personally, I'm interested in using it as a reporting layer - putting some D3 visualisations over the top and natively displaying clustering, force-directed graphs and the like without and need for transformations, all from your existing SQL Server.

Adatis's very own Terry McCann has put together his introduction to SQL Graph

2. Azure Analysis Services is now Generally Available

This is fantastic news for us as it is the missing pieces that can connect together the various pieces of the modern azure warehouse. By having a Platform-as-a-Service semantic layer, we can replace any remaining SQL VMs in our architecture. We can now process data using Data Lake and SQL Datawarehouse, model it in Azure Analysis Services for highly concurrent querying, then expose it via PowerBI, all without a virtual machine in sight.

As a service, it is a little costly - it's currently priced around the same as the SQL VM it is replacing, but it only performs one of the functions. But the benefits of not managing infrastructure, and the ability to pause & resume without re-processing, are going to be very useful going forward.

3. Cognitive APIs now General Available

They've been in preview for a while, but we can now use the Face API, Computer Vision API and Content Moderator service in production systems. This allows us to do things like checking for the same person in a bunch of different photos, detect similarities between images and tag them with certain contents. This is pretty cool, although can be onerous to do en-masse… until now.

The BIG news for me here, is that the Cognitive API technology has been pushed into Data Lake Analytics. This means you can use native functions inside U-SQL to harness these APIs.

What does this mean? Well we can now push petabytes of image files through these APIs in a hugely parallel way, all without having to provision any infrastructure - you could write the code to do this from the browser on your phone as you travel home! This is an immensely powerful, scalable tool that's now available right out of the box! Want to auto-moderate several million image files at once? You can! Want to pull out any image files that contain something that looks like a car - without waiting hours for the result? You can!!

Now, there are other features and announcements that came out of last night, but these are far and away the three that I'm excited about.

What about you?

PowerBI Optimisation P3– Extracting and Source Controlling PowerBI Data Models

Source Control – once seen as “something proper developers do” – has been an integral part of the way business intelligence developers work for a long time now. The very idea of building a report, data model or database without applying some kind of source control actually pains me slightly.

However, there has been a push for “Self-Serve” reporting tools to strip out anything that looks remotely like a technical barrier for business users - This includes the ability to properly track changes to code.

We find ourselves in a very familiar situation – versions of PowerBI desktop files are controlled by including version numbers in file names. I’ve seen several copies of “Finance Dashboard v1.2.pbix”. This is obviously dangerous – who’s to say that someone didn’t open up the file, edit it and forget to increment the name. Once a file has been shared, there’s no controlling what changes happen at that point. If this happened to an SSIS package, for example, we would still be able to perform a code comparison. This would highlight differences between the two packages so we could accurately see what caused the changes. This is not currently possible with PBIX files in their entirety.

We can, however, compare the data model behind the file. This allows us to check for changes in business logic, amendments to DAX calculations, additions of new fields etc. If the performance of two PBIX files different drastically even if they were meant to be the same “version”, then this is a reasonable starting point!

Extracting the Data Model from a PowerBI PBIX File

Firstly, we need to extract the JSON that describes the Tabular Model embedded model (technically, this is TMSL, the tabular model scripting language, but it’s still JSON…)

We can do this by connecting to the model via SSMS. I’ve talked about the steps required to do this here.

So, assuming you have found your temporary SSAS port and connected via SSMS, you should see something like this:

image

As we would with any other Tabular model, you can right-click and script out the database as so:

image

If we do this to a new query window, you’ll see the various JSON objects that describe your PowerBI model:

image

This script contains the details for all tables, attributes, DAX measures etc required for your data model.

Comparing PowerBI Data Models

What if someone has been using a specific version of my PowerBI desktop file, but they’ve modified it and it has stopped working? For a Tabular model, I’d compare the model definition to source control which will automatically highlight any changes.

Now that we can script out our PowerBI model, we can apply the same principles. Say, for example, I make a couple of changes to my sample PowerBI report and want to figure out how it has changed compared to a baseline script I exported previously.

The easiest option is to use a tool like Textpad – here you can compare two text documents and it will highlight any differences it finds between the two. For example, I changed the name of a table and removed a column, the text comparison highlights this change as below:

image

I can now be confident that if someone sends me a PBIX file, I can check to see if there are any data model changes without having to manually eyeball the two side by side. This alone is a huge leap forward in manageability of models.

The next step would be to add this file to an actual Source Control provider, such as Visual Studio Team Services. This tool is free for the first 5 users and can be used with Visual Studio 2015 Community Edition – which is also free!

Essentially you would add this exported script to your source control directory each time you updated the model. By checking in your new model, you can compare previous versions, much like with the TextPad editor above.

Final Thoughts

In the end, this isn’t real, true Source Control. If you make a mistake, you can only view what the previous configuration was, you cannot roll back code directly into your PowerBI model. It is, however, a step towards managing PowerBI with a bit more discipline and rigour.

I don’t see this as a huge drawback as rumours on the wind are hinting at larger steps in this direction coming with future releases. Let’s hope we’re not having to work around these problems for much longer!