Simon

Simon Whiteley's Blog

Azure SQLDW–Distribution

We've spoken about the MPP nature of SQLDW, and how it scales by changing the number of compute nodes associated to a fixed number of storage distributions. But the way that you organise data across these distributions can drastically change how the system performs - if it's done badly, one of your compute nodes will be doing all the work and your performance will be very poor.

For now, there are two main methods of distribution. We can focus on spreading data evenly across all distributions to ensure we get predictable performance or we can distribute on a specific column to optimise for specific query performance. These distribution types are known as ROUND ROBIN and HASH distributions respectively.

Round Robin Distribution

Each record that is inserted goes onto the next available distribution. This guarantees that you will have a smooth, even distribution of data, but it means you have no way of telling which data is on which distribution. This isn't always a problem!

If I wanted to perform a count of records, grouped by a particular field, I can perform this on a round-robin table. Each distribution will run the query in parallel and return it's grouped results. The results can be simply added together as a second part of the query, and adding together 60 smaller datasets shouldn't be a large overhead. For this kind of single-table aggregation, round-robin distribution is perfectly adequate!

However, the issues arise when we have multiple tables in our query. In order to join two tables. Let's take a very simple join between a fact table and a dimension. I've shown 6 distributions for simplicity, but this would be happening across all 60.

clip_image001[4]

In this case, both tables have been distributed using ROUND ROBIN. There are an even number of records on each distribution, however there is no consistency where records are stored. There are multiple fact records for "A1" but they are held on different distributions.

If I run a simple select statement joining the two tables, each distribution needs to be able to run the query in isolation. Taking the first distribution, we have records with foreign keys for A1 & A2 on the fact, and only records for A1 on the dimension. But more than that - the server doesn't KNOW which records it has.

In order to execute this query, a thing called data movement happens, which would essentially take a full copy of the dimension table and copy it to all other distributions. At this point, each distribution knows it has all records needed in order to execute the query.

clip_image002[4]

This happens at query runtime. So when you're user is trying to get the results of a select statement, they have to wait while the system reorganises data. This is done for the specific query and dropped after execution - so it'll happen every time the query is run.

Basically - data movement is bad. It's not disastrous, it's unavoidable in many queries, but it's something you should have in mind whenever designing your distributions.

HASH Distribution

This is the alternative - designing our distributions on a specific key. When a record is inserted, a HASH is created on the column name and value, and it is allocated to a distribution. Then, if another record comes in that has the same value for that column, it will be co-located in that same distribution. This is true across tables - so if two tables are distributed on the same column, the related records will be stored on the same distribution.

Let's take a look at our fact and dimensions, assuming they were distributed on the foreign key:

clip_image003[4]

You can immediately see two things - firstly, all records relating to A1 are on the same distribution as the A1 dimension records, same for B1, B2 etc. Secondly, there is obvious skew - there are more A1 records than on any other distribution! This means that the distribution with the A1 records on with be slower to finish it's query than the others.

This is a tiny example - two extra records makes very little difference, but when we're dealing with hundreds of millions of records, this skew can become significant. SQLDW can only return queries once all distributions have finished executing, meaning the whole query is only as fast as your slowest distribution.

However, when we execute our query this time, let's look at our A1 distribution:

clip_image004[4]

We now know which records this distribution holds for both tables, we also know that this distribution can execute the query without the need for data from other distributions. The query can therefore execute immediately with no data movement steps taking place - we have effectively eliminated a whole step from the query.

Replicated Distribution

In APS/PDW world, there is a third distribution type - REPLICATED. This would create a full copy of the table on all distributions, eliminating the need for many types of data movement. At the time of writing, this is being worked on but has not yet been released, even within preview.

Picking a Distribution key

Based on that knowledge, we’re after the following factors for a good distribution:

  1. Distinct Values - We need at least 60 unique values – but more than this, we want LOTS more. More than 600 at least, would be preferable. If we have fewer than 60 values, we would not be using all distributions and essentially wasting potential performance!
  1. Even Distribution - We want the values to be evenly spread. If one of our distribution keys has 10 records with that key, but another has 1000, that’s going to cause a lot of discrepancy in distribution performance.
  1. Commonly Used in Table Joins - Next, the biggest reason for data movement to happen is where joins are occurring. Look at the queries your users are going to be issuing commonly, the keys used in those joins are going to be candidates for distribution.
  1. Not Commonly Filtered - Finally, if users commonly use your distribution key as a search predicate, this is going to drastically reduce any parallelism. If it happens now and then, it���s not terrible – but if it affects the bulk of queries then it’s going to be a real performance drain.

That's probably the hardest part of designing an Azure SQL Datawarehouse - a lot of the performance and design comes down to knowing how your users will query the system, and this is rarely something we know in advance! Luckily, it's fairly easy to change distributions, we'll come on to that in future posts.

Azure SQLDW–How Does Scaling Work?

We've discussed what Azure SQLDW actually is, so we know it works on scalable MPP technology. But what does that actually mean?

If you're happy with the idea that MPP aligns compute with storage then read on. If not, have a quick skim of my blog here.

Ok, so we're using MPP, we have storage and we apply compute to it. In SQLDW, the magic number is 60. There are always sixty distributions. Whenever you write a record to a table, it will always be written to one of those 60 distributions.

You can think of those distributions however you like - as individual buckets of data, as isolated SQL databases, as magic data-shelves in the sky - as long as you know that there are always sixty, regardless of scale, tier, price, country or any other factor. That's a major difference from the APS/PDW systems that SQLDW evolved from - they would change the number of distributions as you scales. However in order to get the dynamic scaling that cloud technologies rely on, they couldn't have data being redistributed every time you scale up or down. So distributions are fixed, permanently, at 60.

You scale the system by adding/removing compute to work against these 60 distributions.

IMPORTANT: When you scale SQLDW, you will kill all running transactions. Scaling effectively restarts the server

How to Scale SQLDW

There are several different ways of scaling SQLDW, these are:

  • Through the Azure Portal
  • Via T-SQL
  • Via PowerShell

The portal is the only one that informs you of the cost when you scale. You should be aware that you will be charged per hour - so if you turn it up for 5 minutes then back down, you'll still be charged for the remainder of the hour.

Firstly, the portal. This is the easiest when you're first getting started, however cannot be automated and is very manual.

To do so, navigate to your SQLDW blade, click on the "Scale" button. When you select your chosen DWU level, it will tell you how much this amount of compute will cost per hour.

clip_image001

Next, there’s the SQL route. Really easy, simply issue an alter statement command against your database and it’ll update the “Service Objective”, which for SQLDW is the DWU.

Note that this will NOT work if you run it against the SQLDW directly. Like other Platform-as-a-Service databases (aka: Azure SQL DB), the server it is allocated to will have a master database, used to control database-level metadata and security settings.

You need to connect directly to the master database to issue a T-SQL scaling command – this isn’t always possible with production systems but it’s worth bearing in mind as a potential method.

ALTER DATABASE [DBNAME] MODIFY (SERVICE_OBJECTIVE = 'DW1000');

Finally there’s PowerShell – either via the Azure Powershell IDE or through an Azure Automation Account – which essentially provides a Platform-as-a-Service Powershell node.

There are a couple of cmdlets available – these are pre-built Powershell functions you can pass parameters to in order to perform standard tasks.

There are three standard cmdlets I would use, these are:


Set-AzureRmSqlDatabase -DatabaseName “Database"
-ServerName “Server"

-RequestedServiceObjectiveName "DW1000"

This is the same as the cmdlet used to change the service objective of Azure SQLDBs. Simply select your database & server and provide the required DWU level to move to.

The next two are interesting - you can entirely pause the compute associated to your SQLDW. This allows you to save a lot of money in out-of-hours times, maybe turning things off in the evenings and weekends.


Suspend-AzureRmSqlDatabase
–ResourceGroupName "ResourceGroup"
–ServerName "Server"
–DatabaseName "Database"

Resume-AzureRmSqlDatabase
–ResourceGroupName
"ResourceGroup"
–ServerName "Server"
–DatabaseName "Database"

In each of these methods, the scaling will take anywhere from 30 seconds to 5 minutes. If you have long running transactions when you issue the scaling command, it will need to roll these back, in these scenarios the scaling action could take a long time!

What happens when you scale SQLDW?

When we change the number of DWUs (that's data warehouse units, similar to the SQLDB concept of DTUs) - we're actually just changing the number of distributions each compute node has to look after.

At 100 DWUs, a single compute node manages all storage:

clip_image002

At 200 DWUs, the number of distributions is only 30 per compute node. This is already a drastic difference in performance.

clip_image003

Again, at 300 DWUs, we add another compute node and associate it to storage. No data is moving but we now have less work for each compute node to do, therefore it completes faster.

clip_image004

That's why we can keep scaling all the way to 6000 DWUs - at that point we are associating each distribution directly with a compute node - there's a one-to-one mapping. We can't scale any further than that as we would begin sharing storage between compute, and that's the SMP problem we're trying to avoid!

It might help to think of this in MPP terms - our 300 DWU example would look like this:

clip_image005

Hopefully that helps understand what's happening when you hit that "Scale" button - we'll be talking about distributing data next. If you haven't distributed your data properly, the all the scaling won't help you very much!

Azure SQLDW–What is it?

Azure SQL DataWarehouse has had a lot of attention recently as a super-scalable SQL Server in Azure, But there's not too much understanding about what is actually is. I could tell you it's the cloud version of PDW/APS, although that requires a fair bit of background knowledge. Technically it's a cloud-scalable T-SQL-based MPP platform… does that clear things up? No? Well let's break that down - first let's look at what MPP means…

MPP

I like to play this game at the start of Azure SQLDW sessions. I ask who has used an MPP system before, a couple of hands go up. I ask who has used an SMP system, a few hesitant hands are raised. I ask who has used a SQL Server before and suddenly we have a room full of waving hands.

Traditional SQL Server IS an SMP system. SMP stands for Symmetric Multi-Processing - so without really knowing it, we're using SMP systems all the time. I'm hoping I can stop playing the wavy-hands game soon when everyone either knows this or has used SQLDW…

What SMP means is that you have one data store that is shared between all CPUs and a service bus to manage interactions. If you have a particular record in a table, each of the CPUs can access that at any time. If they all try to access it at once, then we need locking and queue management in the middle. Sure, you can be clever with your FileGroup/File structures and have parallel I/O reads, but all that storage is still shared across the CPUs.

clip_image001

Now we can add more and more compute to this solution, but there comes a point where the service bus is working too hard to try and balance everything out. Essentially you get diminishing returns on your hardware investment as it has to manage more complex interactions with your storage.

MPP, or Massively Parallel Processing, on the other hand, separates storage and aligns it to specific compute. What this means is each CPU can only access a certain set of records - but we know that there is no competition from other CPUs to access that storage. The idea being that each compute node does its job in isolation, then we can add the various results together to get our answer.

clip_image002

That's an important point to remember - a single record is only held on ONE of the storage buckets (In most scenarios currently… we'll get onto that in later posts)

The way we scale this is to add more compute notes and associated buckets. For example, if we had 4 Tbs of data, each of our storage buckets in the above example would be looking after 1.33Tb of data. If we add a further compute & storage combination and spread our data out, then each would be holding 1Tb of data. We can keep going - at 8 compute & storage combinations, each of our nodes will only be querying 500Gb of data and we will see performance increase as a result!

Our scaled-up MPP would look something like this:

clip_image003

The overall storage volume is the same, however it's spread across more compute. So each compute finishes the job quicker, meaning we get our results faster!

Cloud Scalable

That's the MPP part - so what about Cloud Scalable? Well we know that we can make it go faster by adding more compute nodes and spreading the data out across those nodes. But we can take this a step further in Azure - the compute nodes can be allocated to storage rather than redistributing data, so we can add and remove compute nodes dynamically.

Essentially, you take the compute to the storage - it's a fundamental approach in big data to get real scalability. Moving data is expensive and slow, moving compute is quick, efficient and cheap if you have the right hardware setup. This also means that you can take compute away entirely when you're not using it - you'll still pay for the storage, but it's a lot cheaper than having a server you're not using!

Want your server to go faster for the next hour? Double the compute nodes! Not using it overnight? Turn it off! Doing a big data load over night? Turn the server back on at a high rate, blast through the processing then pause the compute again!

clip_image004

T-SQL Based

Finally - yes, it is indeed a SQL Server. However, it uses a specific version of the SQL Server engine, evolved from SQL Server PDW edition. This means that a lot of your favourite functionality is still there and works fine, however there are certain feature limitations due to the system's parallel nature. They're slowly going through and adding these features, but we won't ever achieve full engine parity!

So - CTEs, window functions, aggregations, stored procs, views - these are all present and accounted for!

Identity columns, @@Rowcount, table-valued functions are a few which are not.

It's a very familiar environment to anyone who develops in T-SQL, but it's not exactly the same.

There we go. So now we can happily say that Azure SQL DataWarehouse is a cloud-scalable T-SQL-based MPP platform and not be left reeling.

A Guide to Azure SQL DataWarehouse

So you've heard the hype - the Azure SQL DW is going to solve all of your problems in one fell swoop… Right? Well… maybe. The system itself is a mix of technologies designed for low concurrency analytics across huge amounts of relational data. In short, it's a cloud-scalable T-SQL-based MPP platform, with all the benefits and restrictions that performing everything in parallel brings. If your problem can be solved by performing lots of calculations over small of your data before aggregating the results into a whole, this is the technology for you.

However, before you jump right in, be aware that SQLDW is a very different beast to other SQL tools. There are specific concepts you need to be familiar with before building your system, otherwise you're not going to see the promised performance gains and will likely lose faith very quickly!

I'm in the process of documenting these concepts, plus there is a wealth of information available on the Azure SQLDW site. For the next few months, I'll be running through the blog topics below and updating the links accordingly. If there are topics you'd like me to add to the list, please get in touch!

Azure SQLDW Core Concepts:

- What is it?

- How Does Scaling Work?

- Distributions

- Polybase

- CTAS

- Resource Classes

- Partitioning

Designing ETL (or ELT) In Azure SQLDW

- Row counts

- Statistics

- Surrogate Keys

Performance Tuning Azure SQLDW

- Plan Exploration in SQLDW

- Data Movement Types

- Minimising Data Movement

Managing Azure SQLDW

- Backup & Restore

- Monitoring Distributions

- System Monitoring

- Job Orchestration

- Scaling and Management

- Performance Tuning Queries

Azure SQLDW Architecture

- Presentation Layers

- Data Lake Integrations

A Roundup of MS Data AMP 2017

There was a bit of buzz last night as Microsoft held the first "Data Amp" online conference. It was essentially a chance to make a bit of marketing buzz around some new products, and there was a whole lot of hype associated. But what actually happened?

Well we learned a lot about the next version of SQL Server, several Azure components moved from preview into General Availability and, amidst the noise, some cool new features added to existing ones.

So, to help cut through the hype, here are my top 3 highlights from MS Data Amp:

1. SQL Server 2017

It's official, SQL Server vNext is coming this year and will be called SQL Server 2017. CTP2.0 is now available and is deemed production ready so you can grab a copy and start preparing migrations & new applications.

But should you adopt it early? Is there anything radically different with this version?

Well actually yes, there's quite a lot in there of interest to a BI Developer, Data Scientist or anyone with analytical querying requirements.

We already knew about a few features, some small functions such as STRING_AGG() that will make like a whole lot easier, some huge, such as On-Premise PowerBI Hosting.

The major announcements last night were:

a) PYTHON in SQL Server: Hot on the heels of 2016's R integration, SQL Server 2017 will now support Python in a similar way. This provides a few more integration options and some benefits in terms of performance and optimisation, but mostly, it provides choice and compatibility. You no longer have to choose R if you're going SQL Server, you can port your existing Python across directly.

Will it change the world? Probably not - the majority of data science is currently performed on dedicated platforms, and the ease at which these can be created and torn down in Azure means there is little barrier to entry to continuing to do so. It does, however, make it really easy to try out, and super easy to integrate into existing applications and workflows. I'd disagree with it being called "Built-in AI in SQL Server", but it's certainly a step in that direction.

b) SQL Graph: You can now have Graph Databases inside SQL Server. Essentially you mark tables as edges and nodes, then use specific SQL syntax to query them. If you've got existing Neo4J databases that you would prefer to integrate directly into SQL Server, this is for you. Your edges and nodes also act like normal tables, so you can still query them in the traditional relational way.

Personally, I'm interested in using it as a reporting layer - putting some D3 visualisations over the top and natively displaying clustering, force-directed graphs and the like without and need for transformations, all from your existing SQL Server.

Adatis's very own Terry McCann has put together his introduction to SQL Graph

2. Azure Analysis Services is now Generally Available

This is fantastic news for us as it is the missing pieces that can connect together the various pieces of the modern azure warehouse. By having a Platform-as-a-Service semantic layer, we can replace any remaining SQL VMs in our architecture. We can now process data using Data Lake and SQL Datawarehouse, model it in Azure Analysis Services for highly concurrent querying, then expose it via PowerBI, all without a virtual machine in sight.

As a service, it is a little costly - it's currently priced around the same as the SQL VM it is replacing, but it only performs one of the functions. But the benefits of not managing infrastructure, and the ability to pause & resume without re-processing, are going to be very useful going forward.

3. Cognitive APIs now General Available

They've been in preview for a while, but we can now use the Face API, Computer Vision API and Content Moderator service in production systems. This allows us to do things like checking for the same person in a bunch of different photos, detect similarities between images and tag them with certain contents. This is pretty cool, although can be onerous to do en-masse… until now.

The BIG news for me here, is that the Cognitive API technology has been pushed into Data Lake Analytics. This means you can use native functions inside U-SQL to harness these APIs.

What does this mean? Well we can now push petabytes of image files through these APIs in a hugely parallel way, all without having to provision any infrastructure - you could write the code to do this from the browser on your phone as you travel home! This is an immensely powerful, scalable tool that's now available right out of the box! Want to auto-moderate several million image files at once? You can! Want to pull out any image files that contain something that looks like a car - without waiting hours for the result? You can!!

Now, there are other features and announcements that came out of last night, but these are far and away the three that I'm excited about.

What about you?

PowerBI Optimisation P3– Extracting and Source Controlling PowerBI Data Models

Source Control – once seen as “something proper developers do” – has been an integral part of the way business intelligence developers work for a long time now. The very idea of building a report, data model or database without applying some kind of source control actually pains me slightly.

However, there has been a push for “Self-Serve” reporting tools to strip out anything that looks remotely like a technical barrier for business users - This includes the ability to properly track changes to code.

We find ourselves in a very familiar situation – versions of PowerBI desktop files are controlled by including version numbers in file names. I’ve seen several copies of “Finance Dashboard v1.2.pbix”. This is obviously dangerous – who’s to say that someone didn’t open up the file, edit it and forget to increment the name. Once a file has been shared, there’s no controlling what changes happen at that point. If this happened to an SSIS package, for example, we would still be able to perform a code comparison. This would highlight differences between the two packages so we could accurately see what caused the changes. This is not currently possible with PBIX files in their entirety.

We can, however, compare the data model behind the file. This allows us to check for changes in business logic, amendments to DAX calculations, additions of new fields etc. If the performance of two PBIX files different drastically even if they were meant to be the same “version”, then this is a reasonable starting point!

Extracting the Data Model from a PowerBI PBIX File

Firstly, we need to extract the JSON that describes the Tabular Model embedded model (technically, this is TMSL, the tabular model scripting language, but it’s still JSON…)

We can do this by connecting to the model via SSMS. I’ve talked about the steps required to do this here.

So, assuming you have found your temporary SSAS port and connected via SSMS, you should see something like this:

image

As we would with any other Tabular model, you can right-click and script out the database as so:

image

If we do this to a new query window, you’ll see the various JSON objects that describe your PowerBI model:

image

This script contains the details for all tables, attributes, DAX measures etc required for your data model.

Comparing PowerBI Data Models

What if someone has been using a specific version of my PowerBI desktop file, but they’ve modified it and it has stopped working? For a Tabular model, I’d compare the model definition to source control which will automatically highlight any changes.

Now that we can script out our PowerBI model, we can apply the same principles. Say, for example, I make a couple of changes to my sample PowerBI report and want to figure out how it has changed compared to a baseline script I exported previously.

The easiest option is to use a tool like Textpad – here you can compare two text documents and it will highlight any differences it finds between the two. For example, I changed the name of a table and removed a column, the text comparison highlights this change as below:

image

I can now be confident that if someone sends me a PBIX file, I can check to see if there are any data model changes without having to manually eyeball the two side by side. This alone is a huge leap forward in manageability of models.

The next step would be to add this file to an actual Source Control provider, such as Visual Studio Team Services. This tool is free for the first 5 users and can be used with Visual Studio 2015 Community Edition – which is also free!

Essentially you would add this exported script to your source control directory each time you updated the model. By checking in your new model, you can compare previous versions, much like with the TextPad editor above.

Final Thoughts

In the end, this isn’t real, true Source Control. If you make a mistake, you can only view what the previous configuration was, you cannot roll back code directly into your PowerBI model. It is, however, a step towards managing PowerBI with a bit more discipline and rigour.

I don’t see this as a huge drawback as rumours on the wind are hinting at larger steps in this direction coming with future releases. Let’s hope we’re not having to work around these problems for much longer!

 

 

PowerBI Optimisation P2–What’s using all my memory?

If you're a regular user of PowerBI, you're probably aware of the size limitations around datasets and it's very likely you've hit them more than once whilst writing reports on top of large datasets. It's difficult to see where size savings can be made directly through PowerBI, but we can use traditional tabular optimisation techniques to help us!

For those not in the know, a single dataset can be up to 1Gb in size, with excel files limited to 250mb. Each user also has a storage limit as follows:

  • Free users have a maximum 1 GB data capacity.
  • Pro users of Power BI Pro have 10 GB maximum capacity.
  • Pro users can create groups, with a maximum 10 GB data capacity each.

For more information about the limits themselves and how to view your current usage, there's PowerBI blog about it here: https://powerbi.microsoft.com/en-us/documentation/powerbi-admin-manage-your-data-storage-in-power-bi/

But what if you're hitting that 1Gb data limit? There's very little within PowerBI itself to help you understand which tables are the largest, where you could make some savings, or generally anything about your model itself. The answer is to connect to the model via SSMS and take advantage of the Tabular system views, as described here.

What determines Tabular model size?

It’s worth discussing this briefly before going into the details. Put very simply, the XVelocity engine used by the tabular model will hold more data if there are more unique values for a column column. The key to avoiding large models is, therefore, to avoid columns with huge numbers of lots of distinct values. Text fields will generally be pretty bad for this, although there are common design patterns to avoid the worst offenders.

A simple example is to look at a DateTime column – this combination of date and time means that each minute of each day is a unique value. Even if we ignore seconds, we’re adding 1140 new, distinct records for every day within the system.

If we split this into two fields, a date and a time, this problem goes away. Each new date adds just a single record, whilst we will never have any new hours and minute combinations, so that’s a controllable field.

There are a few techniques to avoid these problems if you find them, I’d advise heading over to Russo & Ferrari for some general tips here and some more detailed techniques here.

Accessing Memory Usage Data

So - following the above instructions, connect to your data model and open a new DMX query:

clip_image001

Here you can use SQL syntax to query several DMVs behind the model - not all of them will be relevant in the cut-down tabular instance that PowerBI uses but there is one in particular that will help us manage our model size - DISCOVER_OBJECT_MEMORY_USAGE.

Admittedly, on it’s own this is pretty incomprehensible. We can filter down the results slightly into something that makes a little sense, but you’ll generally get a big list of model entities with numbers against them – OK as a starter but not great as an actual model optimisation tool:

image

Stopping here we would at least have a hit-list of the worst-offending columns and we can use this to start tackling our model. But there are much better ways to approach this problem!

Tabular Memory Reports

There are several free tools made available within the SSAS community for people to analyse their current SSAS memory usage. These tools simply query this same data but apply a bit of data modelling and make the data much more accessible.

For straight tabular, I would tend to use Kasper de Jonge’s old excel spread, which pulls in data quite reliably, however there is an updated PowerBI Model found here.

However, this doesn’t play nicely with the PowerBI flavour of tabular just yet, so I would advise using the SQLBI.com Vertipaq Analyser.

Following their instructions and pointing it at my temporary tabular instance, we can refresh successfully and use their categorisations to explore the model. I’ve added some conditional formatting to help see where the issues are.

I can see, for example, which of the tables in my model are the worst offenders, and what’s causing it:

image

Interestingly the Customer dimension is pretty huge in my example. It has a lot less data than my fact but the dictionaries required are pretty hefty. Dictionaries are built using string lookups and are heavily affected by high volumes of unique values – so I can presume I’ve got some pretty big text strings in this dimension.

Looking at the Column breakdown, I can see where the offenders are:

image

This tells a slightly different story – my main offenders are from one of the hidden date dimension tables (A sign that relying on PowerBI’s inbuilt date functionality can be a memory drain) and the Sales Order Number – a unique identifier for my fact, obviously this is going to have a large number of distinct values.

The other columns we can do more about – Email address is the next offender. We can assume each customer, of all 18,000 will have a unique email address. However, it’s very rare that we would want to do analysis on the email address specifically, this is a good candidate to remove from the model. At the very least, we could consider keeping only the domain which will yield much fewer unique values.

 

Hopefully the above will help you move forward in reducing your PowerBI data model size – I’ll be posting about Performance Analysis & Source Control over the next couple of days.

PowerBI Optimisation 1 – Connecting Via Management Studio

I recently gave a talk to the London PowerBI UserGroup and I kicked things off with a confession - "I don't do much report building in PowerBI". Perhaps an odd way to qualify myself to speak to that particular audience. But I am, however, a cloud solution architect - I spend my time designing large scalable cloud systems to process vast amounts of data and PowerBI is a common tool used on top of these systems.

Why then, do we accept the lack of controls available within PowerBI? Given any other end-user system I'd want to know about performance bottlenecks, about data model efficiency and, more than anything, I'd want it in source control.

First and foremost, the talk is available here.

The key to it all, is realising that PowerBI Desktop, when running, starts a SQL Server Analysis Services processes in the background. It doesn't just use the same engine as Tabular, it literally runs tabular in the background without telling you.

Open up a PowerBI Desktop file and, after you've seen the "initialising model…" window, you'll see this process in the background - one for each PBID session.

clip_image001

So - if the model is using Tabular in the background, we must be able to actually connect to the model!

First - Find your Temporary SSAS Port

There are two straight forward ways we can achieve this:

1. By far the easiest, is to open up DaxStudio if you have it installed.

When you open DaxStudio, it gives you a Connect window, which lists all of the PowerBI processes you have running in the background, as well as any Tabular services:

clip_image002

When you connect to a PBI file here, you'll see the Port listed

clip_image003

In this case, my port is 5524 -be aware that this will change every time you open PowerBI Desktop, so you can't hardcode anything looking for your "powerbi port".

2. Alternatively, you can find the "msmdsrv.port.txt" file related to your specific instance.

Take a look in your user appdata folder, you should find a Microsoft/Power BI Desktop/ folder with some analysis services details:

C:\Users\<YourUser>\AppData\Local\Microsoft\Power BI Desktop\AnalysisServicesWorkspaces\

You'll see an instance for each of your PBI Desktop instances, I've only got one at the moment:

clip_image004

Inside this folder, in another folder called "Data", you'll find the file we're looking for:

clip_image005

Opening this file, we see:

clip_image006

Pretty straight forward, and no DAX required. Obviously if you have multiple instances, you'll need to figure out which of these relates to the instance you're after.

Connect via SSMS

Now that we know our port, we can simply open up management studio, connect to analysis services and enter "localhost:" and the port number from earlier.

clip_image007

 

Once connected, you'll see a model connection - each PBIX file will have a GUID for this instance, but you can drill down and see the objects underneath, exactly as you would with a Tabular model:

clip_image008

You can now write queries, browse the model and basically treat it as a Tabular instance. The Database itself will use a generated GUID, and several internal tables will do the same - you can see above that a hidden data table has been created for every datekey included in my model.

We'll discuss the applications of this in my next post - namely how this unlocks performance tuning, monitoring and source control.

The Cloud BI Transition: Warehouse of the Future Talk

While we wait for the SQLBits XV videos to be published, I actually gave the same talk back in March for the London PASS Chapter MeetUp. If you were unfortunate enough to miss both talks, or happened to steal a quick nap mid-way through, the recording can be found here.

I wrote up a quick accompanying blog earlier this week, which includes a link to the slides and some additional thoughts.

As usual, if anyone has any questions following the talk, I’d love to hear from you in the comments or directly on twitter via @MrSiWhiteley.

The Cloud BI Transition: ETL & Data Movement in the Cloud

The cloud warehouse is a different beast to the on-premise systems most of us have been using for the past decade. These times, they are a changing. I’ve talked about the key concepts that we, at Adatis, see as core to this new world here, but how does this work in practice? To start us off, the core of any BI Developer’s skillset – ETL.

The main two players are Azure DataWarehouse and Azure Data Lake. I recently gave a talk to SQLBits to introduce some key concepts, explain how I see them fitting in to the new world and, hopefully, challenge some existing preconceptions - slides can be found here. I’ll pop the video link up when published.

For those in a hurry, I’ve summarised the main thrust of my talk below.

The Ways of the Past

For many years, the fastest way to transform data whilst moving it between two places was to extract, transform in memory via SSIS, then load into their final structures, hence ETL.

clip_image002[4]

However – this was eventually rivalled by massive parallelism, whilst one database engine may be slower to work through a load than memory-based operations, 10 might be much faster. But the Hadoop systems that provided this, were not very accessible to the bulk of BI developers, seen as requiring different skillsets and not being particularly standard.

The management of these clusters wasn’t always achievable even with developer acceptance, certainly not in the on-premise world. This is why the latest Azure technologies are looking to make such a difference.

The Scalable Warehouse

With Azure Data Warehouse, you can perform queries over large, distributed tables – both reading and writing using massive levels of parallelism, depending on your service tier. You can also scale your service dynamically, so if your day-to-day usage required 100 DWUs, you’d be seeing 8 reader threads and 60 writer threads. Even at this level you may find performance to be similar to SSIS.

But these are PaaS components and so are incredibly flexible and scalable. We can scale our ADW instance up to 400 DWUs just for the duration of our data load (bear in mind, you pay by the hour for the maximum level used in that time). This gives us 32 readers and 60 writers – at this level we’ll be far outstripping the performance of a powerful SSIS box and this will cost us just a couple of pounds per hour. This scaling can be done via a number of methods – a couple of lines of T-SQL, a simple powershell script or even through the UI.

Once we’ve scaled, transformations are quite likely to be faster performed directly by our data warehouse engine than a data movement tool. This gives us Extract > Load > Transform, which is a common technique by users of MPP systems – with Azure DW coming from APS/PDW this makes sense.

clip_image004[4]

Targeted Compute

With Azure Data Lake Analytics, you can spin up the equivalent of 1000 cheap servers (essentially managed Hadoop instances) for the lifetime of your query. You’ll need to write U-SQL to do this, a new language combining C# and SQL syntax, but it’s not a huge leap for an experienced SQL developer with just a hint of programming skills. But the ability to write a query directly in the browser that can utilise massive amounts of parallelism removes a lot of barriers – you don’t even need visual studio installed to take advantage of ADL.

Of course, if you do build your queries in visual studio, there is a whole tooling set released to allow you to predict compute utilisation, spotting bottlenecks and planning capacity. Not to mention the benefits that source controlling your code give you!

clip_image006[4]

So Which Do We Use?

This gives us two options for our transformation, the Data Lake and within the Warehouse itself.

My assumption is that we’re going to see TELT, some upfront targeted transformation, then an extract/load into our final technology, then some further transformation.

Transform (targeted) > Extract > Load > Transform (general)

Whereas transformation used to be the bottleneck, it’s now the data movement itself that’s slowing things down. If we have just one very large job, we can perform this in isolation within the lake before moving anything and not have to scale our warehouse. If we have a lot of transformations to do on a variety of tables, we can scale up the warehouse for the duration. Maybe we do both to limit how much we scale the ADW. Perhaps we perform the transformation as increments throughout the day, either as smaller data lake jobs, data factory movements or even stream transformations. Once again, that’s what a distributed cloud architecture gives us – flexibility and choice.

The actual answer depends heavily upon the data scenario that you’re attempting to work with – the speed of incoming data, the types of data involved, the volume and complexity of the data. The new components do not provide you with a one-size-fits-all answer, they do, however, provide you with the tools to overcome almost any data challenges you may be facing.

Further Reading

Azure DataWarehouse Loading Reference Guide

Azure Data Warehouse Documentation

Azure Data Lake Analytics Documentation