EmmaStewart

Emma Stewart's Blog

How to do row counts in Azure SQL Data Warehouse

Continuing on from my last couple of blog post about working with the Azure Data Warehouse, here is another issue which has came up during development and is handy to know if you are going to be developing a solution! Smile

Keeping track of how much data has been loaded plays a key part in a BI Solution. It is important to know for a given load for example, how many rows were inserted, updated or deleted. Traditionally, we were able to use the @@ROWCOUNT function

@@ROWCOUNT returns the number of rows affected by the last statement.

Unfortunately, in Azure SQL Data Warehouse @@ROWCOUNT is not supported.

How does it work?

In the Microsoft Azure documentation,they do provide a workaround for this, please see here  for more information and a list of other unsupported functions.

They suggest creating a stored procedure which will query the system tables sys.dm_pdw_exec_requests and sys.dm_pdw_request_steps in order to get the row count for the last SQL statement for the current session.

sys.dm_pdw_exec_requests holds information about all requests currently or recently active in SQL Data Warehouse. It lists one row per request/query. holds information about all SQL Server query distributions as part of a SQL step in the query.

sys.dm_pdw_request_steps holds information about all steps that are part of a given request or query in SQL Data Warehouse. It lists one row per query step.

This is an example of what the stored procedure would look like:

Row_Count

 

As you can see above, we pass through a ‘LabelContext’ parameter. A Label is a concept in Azure SQL Data Warehouse that allows us to provide our queries with a text label that is easy to understand and we can find that label in the DMVs. For example:

label

Here, we have given our query the label ‘Test Label’ and if we wanted to find information about this query in the DMVs we can search using this label like:

label2

 

So, putting this into context, in the ETL we are calling stored procedures to load our data (for example between clean and warehouse). Therefore, within the stored procedure we have the query written to insert or update the data and we would give this query a label. Then, within the same stored procedure, we would call the Row Count stored procedure, passing through the Label as parameter so we can retrieve the row count.

 

Be careful though!

On my current project we have come across times where we haven’t been able to get the row count back. This is because the sys.dm_pdw_exec_requests DMV we are querying is transient and only stores  the last 10,000 queries executed. So when we were running the query above, our requests were no longer there and we were getting nothing back! The table holds data on all queries that go against the distribution nodes and statistics gathering for each of the nodes. So in order to try and limit the records in this table, keep the nesting level of queries as low as possible to avoid the table blowing up and not having the data you need in it!

 

Stay tuned for another blog about working with Azure Data Warehouse!

Statistics in Azure SQL Data Warehouse

Following on from my previous post about Update Queries in Azure SQL Data Warehouse, I thought I would put together a mini-series of blogs related to my ‘struggles’ working with the Azure SQL DW. Don’t get me wrong, its great, just has some teething issues of which there are work-arounds! Winking smile

This blog post is going to look at what Statistics in the database world are, the differences between them on-prem (SQL Server) and in the cloud (Azure SQL Data Warehouse) and also how to use them in Azure Data Warehouse.

What are statistics?

Statistics are great, they provide information about your data which in turn helps queries execute faster, The more information that is available about your data, the quicker your queries will run as it will create the most optimal plan for the query. 

Think of the statistics as you would the mathematical ones- they give us information regarding the distribution of values in a table, column(s) or indexes. The statistics are stored in a histogram which shows the distribution of values, range of values and selectivity of values. Statistics objects on multiple columns store information regarding correlation of values among the columns. They are most important with queries that have JOINS and GROUP BY, HAVING, and WHERE clauses.

In SQL Server, you can get information about the statistics by querying the catalog views sys.stats and sys.stats_columns. By default, SQL Server automatically creates statistics for each index, and single columns.

See here for more information.

How does it work in Azure SQL Data Warehouse?

In Azure SQL Data Warehouse, statistics have to be created manually. On previous SQL Server projects, creating and maintaining statistics wasn’t something that we had to incorporate into our design (and really think about!) however with SQL DW we need to make sure we think about how to include it in our process in order to make sure we take advantage of the benefits of working with Azure DW.

The major selling point of Azure SQL Data Warehouse is that it is capable of processing huge volumes of data, one of the specific performance optimisations that has been made is the distributed query optimiser. Using the information obtained from the statistics (information on data size and distribution), the service is able to optimize queries by assessing the cost of specific distributed query operations. Therefore, since the query optimiser is cost-based, SQL DW will always choose the plan with the lowest cost.

Statistics are important for minimising data movement within the warehouse i.e. moving data from distributions to satisfy a query. If we don’t have statistics, azure data warehouse could end up performing data movement on the larger (perhaps fact) table instead of the smaller (dimension) table as it wouldn’t know any information about the size of them and would just have to guess!

How do we implement statistics in Azure Data Warehouse?

Microsoft have actually provided the code of how to generate the statistics so its just a case of deciding when in your process you want to create them or maintain.

In my current project, we have created a stored procedure which will create statistics and another that will update them if they already exists. Once data has ben loaded into a table, we call the stored procedure and then the statistics will be created or updated (depending on what is needed).

See the documentation for more information and the code.

Tip: On my current project, we were getting errors when running normal stored procedures to load the data.

Error message:

Number of Requests per session had been reached’.

Upon investigation in the system tables,’Show Statistics’ was treated as a request which was also evaluated for each node causing the number of requests to blow up. By increasing the data warehouse units (DWUs) and also the resource group allocation this problem went away. So, take advantage of the extra power available to you!

 

There is a big list on the Microsoft Azure website of features not supported in Azure SQL Data Warehouse, take a look ‘here’. I will cover further issues in my next few blogs Smile

Update queries in Azure SQL Data Warehouse

 

I’ve recently started working on a project where we working in the cloud with Azure SQL Data Warehouse:

“Azure SQL Data Warehouse is a cloud-based, scale-out database capable of processing massive volumes of data, both relational and non-relational” 

For more information about Azure SQL Data Warehouse, see here.

Although we develop with the same T-SQL as we do using the on-prem version of SQL Server, I did come across a bit of a quirk when writing some update statements.

If we are using the on-prem version of SQL Server, when we need to update data, we would have a SQL query like:

 clip_image002

That is a basic update to a specific row of data in the Sales.MyOrderDetails table, using a where clause to filter for the row.

Sometimes, it isn’t always as straight forward and we need to join to other tables, so that we can refer to attributes from those rows for filtering purposes. For example:

 

clip_image004

 

However, if we take this approach in SQL Data Warehouse, we get the following error.

Errormessage_updates_sqldw

 

SQL Data Warehouse doesn't support ANSI joins in the FROM clause of an UPDATE statement (it is also the case for DELETE statements too). There is a way round it and it uses an implicit join.

Before we look at how the update query can be written, it is a good place to point out that Inner joins can be written in a couple of different ways to what we had above.

In an inner join, the ON and WHERE clause both perform the same filtering and they both return rows where the ON and WHERE predicate is true. Therefore, we could write an inner join as

clip_image008

or implicitly like,

clip_image010

However, it is normally best to stick with the original example rather than the implicit version as although it is still supported, it is an old deprecated syntax and not considered best practise.

So, in order to write an update query in SQL Data Warehouse that uses inner joins to filter the rows, the workaround is as follows:

clip_image012

 

In conclusion, most SQL statements written in Azure SQL Data Warehouse are written in the same way we would with he on-prem  version of SQL Server, however, there are some cases where the syntax differs slightly and I will be blogging more about these special cases as I come across them!

Smile