Callum

Callum Green's Blog

Joins vs CROSS APPLY

Every professional (who has worked with SQL Server) will have used a Join operator.  However, the same cannot be said about the APPLY operator.  Although APPLY was introduced back in SQL Server 2005, there are still a number of developers who have barely seen the syntax, let alone utilised it’s capabilities.

This blog focuses solely on INNER JOINS vs CROSS APPLY and when it is more beneficial to use the latter.  I will provide additional links at the end of the post, which will go into other features of APPLY, such as performance benefits against PIVOT’s.

This blog has been written with the assumption that the reader already knows about Joins.  If you would like to refresh your memory, click here.

What is APPLY?


The original purpose of the APPLY operator was to use them with table-valued functions (TVF’s).

The MSDN definition is:

“The APPLY operator is similar to the JOIN operator, but the difference is that the right-hand side operator of APPLY can reference columns from the left-hand side”.

 

In simple terms, a join relies on self-sufficient sets of data, i.e. sets should not depend on each other.  On the other hand, CROSS APPLY is only based on one predefined set and can be used with another separately created set.  A worked example should help with understanding this difference. 

Example Problem

Scenario:             Find the last 3 orders from all customers that live in the UK.
Database:            AdverntureWorksDW2014 – Download the free Microsoft database
here.
Tables:                 dbo.FactInternetSales

                              dbo.DimCustomer

1.      Create a function that accepts CustomerKey and the Top(N) orders (e.g. last 3 orders) to be returned.

 

USE [AdventureWorksDW2014]

GO

CREATE FUNCTION dbo.GetTopOrders (@CustomerKey INT, @N INT) RETURNS TABLE

AS

RETURN

-- find last 3 orders, placed on different days

       SELECT DISTINCT TOP(@N)   

                    SalesOrderNumber,

                    OrderDate,

                    CustomerKey

       FROM   [dbo].[FactInternetSales]

       WHERE  CustomerKey = @CustomerKey

       ORDER BY OrderDate DESC, SalesOrderNumber DESC

2.    Using traditional CROSS JOIN approach.  The C.CustomerKey reference (in the CROSS JOIN & function) is intended to dynamically pass through every customer, with the ‘3’ being the number of orders to show per customer.

SELECT       C.CustomerKey,

             C.FirstName,

             C.LastName,

             C.EmailAddress,

             O.SalesOrderNumber,

             O.OrderDate

FROM         dbo.DimCustomer C

CROSS JOIN

             dbo.GetTopOrders(C.CustomerKey,3) AS O

 

clip_image001

The reason why this fails is because C.CustomerKey is referencing the left set (DimCustomer), but does not exist itself in the right set (the Function).  Joins require pre-defined sets from both sides, which is not how this query works.

3.     Using CROSS APPLY.  This operator reference the left set first, before then checking the second set against the first one.  The second set runs a row by row basis, passing the CustomerKey in a type of recursive variable.

SELECT C.CustomerKey,

             C.FirstName,

             C.LastName,

             C.EmailAddress,

             O.SalesOrderNumber,

             O.OrderDate

FROM   dbo.DimCustomer C

CROSS APPLY

             dbo.GetTopOrders(C.CustomerKey,3) AS O

 

clip_image003

-- 26760 rows affected

The only change to the code is changing CROSS JOIN to CROSS APPLY.  Although the sets are analysed separately, you can still return data from both – just like a Join. 

4.     The above example would also apply to an INNER JOIN.  The left set only returns data if it matches with the right side, so if there is not a match, neither set is applied.  A way to ensure all records are returned, is to use OUTER APPLY.

For the purpose of this article, I have added a new customer (not shown below) to the dbo.Customers table, who has yet to place an order.  We can now run the original CROSS APPLY query, but as an OUTER APPLY instead.

SELECT C.CustomerKey,

             C.FirstName,

             C.LastName,

             C.EmailAddress,

             O.SalesOrderNumber,

             O.OrderDate

FROM   dbo.DimCustomer C

OUTER APPLY

             dbo.GetTopOrders(C.CustomerKey,3) AS O

WHERE  O.SalesOrderNumber IS NULL

 

clip_image007

The additional ‘Where’ clause specifically returns the non-matching record, but in total 26761 rows where affected.

Conclusion

As demonstrated, CROSS APPLY is a very useful operator when referencing a Table Function or filtering on a subset of data.  The key benefit is the ability to use one defined set with another separately created set.  Unlike Joins, you can define a dynamic subset of data to match with the outer query.

APPLY is a very underrated and underused within the SQL Server Community.  Look out for future blogs, where I demonstrate the uses of the other APPLY operators and how they can produce more efficient queries.

References


For further reading on the keywords in this article, try the recommended links below:

1.      Boost your T-SQL with CROSS APPLY - http://www.microsoftvirtualacademy.com/training-courses/boost-your-t-sql-with-the-apply-operator

2.      CROSS APPLY in SQL Server - http://www.microsoftvirtualacademy.com/training-courses/boost-your-t-sql-with-the-apply-operator

3.      Using APPLY – https://technet.microsoft.com/en-us/library/ms175156(v=sql.105).aspx

4.      INNER JOIN VS CROSS APPLY - http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/

 

 

Adatis Coding Dojo – Session No.2

The Challenge

Write a program that generates all two-word anagrams of the string "documenting", in SQL Server.
e.g. “Documenting” = “Document Gin”.

Introduction

This weeks’ challenge was….well, a challenge. In our first session everyone attempted it via paired programming, for this dojo we decided to attempt the problem with one larger group, consisting of 5 people. This technique encourages an open forum for developers to discuss ideas, with one person at the computer coding. The assigned coder must swap with another participant every 5 – 10 minutes.

Preparation

Before beginning the Dojo, I asked for one of the Senior Consultants to attempt to complete the problem and provide the code and concepts behind his thinking. The idea is that one challenge can be solved in a number of ways, with 5 people working collaboratively thinking in a completely different way to one individual coder. We provided a file containing 10,000 words (which would become the master list for the anagram solution). If you would like to try this yourself, the text file can be downloaded from here.

Senior Developer’s Solution

Most importantly, the senior developer DID NOT use the TDD development methodology. It took around 2 hours for the developer to implement a fully functioning stored procedure. Here is how it was achieved:

1. Stored procedure, encapsulating a long T-SQL script. The solution was focused around string manipulation and recursive querying.

2. While loop turns word into a pivoted array of individual letters, which can then be compared against the words in the word list table.

3. Recursive CTE returns words consisting only of letters in that array

4. XML PATH used to create CHECKSUM string of letters and counts.

5. All word combinations of correct length returned and checked against checksum to validate letter counts.

The solution was built to be flexible from the start – it returned anagrams with ‘n’ number of words rather than the proposed 2. It would also work with any provided starting word and special characters.

Code Dojo Solution

The first task was to brainstorm ideas on how to conquer the challenge, starting with whether it could even be achieved in SQL Server! The coding team weighed up string manipulation, recursive CTE’s, a cursor or mathematical calculations, amongst other SQL Server functions. The general consensus was to avoid recursion (where possible) to ensure faster query results and mathematical algorithms to compare against the data in the imported wordlist table.

The development team used TDD to implement the mathematical approach to solving the problem. Each code enhancement contained a new automated test script that would capture any failures and enable a simple rollback to the last working codebase. This proved to be a successful technique, as ideas were changing constantly throughout development.

Actual Dojo Solution

After around 90 minutes of SQL coding and TDD, the team managed to come up with a working solution. This is how they achieved it:

1. A SQL Function to work out a unique value for a character (letter).

a. Each letter of the word ‘Documenting’ has a case sensitive ASCII value e.g. ‘D’ is 68 and ‘d’ is 100.

b. All letters were converted to uppercase, to ensure a non-case sensitive approach.

c. Apply the POWER SQL function within a given letter, which gives it a unique number and cannot be duplicated by another letter.

2. Simple stored procedure that looks up the existing words in the full word list table, which references the function whilst comparing letter values.

a. Find all word combinations and their total POWER value.

b. Compare the total number against the hardcoded word ‘Documenting’

c. Return and records that have two words adding up to the total POWER of ‘Documenting’.

d. Nest the hard coded word into the procedure, which means any word can be compared at run time.

3. The TDD approach helped facilitate the iterative, code review process.

Conclusion

Although the Senior Developer and Dojo Team’s solutions met the minimum expectations for the challenge, there were a number of pro’s and con’s to both solutions.

The Senior Developer thought ahead and made improvements to the requirements, such as dynamically handling more than 2 word anagrams. He also demonstrated some of the under used and more powerful functionality within SQL. However, it takes around 2 minutes to execute one anagram and the code itself is not the most efficient. By not using TDD in his approach, he over complicated the solution and did not encounter performance bottlenecks until the end of the build.

On the other hand, the Dojo team fully practiced TDD. This was reflected in the much smaller codebase and, most importantly, the speed in which the anagram procedure executed was much quicker (23 Seconds). Their solution is limited in that it is very rigid and cannot handle more than two word anagrams. It also cannot handle special characters, whereas the Senior Developer solution can. However, these were not requirements of the solution – would a client accept a slower, more complicated product that handles requirements they do not need?

Overall, both solutions work adequately but it is clear that when confronting a technical challenge/problem, running a Dojo and practising TDD can bring more efficient results. Had we added further requirements to the challenge, maybe the dojo team would have found another, even more efficient solution.

References

General

Good Dojo Practices - http://codingdojo.org/
TDD - http://code.tutsplus.com/tutorials/the-newbies-guide-to-test-driven-development--net-13835

SQL Functions

POWER – https://msdn.microsoft.com/en-us/library/ms174276.aspx
ASCII – https://msdn.microsoft.com/en-us/library/ms177545.aspx
XML PATH – http://blogs.msdn.com/b/mind_talks/archive/2012/01/18/xml-path-for-sql-server.aspx
CHECKSUM - https://msdn.microsoft.com/en-us/library/ms189788.aspx

Dojo Code

For access to the Senior Developer’s and the Dojo team’s SQL solutions, please leave a comment and I will get back to you directly.  It would be interesting to see if anyone else has a better technique (both in terms of code length and performance) and welcome any suggestions.

New Features in SQL Server 2014

There are a number of new features available in SQL Server 2014 but this article will go into more detail regarding the less popular ones.  The well publicised In Memory OLTP Engine and the ability to update a column store index are probably the most used features and provide lots of performance benefits.  However, there are lots of other blogs out there that will give you a breakdown of advantages, disadvantages, useful techniques, etc.

SQL Server 2014 did not go through such a drastic makeover as SQL Server 2012.  The environment has been further stabilised and grown to give users more functionality and a better handle on standard processes.  The below additions to SQL Server 2014 will be discussed:

- Database Backups
- Resource Governor for I/O
- SSD Buffer Pool Extension
- Incremental statistics

NOTE
This Article DOES NOT explain how to implement a feature.  The ‘Further Reading’ section provides a website that 
provides more detailed information on each subject.

Database Backups

  • Encryption.
    • SQL Server now supports built-in database encryption – no need for third party tools.
    • Can use either a certificate or asymmetric key to backup a database.
    • Supported encryption algorithms are for Advanced Encryption Standard (AES):
      • AES 128-bit.
      • AES 192-bit.
      • AES 256-bit.
      • Triple DES (3DES).
  • Managed backup to Azure Blob Storage.
    • Automatically backs up a database (or an instance) based on:
      • Defined recovery intervals.
      • Workload patterns.
    • When the system determines that data has changed enough, a backup is taken on Azure.
    • Azure backup integration is fully built into SQL Management Studio (SSMS).
    • Managed backups are ONLY compatible with blob storage and dependant on Internet access to take and restore backups.

Resource Governor for I/O

Disk I/O (Input/Output) can cause a lot of problems in a database system, especially when running large or problematic queries.  SQL Server 2014 (more specifically, Resource Governor) enables more control over I/O:

  • Queries put into their own resource pool – limiting I/O per volume.
  • Minimum/Maximum reads or writes (per second) in a disk volume are controlled by:
    • MIN_IOPS_PER_VOLUME
      • Reserves a minimum number of I/O transactions per second
    • MAX_IOPS_PER_VOLUME
      • Sets a maximum but more importantly, prevents the hard disk from being monopolised.
      • Large queries will run but enable other processes to function in parallel.
  • Good I/O control will reserve Input/Output Operations Per Second (IOPS), in order for administrators to investigate overloaded or bad performing disks.

SSD Buffer Pool Extension

A buffer pool extension for SQL Server 2014 is similar to defining a different page file for Windows.  As various data pages move into memory, the buffer pool begins to fill up.  Once it is full, less frequently used pages are added to disk.  When they are needed again, the pages are swapped with another page in the buffer pool and allocated back to memory.

  • Buffer pool extension allows a user to define a Solid State Drive (SSD) as a buffer file location.
    • SSD is much faster than a standard spinning disk, therefore, increasing performance considerably.
    • Buffer pool extension file:
      • Up to 32 times the size of physical memory.
      • Increased random I/O throughput.
      • Increased transaction throughput.
      • Caching that can take advantage of low-cost memory drives.
    • No risk of data loss. Buffer Pool Extension (BPE) only deals with clean pages.
    • SQL syntax is very simple to use and maintain.

Incremental statistics

Typically, statistics (stats) have been pretty useless and costly in SQL Server.  Whenever stats need to be rebuilt, the whole table requires updating.  For example, a table with 100 million records but only 5 millions updates would need a statistics update for the whole 100 million rows.

Incremental statistics can alleviate the above example, as well as offer some other advantages:

  • Update changed rows and merge them with existing rows
    • Can result in drastically improved query performance.
    • Used in conjunction with table partitioning.
    • Final statistics object is updated – not re-created.
  • Percentage of data changes to trigger an automatic update of statistics now works at the partition level.
    • Now only 20% of rows changed per partition are required.
  • Default setting for incremental statistics is OFF.

Conclusion

There are many other new/improved features in SQL Server 2014, that have not been discussed in this blog.  The 5 features included here are the ones which could help increase SQL Server performance, efficiency and reliability.  You should think about these elements before building a data warehouse or Analysis Service cube, as it could save a lot of time and money further down the line.

I would encourage all interested parties to read up on the other exciting additions to SQL Server 2014.  Simply typing any of the below keywords into Google will bring up a number of related articles.

- Updateable Columnstore Indexes
- In-Memory OLTP Engine
- AlwaysOn Enhancements
- Power BI for Office 365 Integration
- Power View for Multidimensional Models
- Delayed durability

For any questions related to this blog, please contact Callum Green at callum.green@adatis.co.uk.

Further Reading

For further information on the new features (and associated functionality), navigate to the below website links:

  1. Database Backups
    Encryption Types - http://msdn.microsoft.com/en-GB/library/ms345262.aspx
    Managed Azure backups - http://msdn.microsoft.com/en-gb/library/dn449491.aspx
  2. Resource Governor for I/O
    Configuration and SQL Server 2014 implementation examples - http://www.databasejournal.com/features/mssql/restricting-io-using-sql-server-2014-resource-governor.html
  3. SSD Buffer Pool
    Benefits, features and SQL Syntax for Buffer Pool Extension - http://blogs.technet.com/b/dataplatforminsider/archive/2013/07/25/buffer-pool-extension-to-ssds-in-sql-server-2014.aspx
  4. Incremental Statistics
    Improvements made to statistics and using alongside table partitioning - http://sqlperformance.com/2014/02/sql-statistics/2014-incremental-statistics
    Scenarios when Incremental Statistics are not available or possible (half way down the page)  - http://msdn.microsoft.com/en-us/library/ms188038.aspx 

SQL Server NOT IN Clause - Avoid like the….

Background

Up until recently, I was one of the SQL Server developers adopting the bad habit that is known as the NOT IN clause.  It is an easy way of finding data in one table, that does not exist in another.  For this purpose, I thought using the NOT IN would help me conceptualise a query result, as well as help make it easier for someone else looking at the code.  In fact, although the performance (within an execution plan) is OK, you can pull back incorrect results from the overall query.

The Problem

The NOT IN clause is problematic in only one, but VERY IMPORTANT way…….it DOES NOT include NULLS in the comparison table.  Please see the example below:

Create two tables for NOT In Example:

NOT IN SQL 1

Query results for both tables:

NOT IN SQL 2

NOT In Query:

NOT IN SQL 3

As you can see, 0 records were returned.  We would expect the record (containing Striker, Andy Cole) in the NewFootyPlayers table to be returned.  The NOT IN Clause is ignoring any comparisons on NULLS.

NOTE 
Adding an additional ‘WHERE Position IS NOT NULL’ filter to the NOT IN clause would also give the same result but a lot of people will forget to add it and spend a substantial amount of time wondering why certain records are missing from their result set.

The Solution(s)

There are a number of clauses or SQL syntax that can be used instead of the NOT IN.  although most do not have any major performance benefits, they actually return what is expected.  The three examples below all return the one expected record:

NOT IN SQL 4

All three return the below result, which we expected in the first place:

NOT IN SQL 5

Recommended Solution

Whilst none of the solutions above cause major performance problems, there is one method that is better than the others.  If we are working with hundreds of millions of records in both tables, using the NOT EXISTS is the most efficient query.  Its performance is similar to NOT IN and EXCEPT, and it produces an identical plan, but is not prone to the potential issues caused by NULLs or duplicates.

I would be interested to see if anyone else has performance tested each query type and if there are better alternatives to NOT EXISTS.  One thing I am certain on, however, is that no one should have to use the NOT IN clause.

TF204017 The operation cannot be completed because the user does not have one or more required permissions (Use) for workspace

The titled error message (above) often occurs when a number of people work on a particular server, using the same credentials. This article explains the different workspace permissions that can be assigned in Visual Studio, as well a way of eradicating the error message.

Please Note:

This post is only relevant to TFS 2010 and beyond. Any previous versions will only allow a workspace to use the ‘private workspace’ setting

Background

When creating a TFS workspace and linking it to Source Control, the associated solutions (and files) are saved to a local location. By default, everything saved here is set with a ‘private workspace’ permission. As a result, only the associated user (that is logged on to a server) can create, edit or delete parts of the solution.

In most cases, you would want TFS to apply the ‘private workspace’ permission. This ensures individuals have their own copy of locally developed work and prevent the risk of anything being accidentally overwritten or deleted. However, there may be a need to share a workspace across multiple users, where the solution files are store on a shared drive. Furthermore, some servers use generic log in credentials (to a server), in which more than one person would connect to TFS. Switching the TFS workspace to a more ‘public’ setting would enable everyone to share the same solution files.

Workspace Permissions

There are 3 types of workspaces in TFS 2010 (and beyond);

1. Private

a. Default setting and most commonly used. Only the logged in user will be able to create, edit or delete from the local files. In effect, only the owner will have permissions.

2. Public-limited

a. Additional users are granted the Read and Use permissions on the workspace.

3. Public

a. Every valid user of the team project collection has the same permissions as the owner: Read, Use, CheckIn, Administer.

Edit TFS Workspace Settings

The below instructions illustrate how to change a workspace permission in Visual Studio 2010. The same logic applies to Visual Studio 2012 and 2014.

1. From the File menu, click Source Control, and then click Workspaces.

2. In the Manage Workspaces dialog box, under the Name column, highlight the workspace that you want to modify, and then click Edit.

3. In the Edit Workspace dialog box, Click Advanced.

4. Change to the desired workspace by using the ‘permissions’ drop down.

Fig 1.0 – Edit Workspace Example

clip_image002

5. Click OK to confirm.

Test Workspace Permissions

You'll need to log onto the machine which has the public workspace. After starting Visual Studio 2010 and connecting to the server which has a public workspace, you'll be able to see the workspace in the appropriate drop-down combo boxes in the Source Control Explorer and Pending Changes tool windows.

1. From the Team Explorer home page (Keyboard: Ctrl + 0, H), choose Source Control Explorer.

2. From the menu bar choose View, Other Windows, Source Control Explorer.

clip_image004

3. Choose the recently changed ‘Public Workspace’ and begin to use it.

Summary

Although the original TFS error message was the reason for my findings, it also highlights the fact that there are now different workspace permissions that can fit your needs. To any users of TFS and Source Control, these types of articles are priceless when learning more about it.

Visit the websites below for a more in depth look into workspaces:

- http://social.msdn.microsoft.com/Forums/vstudio/en-US/21440b8d-67c4-4aa5-8a1b-3c747ebbb440/tf204017-the-operation-cannot-be-completed-because-the-user-does-not-have-one-or-more-required?forum=tfsgeneral

- http://blogs.msdn.com/b/phkelley/archive/2010/02/04/improvements-to-workspaces-in-tfs-2010.aspx

- http://msdn.microsoft.com/en-us/library/ms245466(v=vs.100).aspx

External Connection to a Windows Azure VM SQL Server

I was recently tasked with connecting to a SQL Server database, sitting on an external Windows Azure VM.  Simply typing in the SQL server name and providing appropriate SQL Server authentication is not enough to connect.  The following error is displayed:

clip_image002

An Azure virtual machine works differently to a physical server and cmd commands such as ping do not work.  As a result, it was difficult to test connection to the DNS server name or the IP address. After scouring numerous forums and trying a number of ‘fixes’, I still could not connect to the VM SQL Server from my local machine.

I passed the problem to Sacha and luckily, he managed to work out where I was going wrong. The below steps MUST be adhered to for the connection to work:

1. Within Windows Azure, navigate to ‘ENDPOINTS’ and Select ‘ADD’ at the bottom of the page.

image

2. You will need to open up a port in order to allow access to the instance of SQL Server. I’m going to stick with the default port of 1433 and create a new endpoint here. Make sure Both Check boxes (circled below) are NOT ticked.

clip_image005

3. Next up we need to configure the firewall on the server itself. Open an RDP session and log into the Azure VM. Open Control Panel > Windows Firewall and configure a new inbound rule in the Windows firewall for port 1433:

clip_image007

4. Select a rule type of ‘Port’, as shown above. The default port for SQL Server is 1433 but this can be different depending on how the SQL Server properties have been set up. Simply type ‘tcp setting for SQL Server’ in a search engine and this will tell you where to find or change the port number.

After clicking next, the options are as follows:

· Select TCP and specify port 1433

· Next, select ‘Allows This Connection’

· Then, select when you want the rule applied – typically ‘Public.’ This will enable external connection to the port.

· Lastly, assign this rule a name (e.g SQL Port)

5. The next step is to verify that SQL Server is enabled to use the TCP protocol. Ensure that ‘Named Pipes’ is set to DISABLED.

clip_image009

6. Check that SQL Server is configured to use mixed mode authentication. Windows Azure does support Active Directory at this time, so SQL authentication will be required. Right Click on the server (within SSMS) and toggle to the ‘Security’ page and verify that the radio button ‘SQL Server and Windows Authentication Mode’ is selected.

image

7. Create a test login (sys admin for simplicity) and ensure the user is set up with SQL Server authentication and a meaningful password is set.

image

8. Restart the SQL Server (either through SSMS or SQL Server Configuration Manager). In the below example, SSMS was used. It is important the service is restarted, otherwise, any changes to the tcp or pipelines will n bot have been committed.

image

9. The final step is to login in from your local SSMS. Specify the full server name (taken from Azure) and the new user credentials created.

image

10. Connection should now be established. If you are still seeing the error message shown at the beginning of the blog, please go through the above steps again or speak to Sacha or myself.

If you have any questions or require further information, do not hesitate to contract me.

Setting up a VM in Azure & Compatibility with SQL Server 2008 R2

I recently set up an internal Virtual Machine (VM) for a dev environment, using Windows Azure.  For simplicity, I set the server name and the account username (administrator) exactly the same.  I named them both the same, e.g. 'ServerApp1'.  The installation of the VM worked perfectly fine.

I only experienced an issue when trying to install SQL Server 2008 R2 Enterprise.  The Database Engine, Analysis & Reporting Services would not install, and the error message (Username '' could not be found) was not very helpful!  However, I googled the message and managed to find a SQL forum where someone suggested that SQL Server does not install properly when the administrator username and VM machine name are the same.

As a result of the above, I had to go into Control Panel and physically change the username for the Admin account.  Unfortunately, once a username and VM name are specified in Azure, YOU CANNOT CHANGE IT.  Or at least I could not find a way.  I would strongly recommend that when setting up a VM, always ensure the admin username differs from the machine name.  For example, I changed the username to something different, e.g. 'ServerApp1Admin'.  Afterwards, I re-installed SQL Server 2008 R2 and all components installed correctly.

I am not 100% sure if this is just an isolated problem with SQL Server 2008 R2, but it would be best to assume that all versions would cause the same issue.