Yes it’s true, despite what I have been told for years I can officially confirm it, size does matter! I am of course talking about sizing of columns using the variable length data types in sql server, what else could I mean!?
Back in the early days ..
So this came to me after watching Klaus Aschenbrenner speaking at SQLBits 8, his session was on the internals of the storage engine in sql server and how it all worked. As he explained, this area of sql server can be a bit of a black box and I guess for a lot of people this is fine. After all, I certainty don't know, or need to know, the inner workings of a lots of the technology and gadgets I use on a day to day basis (I don’t know or care how my TV works for example, I just like watching the Simpsons on it ).
Anyway I digress, after Klaus’s session I remembered back to when I first started using sql server (around 11 years ago, which actually makes me feel really old!) and I asked a few of the guys I worked with at the time “If these varchar columns only use space for the data they store then why not just declare them as all varchar(8000)?”, they all looked around at each other and after a minute or two came up with something about using the database design to restrict what the user can store in the table by limiting the column widths.
While this made some sense I was still left thinking “well why do they let you set a size at all, just call it varchar without setting a size, make it accept 8000 characters (4000 for nvarchar) and let the DB handle it”. This to me,with the knowledge I had at the time made much more sense, because the sql server storage engine was a black box to me and obviously the guys I was working with at the time as well.
Back to the future
Obviously time passes and we all learn and improve our skills and knowledge. So as I mentioned earlier listening to Klaus reminded me of this unanswered question from the past and I kind of wished I could go back in time and answer myself ..(?) by saying …
“Size matters! and here’s why…”
So at a basic level SQL Server stores all its data in 8KB chunks it calls “pages”. There are several types of pages but the ones we are looking at are the three different types of data pages, IN_ROW, ROW_OVERFLOW and LOB_DATA.
Pages are then logically lumped into 64KB groups of 8 (8 x 8KB) which is referred to as an “extent”.
There are two kinds of extents, “Uniform” and “Mixed”. I’m not going to go into extents too much (there are a bunch of great resources out there if you want to look into this, I may even do another blog post on the subject) but basically a uniform extent is one where all its 8 pages hold data for the same object (i.e. table or index etc) and a mixed extent as you may have guessed is one that has pages that hold data for multiple objects (example below)
So building on this, lets start with a table (heap or b-tree, the principles are the same), when you first create this table sql server automatically creates a partition for you (tables and indexes can be spread across multiple partitions but I’m not going to go into detail about partitions in this post). Each partition then has something called an “allocation unit” for each type of “page” (See page types above) it is using (I emphasise using as this is the important bit), which tells SQL server which extents are being used by / belong to this object . Example as below:
Page types, what they can store and variable vs fixed width
Now if you remember back to a moment ago when I mentioned the three different types of data pages, lets look at what each is used for.
IN_ROW (Data in the image above) pages can store data from both fixed width data types (CHAR, INT, DATETIME etc etc) and variable width data types (VARCHAR, VARBINARY etc).
Fixed width data types, unlike variable width data types, can only ever be stored on IN_ROW type pages, which enforces a row size limit for fixed width data type columns to around 8053 bytes (a page being 8060 with roughly a 7 byte overhead, I say roughly as this can increase). Rows can be larger than this using variable width data type columns as this limit only applies to the fixed with types
LOB_DATA pages are for storing data from large data types such as VARCHAR(MAX), text, image and xml for example.
ROW_OVERFLOW pages are used to store data from variable width data types, but only when the potential maximum width of a row exceeds the 8053 (ish) byte limit. So to clarify, "potential maximum width” means the sum width of all columns using both fixed width and variable width data types, taking the declared length of variable with columns as its width (so a varchar(50) would count as a column 50 bytes wide)
So you might have spotted the crux of this post by now? Basically put, if your potential maximum row size is going to exceed the 8053 (ish) byte limit then SQL server moves the variable width data type data into a ROW_OVERFLOW type page and you now need to maintain two types of pages per row. This means two allocation units per partition and you now have to pull two pages into memory to read each row as all I/O in SQL Server is done at the page level, you’re basically doubling your I/O. Now think how that would scale over so millions or hundreds of millions of rows…
So if you don’t really need to store those 8000 characters in your varchar column then don't declare it that size as you may be giving yourself a massive I/O hit.
So there you are, size does matter! Hopefully this has helped and now when you’re designing your tables you wont be so fast and loose with the size of your variable width columns
Thanks for reading folks!
(Images from MSDN and general Kudos to Klaus Aschenbrenner)