One of the new SQL Server 2012 data warehouse features is the Columnstore index. It stores data by columns instead of by rows, similar to a column-oriented DBMS like the Vertica Analytic Database and claims to increase query performance by hundreds to thousands of times.
The issue with indexes in a data warehouse environment is the number and broad range of questions that the warehouse may have to answer meaning you either have to introduce a large number of large indexes (that in many cases results in a larger set of indexes than actual data), plump for a costly spindle-rich hardware infrastructure, or you opt for a balanced hardware and software solution such as a Microsoft SQL Server 2008 R2 Fast Track Data Warehouse or a HP Business Data Warehouse Appliance where the approach is ‘index-light’ and you rely on the combination of high throughput and performance power to reduce the dependency on the traditional index.
The Columnstore index is different in that, when applied correctly, a broad range of questions can benefit from a single Columnstore index, the index is compressed (using the same Vertipaq technology that PowerPivot and Tabular based Analysis Services share) reducing the effort required on the expensive and slow disk subsystem and increasing the effort of the fast and lower cost memory/processor combination.
In order to test the claims of the Columnstore index I’ve performed some testing on a Hyper-V instance of SQL Server 2012 “Denali” CTP3 using a blown up version of the AdventureWorksDWDenali sample database. I’ve increased the FactResellerSales table from approximately 61,000 records to approximately 15.5 million records and removed all existing indexes to give me a simple, but reasonably large ‘heap’.
With a clear cache, run the following simple aggregation:
,SUM(SalesAmount) AS SalesAmount
Table 'FactResellerSales'. Scan count 5, logical reads 457665, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 7641 ms, elapsed time = 43718 ms
Before jumping straight in with a columnstore index, let’s review performance using a traditional index. I tried a variety of combinations, the fastest I could get this query to go was to simply add the following:
CREATE NONCLUSTERED INDEX [IX_SalesTerritoryKey] ON [dbo].[FactResellerSales]
INCLUDE ([SalesAmount]) WITH
PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF,
DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100, DATA_COMPRESSION = PAGE
) ON [PRIMARY]
Notice I have compressed the index using page compression, this reduced the number of pages my data consumed significantly. The IO stats when I re-ran the same query (on a clear cache) looked like this:
Table 'FactResellerSales'. Scan count 5, logical reads 26928, physical reads 0, read-ahead reads 26816, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 6170 ms, elapsed time = 5201 ms.
Much better! Approximately 6% of the original logical reads were required, resulting in a query response time of just over 5 seconds. Remember though, this new index will really only answer this specific question. If we change the query, performance is likely to fall off the cliff and revert back to the table scan.
Incidentally, adopting an index-light ([no index]) approach and simply compressing (and reloading to remove fragmentation) the underlying table itself, performance was only nominally slower than the indexed table with the added advantage of being able to perform for a large number of different queries. (Effectively speeding up the table scan. Partitioning the table can help with this approach too.)
Okay, time to bring out the columnstore. The recommendation is to add all columns into the columnstore index (Columnstore indexes do not support ‘include’ columns), practically there may be a few cases where you do exclude some columns. Meta data, or system columns that are unlikely to be used in true analysis are good candidates to leave out of the columnstore. However, in this instance, I am including all columns:
CREATE NONCLUSTERED COLUMNSTORE INDEX [IX_Columnstore] ON [dbo].[FactResellerSales]
)WITH (DROP_EXISTING = OFF) ON [PRIMARY]
Now when I run the query on a clear cache:
Table 'FactResellerSales_V2'. Scan count 4, logical reads 2207, physical reads 18, read-ahead reads 3988, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 235 ms, elapsed time = 327 ms.
I think the figures speak for themselves ! Sub-second response and because all columns are part of the index, a broad range of questions can be satisfied by this single index.
The traditional (compressed) non-clustered index takes up around 208 MB whereas the Columnstore Index comes in a little less at 194 MB so speed and storage efficiency, further compounded when you take into account the potential additional indexes the warehouse may require.
So, the downsides? Columnstore indexes render the table read-only. In order to to update the table you either need to drop and re-create the index or employ a partition switching approach. The other notable disadvantage, consistently witnessed during my tests, is the columnstore index takes longer to build. The traditional non-clustered index took approximately 21 seconds to build whereas the columnstore took approximately 1 minute 49 seconds. Remember though, you only need one columnstore index to satisfy many queries so that’s potentially not a fair comparison.
If you don’t notice a huge difference between a table scan and a Columnstore Index Scan, check the Actual Execution Mode of the Columnstore Index Scan. This should be set to Batch, not Row.
If the Actual Execution Mode is reporting Row then your query cannot run in parallel:
- Ensure, if running via Hyper-V, you have assigned more than one processor to the image.
- Ensure the Server Property ‘Max Degee of Parallelism’ is not set to 1.
Summary In summary, for warehousing workloads, a columnstore index is a great addition to the database engine with significant performance improvements even on reasonably small data sets. It will re-define the ‘index-light’ approach that the SQL Server Fast Track Data Warehouse methodology champions and help simplify warehouse based performance tuning activities. Will it work in every scenario? I very much doubt it, but it’s a good place to start until we get to experience it live in the field.