Adatis

Adatis BI Blogs

SQL Server 2012 : Columnstore Index in action

One of the new SQL Server 2012 data warehouse features is the Columnstore index. It stores data by columns instead of by rows, similar to a column-oriented DBMS like the Vertica Analytic Database and claims to increase query performance by hundreds to thousands of times. The issue with indexes in a data warehouse environment is the number and broad range of questions that the warehouse may have to answer meaning you either have to introduce a large number of large indexes (that in many cases results in a larger set of indexes than actual data), plump for a costly spindle-rich hardware infrastructure, or you opt for a balanced hardware and software solution such as a Microsoft SQL Server 2008 R2 Fast Track Data Warehouse or a HP Business Data Warehouse Appliance where the approach is ‘index-light’ and you rely on the combination of high throughput and performance power to reduce the dependency on the traditional index. The Columnstore index is different in that, when applied correctly, a broad range of questions can benefit from a single Columnstore index, the index is compressed (using the same Vertipaq technology that PowerPivot and Tabular based Analysis Services share) reducing the effort required on the expensive and slow disk subsystem and increasing the effort of the fast and lower cost memory/processor combination. In order to test the claims of the Columnstore index I’ve performed some testing on a Hyper-V instance of SQL Server 2012 “Denali” CTP3 using a blown up version of the AdventureWorksDWDenali sample database. I’ve increased the FactResellerSales table from approximately 61,000 records to approximately 15.5 million records and removed all existing indexes to give me a simple, but reasonably large ‘heap’. Heap With a clear cache, run the following simple aggregation: SELECT     SalesTerritoryKey     ,SUM(SalesAmount) AS SalesAmount FROM    [AdventureWorksDWDenali].[dbo].[FactResellerSales] GROUP BY     SalesTerritoryKey ORDER BY      SalesTerritoryKey Table 'FactResellerSales'. Scan count 5, logical reads 457665, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 7641 ms, elapsed time = 43718 ms Non-Clustered Index Before jumping straight in with a columnstore index, let’s review performance using a traditional index. I tried a variety of combinations, the fastest I could get this query to go was to simply add the following: CREATE NONCLUSTERED INDEX [IX_SalesTerritoryKey] ON [dbo].[FactResellerSales] (    [SalesTerritoryKey] ASC ) INCLUDE ([SalesAmount]) WITH (     PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF,     DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON,     ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100, DATA_COMPRESSION = PAGE ) ON [PRIMARY] GO Notice I have compressed the index using page compression, this reduced the number of pages my data consumed significantly. The IO stats when I re-ran the same query (on a clear cache) looked like this: Table 'FactResellerSales'. Scan count 5, logical reads 26928, physical reads 0, read-ahead reads 26816, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 6170 ms, elapsed time = 5201 ms. Much better! Approximately 6% of the original logical reads were required, resulting in a query response time of just over 5 seconds. Remember though, this new index will really only answer this specific question. If we change the query, performance is likely to fall off the cliff and revert back to the table scan. Incidentally, adopting an index-light ([no index]) approach and simply compressing (and reloading to remove fragmentation) the underlying table itself, performance was only nominally slower than the indexed table with the added advantage of being able to perform for a large number of different queries. (Effectively speeding up the table scan. Partitioning the table can help with this approach too.) Columnstore Index Okay, time to bring out the columnstore. The recommendation is to add all columns into the columnstore index (Columnstore indexes do not support ‘include’ columns), practically there may be a few cases where you do exclude some columns. Meta data, or system columns that are unlikely to be used in true analysis are good candidates to leave out of the columnstore. However, in this instance, I am including all columns: CREATE NONCLUSTERED COLUMNSTORE INDEX [IX_Columnstore] ON [dbo].[FactResellerSales] (     [ProductKey],     [OrderDateKey],     [DueDateKey],     [ShipDateKey],     [ResellerKey],     [EmployeeKey],     [PromotionKey],     [CurrencyKey],     [SalesTerritoryKey],     [SalesOrderNumber],     [SalesOrderLineNumber],     [RevisionNumber],     [OrderQuantity],     [UnitPrice],     [ExtendedAmount],     [UnitPriceDiscountPct],     [DiscountAmount],     [ProductStandardCost],     [TotalProductCost],     [SalesAmount],     [TaxAmt],     [Freight],     [CarrierTrackingNumber],     [CustomerPONumber],     [OrderDate],     [DueDate],     [ShipDate] )WITH (DROP_EXISTING = OFF) ON [PRIMARY] Now when I run the query on a clear cache: Table 'FactResellerSales_V2'. Scan count 4, logical reads 2207, physical reads 18, read-ahead reads 3988, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 235 ms, elapsed time = 327 ms. I think the figures speak for themselves ! Sub-second response and because all columns are part of the index, a broad range of questions can be satisfied by this single index. Storage The traditional (compressed) non-clustered index takes up around 208 MB whereas the Columnstore Index comes in a little less at 194 MB so speed and storage efficiency, further compounded when you take into account the potential additional indexes the warehouse may require. So, the downsides? Columnstore indexes render the table read-only. In order to to update the table you either need to drop and re-create the index or employ a partition switching approach. The other notable disadvantage, consistently witnessed during my tests, is the columnstore index takes longer to build. The traditional non-clustered index took approximately 21 seconds to build whereas the columnstore took approximately 1 minute 49 seconds. Remember though, you only need one columnstore index to satisfy many queries so that’s potentially not a fair comparison. Troubleshooting If you don’t notice a huge difference between a table scan and a Columnstore Index Scan, check the Actual Execution Mode of the Columnstore Index Scan. This should be set to Batch, not Row. If the Actual Execution Mode is reporting Row then your query cannot run in parallel: - Ensure, if running via Hyper-V, you have assigned more than one processor to the image. - Ensure the Server Property ‘Max Degee of Parallelism’ is not set to 1. Summary In summary, for warehousing workloads, a columnstore index is a great addition to the database engine with significant performance improvements even on reasonably small data sets. It will re-define the ‘index-light’ approach that the SQL Server Fast Track Data Warehouse methodology champions and help simplify warehouse based performance tuning activities. Will it work in every scenario? I very much doubt it, but it’s a good place to start until we get to experience it live in the field.

HP Business Decision Appliance–PowerPivot in a box!

Further to my blog post about SQL server appliances Microsoft and HP also offer the very exciting Business Decision Appliance.  This contains a preconfigured environment with: Microsoft Windows Server 2008 R2 Enterprise Edition Microsoft SQL Server 2008 R2 Enterprise Edition with PowerPivot integration for SharePoint Microsoft SharePoint 2010 Enterprise Edition Prerequisites for SharePoint and PowerPivot Appliance Administration Console Appliance-specific SharePoint Home Page Up to 80 Concurrent Users The aim of this appliance is to provide a safe and scalable environment for business users to quickly put a PowerPivot environment in place.  This is isolated from existing systems and can be implemented by a business department with very limited IT involvement.  Its isolated nature is important as many organisations don’t run SharePoint 2010 throughout their enterprise.  PowerPivot authors also need Excel 2010 but once reports are built and deployed they can be shared throughout the business through SharePoint using any web browser.  Anyone who has already tried to set up an integrated SharePoint and PowerPivot environment will know that the installation is not simple. This appliance takes that pain away with a one click installation from first start up that can have you up and running in under an hour. The key point is that business users love PowerPivot and its ability to quickly use their data to answer any question.  They can respond dynamically, collaborate and share insights throughout the organisation.  Importantly PowerPivot allows the creation of reports that look great and are very fast.  All this is done from within the familiar Excel interface and requiring little training to get started.  Any team of business analysts would likely have a massive boost in productivity from the installation of the BDA in their department.  IT can then monitor those reports that are heavily used and decide whether they need making more robust through transition to enterprise software such as Analysis Services.  PowerPivot use is at its easiest in an environment where there is a clean data warehouse but where the business aren't happy with the speed or responsiveness to change of their current front end.  In an environment with more disparate and dirty data the end users need to be more SQL and data modelling skilled, but the payback can be even greater. I see massive opportunity in this appliance for any team of data analysts to be able to deliver massive value to their business right now.  More Info on the BDA