Adatis

Adatis BI Blogs

SQL Server Delete does not release file space allocation

This one’s a short blog post, and by no means new, but it may help you in the future and maybe save some head scratching. Commonly in Data Warehousing projects you will be processing data through tables whilst staging and cleaning source data. These tables may commonly be heaps (no Clustered Index), that have no persistent data; rather they are a transient set of tables that get deleted from when your Warehouse loads each day. What you may not know is that when a heap is deleted from (as opposed to truncated) the space that was used by that data is not automatically de-allocated from the table, and when new data is added the table space is increased, even though the table may be “empty”. So what is actually going on? When deleting from a heap the database engine either locks the page or the row, as opposed to the whole table when deleting on a clustered index. During the delete on the heap the page is emptied, but the engine does not de-allocate the page. This behaviour only occurs on a heap, and can cause very large tables with no data in them. For my recent project, it caused the stage and clean processes to slow down gradually over time because the database engine was scanning huge tables with no data in them. So, what can we do to stop this behaviour? There are three options (as documented by Microsoft here (see Locking Behaviour)): 1. When deleting from the table obtain an exclusive lock on the whole table, by using a TABLOCK hint as follows:      DELETE FROM [TABLE] WITH (TABLOCK) WHERE [COLUMN] = 'CRITERIA' 2. If possible, truncate the whole table, this will de-allocate all pages. 3. Finally, my least favourite, add a clustered index to the table.  

RANKX–Analysis Services 2012 Tabular

RANKX was a new function made available in DAX at the launch of SQL Server 2012; however Ranking is a widely used function in most BI projects, and was very much anticipated. Having worked extensively with Tabular on a recent project we had the need to rank over large sets of data, with models based on billions of rows. Whilst we were extremely impressed with the performance of Tabular, we found something not quite right with RANKX. Suppose we had the following very simple format of data, and wanted to perform a simple Rank over it. CountryKey Date Units SalesValue 53 24/01/2012 62 81.290619 156 03/01/2012 3575 5186.565208 48 07/05/2012 321 362.836524 157 16/04/2012 7 8.005637 134 19/08/2012 20 44.12 232 29/07/2012 40 49.313156 178 05/03/2012 47 61.25 63 26/03/2012 51 73.369509 153 17/06/2012 3061 6026.4 31 08/07/2012 3622 5005.038257 140 13/02/2012 9870 7713.968564 14 27/05/2012 29220 46728.26302 98 20/08/2012 297 377.653459 131 30/07/2012 172 182.591931 48 27/03/2012 348 411.801792 157 06/03/2012 3 8.01422 In order to a Rank the above data by country over all time we would probably write a simple DAX query as follows: DEFINE   MEASURE FactSales[UnitCount]= SUM(FactSales[Units])   MEASURE FactSales[Rank]= RANKX(ALL(FactSales[CountryKey]), FactSales[UnitCount]) EVALUATE   SUMMARIZE(     FactSales    ,FactSales[CountryKey]    ,"ProductSalesUnits",FactSales[UnitCount]    ,"Rank", FactSales[Rank]   ) ORDER BY   FactSales[UnitCount] DESC Now you can see from the query above, in order to avoid any doubt we only rank over a single table and we have no joins to other tables at all. This works fine resulting in the data below. FactSales[CountryKey] [ProductSalesUnits] [Rank] 227 91818113 1 226 21487836 2 39 9033389 3 14 7597054 4 82 4656585 5 109 3678302 6 75 3371681 7 140 2511239 8 107 1850520 9 208 1443241 10 31 1092047 11 153 997860 12 201 914827 13 59 870366 14 163 840927 15 Now assume we rank over a lot more data and wish to apply a simple CALCULATETABLE in order to filter some data out. We may write a DAX Query as follows: DEFINE   MEASURE FactSales[UnitCount]= SUM(FactSales[Units])   MEASURE FactSales[Rank]= RANKX(ALL(FactSales[CountryKey]), FactSales[UnitCount]) EVALUATE   CALCULATETABLE (     SUMMARIZE(       FactSales      ,FactSales[CountryKey]      ,"ProductSalesUnits",FactSales[UnitCount]      ,"Rank", FactSales[Rank]     )   ,DATESBETWEEN('Date'[CalDate], DATE(2012,1,1), DATE(2012,8,31)) ) ORDER BY   FactSales[UnitCount] DESC In this instance we are joining to a Date dimension, but that is it.  The above query yields the below result. FactSales[CountryKey] [ProductSalesUnits] [Rank] 227 83147875 1 226 19446567 2 39 8137335 3 14 6769390 4 82 4100168 5 109 3321496 6 75 3010890 7 140 2165554 8 107 1655472 9 208 1274740 10 31 987122 11 153 908938 12 201 853322 13 59 771677 14 That works, we get results as we would expect within seconds. Now, if instead of ranking over an integer field, let’s apply the RANKX function to a real number. In this example we get a worrying result set using the DAX as follows: DEFINE    MEASURE FactSales[SalesValue]= FactSales[Sales Value]   MEASURE FactSales[Rank]= RANKX(ALL(FactSales[CountryKey]), FactSales[UnitCount]) EVALUATE   CALCULATETABLE (     SUMMARIZE(       FactSales      ,FactSales[CountryKey]      ,"ProductSalesValue",FactSales[Sales Value]      ,"Rank", FactSales[Rank]     )   ,DATESBETWEEN('Date'[CalDate], DATE(2012,1,1), DATE(2012,8,31)) ) ORDER BY   FactSales[UnitCount] DESC Which outputs: FactSales[CountryKey] [ProductSalesValue] [Rank] 227 84074007.25 1 226 29928143.25 3 14 10859628.74 4 39 8451588.111 4 109 7964922.769 6 82 6254219.85 6 75 4730390.37 7 107 2466064.97 9 208 1904009.18 10 140 1862708.961 11 153 1311217.35 11 22 1207366.72 13 59 1182179.95 15 Now let’s be clear, all we have done is simply change the measure from an Integer to a Float, the rest of the data is the same. You will notice that there are tie’s in the data that there should not be. Having scratched our heads for hours, rebuilt the model, re wrote the DAX, and had a number of colleagues check it over we found that when no calculate table is applied, then we get the correct answer again as follows: DEFINE   MEASURE FactSales[SalesValue]= FactSales[Sales Value]   MEASURE FactSales[Rank]= RANKX(ALL(FactSales[CountryKey]), FactSales[UnitCount]) EVALUATE   SUMMARIZE(     FactSales    ,FactSales[CountryKey]    ,"ProductSalesValue",FactSales[Sales Value]    ,"Rank", FactSales[Rank] ) ORDER BY   FactSales[UnitCount] DESC FactSales[CountryKey] [ProductSalesValue] [Rank] 227 92885561.31 1 226 33237033.3 2 14 12253005.68 3 39 9414266.358 4 109 8928147.606 5 82 7120811.54 6 75 5296490.13 7 107 2756500.54 8 208 2146627.18 9 140 2115750.609 10 153 1479858.53 11 22 1339255.82 12 59 1324799.98 13 105 1320651.83 14 31 1277065.779 15 So what we have learnt here is that RANKX seems to give the wrong answer, but only when ranking over real numbers, and only when we filter the data set in some way using CALCULATETABLE. Not being able to find a clear reason for this behaviour we eventually gave in and raised it with Microsoft. Having spent a week or so working it through with the great support team at Microsoft, it seems that this is a current “feature” of RANKX. It is believed to be a floating point arithmetic issue that is driven from how floating point numbers are stored. It is documented here: http://support.microsoft.com/kb/78113/EN-US There is also some further information documented here: http://technet.microsoft.com/en-us/library/gg492146.aspx http://www.microsoft.com/en-us/download/details.aspx?id=4106 I have been told by Microsoft that this is the same as it is in Excel, and Analysis Services, however still does not explain why the CALCULATETABLE makes a difference.  I am still working with Microsoft to see if we can get to the bottom of it.   However, if precision beyond 15 significant figures within the rank is not important to you (up to 99 Billion with 2DP) this issue is very easily worked around. Take the following DAX query: DEFINE   MEASURE FactSales[UnitCount]= FactSales[Royalty Value]   MEASURE FactSales[Rank]= RANKX(ALL(FactSales[CountryKey]), ROUND(FactSales[UnitCount],2)) EVALUATE   CALCULATETABLE (     SUMMARIZE(       FactSales      ,FactSales[CountryKey]      ,"ProductSalesValue",FactSales[UnitCount]      ,"Rank", FactSales[Rank]     ) ,DATESBETWEEN('Date'[CalDate], DATE(2012,1,1), DATE(2012,8,31)) ) ORDER BY   FactSales[UnitCount] DESC Which now correctly outputs: FactSales[CountryKey] [ProductSalesValue] [Rank] 227 84074007.25 1 226 29928143.25 2 14 10859628.74 3 39 8451588.111 4 109 7964922.769 5 82 6254219.85 6 75 4730390.37 7 107 2466064.97 8 208 1904009.18 9 140 1862708.961 10 153 1311217.35 11 22 1207366.72 12 59 1182179.95 14 105 1196551.61 13 31 1132926.109 15 You will notice that all I have done is ROUND the measure in the RANKX function to 2 decimal places, this will stop the calculation engine from ranking over an approximated value, and force it RANK on a decimal with less than 15 significant figures, which will in turn fix the ranking. Further to this, rounding the values prior to it being loaded into the model will also work, just be sure to keep below the 15 Significant figures if possible. So in summary, it seems that the calculation engine, when using the RANKX function does not handle the float data type as we may expect. This can be easily rectified by losing some precision on the RANK measure you define. We have been informed that using the Currency function will also fix the problem. I hope this stops others going through the same pain. As and When i get any further information on this issue I will post it up.  

SQL Server 2012 Running Totals

SQL server has sometimes come under fire with the Oracle vs SQL Server debate because of the lack of some of the more advanced (and less used) functionality that is outlined in the SQL ANSI standards.   An example of this is Window Functions, which became an ANSI standard under the ANSI:2003 revision.  A Window function is an aggregate function that can be applied to a subset of a full set of data.  Now this can be achieved in current versions of SQL Server (2008 R2 and its predecessors), but not using Window Functions and therefore has a performance implication.  There are about 3 or 4 approaches to achieving a running total in SQL Server prior to the 2012 version, however none particularly elegant.  For my following example I will outline how to perform a running total using AdvertureWorks2008 sample data.  I will remind us of 1 of the method's (probably the most common) of how we used to do running totals prior to SQL Server 2012, and then show how to do the same running total using a SQL Server 2012 Window Function.  To set the scene, we will be looking for a running total of Line Items for a given Order.  This may be kind of query you may wish to write to generate an invoice with a running total on it. To begin with lets look at the more traditional query: SELECT         A.SalesOrderID,         A.SalesOrderDetailID,         A.LineTotal,         SUM(B.LineTotal) FROM Sales.SalesOrderDetail  AS A     CROSS JOIN Sales.SalesOrderDetail AS B WHERE     B.SalesOrderDetailID <= A.SalesOrderDetailID         AND     A.SalesOrderID = B.SalesOrderID GROUP BY     A.SalesOrderID,     A.SalesOrderDetailID,     A.LineTotal ORDER BY                     A.SalesOrderID,     A.SalesOrderDetailID,     A.LineTotal This simply works by self joining up to certain point, so works fine for any data that can be ordered easily, as running totals usually are this method usually suffices. The following query uses the new Window function in SQL Server 2012:   SELECT        SalesOrderID,         SalesOrderDetailID,         LineTotal,         SUM(LineTotal)                 OVER     (PARTITION BY                                 SalesOrderID                            ORDER BY                                 SalesOrderDetailID) AS OrderRunningTotal FROM         Sales.SalesOrderDetail ORDER BY         SalesOrderID,        SalesOrderDetailID,     LineTotal You will instantly notice that the second query is far more elegant, and more simplistic to understand.  In short, in this query, we are telling SQL server to Sum the “Line Total” over an ordered partition of the data.  Both queries return the same result, however it becomes interesting when we look at the execution plans, relative to each other.  This is shown below (apologies for the size of these): What is important to note that the second (bottom) plan is smaller, and simpler.  the other, very significant point is that the Query Cost (relative to batch) is a whopping 97% for the old method of running totals, meaning that the new Windowing Functions are far more efficient.  In summary, I believe that this is an example of where SQL server is becoming a firm competitor to some of its perceived rivals, its one of the few points that can be raised as a valid point in the argument of SQL Server vs Oracle, but not any more!