Adatis

Adatis BI Blogs

The Azure Modern Data Warehouse: Unparalleled Performance

Today, 80% of organisations adopt cloud-first strategies to scale, reduce costs, capitalise on new capabilities including advanced analytics and AI and to remain competitive. Cloud-adoption is accelerating, and data exploitation is a key driver. The central tenet to this enterprise-wide exploitation of data is the cloud-based Modern Data Warehouse. Legacy on-premises or appliance based EDWs, that were once the strategic asset for only the largest of enterprise organisations, not only limit performance, and flexibility, but are also harder to set up & scale securely. The Modern Data Warehouse fundamentally changes the landscape for data analytics by making analytics available to everyone across organisations of all sizes, and not only the largest enterprise. A modern data analytics platform enables you to bring together huge volumes of relational and non-relational, or structured and unstructured data into a single repository; the highly scalable and cost-effective Azure Data Lake. This provides access across the enterprise from basic information consumption through to innovation led data science. Big data processing capability for data preparation, such as transformation and cleansing, can be performed as well as infusing Machine Learning and AI with the results made readily available for analysis through visual tools like Power BI. Azure provides unparalleled performance at incredible value. To further support this claim, Microsoft have just announced the GigaOM TPC-DS Benchmark Results that further cements Azure SQL Data Warehouse as a leader for price/performance for both decision support benchmarks, having already attained best price/performance status for the TPC-H benchmark, announced back in Feb 2019. TPC-DS @ 30TB$ per Query per Hour TPC-H @ 30TB$ per Query per Hour Azure SQL Data Warehouse (Azure SQL DW) always delivered on performance when compared to alternatives, and now GigaOm found analytics on Azure is 12x faster and 73% cheaper when compared using the TPC-DS benchmark. Azure SQL DW has established itself as the alternative to on-premises data warehouse platforms and leader in Cloud Analytics. Adatis have been at the cutting edge of Cloud Analytics Solutions since the introduction of the Azure SQL Data Warehouse PaaS offering back in 2015. In the last 18 months we have noticed the profile of Azure SQL DW rise sharply; with Azure SQL DW outperforming and taking over workloads from its closest competitors. We specialise in all aspects of delivering the value of Cloud Analytics, AI and the Modern Data Warehouse, from strategic business value led engagements through technical design and implementation to on-going continuous improvement via fully managed DataOps practices. Adatis utilise Microsoft Azure technologies, in conjunction with first-party Spark based services, that securely integrate to provide enterprise-grade, cloud-scale analytics and insight for all and partner deeply with Microsoft to enable data driven transformation for our customers. We work to develop a modern data analytics strategy and ensure it is implemented and supported in the best way possible, aligning to your specific company’s goals and overriding strategy. If you want to find out how Adatis can help you make sense of your data and learn more about the Modern Data Warehouse, please join us in London on 6th June for an exclusive workshop. We will guide you through the Microsoft landscape and showcase how we can help you get more value from your data, wherever you are on your data transformation journey. Register here for our "Put your Data to Work" in-person event to be held in London on 6th June 2019 Additional Resources Microsoft Azure Big Data and Analytics Information on Azure SQL Data Warehouse Try Azure SQL Data Warehouse for Free #SimplyUnmatched, #analytics, #Azure, #AzureSQLDW, #MSPowerBI

Geographic Spatial Analysis with Azure Data Lake Analytics (ADLA)

Whilst working on an Azure Data Lake project, a requirement hit the backlog that could be easily solved with a Geographical Information System (GIS) or even SQL Server - Spatial data type support was introduced into SQL Server 2008. However, Azure Data Lake Analytics (ADLA) does not natively support spatial data analytics so we'll have to extract the data into another service right? Wrong ? :) Due to the extensibility of Azure Data Lake Analytics, we can enhance it to do practically anything. In fact, we can lean on existing components and enhance the service without having to develop the enhancement itself. This blog is a quick run through demonstrating how to enhance ADLA such that it will support Spatial analytics and meet our project requirement. Problem For simplicity I've trivialised the problem. Here's the requirement: Indicate which Bus Stops are within 1.5 km of Southwark Tube Station. To support this requirement, we have two datasets: A list of all the Bus Stops in London, including their Geo location (circa 20k records) The Geo location record of Southwark Tube Station (a single record !) In fact, the location of the tube station is pretty accurate and is geo located to the entrance pavement outside the tube station: This would be an easy problem for a GIS to solve. You would specify the central point i.e. our Southwark Tube station marker and draw a circle, or buffer, with a radius 1.5 km around it and select all bus stops that fall within or intersect with that circle. This spatial analysis is easy for these systems as that's essentially what they are built to do. SQL Server 2008 introduced the Spatial Data Type, this allowed spatial style analysis to be performed on geo data using T-SQL in conjunction with the supplied Geometry and Geography data types. More info on those can be found here So, how can we solve our problem in ADLA, without a GIS and without having to export the data to SQL Server?? Solution You can register existing assemblies with ADLA. It so happens that the SQL Server Data Types and Spatial assemblies are nicely packaged up and can be used directly within ADLA itself - think about that, it's pretty awesome ! Caveat: At the time of writing we have no idea of the licence implications. It will be up to you to ensure you are not in breach :) Those assemblies can be downloaded from here.  You only need to download and install the following file: ENU\x64\SQLSysClrTypes.msi This installs two key assemblies, which you'll need to grab and upload to your Data Lake Store: C:\Program Files (x86)\Microsoft SQL Server\130\SDK\Assemblies\Microsoft.SqlServer.Types.dll C:\Windows\System32\SqlServerSpatial130.dll Once they have been uploaded to your Data Lake Store, you need to register those assemblies with ADLA. DECLARE @ASSEMBLY_PATH string = "/5.UTILITY/USQL-Extend/SQL-Server/"; DECLARE @TYPES_ASM string = @ASSEMBLY_PATH+"Microsoft.SqlServer.Types.dll"; DECLARE @SPATIAL_ASM string = @ASSEMBLY_PATH+"SqlServerSpatial130.dll"; CREATE DATABASE IF NOT EXISTS SQLServerExtensions; USE DATABASE SQLServerExtensions; DROP ASSEMBLY IF EXISTS SqlSpatial; CREATE ASSEMBLY SqlSpatial FROM @TYPES_ASM WITH ADDITIONAL_FILES = ( @SPATIAL_ASM ); Following registration of the assemblies, we can see the registration loaded in the ADLA Catalog database we created: We are now ready to use this U-SQL enhancement in our U-SQL Query - let's go right ahead and solve our problem in one U-SQL Script. // Reference the assemblies we require in our script. // System.Xml we get for free as a System Assembly so we didn't need to register that and our SQLServerExtensions.SqlSpatial assembly REFERENCE SYSTEM ASSEMBLY [System.Xml]; REFERENCE ASSEMBLY SQLServerExtensions.SqlSpatial; // Once the appropriate assemblies are registered, we can alias them using the USING keyword. USING Geometry = Microsoft.SqlServer.Types.SqlGeometry; USING Geography = Microsoft.SqlServer.Types.SqlGeography; USING SqlChars = System.Data.SqlTypes.SqlChars; // First create the centralised point. // In this case it's the pavement outside the entrance of Southwark Tube Station, London. // Format is Longitude, Latitude and then SRID. // NB: It's Longitude then Latitude, that's the opposite way to what you might expect.. DECLARE @southwarkTube Geography = Geography.Point(-0.104777,51.503829,4326); // Next we extract our entire London bus stop data set from the file. // There's about 20k of them. @busStopInput = EXTRACT [StopCode] string, [StopName] string, [Latitude] double?, [Longitude] double? FROM @"/1.RAW/OpenData/Transport/bus-stops-narrow-full-london.csv" USING Extractors.Csv(skipFirstNRows:1,silent:true); // This is effectively the transform step and where the magic happens // Very similar syntax to what you would do in T-SQL. // We are returning all the bus stops that fall within 1500m of Southwark Tube // Essentially we return all stops that intersect with a 1500m buffer around the central tube point @closeBusStops= SELECT * FROM @busStopInput WHERE @southwarkTube.STBuffer(1500).STIntersects(Geography.Point((double)@busStopInput.Longitude,(double)@busStopInput.Latitude,4326)).ToString()=="True"; // The results are written out to a csv file. OUTPUT @closeBusStops TO "/4.LABORATORY/Desks/Sach/spatial-closebusstops.csv" USING Outputters.Csv(outputHeader: true); The query outputs a list of bus stops that are within the specified Spatial distance from Southwark Tube Station. If we have a look at all the bus stops (in red) and overlay all the 'close' bus stops (in green), we can see the results: Pretty neat. Azure Data Lake Analytics does not natively support spatial data analytics but by simply utilising the assemblies that ship with SQL Server, we can extend the capability of U-SQL to provide that functionality or practically any functionality we desire.

SQL Server 2012 : Columnstore Index in action

One of the new SQL Server 2012 data warehouse features is the Columnstore index. It stores data by columns instead of by rows, similar to a column-oriented DBMS like the Vertica Analytic Database and claims to increase query performance by hundreds to thousands of times. The issue with indexes in a data warehouse environment is the number and broad range of questions that the warehouse may have to answer meaning you either have to introduce a large number of large indexes (that in many cases results in a larger set of indexes than actual data), plump for a costly spindle-rich hardware infrastructure, or you opt for a balanced hardware and software solution such as a Microsoft SQL Server 2008 R2 Fast Track Data Warehouse or a HP Business Data Warehouse Appliance where the approach is ‘index-light’ and you rely on the combination of high throughput and performance power to reduce the dependency on the traditional index. The Columnstore index is different in that, when applied correctly, a broad range of questions can benefit from a single Columnstore index, the index is compressed (using the same Vertipaq technology that PowerPivot and Tabular based Analysis Services share) reducing the effort required on the expensive and slow disk subsystem and increasing the effort of the fast and lower cost memory/processor combination. In order to test the claims of the Columnstore index I’ve performed some testing on a Hyper-V instance of SQL Server 2012 “Denali” CTP3 using a blown up version of the AdventureWorksDWDenali sample database. I’ve increased the FactResellerSales table from approximately 61,000 records to approximately 15.5 million records and removed all existing indexes to give me a simple, but reasonably large ‘heap’. Heap With a clear cache, run the following simple aggregation: SELECT     SalesTerritoryKey     ,SUM(SalesAmount) AS SalesAmount FROM    [AdventureWorksDWDenali].[dbo].[FactResellerSales] GROUP BY     SalesTerritoryKey ORDER BY      SalesTerritoryKey Table 'FactResellerSales'. Scan count 5, logical reads 457665, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 7641 ms, elapsed time = 43718 ms Non-Clustered Index Before jumping straight in with a columnstore index, let’s review performance using a traditional index. I tried a variety of combinations, the fastest I could get this query to go was to simply add the following: CREATE NONCLUSTERED INDEX [IX_SalesTerritoryKey] ON [dbo].[FactResellerSales] (    [SalesTerritoryKey] ASC ) INCLUDE ([SalesAmount]) WITH (     PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF,     DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON,     ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100, DATA_COMPRESSION = PAGE ) ON [PRIMARY] GO Notice I have compressed the index using page compression, this reduced the number of pages my data consumed significantly. The IO stats when I re-ran the same query (on a clear cache) looked like this: Table 'FactResellerSales'. Scan count 5, logical reads 26928, physical reads 0, read-ahead reads 26816, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 6170 ms, elapsed time = 5201 ms. Much better! Approximately 6% of the original logical reads were required, resulting in a query response time of just over 5 seconds. Remember though, this new index will really only answer this specific question. If we change the query, performance is likely to fall off the cliff and revert back to the table scan. Incidentally, adopting an index-light ([no index]) approach and simply compressing (and reloading to remove fragmentation) the underlying table itself, performance was only nominally slower than the indexed table with the added advantage of being able to perform for a large number of different queries. (Effectively speeding up the table scan. Partitioning the table can help with this approach too.) Columnstore Index Okay, time to bring out the columnstore. The recommendation is to add all columns into the columnstore index (Columnstore indexes do not support ‘include’ columns), practically there may be a few cases where you do exclude some columns. Meta data, or system columns that are unlikely to be used in true analysis are good candidates to leave out of the columnstore. However, in this instance, I am including all columns: CREATE NONCLUSTERED COLUMNSTORE INDEX [IX_Columnstore] ON [dbo].[FactResellerSales] (     [ProductKey],     [OrderDateKey],     [DueDateKey],     [ShipDateKey],     [ResellerKey],     [EmployeeKey],     [PromotionKey],     [CurrencyKey],     [SalesTerritoryKey],     [SalesOrderNumber],     [SalesOrderLineNumber],     [RevisionNumber],     [OrderQuantity],     [UnitPrice],     [ExtendedAmount],     [UnitPriceDiscountPct],     [DiscountAmount],     [ProductStandardCost],     [TotalProductCost],     [SalesAmount],     [TaxAmt],     [Freight],     [CarrierTrackingNumber],     [CustomerPONumber],     [OrderDate],     [DueDate],     [ShipDate] )WITH (DROP_EXISTING = OFF) ON [PRIMARY] Now when I run the query on a clear cache: Table 'FactResellerSales_V2'. Scan count 4, logical reads 2207, physical reads 18, read-ahead reads 3988, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 235 ms, elapsed time = 327 ms. I think the figures speak for themselves ! Sub-second response and because all columns are part of the index, a broad range of questions can be satisfied by this single index. Storage The traditional (compressed) non-clustered index takes up around 208 MB whereas the Columnstore Index comes in a little less at 194 MB so speed and storage efficiency, further compounded when you take into account the potential additional indexes the warehouse may require. So, the downsides? Columnstore indexes render the table read-only. In order to to update the table you either need to drop and re-create the index or employ a partition switching approach. The other notable disadvantage, consistently witnessed during my tests, is the columnstore index takes longer to build. The traditional non-clustered index took approximately 21 seconds to build whereas the columnstore took approximately 1 minute 49 seconds. Remember though, you only need one columnstore index to satisfy many queries so that’s potentially not a fair comparison. Troubleshooting If you don’t notice a huge difference between a table scan and a Columnstore Index Scan, check the Actual Execution Mode of the Columnstore Index Scan. This should be set to Batch, not Row. If the Actual Execution Mode is reporting Row then your query cannot run in parallel: - Ensure, if running via Hyper-V, you have assigned more than one processor to the image. - Ensure the Server Property ‘Max Degee of Parallelism’ is not set to 1. Summary In summary, for warehousing workloads, a columnstore index is a great addition to the database engine with significant performance improvements even on reasonably small data sets. It will re-define the ‘index-light’ approach that the SQL Server Fast Track Data Warehouse methodology champions and help simplify warehouse based performance tuning activities. Will it work in every scenario? I very much doubt it, but it’s a good place to start until we get to experience it live in the field.

SQL Server 2012 Licensing

Today saw the announcement of how SQL Server 2012 will be carved up and licensed, and it's changed quite a bit. There are three key changes: 1) There's a new Business Intelligence Edition that sits between Standard and Enterprise 2) No more processor licensing. There's a move to Core based licensing instead (with a minimum cost of 4 cores per server) 3) Enterprise is only available on the Core licensing model (Unless upgrading through Software Assurance *) Enterprise, as you would expect, has all the functionality SQL Server 2012 has to offer. The Business Intelligence edition strips away - Advanced Security (Advanced auditing, transparent data encryption) - Data Warehousing (ColumnStore, compression, partitioning) and provides a cut-down, basic (as opposed to advanced) level of High Availability (AlwaysOn). In addition, the Standard Edition removes - Enterprise data management (Data Quality Services, Master Data Services), - Self-Service Business Intelligence (Power View, PowerPivot for SPS) - Corporate Business Intelligence (Semantic model, advanced analytics) If you are utilising 4 core processors, licence costs for Standard ($1,793 per core, or $898 per Server + $209 per CAL) and Enterprise ($6,874 per core) remain similar (ish).  However, you will be stung if you have more cores. The Business Intelligence edition is only available via a Server + CAL licence model and it's apparent that Microsoft are placing a big bet on MDS/DQS, Power View, PowerPivot for SharePoint and BISM as the licence for the Business Intelligence edition is $8,592 per server, plus $209 per CAL, that's nearly 10x more per server than Standard Edition ! For the complete low-down check out these links: Editions Overview: http://www.microsoft.com/sqlserver/en/us/future-editions/sql2012-editions.aspx Licensing Overview: http://www.microsoft.com/sqlserver/en/us/future-editions/sql2012-licensing.aspx Licence Detail (including costs): http://download.microsoft.com/download/D/A/D/DADBE8BD-D5C7-4417-9527-5E9A717D8E84/SQLServer2012_Licensing_Datasheet_Nov2011.docx * If you are currently running Enterprise as a Server + CAL and you upgrade to SQL 2012 through Software Assurance, you can keep Server + CAL model, providing you don’t exceed 20 cores.