Sacha Tomey

Sacha Tomey's Blog

The Azure Modern Data Warehouse: Unparalleled Performance

Today, 80% of organisations adopt cloud-first strategies to scale, reduce costs, capitalise on new capabilities including advanced analytics and AI and to remain competitive. Cloud-adoption is accelerating, and data exploitation is a key driver.

The central tenet to this enterprise-wide exploitation of data is the cloud-based Modern Data Warehouse. Legacy on-premises or appliance based EDWs, that were once the strategic asset for only the largest of enterprise organisations, not only limit performance, and flexibility, but are also harder to set up & scale securely.

The Modern Data Warehouse fundamentally changes the landscape for data analytics by making analytics available to everyone across organisations of all sizes, and not only the largest enterprise.

A modern data analytics platform enables you to bring together huge volumes of relational and non-relational, or structured and unstructured data into a single repository; the highly scalable and cost-effective Azure Data Lake. This provides access across the enterprise from basic information consumption through to innovation led data science.

Big data processing capability for data preparation, such as transformation and cleansing, can be performed as well as infusing Machine Learning and AI with the results made readily available for analysis through visual tools like Power BI.

Azure provides unparalleled performance at incredible value. To further support this claim, Microsoft have just announced the GigaOM TPC-DS Benchmark Results that further cements Azure SQL Data Warehouse as a leader for price/performance for both decision support benchmarks, having already attained best price/performance status for the TPC-H benchmark, announced back in Feb 2019.

TPC-DS @ 30TB
$ per Query per Hour

DS_thumb1

TPC-H @ 30TB
$ per Query per Hour

H_thumb58

Azure SQL Data Warehouse (Azure SQL DW) always delivered on performance when compared to alternatives, and now GigaOm found analytics on Azure is 12x faster and 73% cheaper when compared using the TPC-DS benchmark. Azure SQL DW has established itself as the alternative to on-premises data warehouse platforms and leader in Cloud Analytics.

Adatis have been at the cutting edge of Cloud Analytics Solutions since the introduction of the Azure SQL Data Warehouse PaaS offering back in 2015. In the last 18 months we have noticed the profile of Azure SQL DW rise sharply; with Azure SQL DW outperforming and taking over workloads from its closest competitors.

We specialise in all aspects of delivering the value of Cloud Analytics, AI and the Modern Data Warehouse, from strategic business value led engagements through technical design and implementation to on-going continuous improvement via fully managed DataOps practices.

Arch_thumb[7]

Adatis utilise Microsoft Azure technologies, in conjunction with first-party Spark based services, that securely integrate to provide enterprise-grade, cloud-scale analytics and insight for all and partner deeply with Microsoft to enable data driven transformation for our customers. We work to develop a modern data analytics strategy and ensure it is implemented and supported in the best way possible, aligning to your specific company’s goals and overriding strategy.

If you want to find out how Adatis can help you make sense of your data and learn more about the Modern Data Warehouse, please join us in London on 6th June for an exclusive workshop. We will guide you through the Microsoft landscape and showcase how we can help you get more value from your data, wherever you are on your data transformation journey.

Register here for our "Put your Data to Work" in-person event to be held in London on 6th June 2019

Additional Resources

Microsoft Azure Big Data and Analytics

Information on Azure SQL Data Warehouse

Try Azure SQL Data Warehouse for Free

#SimplyUnmatched, #analytics, #Azure, #AzureSQLDW, #MSPowerBI

MDX Stored Procedures (Part 2)

What seems like a lifetime ago I posted about MDX Stored Procedures and how we've built a helper class that provides a wrapper to an MDX query.

The helper class has been stable for quite sometime now and a few people have been asking me when I'm going to release it.  Well, today is the day.

You can download a Visual Studio 2005 C# Quick Start project from here.

The idea behind the class is that text files containing parameterised MDX queries can be executed against a cube with just a couple of lines of code.

Here are some notes to complement the quick start to get you going:

Step 1:  Create an instance of the helper class

MDXHelper helper = new MDXHelper
    ("localhost",
    "Adventure Works DW Standard Edition",
    "Adventure Works",
    MDXHelper.MDXResourceFormat.RelativeURL,
    "MDX/Product Category List.txt");

There are a couple of overloads on the constructor.  The one above specifies all the options.  The first three parameters set the Analysis Services Server, the database, the cube.  The last two parameters specify what format the MDX query.  In this case it's a Relative URL meaning that the next parameter must specify a relative URL pointing to a text file containing the desired query.  The alternative to a Relative URL is 'Text' which means the last parameter must be the query itself rather than a path to a file.

Step 2:  Add the parameters to the parameter collection.

helper.Parameters.Add(
    "@ProductCategory",
    CategoryList.SelectedValue);

Parameters are simply name/value pairs and are held internally as a generic dictionary<string,object>.  Parameters are optional but you will get an error if you invoke a parameterised query without defining a value for every parameter.

Step 3: Execute the Query

DataTable categoryList = helper.ExecuteDataTable();

A DataTable or a CellSet can be returned, again various overloads exist to provide additional flexibility, for example you can set the resource format and query path on the ExecuteDataTable e.g.

DataTable result = helper.ExecuteDataTable(MDXHelper.MDXResourceFormat.RelativeURL, "MDX/Internet Sales by Country.txt");

Additional Notes

The query definitions are cached in the ASP.NET application cache so if you make any changes to the query definition files you'll need to recycle the application pool.

The helper class will only work for Web Projects.  There is no support for WInForms/SmartClient deployments.

If you are returning a data table you can specify some additional options. 

  • Untyped is the fastest and default, returning a plain data table based on a flattened CellSet object. 
  • Typed brings back a Data Table with the Schema column data types set to match the CellSet object it was derived from.  This can be slow for large datasets.
  • UntypedFormatted brings back a plain data table but using the formatting e.g currency formats etc held in the cube.

There are some additional properties containing meta data debugging information that have been useful.  ExecutedMdxQuery contains the actual MDX that was executed after the parameter replacements have occurred.  DatabaseServerVersion, CubeDescription, CubeLastProcess and CubeLastSchemaUpdate are pulled from the server when a query is executed.

You can use parameters to define any aspect of the query.  E.g.  You could use parameters to swap rows and columns, define measures, pretty much anything you can achieve with search and replace.  On one hand this provides no end of flexibility but on the other provides diddly squat compile time debugging!

The same database connection is maintained between query execution calls.

All feedback gratefully received.

Basket Analysis using Analysis Services 2005

My job is pretty varied these days :o)

Today I've been looking at satisfying some basket analysis type requirements where, for example, we need to determine what other products customers have bought in addition to a known product.

We could have used data mining for this but decided not to as, in this case, it literally was a few mouse clicks away from our existing AS2005 cube.

The implementation was surprisingly straight forward and query results (admittedly on a subset of the full dataset) very impressive.

In an attempt to outline the implementation steps I will add some basket analysis to the Adventure Works sample database to demonstrate how easy it was !

Requirement

List all products (and associated profit margin) that have also been purchased by customers who have bought a road bike from the 550-W range.

Approach

Rather than using data mining to handle queries of this type I want to extend my current cube.  In order for the performance to be acceptable we will adopt a variation on the many-to-many distinct count solution discussed in numerous blogs, white papers and articles.

This result of this approach is a new measure group, we'll call it 'Customer Cross Purchase' and a new reference dimension based on product for our 'known product', we'll call this 'Product Filter'

 clip_image002From a relational perspective, the product filter dimension provides a lookup for customers (in Customer Cross Purchase) that have purchased that product.

This in turn provides the ability to locate all orders (Internet Sales Fact) for that subset of customers.  Once we know all the orders for that subset of customers we can simply list the distinct products that make up those orders. (Product)




Implementation

The measure group contains a distinct set of customers and products that will act as the junction table of our many-to-many.  This is a view over the main fact and customer dimension.

CREATE VIEW [dbo].[vCustomerCrossPurchase]
AS
SELECT DISTINCT
f.CustomerKey, f.ProductKey
FROM         dbo.FactInternetSales AS f INNER JOIN
                      dbo.DimCustomer AS c ON f.CustomerKey = c.CustomerKey

Next, the fact table [view] is added to the Data Source View, ensuring the relationships to the customer and product dimension are set up.

clip_image004

With the DSV updated, the cube itself can be extended.  A new measure group is created together with a Product Filter reference dimension.  The dimension usage looks like the diagram below.  This ensures the appropriate relationships exist as outlined above

clip_image006

The new measure group is mapped to the product filter and customer dimensions, as per our dsv.  Note, this is not done automatically as the real (non referenced) product dimension is selected instead.

To complete the picture, the Customer Cross Purchase measure group is used to create a many-to-many relationship between the Product Filter and the main Internet Sales measure group.

Testing

Once deployed and processed we can test out our modifications to check for some reasonable results.

The following MDX query returns a list of all products that have been bought by customers buying  a road bike from their 550-W range in Reading, England.

select
    non empty [Product].[Product].members on rows,
    [Measures].[Internet Gross Profit] on columns
from

    [adventure works]
where
(
    [Product Filter].[Product Model Categories].[Model].&[Road-550-W],
    [Customer].[City].&[Reading]&[ENG]
)

The query is simple, it lists products on the rows and profit on the columns, the 'where' clause slices by Reading, England and employs the new Product Filter dimension.  The Product Filter dimension has the effect of slicing the main fact table by customers that have bought a bike from the Road 550-W range.

clip_image008

So, we can see that apart from the road bikes, a few other accessories have been purchased too.  A quick couple of queries confirm the results.

image

Three customers (above) have bought a road bike from the 550-W range and the other products these customers have bought (below) match our results !

image

Analysis Services XMLA to clear query cache

To get a realistic measurement of how your query tuning improves query response time you should start from a common point.  To help with this you should clear your query cache before running a queries after making improvements (Unless you are cache warming of course !).
 
Here is some XMLA script to clear the cache for a specified database.

<Batch xmlns="http://schemas.microsoft.com/analysisservices/2003/engine">
   <ClearCache>
      <Object>
         <DatabaseIDdatabase id </DatabaseID>
         <CubeID> cube id </CubeID>
      </Object>
   </ClearCache>
</Batch>
 
Replace database id and cube id with the details for your cube.  Notice these are ids, not names.  Look in the Analysis Services properties windows to find the ids.