DavidStelfox

David Stelfox's Blog

Snowflake – Part 1: Introduction


Microsoft have a well-established and successful architecture for modern data warehousing which has been implemented by Adatis at multiple clients. This architecture could look like the following:


clip_image002[5]


Azure SQL Data Warehouse is Microsoft’s cloud-base data warehousing offering providing an MPP architecture where compute can be scaled elastically quickly with limited downtime, easily integrated with other Azure services such as Azure Databricks and Azure Analysis Services and can be secured using Azure Analysis Services.

 

While Azure SQL Data Warehouse will allow you to ingest large datasets and provide very fast querying times there are time when it might not exactly fit with your requirements. One example is that Azure SQL Data Warehouse requires downtime to scale. Due to its architecture this is unavoidable however some companies want 100% uptime even when you are scaling. This is where new technologies such as Snowflake come in. Snowflake sells itself as a new approach to data warehousing, promising to address some of the issues other cloud data warehouse solutions have, especially performance issues around concurrency and downtime to scale. This blog will introduce Snowflake’s architecture and how its Azure implementation can fit into a larger Microsoft workflow. Later blogs in the series will look at setting up a Snowflake instance and comparing Snowflake’s performance to Azure SQL Data Warehouse.

 

Snowflake’s architecture consists of three components:

clip_image004[4]


The Database storage layer is where data is stored when loaded into Snowflake, which manages all aspects of how it is stored including organisation, file sizes, compressions and metadata. Data is stored in a columnar format, seemingly like a clustered columnstore index in an Azure SQL Data Warehouse and replicated across multiple regions to ensure high availability. The data stored in this layer is only accessible through SQL queries in Snowflake; there is no direct access to the underlying database as there would be in a SQL Server database.

 

The Query processing layer is a series of “Virtual warehouses”, which are independent MPP clusters. This means you can provision multiple clusters without any performance impact on another clusters and there is a high level of resilience as while virtual warehouses only exist in a single region, if that region is unavailable, Snowflake will automatically re-provision the Virtual Warehouse in another. Another feature of the Query processing layer is, for every query processed it will automatically cache data, which the query optimiser in the Cloud services layer can use for subsequent queries – greatly reducing query times in some cases.

 

Finally. the Cloud services layer is the coordination layer for Snowflake handling everything from authentication to query optimisation to infrastructure management. This means much of the administration associated with cloud data warehousing is removed from the user and handled internally by Snowflake.

 

By completely separating storage and compute, Snowflake can take advantage of the performance benefits of existing data architectures (such as shared-disk and shared-nothing*) while eliminating some of the issues with these architectures such as problems with concurrency and data movement and reorganisation between disks. With concurrency, many existing architectures struggle with multiple connections all competing for the same resources, Snowflake manages this by allowing scaling of virtual warehouses instantly to make more resources available. For data movement and reorganisation, a shared-nothing architecture relies on the correct type of distribution across clusters to minimise data movement at query time, if this is misjudged it can lead to much slower queries times. However, with a single data store no data movement is required in Snowflake.

 

Conceptually this is all well and good but as a Microsoft shop, how could Snowflake fit with the work that Adatis does. Previously Snowflake has been deployed using AWS components, but it was recently announced that it would be made available on Azure.  This means using blob store for storage and Azure Compute (read: VMs) for compute. By having this separation, it is more straightforward to see how it is possible to scale seamlessly. Need more compute? Add a more powerful VM to your cluster. Need concurrency between an analytical workload, a data ingestion workload and a data science workload? Split these workloads between different compute clusters (Virtual Warehouses). In addition, Snowflake can utilise Azure Active Directory for enterprise-grade security, use Azure Data Lake Store as a source, and via Azure Data Factory use a custom activity to connect to an instance with a .net connection and export ARM templates so it can be integrated into a DevOps pipeline. 

 

As we have seen Snowflake on Azure is a solution to some of the issues which exist with other cloud data warehouses. However, there are some potential red flags including a perceived lack of control over settings and configuration, and a slightly opaque costing model. In the next blog in this series, we will look at what you need to do to provision a Snowflake instance as well as explore these red flags more.

 

 

* Shared-disk architecture is where all data is stored on a single disk which is accessible by all compute nodes – a traditional on-premise data warehouse. Shared-nothing architecture is where data is distributed across multiple nodes – Azure SQL Data Warehouse.

 


 

 

SSRS and Dax

Tools like Power BI have changed reporting allowing power users to leverage tabular cubes to present information quicker and without the (perceived) need for developers. However, experience tells us many users still want data in tables with a myriad of formatting and display rules. Power BI is not quite there yet in terms of providing all this functionality in the same way that SSRS is. For me, SSRS's great value and, at the same time its curse, is the sheer amount of customisation a developer can do. I have found that almost anything a business user demands in terms of formatting and display is possible.

But you have invested your time and money in a tabular SSAS model which plays nicely with Power BI but your users want SSRS reports so how to get to your data - using DAX, of course. Using EVALUATE, SUMMARIZECOLUMNS and SELECTCOLUMNS you can return data from a tabular model in a tabular format ready to be read as a dataset in SSRS.

If you had the following data in your tabular model:

  ProductName

  ProductCategory

  Country

  Sales

  Surly Steamroller

  Complete bike

  France

  £46,575

  Genesis Day One

  Frame

  France

  £47,987

  Genesis Day One

  Frame

  France

  £47,987

  Genesis Vapour 20

  Complete bike

  France

  £24,867

  Genesis Vapour 20

  Complete bike

  United Kingdom

  £21,856

  Genesis Day One

  Complete bike

  United Kingdom

  £47,875

  Surly Steamroller

  Complete bike

  United Kingdom

  £27,969

  Surly Steamroller

  Complete bike

  United States

  £46,575

  Genesis Day One

  Frame

  United States

  £47,987

  Genesis Day One

  Complete bike

  United States

  £47,987


A pattern for returning data for SSRS could be:

DEFINE
    VAR FilterCountry = @country
EVALUATE
SELECTCOLUMNS (
    SUMMARIZECOLUMNS (
        'Product'[ProductName],
        'ProductCategory'[ProductCategory],
        'Geography'[Country],
        FILTER (
            VALUES ( 'Geography'[Country] ),
            (
                OR (
                    ( FilterCountry = "All" ),
                    PATHCONTAINS ( FilterCountry, 'Geography'[Country] )
                )
            )
        ),
        "Sales", [Sales]
    ),
    "Product Name", [ProductName],
    "Product Category", [ProductCategory],
    "Sales", [Sales],
    "Estimated VAT", [Sales] * 0.2
)

To step through this…

  • The DEFINE part declare a variable which will take a value from a SSRS parameter
  • SUMMARIZECOLUMNS will return a CROSS JOIN of all the dimension columns entered, by passing in a measure or fact column as the last part of the block (here [Sales]) it will return only the cmbinations of dimension where there is a value for measure or fact item - particularly useful if you are dealing with date table stretching into the future.
  • FILTER will filter the SUMMARIZECOLUMNS based on the SSRS parameter value. By adding the OR and PATHCONTAINS sections you can handle multi-value parameters in SSRS. This is a good blog post explaining this approach.
  • SELECTCOLUMNS allows you to both provide friendly names for any dimensions (this is better handled in the tabular model but the amount of times it isn't) and also perform calculations within your DAX query.
  • If your SSRS parameter was set to France and United Kingdom, the results would be:

      Product Name

      Product Category

      Country

      Sales

      Estimated VAT

      Surly Steamroller

      Complete bike

      France

      £46,575

      £931.50

      Genesis Day One

      Frame

      France

      £47,987

      £959.74

      Genesis Day One

      Frame

      France

      £47,987

      £959.74

      Genesis Vapour 20

      Complete bike

      France

      £24,867

      £497.34

      Genesis Vapour 20

      Complete bike

      United Kingdom

      £21,856

      £437.12

      Genesis Day One

      Complete bike

      United Kingdom

      £47,875

      £957.50

      Surly Steamroller

      Complete bike

      United Kingdom

      £27,969

      £559.38


    If you set up a connection to your tabular cube in SSRS, you can paste in your DAX code, configure your parameters and you are good to go!

    Displaying vector data on maps in Power BI

    Something I have enjoyed doing in the past is making maps, especially ones showing how various things move around London. One thing I did was take some open data on usage of London Santander bikes and passed it through the Google Waypoints API and visualised it using Mapbox:

    map1

    Which got me thinking, is it possible to do something similar using Power BI? Microsoft introduced the geography data type with SQL Server 2008 and more mapping providers are making versions of their tools available in Power BI, including Esri and Mapbox. From a business perspective, IoT devices are enabling the collection of greater volumes of geographical data for companies to use in their analyses.

    Using the GEOjson files I had previously produced for the project above, I set about importing them to a SQL DB instance in Azure as a geography data type. This involved reading the file into a variable and then iterating over the index position of each individual ride.

    -- load the geojson file into a SQL variable
    
    DECLARE @geojson VARCHAR(max) = (SELECT *
    
    FROM OPENROWSET (BULK 'C:\path\to\my\file', SINGLE_CLOB) as j)
    
    -- use the index position to iterate over each ride
    
    SELECT
        geography::STGeomFromText('LINESTRING(' +STRING_AGG(CAST(long + ' ' + lat as varchar(max)), ',') + ')',4326)
    FROM
    (
    SELECT
    *
    FROM OPENJSON(@geojson, '$.features[0].geometry.coordinates' )
    WITH
    (
    long varchar(100) '$[0]'
    ,lat varchar(100) '$[1]'
    )
    ) a

    This results in a row per ride and visualises pretty well in SSMS. If you are familiar with the geography of London you can make out the river Thames toward the centre of the image and Regents Park towards the top left:

    map

    This could be overlaid on a shape file of London or a map from another provider such as Google Maps or Mapbox.

    However, when you try to load the dataset into Power BI, you find that Power BI does not natively support Geography data types. There is an idea you can vote on here to get them supported: https://ideas.powerbi.com/forums/265200-power-bi-ideas/suggestions/12257955-support-sql-server-geometry-geography-data-types-i


    After a bit of research it looked like using R within Power BI could be an option. The foundation thinking of this was rather than converting the longitude and latitudes to a geography data type as above I wrote them to a table with an Id to associate each ride and let the ggmap library in R join the dots.

    Once you have your data loaded into Power BI, you only need a couple of lines of R code to render the map:

    library(ggmap)
    
    qmap("london", zoom=12, maptype="toner-lite") + geom_path(aes(x = long, y = lat, group=LineId), size = .7, data = dataset, colour="#FF69B4", lineend = "round")

    An important thing to note is that this R library uses the Google Maps API which could cause some privacy concerns. Also before showing this to users you would want to assign a Google Maps API token in Power BI as without it you are using a generally available connection which means your maps will fail to load a lot of the time.

    The resulting map displays the paths – although the sorting of each row gets mixed up somewhere in Power BI for some paths hence the horizontal lines.

    map2


    And it turns out that the Power BI service does not currently support the use of certain R libraries which use client-server queries over the internet meaning you cannot publish this map.

    So as a proof of concept it is possible to display vector paths on a map in Power BI desktop, however you will need to wait until Microsoft supports geography data types in Power BI or allows R libraries which call an external service before taking it any further.

    These links provided the basis for this PoC and would be a good starting point if you wanted to take things further: