DavidStelfox

David Stelfox's Blog

Snowflake – Part 1: Introduction


Microsoft have a well-established and successful architecture for modern data warehousing which has been implemented by Adatis at multiple clients. This architecture could look like the following:


clip_image002[5]


Azure SQL Data Warehouse is Microsoft’s cloud-base data warehousing offering providing an MPP architecture where compute can be scaled elastically quickly with limited downtime, easily integrated with other Azure services such as Azure Databricks and Azure Analysis Services and can be secured using Azure Analysis Services.

 

While Azure SQL Data Warehouse will allow you to ingest large datasets and provide very fast querying times there are time when it might not exactly fit with your requirements. One example is that Azure SQL Data Warehouse requires downtime to scale. Due to its architecture this is unavoidable however some companies want 100% uptime even when you are scaling. This is where new technologies such as Snowflake come in. Snowflake sells itself as a new approach to data warehousing, promising to address some of the issues other cloud data warehouse solutions have, especially performance issues around concurrency and downtime to scale. This blog will introduce Snowflake’s architecture and how its Azure implementation can fit into a larger Microsoft workflow. Later blogs in the series will look at setting up a Snowflake instance and comparing Snowflake’s performance to Azure SQL Data Warehouse.

 

Snowflake’s architecture consists of three components:

clip_image004[4]


The Database storage layer is where data is stored when loaded into Snowflake, which manages all aspects of how it is stored including organisation, file sizes, compressions and metadata. Data is stored in a columnar format, seemingly like a clustered columnstore index in an Azure SQL Data Warehouse and replicated across multiple regions to ensure high availability. The data stored in this layer is only accessible through SQL queries in Snowflake; there is no direct access to the underlying database as there would be in a SQL Server database.

 

The Query processing layer is a series of “Virtual warehouses”, which are independent MPP clusters. This means you can provision multiple clusters without any performance impact on another clusters and there is a high level of resilience as while virtual warehouses only exist in a single region, if that region is unavailable, Snowflake will automatically re-provision the Virtual Warehouse in another. Another feature of the Query processing layer is, for every query processed it will automatically cache data, which the query optimiser in the Cloud services layer can use for subsequent queries – greatly reducing query times in some cases.

 

Finally. the Cloud services layer is the coordination layer for Snowflake handling everything from authentication to query optimisation to infrastructure management. This means much of the administration associated with cloud data warehousing is removed from the user and handled internally by Snowflake.

 

By completely separating storage and compute, Snowflake can take advantage of the performance benefits of existing data architectures (such as shared-disk and shared-nothing*) while eliminating some of the issues with these architectures such as problems with concurrency and data movement and reorganisation between disks. With concurrency, many existing architectures struggle with multiple connections all competing for the same resources, Snowflake manages this by allowing scaling of virtual warehouses instantly to make more resources available. For data movement and reorganisation, a shared-nothing architecture relies on the correct type of distribution across clusters to minimise data movement at query time, if this is misjudged it can lead to much slower queries times. However, with a single data store no data movement is required in Snowflake.

 

Conceptually this is all well and good but as a Microsoft shop, how could Snowflake fit with the work that Adatis does. Previously Snowflake has been deployed using AWS components, but it was recently announced that it would be made available on Azure.  This means using blob store for storage and Azure Compute (read: VMs) for compute. By having this separation, it is more straightforward to see how it is possible to scale seamlessly. Need more compute? Add a more powerful VM to your cluster. Need concurrency between an analytical workload, a data ingestion workload and a data science workload? Split these workloads between different compute clusters (Virtual Warehouses). In addition, Snowflake can utilise Azure Active Directory for enterprise-grade security, use Azure Data Lake Store as a source, and via Azure Data Factory use a custom activity to connect to an instance with a .net connection and export ARM templates so it can be integrated into a DevOps pipeline. 

 

As we have seen Snowflake on Azure is a solution to some of the issues which exist with other cloud data warehouses. However, there are some potential red flags including a perceived lack of control over settings and configuration, and a slightly opaque costing model. In the next blog in this series, we will look at what you need to do to provision a Snowflake instance as well as explore these red flags more.

 

 

* Shared-disk architecture is where all data is stored on a single disk which is accessible by all compute nodes – a traditional on-premise data warehouse. Shared-nothing architecture is where data is distributed across multiple nodes – Azure SQL Data Warehouse.

 


 

 

Loading