In this blog I will provide an overview of the process for loading data into Snowflake. It will explain what staging the data means in the context of Snowflake, before illustrating the process of how data can be loaded into the Snowflake database from an Azure container. The goal of this blog is to help you understand from a conceptual point of view how data is loaded into Snowflake.
Before being loaded into a Snowflake table, the data can be optionally staged, which is essentially just a pointer to a location where the files are stored. There are different types of stages including:
- User stages, which each user will have by default
- Table stages, which each table will have by default
- Internal named stages, meaning staged within Snowflake
Internal named stages are the best option for regular data loads, if you are thinking along the lines of your standard daily ETL process. One benefit of these is the flexibility in that they are database objects, so you can grant privileges to roles to access these objects as you would expect. Alternatively, there are external stages, such as Azure Blob storage.
Let's have a high level look at how you would load data into Snowflake.
1. Source data is stored in files in an Azure container (Blob Storage, Azure Data Lake Store Gen1/Gen2).
2. Optionally, data can be staged, either internally within a Snowflake table, or externally, in an Azure container. If you choose to stage the data internally, you will need to use the PUT command to load the data into stage if it is stored on a local machine. This cannot be done from the Web UI for Snowflake, it needs to be done using SnowSQL, the CLI client for Snowflake. If you want to stage the data externally, such as in Blob Storage for example, you cannot use the PUT command, and may need to manually upload the data, use PowerShell, or Azure Data Factory for example, to load the data into the desired external stage location.
3. Use the 'COPY INTO' command (shown below) while a virtual warehouse is running with reference to a stage to load the data, or if the data is not staged, it can be loaded directly from the location the data is stored in, such as referencing a SAS token to load straight from Blob Storage.
However, there is a second way that you can load data into Snowflake, which is called Snowpipe, allowing data to be continuously loaded in Snowflake once a file becomes available in stage. Essentially, you can call a public REST endpoint with a list of filenames and a pipe name, and if any of these files are in stage, they are added to the queue to be loaded. Discussing Snowpipe in-depth is a whole other blog in itself, but it's worth noting that it is an option if you need data to be loaded into Snowflake more often than your daily ETL can provide.
Hopefully that should give a very brief overview of the concept behind loading data into Snowflake, however, there are a couple of features that are worth mentioning here.
- Firstly, Snowflake can load semi-structured data, think JSON or Parquet for example, which get loaded into a single VARIANT column, or you can transform the data into separate columns using the COPY INTO command.
- Secondly, if using an internal stage, unencrypted data will automatically be encrypted. Encrypted data can also be stored in an internal stage, providing the key used to encrypt the data is supplied to Snowflake.
Thanks for reading!