Ust

Ust Oldfield's Blog

Integration Testing a Data Platform with Pester

In a previous post, I gave an overview to integration tests and documenting integration points. In this post, I will give a practical example of developing and performing integration tests with the Pester framework for PowerShell. With a data platform, especially one hosted in Azure, it’s important to test that the Azure resources in your environment have been deployed and configured correctly. After we’ve done this, we can test the integration points on the platform, confident that all the components have been deployed.

The code for performing integration tests is written in PowerShell using the Pester Framework. The tests are run through Azure DevOps pipelines and are designed to test documented integration points. The PowerShell scripts, which contain the mechanism for executing tests, rely upon receiving the actual test definitions from a metadata database.

Database Structure

The metadata database should contain a schema called Test, which is a container for all the database objects for running tests using Pester. These objects are:

  • Test.TestCategory - contains what category of test is to be run e.g. Integration Tests
  • Test.TestType - contains the type of tests that need to be run and are associated with a particular type of functionality. In the Pester Framework, Test Type maps to the Describe function.
  • Test.Test - contains the individual tests to be run, with reference to the test type and environment. In the Pester Framework, Test maps to the Context function.
  • Test.Assert contains the individual asserts to be executed against the output from the test run, with reference to the Test and type of assert. In the Pester Framework, Test maps to the It function.

How you design the tables, is up to you, but I suggest that the schema looks similar to the above.

Test Environment Setup

Before we begin testing all the integration points, we need to be confident that the environment, for which the platform is deployed to, has been created and configured correctly. If it hasn’t, there’s no point in progressing with the actual integration tests as they would fail. For this, we have an initial script to perform these checks.

The script executes a stored procedure called Test.ObtainTests, which returns the list of tests to be run. Within a Pester Describe block, the tests are executed. The tests use the Get-AzureRmResource cmdlet and asserts that the name of the deployed resource matches that of the expected resource, as defined in the TestObject.

If any of the tests fail in this phase, no further testing should take place.

Integration Tests

We’re confident that the environment has been created and configured correctly, so now we’re ready to run the integration tests according to the documented integration points. For this example, we’ll be putting in some data into the RAW layer of the data lake, running it through the various layers until it ends up in the CURATED layer and can be read by Azure SQL DW. Because the majority of the processing is orchestrated using Azure Data Factory V2 (ADF), and the majority of the integration points are within ADF, we only really need to ensure that the pipeline(s) run successfully and some valid data appears in the CURATED layer for SQL DW to consume via PolyBase.

Because we’re also deploying some data, we’ve got elements of setup and teardown in the script. Setup and teardown in the metadata database, so that ADF knows what to process. Setup and teardown in the data lake, so that there is data to process

Tying it all together

We’ve got our scripts, but how does it get invoked? This is where the InvokePester script comes in. For anyone not familiar with Pester, this is effectively the orchestrator for your testing scripts.

If you deploy the tests to Azure DevOps as part of a release pipeline, you’ll see a similar output to the image below:

image

This should give you enough to start using Pester for testing your own Azure data platform implementations.

Whitelisting Azure IP addresses for SQL Server

In a recent blog post, I wrote about whitelisting Azure Data Centre IP addresses for Key Vault. Key Vault’s firewall uses CIDR notation for IP ranges, which is exactly what is contained within the list of IP addresses supplied by Microsoft. However, there are some resources, like Azure SQL Server, which only accept IP ranges. Therefore, we need a way of converting CIDR to an IP range.

Handily, there’s a PowerShell script which exists to provide that conversion – called ipcalc.ps1. When you download it, make sure it’s in the same working folder as the script you’re going to use to create the new firewall rules.

From there, we can make slight amends to the script we had in the previous post and produce the following script:


If you need to assign the IP ranges to other resources you can substitute the New-AzSqlServerFirewallRule with the appropriate cmdlet and parameters


Whitelisting Azure IP addresses for Key Vault

A colleague came to me with an interesting request:

We want to put Key Vault behind a firewall, but when we do that it means that Azure Data Factory can no longer access the secrets. Is there a way to whitelist the IP addresses for a given Azure Data Centre?

The short answer is: Yes.

By default, the following option is enabled on Azure Key Vault under the Firewalls and virtual networks blade.

image

For most users, having unrestricted access from external networks to a resource that holds secrets, certificates and other sensitive information is a big red flag.

If we choose to only allow access from Selected Networks we get the following options opening up for us:

image

Note that trusted Microsoft services is not an extensive list and does not include Azure Data Factory.

image

Therefore we need to whitelist a series of IP Addresses in the firewall rules. The list of IP Addresses are published by Microsoft and are updated on a weekly basis. The IP addresses are published in an XML document, which isn’t always the best format when one needs to update firewalls in Azure.

Shredding XML

To update the Firewall in Azure, we’re going to use PowerShell to shred the XML and extract the IP ranges for a given region. Then, we’re going to use the updated Azure PowerShell module to register the IP ranges against the Key Vault.

Using the last command, we can check that the IP ranges have been registered successfully. You should see something like:

image

There we have it, explicit IP whitelisting of Azure Data Centres so we can lock down Azure resources, only opening up access when we need to.

Update

Key Vault is currently limited to 127 firewall rules. If you are adding a region with more than 127 IP ranges, you might have an issue…

Unit Testing Overview

In a previous post, I touched on the point of testing and briefly talked about unit testing. In this post, I will be going into more detail about what unit testing is and why it’s important to do it.

In the previous post, I said that Unit Tests are:

low level tests, meaning that they are close to the source of the product. They should be written with the aim of testing individual methods and functions for a given code base, using a unit test framework to support the authoring and execution of a test. As a developer, you would typically author the unit tests in a development tool like Visual Studio; you’d run them locally to ensure that the tests pass; and then they would be executed on a regular basis as a task in a Build Pipeline within Azure DevOps. Unit Tests are cheap to automate and should be quick to run.

To expand on this, unit tests are written by a developer to apply to a unit of code. But what do we mean by “unit of code”? A unit of code is the smallest testable part of a solution – verifying that the individual part or component of a solution works as intended, independently from other parts. A unit could be a C# method; a PowerShell function; a T-SQL Stored Proc, and many others. Like most forms of testing, unit tests follow a pattern of:

  • Initialise system under test
  • Call method under test
  • Assert expected outcome against result of method

A best practice would be to write the unit test before the writing any code, but if you’ve not got to that level of maturity with your test approach - writing tests after code is still good practice.

How do you write a good unit test?

Keep it simple

  • A unit test shouldn’t replicate the code it is intended to test.
  • You’ll be writing lots of them, so make them quick and easy to write.

Readable

  • By keeping it simple, the test should also be readable. Making it easy to know what method is being tested and the expected behaviour of the method.
  • By making it readable, you can easily address any failures that may surface.

Reliable and Repeatable

  • Unit tests should only fail if there are bugs in the system, not because there are bugs in the tests. Keeping it simple and readable will avoid that issue.
  • Unit tests need to be run many times, sometimes multiple times throughout the course of a day, so they need to be executed quickly in a repeatable manner. Keeping it simple helps achieve this aim.

How do you write a unit test?

We’ve got an understanding of what a unit test is, but how do we write one? For this example, we’ll be writing our code and tests using C#.

Our application is a very simple calculator, which adds two numbers together.

Calculator

Simply, to add a new Unit Test, we can right-click on the method and select Create Unit Tests. Because we’ve not built any unit tests before, we can use it to create a new unit test project using a framework of choice. If we already had a unit test project, we could add the new test to the existing project.

createUnitTest

Using this method, it creates a skeleton of a unit test from which we can amend for our needs.

unitTestNew

As you can see, this doesn’t contain what we need, so we amend the test so that it reflects our requirements, as in the below.

unitTestAmended

To run a Unit Test, you can either right-click on the test method and click on Run Test(s) or open up the Test Explorer window, navigate to the desired test and click on Run Selected Tests.

Unit Tests in Azure DevOps

We’ve written our unit tests and have run them locally, but how do we make it repeatable? We utilise the power of Azure DevOps to have repeatable tests run against a changing code base as part of the Build or Continuous Integration process.

image

The process is:

  1. Install NuGet on the Build Agent
  2. Restore any packages from NuGet that your application requires
  3. Build solution
  4. Run tests
  5. Publish tests
  6. Copy successfully built and tested artifacts to a staging directory
  7. Publish those artifacts

Using Azure DevOps, or another CI tool, we can rely upon our tests in a repeatable manner.

Additional Reading

There’s a good post by Sergey Kolodiy on the importance of writing good code and how unit testing encourages good behaviour.

My colleague Jon has also written a post on the subject: Setup Unit Testing with NUnit and NBi.


Azure Active Directory Authentication and Azure Data Catalog

In a previous post I introduced Azure Data Catalog. Because it’s great for data discovery and for data asset management, it makes sense to automate, as much as possible, the process of registering new data assets, and allowing users to discover data in a more natural, perhaps conversational, way. In order to automate the registration of data assets or to allow discovery through other tools, it’s necessary to look at how Azure Data Catalog authenticates users using Azure Active Directory (AAD). This post is going to explore some of the options the Azure Data Catalog uses for authentication and a walkthrough of a code example to make authentication work without user input.

Azure Active Directory Authentication

If you have interacted with Azure Data Catalog before, you will find that there are two ways of doing so. First, there’s the web application that allows you to conduct data discovery and data asset management. Then there’s the native application that sits on your local machine that can be used for registering data assets. These use different methods of authenticating using Azure Active Directory. The first one uses Web Browser to Web Application authentication. The second uses Native Application to Web API authentication.

Web Browser to Web Application

What is involved with Web Browser to Web Application authentication? Simply put, the web application directs the user’s browser to get them to sign-in AAD. AAD then returns a token which authenticates the user to use the web application. In practice, it’s a bit more complex, so here’s a diagram to help explain it.

image

In a bit more detail, the process it follows is:

1) A user visits the application and needs to sign in, they are redirected via a sign-in request to the authentication endpoint in AAD.

2) The user signs in on the sign-in page.

3) If authentication is successful, AAD creates an authentication token and returns a sign-in response to the application’s Reply URL that was configured in the Azure Portal. The returned token includes claims about the user and AAD that are required by the application to validate the token.

4) The application validates the token by using a public signing key and issuer information available at the federation metadata document for Azure AD. After the application validates the token, Azure AD starts a new session with the user. This session allows the user to access the application until it expires.

This method of authentication is used by Azure Data Catalog when discovering data through the browser.

Native Application to Web API

What’s the process of Native Application to Web API authentication? Simply put, the application will ask you to sign-in to AAD, so that it can acquire a token in order to access resources from the Web API. In practice, it’s a bit more complex, so here’s a diagram to help explain it.

image

In a bit more detail, the process it follows is:

1) The native application makes a request to the authorisation endpoint in AAD, but using a browser pop-up. This request includes the Application ID and redirect URI of the native application (see the following article for native applications and registering them in Azure) and the Application ID URI of the Web API. The user is then requested to sign-in.

2) AAD authenticates the user. AAD then issues an authorisation code response back to the application’s redirect URI.

3) The Application then stops the browser activity and extracts the authorisation code from the response. Using the authorisation code, the Application then requests an access token from AAD. It also uses details about the native application and the desired resource (Web API).

4) The authorisation code and details are checked by AAD, which then returns an access token and a refresh token.

5) The Application then uses the access token to add to the authorisation header in its request to the Web API. Which returns the requested resource, based on successful authentication.

6) When the access token expires, the refresh token is used to acquire a new access token without requiring the user to sign-in again.

This method of authentication is used by Azure Data Catalog when registering data assets via the desktop application.

Automated Interaction with Azure Data Catalog

In both of the examples above, they require the user to interact in order to provide sign-in credentials. This is not ideal if we want to automate the registration of data assets or conduct data discovery outside of the browser. Therefore we’ll need to use a different method of authentication. This is the Server Application to Web API authentication method. Simply, it assumes that the server has already required a user to login and therefore has the user’s credentials. It then uses those credentials to request the access and refresh tokens from AAD.

image

In a bit more detail, the process it follows is:

1) The Server Application makes a request to AAD’s Token Endpoint, bypassing the Authentication Endpoint, providing the credential, Application ID and Application URI.

2) AAD authenticates the application and returns an access token that can be used to call the Web API.

3) The Application uses the access token to add to the authorisation header in its request to the Web API. Which returns the requested resource, based on successful authentication.

This method is what we’re going to use to automate our interaction with Azure Data Catalog.

From an authentication aspect, the code for Server Application to Web API is simple and this example will take us to the point of returning that token, from which we can then use to request resources from the Azure Data Catalog API. The full code can be found my GitHub repo.

We are going to use the Client Id and Secret from an application we’ve registered in AAD (full process can be found in this Microsoft article on Integrating Applications with AAD).

private static string clientId = "ApplicationId";

private static string secret = "ApplicationKey";

Then, we’re going to make sure we’re connecting to the correct AAD instance

private static string authorityUri = string.Format("https://login.windows.net/{0}", tenantId);

So we can create an authorisation context

AuthenticationContext authContext = new AuthenticationContext(authorityUri);

In order to acquire a token

authResult = await authContext.AcquireTokenAsync(resourceUri, new ClientCredential(clientId, secret));

Which can then be used in an authorisation header in requests to the Azure Data Catalog API. In the next related post, we’ll explore how to make a call to the API using this authentication method.

Tabular Automation and NuGet

In a recent blog post, I wrote about processing an Azure Analysis Services tabular model using Azure Functions. In it, there’s a lengthy process of downloading some DLLs and uploading them to the Azure Function. Handily, the Analysis Services team at Microsoft have released the Analysis Services NuGet package, which means that the necessary DLLs can be automatically installed to an Azure Function without much hassle. This blog is going to go through the steps of adding the NuGet package to your Azure Function.

Add a new file to your function called project.json

image

Input the following code in the newly created file

{
   "frameworks": {
     "net46":{
       "dependencies": {
         "Microsoft.AnalysisServices.retail.amd64": "15.0.2"
       }
     }
    }
}

Then save the Azure Function to proceed with the NuGet restore and compile your function. You should see the following logs in your log window.

image

That is the entire process. Much easier than documented previously!

Introduction to Azure Data Catalog

With the rise of self-service business intelligence tools, like Power BI, and an increased engagement with data in the workplace, people’s expectations of where they can find expert information about data has changed. Where previously there would an expert that people would have to book time with in order to understand data, now people expect to get quick and detailed information about the data assets that an enterprise holds and maintains without going through a single contact. With Azure Data Catalog, data consumers can quickly discover data assets and gain knowledge about the data from documentation, tags and glossary terms from the subject matter experts. This post aims to give a brief introduction to Azure Data Catalog and what it can broadly be used for.

What is Azure Data Catalog?

Azure Data Catalog is a fully managed Azure service which is an enterprise-wide metadata catalogue that enables data discovery. With Azure Data Catalog, you register; discover; annotate; and, for some sources, connect to data assets. Azure Data Catalog is designed to manage disparate information about data; to make it easy to find data assets, understand them, and connect to them. Any user (analyst, data scientist, or developer) can discover, understand, and consume data sources. Azure Data Catalog is a one-stop central shop for all users to contribute their knowledge and build a community and culture of data.

What can Azure Data Catalog be used for?

As mentioned in the earlier headings, Azure Data Catalog can be used for data asset management; data governance; and data discovery. For data asset management, this means knowing what data is available and where; for data governance teams, this means answering questions like: where is my customer data? or what does this data model look like?; for data discovery, this means knowing which data is suitable for particular reports and who you can go to if you have any questions. There are some common scenarios for using Azure Data Catalog that Microsoft has put together, and it’s well worth reading to get a fuller understanding of what Azure Data Catalog can be used for.




Process an Azure Analysis Services Tabular Model from an Azure Function

A couple of weeks ago I wrote a blog post detailing how to process a tabular model from a stored procedure. The challenge there was to have a way of processing a tabular model on demand from a web application. The challenge still exists, but how do you achieve the same in Azure using Platform as a Service (PaaS) objects which do not have the benefit of a full server and items, such as msdb and SQL Agent, to aid in the process?

In this post, I’m going to show you how to process a tabular model only using Azure PaaS offerings. Not only am I going to show you how to do process a tabular model on-demand, but also how to process a tabular model on a schedule. This post has taken inspiration and part of the code base from the a Microsoft blog: Automating Azure Analysis Services processing with Azure Functions.

Azure Functions

Before we begin properly, it’s worth spending some time introducing Azure Functions. According to Microsoft, Azure Functions are:

…a solution for easily running small pieces of code, or "functions," in the cloud. You can write just the code you need for the problem at hand, without worrying about a whole application or the infrastructure to run it. Functions can make development even more productive, and you can use your development language of choice, such as C#, F#, Node.js, Java, or PHP. Pay only for the time your code runs and trust Azure to scale as needed. Azure Functions lets you develop server less applications on Microsoft Azure.

They are super useful for extending the capabilities of any solution and not just limited to what we’re going to cover here.

On-Demand Refresh

The use of Azure Functions creates a trigger for the on-demand refresh of a tabular model from the web application or web hook, this is to make sure that selected elements of data in a tabular model, for example hot partitions, are always up to date. The following describes the process that Azure Functions will be involved in this scenario:

image

The steps that are needed to create an Azure Function for On-Demand Refresh are as follow:

1) Create an Azure Function App

Navigate to the Azure Portal and create a Function App (the name changes quite a bit, so don’t be concerned if it’s not exactly displayed as it is below)

image

image

2) Create a new Function

After you’ve created the Function App, we need to add a new Webhook + API function, which we’ll use as the basis for our on-demand refresh. Click on the + button next to Functions, select Webhook + API, choose C# as your language and click Create this function.

image

3) Configure the Function

Download the latest client libraries for Analysis Services. This needs to be done to your local machine so you can then copy these files to your Azure Function App. After you’ve downloaded the client libraries, the DLLs can be found in C:\Program Files (x86)\Microsoft SQL Server\140\SDK\Assemblies. The two files you need are:

C:\Program Files (x86)\Microsoft SQL Server\140\SDK\Assemblies\Microsoft.AnalysisServices.Core.DLL
C:\Program Files (x86)\Microsoft SQL Server\140\SDK\Assemblies\Microsoft.AnalysisServices.Tabular.DLL

The documentation references the 130 assemblies, which is not correct and will not work. You need the assemblies in 140.

In order to add these assemblies to your function, you’ll need to add a folder called “bin”. To do this, select your Function App, click Platform features, and under Development Tools, click Advanced tools (Kudu).

image

In Kudu, click Debug console and select Cmd. Navigate to the site\wwwroot\OnDemandHttpTrigger folder and add the “bin” folder here by clicking the + button.

image

Once you’ve added the “bin” folder, go back over to the Azure portal and select your function OnDemandHttpTrigger.

On the right under View files, navigate to the bin folder. Then click the Upload button to add the two previously mentioned DLLs to the bin folder.

image

You should see the two DLLs in your bin folder now.

image

4) Add Azure Analysis Services Connection String to the Function App

This step is going to add the connection string to the Azure Analysis Services (AAS) service to the entire Function App, not just individual functions.

Click the name of your Function App, then select Platform features. Select Application settings under General Settings.

image

Now we need to add our connection string under the Connection strings section. You’ll need your AAS server name and a user ID and password that has access to the AAS database.

You can find your AAS server name by navigating to your AAS database in the Azure portal and copying the value constructed after clicking Show server connection strings:

image

Your connection string should look like this:

Provider=MSOLAP;Data Source=<your aas server>; Initial Catalog=<aas database name>;User ID=<your username>;Password=<your password>

Back in the screen for the Function App, fill in the Name textbox with a name for your connection string and paste your connection string in the Value text box:

image

Click Save at the top to save these settings for your Functions.

5) Time for Code

Our Function App has been configured, now we need to add code to the function. The function comes with a working function, for which to test out the functionality, but we don’t need everything that is on offer.

image

We’re going to programmatically process the tabular model. In doing so, we’ll leverage Analysis Services Management Objects (AMO). If you’re new to AMO, start here.

Paste in the following code (all code referenced can also be download from my GitHub Repo):

#r "Microsoft.AnalysisServices.Tabular.DLL"

#r "Microsoft.AnalysisServices.Core.DLL"

#r "System.Configuration"

using System;

using System.Configuration;

using Microsoft.AnalysisServices.Tabular;

using System.Net;

public static async Task<HttpResponseMessage> Run(HttpRequestMessage req, TraceWriter log)

{

log.Info("C# HTTP trigger function processed a request.");

// parse query parameter

string status = req.GetQueryNameValuePairs()

.FirstOrDefault(q => string.Compare(q.Key, "status", true) == 0)

.Value;

if (status == null)

{

// Get request body

dynamic data = await req.Content.ReadAsAsync<object>();

status = data?.status;

}

if (status == "execute")

{log.Info($"C# trigger function started at: {DateTime.Now}"); 

try

            {

Microsoft.AnalysisServices.Tabular.Server asSrv = new Microsoft.AnalysisServices.Tabular.Server();

var connStr = ConfigurationManager.ConnectionStrings["AASTabular"].ConnectionString;

asSrv.Connect(connStr);

Database db = asSrv.Databases["azureadventureworks"];

Model m = db.Model;

//db.Model.RequestRefresh(RefreshType.Full);     // Mark the model for refresh

m.RequestRefresh(RefreshType.Full);     // Mark the model for refresh

//m.Tables["Date"].RequestRefresh(RefreshType.Full);     // Mark only one table for refresh

db.Model.SaveChanges();     //commit  which will execute the refresh

asSrv.Disconnect();

            }

catch (Exception e)

            {

log.Info($"C# trigger function exception: {e.ToString()}");

            }

log.Info($"C# trigger function finished at: {DateTime.Now}");

}

return status == "execute"

?req.CreateResponse(HttpStatusCode.OK, "Successfully Processed Tabular Model ")

:req.CreateResponse(HttpStatusCode.BadRequest, "Please pass a status on the query string or in the request body");

}

Click the Save button at the top.

6) Test, Test, Test

Click the Run button at the top to test the function

image

The function can also be tested in a web browser, and be called by a Web App using the POST HTTP method.

image

Now we have a fully functioning method of refreshing an Azure Analysis Services tabular model on-demand.

Scheduled Refresh

The use of Azure Functions creates a trigger for the scheduled refresh of a tabular model, this is to make sure that the entire tabular model has the latest data and is always up to date. The following describes the process that Azure Functions will be involved in this scenario:

image

The steps that are needed to create an Azure Function for Scheduled Refresh are as follow:

1) Create a Function

We’ve created our Function App, and now we need to add a new Timer Trigger function, which we’ll use as the basis for our scheduled refresh. Click on the + button next to Functions, select Timer, choose C# as your language and click Create this function.

2) Configure the Timer

What use is a timer without a schedule? To give the timer a schedule, click Integrate, set the schedule and click Save.

image

The Schedule text box expects a CRON expression to define the days and times that the function should execute. Click the little Documentation button on the screen above to read about CRON expressions. The schedule I’ve set is to run once everyday at 09:30AM.

3) Configure the Function

See step 3 of the On-Demand Function for detailed steps. You’ll need to create the bin folder and upload the DLLs to the bin folder.

4) Time for Code

We’ve configured our function, so now it’s time to add the code. The code base is much simpler than the On-Demand code base, mainly because it’s doing fewer tasks. But the AMO section is exactly the same. Paste in the following code:

#r "Microsoft.AnalysisServices.Tabular.DLL"

#r "Microsoft.AnalysisServices.Core.DLL"

#r "System.Configuration"

using System;

using System.Configuration;

using Microsoft.AnalysisServices.Tabular;

public static void Run(TimerInfo myTimer, TraceWriter log)

{

log.Info($"C# Timer trigger function started at: {DateTime.Now}"); 

try

            {

Microsoft.AnalysisServices.Tabular.Server asSrv = new Microsoft.AnalysisServices.Tabular.Server();

var connStr = ConfigurationManager.ConnectionStrings["AASTabular"].ConnectionString;

asSrv.Connect(connStr);

Database db = asSrv.Databases["azureadventureworks"];

Model m = db.Model;

//db.Model.RequestRefresh(RefreshType.Full);     // Mark the model for refresh

m.RequestRefresh(RefreshType.Full);     // Mark the model for refresh

//m.Tables["Date"].RequestRefresh(RefreshType.Full);     // Mark only one table for refresh

db.Model.SaveChanges();     //commit  which will execute the refresh

asSrv.Disconnect();

            }

catch (Exception e)

            {

log.Info($"C# Timer trigger function exception: {e.ToString()}");

            }

log.Info($"C# Timer trigger function finished at: {DateTime.Now}");

}

Click the save button at the top.

5) Test, Test, Test

Click the Run button at the top to test the function

image


Now we have a fully functioning method of refreshing an Azure Analysis Services tabular model on-demand.

Conclusion

I have shown you how simple it is to invoke two methods of refreshing a tabular model using Azure Functions: an On-Demand refresh and a refresh by Schedule. I hope that you take inspiration and use these methods in your use of both Azure Analysis Services and Azure Functions.

Archiving the Data Lake

In a blog introducing the Data Lake Framework, keen readers will be aware that in the diagram there’s a box titled “ARCHIVE” but it has not been brought up since. The reason why the Archive layer in the data lake has not been discussed is because we’ve been waiting for the Archive Tier in Blob Storage.

To remind readers of the framework and where the archive layer sits in it, here it is again with the archive layer highlighted.

image

The Archive Blob

The Archive access tier in blob storage was made generally available today (13th December 2017) and with it comes the final piece in the puzzle to archiving data from the data lake.

Where Hot and Cool access tiers can be applied at a storage account level, the Archive access tier can only be applied to a blob storage container. To understand why the Archive access tier can only be applied to a container, you need to understand the features of the Archive access tier. It is intended for data that has no or low SLAs for availability within an organisation and the data is stored offline (Hot and Cool access tiers are online). Therefore, it can take up to 15 hours for data to be made online and available. Brining Archive data online is a process called rehydration (fitting for the data lake). If you have lots of blob containers in a storage account, you can archive them and rehydrate them as required, rather than having to rehydrate the entire storage account.

Archive Pattern

An intended use for the Archive access tier is to store raw data that must be preserved, even after it has been fully processed, and does not need to be accessed within 180 days.

Data gets loaded into the RAW area of the data lake, is fully processed through to CURATED, and a copy of the raw data is archived off to a blob container with a Cool access tier applied to it. When the archive cycle comes about, a new Cool access tiered blob container is created and the now old container has its access tier changed to Archive.

For example, our Archive cycle is monthly and we have a Cool access tiered blob container in our storage account called “December 2017”. When data has finished being processed in the Azure Data Lake, the Raw data is archived to this blob container. January comes around, we create a new blob container called “January 2018” with Cool access tier settings and change the access tier of “December 2017” from Cool to Archive.

This data has now been formally achieved and is only available for disaster recovery, auditing or compliance purposes. 




LETS Process Data–Modern ETL for the Data Lake

At the PASS Summit this year, I attended a session by Michael Rys. In this session he introduced the concept of LETS as an approach to process data in the data lake. If you are familiar with data lake, then you will be familiar of having to apply a schema to the data held within. The LETS approach is purpose design for schematization.

Where ETL stands for Extract, Transform, Load or ELT stands for Extract, Load, Transform – LETS stands for Load, Extract, Transform, Store.

Data are Loaded into the data lake

Data are Extracted and schematized

Data are Transformed in rowsets

Data are Stored in a location, such as the Catalog in Azure Data Lake Analytics, Azure Data Warehouse, Azure Analysis Services, for analysis purposes.

image

I really like this approach as it makes sense for how data are handled in the data lake. It’s something that I will be advocating and using, and I hope you do too!