Ust

Ust Oldfield's Blog

Regression Testing a Data Platform with Pester

In a previous post, I gave an overview to regression tests. In this post, I will give a practical example of developing and performing regression tests with the Pester framework for PowerShell. The code for performing regression tests is written in PowerShell using the Pester Framework. The tests are run through Azure DevOps pipelines and are designed to test regression scenarios. The PowerShell scripts, which contain the mechanism for executing tests, rely upon receiving the actual test definitions from a metadata database. The structure of the metadata database will be exactly the same as laid out in the Integration Test post.

Regression Tests

The regression tests will need to be designed so that existing functionality isn’t regressed by any changes made to the code. In an analytics system, the functionality is typically going to be aligned to the target schema that’s used for reporting and analysis. If we change the cleaning transformation logic in the source tables which make up our customer dimension, we’ll want to ensure that the customer dimension itself doesn’t change expected outcomes, for example row counts or a specific value.

For this example, we’ll be putting in some data into the data lake, run it through the various layers until it ends up in the CURATED layer. Because the majority of the processing is orchestrated using Azure Data Factory V2 (ADF), we only really need to ensure that the pipeline(s) run successfully and some valid data appears in all the layers of the lake, as well as logged into the metadata database.

Because we’re deploying some data, we’ve got elements of setup and teardown in the script. Setup and teardown in the metadata database, so that ADF knows what to process. Setup and teardown in the data lake, so that there is data to process.

This should give you enough to start using Pester for testing your own Azure data platform implementations.

Regression Testing Overview

In a previous post, I touched on the point of testing and briefly touched on regression testing. In this post, I will be going into more detail about what regression testing is and why it’s important to do it.

In the previous post, I said that Regression Tests are intended to:

verify that newly developed code into a deployed product does not regress expected results. We’ll still need to go through the process of unit testing and integration testing; but do we want to go through the rigmarole of manual testing to check if a change has changed more than what it was meant to? That’s something that we would like to avoid, so we have regression testing to alleviate that need. Like integration tests, they do need multiple parts of the product available so would need to be executed as part of a Release Pipeline in Azure DevOps. Regression testing is expensive to automate and maintain; and slow to run – but that doesn’t mean that they should be avoided. They add a layer of confidence to a newly changed code base which is about to be deployed. However, because we are testing targeted elements, perhaps the entire solution at once, we don’t want to run all regressions tests all the time because they would take a very long time to complete.

Regression Techniques

There are a variety of methods and techniques that can be used in the design and execution of regression tests. These are:

  • Retest All
  • Test Selection
  • Test Case Prioritisation

Retest All executes all the documented test cases to check the integrity of the solution. This is the most expensive technique for regression testing as it runs all the test cases, however, it does ensure that there are no errors in the modified code that could be released into Production.

Test Selection executes a defined selection of documented test cases to check the integrity of a section of the solution. Less expensive than the Retest All technique, but does introduce an element of risk as the test coverage does not cover the entire solution.

Test Case Prioritisation executes tests in priority order, executing higher priority tests over lower priority tests.

Regression Testing

Regression tests are executed for the various functional slices that exist in a solution. Like most forms of testing, regression tests follow a pattern of:

  • Initialise system under test
  • Call functionality under test
  • Assert expected outcome against result of method

Generally, regression tests will be executed after deployment as they often require the infrastructure to exist. Generally, regression tests are dependent on data, which must be created at the time of setting up the tests. Most of the time, you can automate regression tests using a unit test framework such as Pester or NUnit.

Some best practices

To get you going I’m going to set out some best practices that you should aim to follow:

  • Adopt a hybrid technique of mixing and matching regression techniques to use what’s best for you at the time
  • Create the data needed for the tests before execution, as part of the test setup
  • Multiple asserts per test. You might have dependencies on external resources that you’d like to keep open or you want a fast running set of tests. Multiple asserts help with all of these
  • Choose unit tests over regression tests when feasible
  • Choose integration tests over regression tests when feasible

Integration Testing a Data Platform with Pester

In a previous post, I gave an overview to integration tests and documenting integration points. In this post, I will give a practical example of developing and performing integration tests with the Pester framework for PowerShell. With a data platform, especially one hosted in Azure, it’s important to test that the Azure resources in your environment have been deployed and configured correctly. After we’ve done this, we can test the integration points on the platform, confident that all the components have been deployed.

The code for performing integration tests is written in PowerShell using the Pester Framework. The tests are run through Azure DevOps pipelines and are designed to test documented integration points. The PowerShell scripts, which contain the mechanism for executing tests, rely upon receiving the actual test definitions from a metadata database.

Database Structure

The metadata database should contain a schema called Test, which is a container for all the database objects for running tests using Pester. These objects are:

  • Test.TestCategory - contains what category of test is to be run e.g. Integration Tests
  • Test.TestType - contains the type of tests that need to be run and are associated with a particular type of functionality. In the Pester Framework, Test Type maps to the Describe function.
  • Test.Test - contains the individual tests to be run, with reference to the test type and environment. In the Pester Framework, Test maps to the Context function.
  • Test.Assert contains the individual asserts to be executed against the output from the test run, with reference to the Test and type of assert. In the Pester Framework, Test maps to the It function.

How you design the tables, is up to you, but I suggest that the schema looks similar to the above.

Test Environment Setup

Before we begin testing all the integration points, we need to be confident that the environment, for which the platform is deployed to, has been created and configured correctly. If it hasn’t, there’s no point in progressing with the actual integration tests as they would fail. For this, we have an initial script to perform these checks.

The script executes a stored procedure called Test.ObtainTests, which returns the list of tests to be run. Within a Pester Describe block, the tests are executed. The tests use the Get-AzureRmResource cmdlet and asserts that the name of the deployed resource matches that of the expected resource, as defined in the TestObject.

If any of the tests fail in this phase, no further testing should take place.

Integration Tests

We’re confident that the environment has been created and configured correctly, so now we’re ready to run the integration tests according to the documented integration points. For this example, we’ll be putting in some data into the RAW layer of the data lake, running it through the various layers until it ends up in the CURATED layer and can be read by Azure SQL DW. Because the majority of the processing is orchestrated using Azure Data Factory V2 (ADF), and the majority of the integration points are within ADF, we only really need to ensure that the pipeline(s) run successfully and some valid data appears in the CURATED layer for SQL DW to consume via PolyBase.

Because we’re also deploying some data, we’ve got elements of setup and teardown in the script. Setup and teardown in the metadata database, so that ADF knows what to process. Setup and teardown in the data lake, so that there is data to process

Tying it all together

We’ve got our scripts, but how does it get invoked? This is where the InvokePester script comes in. For anyone not familiar with Pester, this is effectively the orchestrator for your testing scripts.

If you deploy the tests to Azure DevOps as part of a release pipeline, you’ll see a similar output to the image below:

image

This should give you enough to start using Pester for testing your own Azure data platform implementations.

Integration Testing Overview

In a previous post, I touched on the point of testing and briefly talked about integration testing. In this post, I will be going into more detail about what integration testing is and why it’s important to do it.

In the previous post, I said that Integration Tests are:

intended to verify that the units of code and the services used in a product work together. As a result, they are more expensive to automate and maintain than unit tests; and can take considerably longer to run. Whilst unit tests can be run without dependencies of other parts of the product being available, integration tests often require multiple parts of the product – including infrastructure – to be up and running so that the integrations between units and services can be tested. Because integration tests might require infrastructure to be available, and certainly multiple parts of the product available, integration tests are best run as part of a Release Pipeline in Azure DevOps.

To expand on this, integration tests are written for each integration point for a solution. But what do we mean by “integration point”? An integration point is typically where two or more units of code interact with each other, or two or more services interact with each other – verifying that the individual parts or components of a solution works as intended together with other parts. How do we define an integration point?

Integration Points

We define an integration point by whiteboarding each component of our solution with the aim to document how they interact with each other. We can highlight the integration point by drawing a circle around it.

Consider the following architecture:

image

It’s a fairly typical modern data warehouse solution. We’re ingesting data from a variety of sources and storing it in a data lake. We’re then transforming and processing that data into our warehouse schema before presenting it in a data warehouse; processing it in an analysis services model so that it can be reported on. That’s the architecture, but the components used might be very different and interact differently with the architecture.

For the ingestion, our integration points are going to be between the following components:

sourceToRawIntegration

For the transformation piece, our integration points are going to look like:

rawToCuratedIntegration

Finally, for processing our data into the semantic model, the integration points look like:

curatedToSemanticIntegration

As you can see, the integration points do not align perfectly with the architecture – bear in mind that every solution is different, so your integration points will definitely look different even if the broad architecture is the same.

Integration Testing

We’ve documented our integration points and now we need to write some integration tests. Integration tests are executed for the various integration points that exist in a solution. Like most forms of testing, integration tests follow a pattern of:

  • Initialise system under test
  • Call functionality under test
  • Assert expected outcome against result of method

Generally, integration tests will be executed after deployment as they often require the infrastructure to exist. Most of the time, integration tests should not be dependent on data. However, if data does need to exist, this must be created at the time of setting up the tests. Most of the time, you can automate integration tests using a unit test framework such as Pester or NUnit.

Some Best Practices

To get you going, I’m going to set out some best practices that you should aim to follow:

  • Only create integration tests you need
  • Don’t depend on data being available. If you have tests that depend on data – create that data before execution, as part of the test setup
  • Multiple asserts per test. You might have dependencies on external resources that you’d like to keep open or you want a fast running set of tests. Multiple asserts help with all of these.
  • Choose unit tests over integration tests when feasible. Don’t duplicate effort.

Further reading

My colleague Ben has written an excellent blog on SQL Integration Testing using NUnit.

I’ll add another post soon about how to do Integration Testing using Pester.


Archiving the Data Lake

In a blog introducing the Data Lake Framework, keen readers will be aware that in the diagram there’s a box titled “ARCHIVE” but it has not been brought up since. The reason why the Archive layer in the data lake has not been discussed is because we’ve been waiting for the Archive Tier in Blob Storage.

To remind readers of the framework and where the archive layer sits in it, here it is again with the archive layer highlighted.

image

The Archive Blob

The Archive access tier in blob storage was made generally available today (13th December 2017) and with it comes the final piece in the puzzle to archiving data from the data lake.

Where Hot and Cool access tiers can be applied at a storage account level, the Archive access tier can only be applied to a blob storage container. To understand why the Archive access tier can only be applied to a container, you need to understand the features of the Archive access tier. It is intended for data that has no or low SLAs for availability within an organisation and the data is stored offline (Hot and Cool access tiers are online). Therefore, it can take up to 15 hours for data to be made online and available. Brining Archive data online is a process called rehydration (fitting for the data lake). If you have lots of blob containers in a storage account, you can archive them and rehydrate them as required, rather than having to rehydrate the entire storage account.

Archive Pattern

An intended use for the Archive access tier is to store raw data that must be preserved, even after it has been fully processed, and does not need to be accessed within 180 days.

Data gets loaded into the RAW area of the data lake, is fully processed through to CURATED, and a copy of the raw data is archived off to a blob container with a Cool access tier applied to it. When the archive cycle comes about, a new Cool access tiered blob container is created and the now old container has its access tier changed to Archive.

For example, our Archive cycle is monthly and we have a Cool access tiered blob container in our storage account called “December 2017”. When data has finished being processed in the Azure Data Lake, the Raw data is archived to this blob container. January comes around, we create a new blob container called “January 2018” with Cool access tier settings and change the access tier of “December 2017” from Cool to Archive.

This data has now been formally achieved and is only available for disaster recovery, auditing or compliance purposes. 




LETS Process Data–Modern ETL for the Data Lake

At the PASS Summit this year, I attended a session by Michael Rys. In this session he introduced the concept of LETS as an approach to process data in the data lake. If you are familiar with data lake, then you will be familiar of having to apply a schema to the data held within. The LETS approach is purpose design for schematization.

Where ETL stands for Extract, Transform, Load or ELT stands for Extract, Load, Transform – LETS stands for Load, Extract, Transform, Store.

Data are Loaded into the data lake

Data are Extracted and schematized

Data are Transformed in rowsets

Data are Stored in a location, such as the Catalog in Azure Data Lake Analytics, Azure Data Warehouse, Azure Analysis Services, for analysis purposes.

image

I really like this approach as it makes sense for how data are handled in the data lake. It’s something that I will be advocating and using, and I hope you do too!



Azure Data Lake Store–Storage and Best Practices

The Azure Data Lake Store is an integral component for creating a data lake in Azure as it is where data is physically stored in many implementations of a data lake. Under the hood, the Azure Data Lake Store is the Web implementation of the Hadoop Distributed File System (HDFS). Meaning that files are split up and distributed across an array of cheap storage.

 

What this blog will go into is the physical storage of files in the Azure Data Lake Store and then best practices, which will utilise the framework.

 

Azure Data Lake Store File Storage

As mentioned, the Azure Data Lake Store is the Web implementation of HDFS. Each file you place into the store is split into 250MB chunks called extents. This enables parallel read and write. For availability and reliability, extents are replicated into three copies. As files are split into extents, bigger files have more opportunities for parallelism than smaller files. If you have a file smaller than 250MB it is going to be allocated to one extent and one vertex (which is the work load presented to the Azure Data Lake Analytics), whereas a larger file will be split up across many extents and can be accessed by many vertexes.

 

The format of the file has a huge implication for the storage and parallelisation. Splittable formats – files which are row oriented, such as CSV – are parallelizable as data does not span extents. Non-splittable formats, however, – files what are not row oriented and data is often delivered in blocks, such as XML or JSON – cannot be parallelized as data spans extents and can only be processed by a single vertex.

 

In addition to the storage of unstructured data, Azure Data Lake Store also stores structured data in the form of row-oriented, distributed clustered index storage, which can also be partitioned. The data itself is held within the “Catalog” folder of the data lake store, but the metadata is contained in the data lake analytics. For many, working with the structured data in the data lake is very similar to working with SQL databases.

 

Azure Data Lake Store Best Practices

The best practices generally involve the framework as outlined in the following blog: http://blogs.adatis.co.uk/ustoldfield/post/Shaping-The-Lake-Data-Lake-Framework

The framework allows you to manage and maintain your data lake. So, when setting up your Azure Data Lake Store you will want to initially create the following folders in your Root

image

Raw is where data is landed in directly from source and the underlying structure will be organised ultimately by Source.

image

Source is categorised by Source Type, which reflects the ultimate source of data and the level of trust one should associate with the data.

Within the Source Type, data is further organised by Source System.

image

Within the Source System, the folders are organised by Entity and, if possible, further partitioned using the standard Azure Data Factory Partitioning Pattern of Year > Month > Day etc., as this will allow you to achieve partition elimination using file sets.

image

The folder structure of Enriched and Curated is organised by Destination Data Model. Within each Destination Data Model folder is structured by Destination Entity. Enriched or Curated can either be in the folder structure and / or within the Database.  

image

Shaping The Lake: Data Lake Framework

The Azure Data Lake has just gone into general availability and the management of Azure Data Lake Store, in particular, can seem daunting especially when dealing with big data. In this blog, I will take you through the risks and challenges of working with data lakes and big data. Then I will take you through a framework we’ve created to help best manage these risks and challenges.

If you need a refresh as to what a data lake is and how to create your first Azure Data Lake Store and your first Azure Data Lake Analytics job, please feel free to follow the links.

Risks and Challenges of Big Data and Data Lake

The challenges posed by big data are as follow:

  • Volume – is the sheer amount of data becoming unmanageable?
  • Variety – Structured tables? Semi-structured JSON? Completely unstructured text dumps? We can normally manage with systems that contain just one of these, but if we’re dealing with a huge mix, it gets very tricky
  • Velocity – How fast is the data coming in? And how fast do we need to get it to the people who need it?
  • Veracity - How do we maintain accuracy, veracity, when the data is of varying volumes, the sources and structures are different and the speed in which they arrive in the Lake are of differing velocities?

Managing all four simultaneously is where the challenges begin.

It is very easy to treat a data lake as a dumping ground for anything and everything. Microsoft’s sale pitch says exactly this – “Storage is cheap, Store everything!!”. We tend to agree – but if the data is completely malformed, inaccurate, out of date or completely unintelligible, then it’s no use at all and will confuse anyone trying to make sense of the data. This will essentially create a data swamp, which no one will want to go into. Bad data & poorly managed files erode trust in the lake as a source of information. Dumping is bad.

There is also data drowning – as the volume of the data tends towards the massive and the velocity only increases over time we are going to see more and more information available via the lake. When it gets to that point, if the lake is not well managed, then users are going to struggle to find what they’re after. The data may all be entirely relevant and accurate, but if users cannot find what they need then there is no value in the lake itself. Essentially, data drowning is when the amount of data is so vast you lose the ability to find what’s in there.

If you ignore these challenges, treat the lake like a dumping ground, you will have contaminated your lake and it will no longer be fit for purpose.

If no one uses the Data Lake, it’s a pointless endeavour and not worth maintaining.

Everyone needs to be working together to ensure the lake stays clean, managed and good for a data dive!

Those are the risks and challenges we face with Azure Data Lake. But how do we manage it?

The Framework

Data Lake

 

We’ve carved the lake up into different sections. The key point is that the lake contains all sorts of different data – some that’s sanitised and ready to consume by the business user, some that’s indecipherable raw data that needs careful analysis before it is of use. By ensuring data are carefully managed you can instantly understand the level of preparation that data has undergone.

Data flows from left to right – the further left areas represent where data has been input directly from source systems. The horizontal sections describe the level of preparation – Manual, Stream and Batch.

Manual – aka, the laboratory. Here data is manually prepared with ad-hoc scripts.

Stream – The data here flows in semi-real time, coming from event hubs and being landed after processing through stream-specific tools such as Streaming Analytics. Once landed, there is no further data processing – the lake is essentially a batch processing tool.

Batch – This is more traditional data processing, the kind of “ETL” seen by many BI developers. We have a landing area for our raw data, a transitional area where data is cleaned, validated, enriched and augmented with additional sources and calculation, before finally being placed in a curated area where it is ready for consumption by the business.

We’re taking the blank-canvas of the Data Lake Store and applying a folder structure, a file management process and a curation process over the top.

The folder structure itself can be as detailed as you like, we follow a specific structure ourselves:

Data Lake 2

 

The Raw data area, the landing place for any files entering the lake, has sub-folders for each source of data. This allows for the easy browsing of the data sources within the Lake and ensures we are not receiving the same data twice, even if we use it within different systems.

The Enriched and Curated layers however, have a specific purpose in mind. We don’t take data and enrich/clean/process it without a business driver, it’s not something we do for fun. We can therefore assign a project or system name to it, at this point it is organised into these end-systems. This means we can view the same structure within Enriched as within Curated.

Essentially Raw data is categorised by Source whilst Enriched and Curated data is categorised by Destination.

There’s nothing complicated about the Framework we’ve created or the processes we’ve ascribed to it, but it’s incredibly important that everyone is educated on the intent of it and the general purpose of the data lake. If one user doesn’t follow process when adding data, or an ETL developer doesn’t clean up test files, the system starts to fall apart and we succumb to the challenges we discussed at the start.

To summarise, structure in your Azure Data Lake Store is key to maintaining order:

• You need to enforce and maintain folder structure.

• Remember that structure is necessary whether using unstructured data or tables & SQL

• Bear in mind that schema on read applies temporary structure – but if you don’t know what you’re looking at, this is going to be very hard to do!

Introduction to Data Lakes

Data Lakes are the new hot topic in the big data and BI communities. Data Lakes have been around for a few years now, but have only gained popular notice within the last year. In this blog I will take you through the concept of a Data Lake, so that you can begin your own voyage on the lakes.

What is a Data Lake?

Before we can answer this question, it's worth reflecting on a concept which most of us know and love - Data Warehouses. A Data Warehouse is a form of data architecture. The core principal of a Data Warehouse isn't the database, it's the data architecture which the database and tools implement. Conceptually, the condensed and isolated features of a Data Warehouse are around:

1.     Data acquisition

2.     Data management

3.     Data delivery / access

A Data Lake is similar to a Data Warehouse in these regards. It is an architecture. The technology which underpins a Data Lake enables the architecture of the lake to flow and develop. Conceptually, the architecture of a Data Lake wants to acquire data, it needs careful, yet agile management, and the results of any exploration of the data should be made accessible. The two architectures can be used together, but conceptually the similarities end here.

 

Conceptually, Data Lakes and Data Warehouses are broadly similar yet the approaches are vastly different. So let's leave Data Warehousing here and dive deeper into Data Lakes.

 

Fundamentally, a Data Lake is just not a repository. It is a series of containers which capture, manage and explore any form of raw data at scale, enabled by low cost technologies, from which multiple downstream applications can access valuable insight which was previously inaccessible.

 

How Do Data Lakes Work?

 

Conceptually, a Data Lake is similar to a real lake - water flows in, fills up the reservoir and flows out again. The incoming flow represents multiple raw data formats, ranging from emails, sensor data, spreadsheets, relational data, social media content, etc. The reservoir represents the store of the raw data, where analytics can be run on all or some of the data. The outflow is the analysed data, which is made accessible to users.

 

To break it down, most Data Lake architectures come as two parts. Firstly, there is a large distributed storage engine with very few rules/limitations. This provides a repository for data of any size and shape. It can hold a mixture of relational data structures, semi-structured flat files and completely unstructured data dumps. The fundamental point is that it can store any type of data you may need to analyse. The data is spread across a distributed array of cheap storage that can be accessed independently.

 

There is then a scalable compute layer, designed to take a traditional SQL-style query and break it into small parts that can then be run massively in parallel because of the distributed nature of the disks.

 

In essence – we are overcoming the limitations of traditional querying by:

·       Separating compute so it can scale independently

·       Parallelizing storage to reduce impact of I/O bottlenecks

 

 

There are various technologies and design patterns which form the basis of Data Lakes. In terms of technologies these include:

·        Azure Data Lake

·        Cassandra

·        Hadoop

·        S3

·        Teradata

With regards to design patterns, these will be explored in due course. However, before we get there, there are some challenges which you must be made aware of. These challenges are:

1.     Data dumping - It's very easy to treat a data lake as a dumping ground for anything and everything. This will essentially create a data swamp, which no one will want to go into.

2.     Data drowning - the volume of the data could be massive and the velocity very fast. There is a real risk of drowning by not fully knowing what data you have in your lake.

These challenges require good design and governance, which will be covered off in the near future.

 

Hopefully this has given you a brief, yet comprehensive high-level overview of what data lakes are. We will be focusing on Azure Data Lake, which is a management implementation of the Hadoop architectures. Further reading on Azure Data Lake can be found below.

 

Further Reading

 

In order to know more about Data Lakes the following resources are invaluable.

Getting Started With Azure Data Lake Store

Getting Started With Azure Data Lake Analytics and U-SQL

Azure Data Lake Overview