Adatis

Adatis BI Blogs

Regression Testing a Data Platform with Pester

In a previous post, I gave an overview to regression tests. In this post, I will give a practical example of developing and performing regression tests with the Pester framework for PowerShell. The code for performing regression tests is written in PowerShell using the Pester Framework. The tests are run through Azure DevOps pipelines and are designed to test regression scenarios. The PowerShell scripts, which contain the mechanism for executing tests, rely upon receiving the actual test definitions from a metadata database. The structure of the metadata database will be exactly the same as laid out in the Integration Test post.Regression TestsThe regression tests will need to be designed so that existing functionality isn’t regressed by any changes made to the code. In an analytics system, the functionality is typically going to be aligned to the target schema that’s used for reporting and analysis. If we change the cleaning transformation logic in the source tables which make up our customer dimension, we’ll want to ensure that the customer dimension itself doesn’t change expected outcomes, for example row counts or a specific value. For this example, we’ll be putting in some data into the data lake, run it through the various layers until it ends up in the CURATED layer. Because the majority of the processing is orchestrated using Azure Data Factory V2 (ADF), we only really need to ensure that the pipeline(s) run successfully and some valid data appears in all the layers of the lake, as well as logged into the metadata database.Because we’re deploying some data, we’ve got elements of setup and teardown in the script. Setup and teardown in the metadata database, so that ADF knows what to process. Setup and teardown in the data lake, so that there is data to process.This should give you enough to start using Pester for testing your own Azure data platform implementations.

Regression Testing Overview

In a previous post, I touched on the point of testing and briefly touched on regression testing. In this post, I will be going into more detail about what regression testing is and why it’s important to do it.In the previous post, I said that Regression Tests are intended to:verify that newly developed code into a deployed product does not regress expected results. We’ll still need to go through the process of unit testing and integration testing; but do we want to go through the rigmarole of manual testing to check if a change has changed more than what it was meant to? That’s something that we would like to avoid, so we have regression testing to alleviate that need. Like integration tests, they do need multiple parts of the product available so would need to be executed as part of a Release Pipeline in Azure DevOps. Regression testing is expensive to automate and maintain; and slow to run – but that doesn’t mean that they should be avoided. They add a layer of confidence to a newly changed code base which is about to be deployed. However, because we are testing targeted elements, perhaps the entire solution at once, we don’t want to run all regressions tests all the time because they would take a very long time to complete.Regression TechniquesThere are a variety of methods and techniques that can be used in the design and execution of regression tests. These are:Retest AllTest SelectionTest Case PrioritisationRetest All executes all the documented test cases to check the integrity of the solution. This is the most expensive technique for regression testing as it runs all the test cases, however, it does ensure that there are no errors in the modified code that could be released into Production.Test Selection executes a defined selection of documented test cases to check the integrity of a section of the solution. Less expensive than the Retest All technique, but does introduce an element of risk as the test coverage does not cover the entire solution. Test Case Prioritisation executes tests in priority order, executing higher priority tests over lower priority tests. Regression TestingRegression tests are executed for the various functional slices that exist in a solution. Like most forms of testing, regression tests follow a pattern of:Initialise system under testCall functionality under testAssert expected outcome against result of methodGenerally, regression tests will be executed after deployment as they often require the infrastructure to exist. Generally, regression tests are dependent on data, which must be created at the time of setting up the tests. Most of the time, you can automate regression tests using a unit test framework such as Pester or NUnit.Some best practicesTo get you going I’m going to set out some best practices that you should aim to follow:Adopt a hybrid technique of mixing and matching regression techniques to use what’s best for you at the timeCreate the data needed for the tests before execution, as part of the test setupMultiple asserts per test. You might have dependencies on external resources that you’d like to keep open or you want a fast running set of tests. Multiple asserts help with all of theseChoose unit tests over regression tests when feasibleChoose integration tests over regression tests when feasible

Integration Testing a Data Platform with Pester

In a previous post, I gave an overview to integration tests and documenting integration points. In this post, I will give a practical example of developing and performing integration tests with the Pester framework for PowerShell. With a data platform, especially one hosted in Azure, it’s important to test that the Azure resources in your environment have been deployed and configured correctly. After we’ve done this, we can test the integration points on the platform, confident that all the components have been deployed. The code for performing integration tests is written in PowerShell using the Pester Framework. The tests are run through Azure DevOps pipelines and are designed to test documented integration points. The PowerShell scripts, which contain the mechanism for executing tests, rely upon receiving the actual test definitions from a metadata database. Database StructureThe metadata database should contain a schema called Test, which is a container for all the database objects for running tests using Pester. These objects are:Test.TestCategory - contains what category of test is to be run e.g. Integration TestsTest.TestType - contains the type of tests that need to be run and are associated with a particular type of functionality. In the Pester Framework, Test Type maps to the Describe function. Test.Test - contains the individual tests to be run, with reference to the test type and environment. In the Pester Framework, Test maps to the Context function.Test.Assert contains the individual asserts to be executed against the output from the test run, with reference to the Test and type of assert. In the Pester Framework, Test maps to the It function.How you design the tables, is up to you, but I suggest that the schema looks similar to the above. Test Environment SetupBefore we begin testing all the integration points, we need to be confident that the environment, for which the platform is deployed to, has been created and configured correctly. If it hasn’t, there’s no point in progressing with the actual integration tests as they would fail. For this, we have an initial script to perform these checks. The script executes a stored procedure called Test.ObtainTests, which returns the list of tests to be run. Within a Pester Describe block, the tests are executed. The tests use the Get-AzureRmResource cmdlet and asserts that the name of the deployed resource matches that of the expected resource, as defined in the TestObject.If any of the tests fail in this phase, no further testing should take place. Integration TestsWe’re confident that the environment has been created and configured correctly, so now we’re ready to run the integration tests according to the documented integration points. For this example, we’ll be putting in some data into the RAW layer of the data lake, running it through the various layers until it ends up in the CURATED layer and can be read by Azure SQL DW. Because the majority of the processing is orchestrated using Azure Data Factory V2 (ADF), and the majority of the integration points are within ADF, we only really need to ensure that the pipeline(s) run successfully and some valid data appears in the CURATED layer for SQL DW to consume via PolyBase.Because we’re also deploying some data, we’ve got elements of setup and teardown in the script. Setup and teardown in the metadata database, so that ADF knows what to process. Setup and teardown in the data lake, so that there is data to processTying it all togetherWe’ve got our scripts, but how does it get invoked? This is where the InvokePester script comes in. For anyone not familiar with Pester, this is effectively the orchestrator for your testing scripts. If you deploy the tests to Azure DevOps as part of a release pipeline, you’ll see a similar output to the image below:This should give you enough to start using Pester for testing your own Azure data platform implementations.

Whitelisting Azure IP addresses for SQL Server

In a recent blog post, I wrote about whitelisting Azure Data Centre IP addresses for Key Vault. Key Vault’s firewall uses CIDR notation for IP ranges, which is exactly what is contained within the list of IP addresses supplied by Microsoft. However, there are some resources, like Azure SQL Server, which only accept IP ranges. Therefore, we need a way of converting CIDR to an IP range. Handily, there’s a PowerShell script which exists to provide that conversion – called ipcalc.ps1. When you download it, make sure it’s in the same working folder as the script you’re going to use to create the new firewall rules.From there, we can make slight amends to the script we had in the previous post and produce the following script:If you need to assign the IP ranges to other resources you can substitute the New-AzSqlServerFirewallRule with the appropriate cmdlet and parameters

Whitelisting Azure IP addresses for Key Vault

A colleague came to me with an interesting request: We want to put Key Vault behind a firewall, but when we do that it means that Azure Data Factory can no longer access the secrets. Is there a way to whitelist the IP addresses for a given Azure Data Centre?The short answer is: Yes. By default, the following option is enabled on Azure Key Vault under the Firewalls and virtual networks blade.For most users, having unrestricted access from external networks to a resource that holds secrets, certificates and other sensitive information is a big red flag. If we choose to only allow access from Selected Networks we get the following options opening up for us:Note that trusted Microsoft services is not an extensive list and does not include Azure Data Factory. Therefore we need to whitelist a series of IP Addresses in the firewall rules. The list of IP Addresses are published by Microsoft and are updated on a weekly basis. The IP addresses are published in an XML document, which isn’t always the best format when one needs to update firewalls in Azure. Shredding XMLTo update the Firewall in Azure, we’re going to use PowerShell to shred the XML and extract the IP ranges for a given region. Then, we’re going to use the updated Azure PowerShell module to register the IP ranges against the Key Vault. Using the last command, we can check that the IP ranges have been registered successfully. You should see something like:There we have it, explicit IP whitelisting of Azure Data Centres so we can lock down Azure resources, only opening up access when we need to. UpdateKey Vault is currently limited to 127 firewall rules. If you are adding a region with more than 127 IP ranges, you might have an issue…

Integration Testing Overview

In a previous post, I touched on the point of testing and briefly talked about integration testing. In this post, I will be going into more detail about what integration testing is and why it’s important to do it.In the previous post, I said that Integration Tests are:intended to verify that the units of code and the services used in a product work together. As a result, they are more expensive to automate and maintain than unit tests; and can take considerably longer to run. Whilst unit tests can be run without dependencies of other parts of the product being available, integration tests often require multiple parts of the product – including infrastructure – to be up and running so that the integrations between units and services can be tested. Because integration tests might require infrastructure to be available, and certainly multiple parts of the product available, integration tests are best run as part of a Release Pipeline in Azure DevOps.To expand on this, integration tests are written for each integration point for a solution. But what do we mean by “integration point”? An integration point is typically where two or more units of code interact with each other, or two or more services interact with each other – verifying that the individual parts or components of a solution works as intended together with other parts. How do we define an integration point?Integration PointsWe define an integration point by whiteboarding each component of our solution with the aim to document how they interact with each other. We can highlight the integration point by drawing a circle around it. Consider the following architecture:It’s a fairly typical modern data warehouse solution. We’re ingesting data from a variety of sources and storing it in a data lake. We’re then transforming and processing that data into our warehouse schema before presenting it in a data warehouse; processing it in an analysis services model so that it can be reported on. That’s the architecture, but the components used might be very different and interact differently with the architecture. For the ingestion, our integration points are going to be between the following components:For the transformation piece, our integration points are going to look like:Finally, for processing our data into the semantic model, the integration points look like:As you can see, the integration points do not align perfectly with the architecture – bear in mind that every solution is different, so your integration points will definitely look different even if the broad architecture is the same.Integration TestingWe’ve documented our integration points and now we need to write some integration tests. Integration tests are executed for the various integration points that exist in a solution. Like most forms of testing, integration tests follow a pattern of:Initialise system under testCall functionality under testAssert expected outcome against result of methodGenerally, integration tests will be executed after deployment as they often require the infrastructure to exist. Most of the time, integration tests should not be dependent on data. However, if data does need to exist, this must be created at the time of setting up the tests. Most of the time, you can automate integration tests using a unit test framework such as Pester or NUnit. Some Best PracticesTo get you going, I’m going to set out some best practices that you should aim to follow:Only create integration tests you needDon’t depend on data being available. If you have tests that depend on data – create that data before execution, as part of the test setupMultiple asserts per test. You might have dependencies on external resources that you’d like to keep open or you want a fast running set of tests. Multiple asserts help with all of these. Choose unit tests over integration tests when feasible. Don’t duplicate effort.Further readingMy colleague Ben has written an excellent blog on SQL Integration Testing using NUnit.I’ll add another post soon about how to do Integration Testing using Pester.

Data Source Permissions and On-Premises Data Gateway: SQL Server and Analysis Services

In Microsoft’s documentation surrounding the On-Premises Data Gateway, the advice on permissions for the account used to authenticate the Data Source in the Power BI Service can be concerning for most, especially DBAs. In the Analysis Services section of the documentation, the advice is:The Windows account you enter must have Server Administrator permissions for the instance you are connecting to. If this account’s password is set to expire, users could get a connection error if the password isn’t updated for the data source.Server Administrator permissions…? What happened to the principle of least-privilege? In a practical sense, the On-Premises Data Gateway has to deal with two very different implementations of Analysis Services: Multidimensional and Tabular. Each are setup and configured differently from the other, and the nature of their security models are also different. As a one size fits all approach, it works. As we will soon see, the permissions do not have to be set as Server Admin The SQL section of the documentation, on the other hand, doesn’t actually specify what permissions are required for the Data Source to be established in the Power BI Service. PermissionsExactly what permissions are required for these common data sources, I hear you ask. As data sources are established at a database level, so too are the permissions set.Data SourceMinimum Permissions LevelSQL Server Databasedb_datareaderSSAS Tabular DatabaseProcess database and ReadSSAS Multidimensional DatabaseFull control (Administrator) Principle of least-permissions is now restored. Though there still are the curious incidents of Analysis Services data sources requiring permissions in addition to read. I am unsure, I have my suspicions, and have tried to find out. If you know, please leave a comment below!