Adatis BI Blogs

Testing: What’s the point?

I’m almost certain that every developer has asked themselves this question at least once throughout their careers. You’ve developed your solution, it works fine on your machine and now the deployment into production is being held up because someone mentions the need to do testing. What’s the point of testing? Ultimately, to provide assurance about the quality of a product.With testing, there are two approaches:ManualAutomatedManual testing is what most developers complain about: it’s expensive to setup; laborious to execute; time consuming to repeat; and prone to human error. Manual testing typically takes the form of User Acceptance Tests – and sometimes can be the only tests that are conducted on a product. How confident are we that the product is of high quality if we only do manual testing? Not very. Automated testing is what every developer should be doing: they’re executed by a machine; they’re repeatable; they’re more robust and reliable than manual testing. However, like manual testing, the quality of the test is dependent on how well the test scripts have been written and the test scripts can vary hugely in complexity. The tests could vary from very simple build verification tests through to complex regression tests. Types of TestingAt its most simplest, testing can be build verification and at its most complex, testing can be user acceptance testing. But to get a true feel of how complex they are and how often you should use them, we should refer to a testing tree. As we can see, the wider the segment the more frequently we should employ it and, as we work our way up the pyramid, the more complex the type of testing becomes. For the remainder of this blog post, I’m going to briefly expand on the following types of tests:Build VerificationUnitIntegrationRegressionBuild Verification TestsA build verification test is using a tool like MS Build to answer the question: does my code compile? If it does compile, the test has passed. If it doesn’t compile, then the test has failed. This can be used in the local development environment, through Visual Studio or it can be conducted as a task in a Build Pipeline within Azure DevOps. These types of tests are extremely cheap to automate and maintain; and very quick to run. Unit TestsUnit tests are low level tests, meaning that they are close to the source of the product. They should be written with the aim of testing individual methods and functions for a given code base, using a unit test framework to support the authoring and execution of a test. As a developer, you would typically author the unit tests in a development tool like Visual Studio; you’d run them locally to ensure that the tests pass; and then they would be executed on a regular basis as a task in a Build Pipeline within Azure DevOps. Unit Tests are cheap to automate and should be quick to run.Integration TestsWe know that individual units of code work, due to unit tests, but how can we be sure that those units work together? Integration tests are intended to verify that the units of code and the services used in a product work together. As a result, they are more expensive to automate and maintain than unit tests; and can take considerably longer to run. Whilst unit tests can be run without dependencies of other parts of the product being available, integration tests often require multiple parts of the product – including infrastructure – to be up and running so that the integrations between units and services can be tested. Because integration tests might require infrastructure to be available, and certainly multiple parts of the product available, integration tests are best run as part of a Release Pipeline in Azure DevOps.Regression TestsWe’ve verified that individual elements of the product work; and we’ve verified that the individual elements of the product work together; what happens if we change elements of the product? This is where regression testing comes in – to verify that newly developed code into a deployed product does not regress expected results. We’ll still need to go through the process of unit testing and integration testing; but do we want to go through the rigmarole of manual testing to check if a change has changed more than what it was meant to? That’s something that we would like to avoid, so we have regression testing to alleviate that need. Like integration tests, they do need multiple parts of the product available so would need to be executed as part of a Release Pipeline in Azure DevOps. Regression testing is expensive to automate and maintain; and slow to run – but that doesn’t mean that they should be avoided. The add a layer of confidence to a newly changed code base which is about to be deployed. However, because we are testing targeted elements, perhaps the entire solution at once, we don’t want to run all regressions tests all the time because they would take a very long time to complete. SummaryWe know why we’re do testing; we are aware of some high-level approaches; and we’ve gone through some types of automated tests in brief detail. This post is first in a series on testing, future posts will include:Unit TestingIntegration TestingRegression TestingAs always, do let me know if you have any feedback or questions in the comments section.

Azure Data Factory Custom Activity Development–Part 4: Testing ADF Pipelines

This is the fourth post in a series on Azure Data Factory Custom Activity Development.Pipeline ExecutionExecuting Pipeline Activities basically requires resetting the status of prior Data Slices so that they can re-execute. We can do this in the “Manage and Monitor“ Data Factory dashboard provided by Azure. We can also do this relatively simply using PowerShell. What is missing is really being able to do this from the comfort of your own Visual Studio environment.When writing Unit and Integration Tests it is most beneficial to be able to do this within the IDE, using frameworks such as VS Test or NUnit to assist with our test logic and execution. This follows proven practice with other types of development and allows easy automation of testing for CI/CD activities. So having made the case, how can we easily go about running our Pipeline Data Slices? Well the ADF API isn’t that intimidating, and although there are some caveats to be aware of, it is relatively easy to go about creating a class or two to assist with the task of executing ADF Activities from DotNet test code. We are of course working on the assumption that we are doing our testing in a Data Factory environment that is indeed purposed for testing and can therefore update the Pipeline accordingly.PipelineExecutorThe first task we need is one that will execute our Pipeline Activity Data Slices. It will need to have access to various properties of the Pipeline itself. It will also need various methods and properties to conduct the desired Data Slice operations. Hence our class with a relatively fitting name of PipelineExecutor. The PipelineGetResponse class within the Microsoft.Azure.DataFactory.Models namespace allows us to request a Pipeline object’s details from our Data Factory. We’ll create a private field for this and retrieve this when constructing our PipelineExecutor class.public class PipelineExecutor { #region Private Fields private string dataFactoryName; private DataFactoryManagementClient dfClient; private DateTime maxDateTime = DateTime.MaxValue; private DateTime minDateTime = DateTime.MinValue; private PipelineGetResponse pipelineGR; private string pipelineName; private string resourceGroupName; #endregion Private Fields #region Public Constructors /// <summary> /// Initializes a new instance of the <see cref="pipelineexecutor"> class. /// /// The adf application identifier. /// The adf application secret. /// The domain. /// The login windows prefix. /// The management windows DNS prefix. /// The subscription identifier. /// Name of the resource group. /// Name of the data factory. /// Name of the pipeline. public PipelineExecutor(string adfApplicationId, string adfApplicationSecret, string domain, string loginWindowsPrefix, string managementWindowsDnsPrefix, string subscriptionId, string resourceGroupName, string dataFactoryName, string pipelineName) { this.resourceGroupName = resourceGroupName; this.dataFactoryName = dataFactoryName; this.pipelineName = pipelineName; ClientCredentialProvider ccp = new ClientCredentialProvider(adfApplicationId, adfApplicationSecret); dfClient = new DataFactoryManagementClient(ccp.GetTokenCloudCredentials(domain, loginWindowsPrefix, managementWindowsDnsPrefix, subscriptionId)); LoadPipeline(); } #endregion Public Constructors /// /// Loads the pipeline using a get request. /// /// public void LoadPipeline() { this.pipelineGR = dfClient.Pipelines.Get(resourceGroupName, dataFactoryName, pipelineName); if (this.pipelineGR == null) throw new Exception(string.Format("Pipeline {0} not found in Data Factory {1} within Resource Group {2}", pipelineName, dataFactoryName, resourceGroupName)); }The PipelineGetResponse.Pipeline property is the actual ADF Pipeline object, so we can apply the Decorator pattern to simply reference this and any other relevant properties within our code whenever required. For example, we create a simple readwrite property End, which will be used for getting and setting the Pipeline End, as below:/// /// Gets or sets the pipeline end. /// /// /// The end. /// public DateTime? End { get { return this.pipelineGR.Pipeline.Properties.End; } set { if (this.pipelineGR.Pipeline.Properties.End != value) { this.pipelineGR.Pipeline.Properties.End = value; IsDirty = true; } } }Notice that we are setting an IsDirty property when updating the property. This can then be used to determine whether we need to save any changes to the Pipeline back to the Data Factory when we are actually ready to run our Data Slice(s).We have properties for the major Pipeline attributes, such as a List of Activities and Data Sets, whether the Pipeline is Paused and various others. These will be used within a number of helper methods that allow for easier running of our Activity Data SlicesAs previously mentioned the method for executing an Activity Data Slice is to set the status accordingly. We also need to check that the Pipeline is not currently paused, and if it is, resume it. For these purposes there are a number of methods that do similar things with regard to Data Slices, such as setting the status of the last Data Slice prior to a specified datetime, for all the Data Sets within the Pipeline, as below:/// /// Sets the state of the pipeline activities last data slice prior to the date specified. Updates the pipeline start and end If data slices fall outside of this range. /// /// The date prior to which the last data slice will be updated. /// State of the slice. /// Type of the update. public void SetPipelineActivitiesPriorDataSliceState(DateTime priorTo, string sliceState = "Waiting", string updateType = "UpstreamInPipeline") { Dictionary<string, dataslice> datasetDataSlices = GetPipelineActivitiesLastDataSlice(priorTo); //the earliest start and latest end for the data slices to be reset. DateTime earliestStart = datasetDataSlices.OrderBy(ds => ds.Value.Start).First().Value.Start; DateTime latestEnd = datasetDataSlices.OrderByDescending(ds => ds.Value.End).First().Value.End; //If the pipeline start and end values do not cover this timespan, the pipeline schedule needs to be updated to use the expanded time period, else //the dataslice updates will fail. if (this.Start > earliestStart) { this.Start = earliestStart; } if (this.End < latestEnd) { this.End = latestEnd; } SavePipeline(); foreach (string datasetName in datasetDataSlices.Keys) { DataSlice ds = datasetDataSlices[datasetName]; SetDataSetDataSliceState(datasetName, ds.Start, ds.End, sliceState, updateType); } } /// /// Saves the pipeline. /// public void SavePipeline() { //update the pipeline with the amended properties if (IsDirty) dfClient.Pipelines.CreateOrUpdate(resourceGroupName, dataFactoryName, new PipelineCreateOrUpdateParameters() { Pipeline = this.pipelineGR.Pipeline }); }Pipeline Data Slice Range Issues…The first gotcha you may encounter when resetting Activity Data Slices is that of possibly exceeding the Pipeline Start and/or End times. If you try and run a Pipeline in this state you will get an exception. Hence the need to check our earliest and latest Data Slice range over all our activities and amend the Start and End properties for our Pipeline (wrapped up inside our PipelineExecutor.Start, End properties). Nothing too taxing here and we’re back in the game.For completeness I’ll include the SetDataSetDataSliceState method called above so you can see what is required to actually set the Data Slice statuses using the Data Factory API, via the Microsoft.Azure.Management.DataFactories.DataFactoryManagementClient class./// /// Sets the state of all data set data slices that fall within the date range specified. /// /// <param name=" datasetname"="">Name of the dataset. /// The slice start. /// The slice end. /// State of the slice. /// Type of the update. /// /// public void SetDataSetDataSliceState(string datasetName, DateTime sliceStart, DateTime sliceEnd, string sliceState = "Waiting", string updateType = "UpstreamInPipeline") { DataSliceState sliceStatusResult; if (!Enum.TryParse<DataSliceState&lgt;(sliceState, out sliceStatusResult)) throw new ArgumentException(string.Format("The value {0} for sliceStatus is invalid. Valid values are {1}.", sliceState, string.Join(", ", Enum.GetNames(typeof(DataSliceState))))); DataSliceUpdateType updateTypeResult; if (!Enum.TryParse<DataSliceUpdateType>(updateType, out updateTypeResult)) throw new ArgumentException(string.Format("The value {0} for sliceStatus is invalid. Valid values are {1}.", sliceState, string.Join(", ", Enum.GetNames(typeof(DataSliceUpdateType))))); DataSliceSetStatusParameters dsssParams = new DataSliceSetStatusParameters() { DataSliceRangeStartTime = sliceStart.ConvertToISO8601DateTimeString(), DataSliceRangeEndTime = sliceEnd.ConvertToISO8601DateTimeString(), SliceState = sliceState, UpdateType = updateType }; dfClient.DataSlices.SetStatus(resourceGroupName, dataFactoryName, datasetName, dsssParams); }You’ll see there are a couple of annoyances we need to code for, such as having to parse the sliceState and updateType parameters against some Enums of valid values that we’ve had to create to ensure only permitted values for these, and having to call ConvertToISO8601DateTimeString() on our slice start and end times. No biggie though.Now that we have a class that gives us some ease of running our Pipeline Slices in a flexible manner (I’ve left out other methods for brevity) we can move on to using these within the various test methods of the Test Project classes we will be using.A simple Base Testing ClassIn the case our the project in question, we will be writing a lot of similar Tests that will simply check row counts on Hive destination objects once a Data Factory Pipeline has executed. To make writing these easier, and remembering our DRY principle mentioned back in Part 2 of this series, we can encapsulate the functionality to do this in a base class method, and derive our row counting Test classes from this.public class TestBase { #region Private Fields protected string adfApplicationId = Properties.Settings.Default.ADFApplicationId; protected string adlsAccountName = Properties.Settings.Default.ADLSAccountName; protected string adlsRootDirPath = Properties.Settings.Default.ADLSRootDirPath; protected string domain = Properties.Settings.Default.Domain; protected string loginWindowsPrefix = Properties.Settings.Default.LoginWindowsDnsPrefix; protected string managementWindowsDnsPrefix = Properties.Settings.Default.ManagementWindowsDnsPrefix; protected string clusterName = Properties.Settings.Default.HDIClusterName; protected string clusterUserName = Properties.Settings.Default.HDIClusterUserName; protected string outputSubDir = Properties.Settings.Default.HDIClusterJobOutputSubDir; protected string dataFactory = Properties.Settings.Default.DataFactory; protected string resourceGroup = Properties.Settings.Default.ResourceGroup; protected string subscriptionId = Properties.Settings.Default.SubscriptionId; protected string tenantId = Properties.Settings.Default.TenantId; #endregion Private Fields #region Public Methods //[TestMethod] public void ExecuteRowCountTest(string pipelineName, long expected, string databaseName, string objectName) { string rowCountStatement = "select count(*) as CountAll from {0}.{1};"; //check for sql injection if (databaseName.Contains(";") | objectName.Contains(";")) throw new ArgumentException(string.Format("The parameters submitted contain potentially malicious values. databaseName : {0}, objectName{1}. This may be an attempt at sql injection", databaseName, objectName)); PipelineExecutor plr = new PipelineExecutor(adfApplicationId, adfApplicationSecret, domain, loginWindowsPrefix, managementWindowsDnsPrefix, subscriptionId, resourceGroup, dataFactory, pipelineName); plr.SetPipelineActivitiesPriorDataSliceState(DateTime.Now); plr.AwaitPipelineCompletion().Wait(); string commandText = string.Format(rowCountStatement, databaseName, objectName); IStorageAccess storageAccess = new HiveDataLakeStoreStorageAccess(tenantId, adfApplicationId, adfApplicationSecret, adlsAccountName, adlsRootDirPath); HiveDataLakeStoreJobExecutor executor = new HiveDataLakeStoreJobExecutor(clusterName, clusterUserName, clusterPassword, outputSubDir, storageAccess); long actual = Convert.ToInt64(executor.ExecuteScalar(commandText)); Assert.AreEqual(expected, actual); } #endregion Public Methods }Our test project classes methods for destination row counts can then be simplified to something such as that below:[TestClass] public class MyPipelineName : TestBase { #region Private Fields private string pipelineName = "MyPipelineName"; #endregion Private Fields #region Public Methods //todo centralise the storage of these [TestMethod] public void DestinationRowCountTest() { string databaseName = "UniversalStaging"; string objectName = "extBrandPlanning"; long expected = 10; base.ExecuteRowCountTest(pipelineName, expected, databaseName, objectName); } #endregion Public Methods } Going Further For tests involving dependent objects we can use mocking frameworks such as NSubstitute, JustMock or Moq to create these, and define expected behaviours for the mocked objects, thereby making the writing and asserting of our test conditions all very much in line with proven Test Driven Development (TDD) practices.I won’t go into this here as there are plenty of well grounded resources out there on these subjects. Up Next… In the final instalment of the series we get remove some of the referenced library limitations inherent in the ADF execution environment. Y’all come back soon now for Part 5: Using Cross Application Domains in ADF…