BenJarvis

Ben Jarvis' Blog

Building REST APIs on Top of Azure Cosmos DB Using Azure Functions

Azure Functions are fast becoming one of my favourite pieces of technology, they are perfect when used in an event driven architecture and their ease of development, deployment, their ability to scale on demand and consumption based pricing make them perfect for a number of use cases. Azure Cosmos DB is also an extremely versatile data store that is extremely easy to get up and running, making it a perfect companion for Azure Functions.

One use case that I haven’t really explored before is using Azure Functions to create REST APIs. The HTTP trigger allows us to trigger a function through a HTTP request so wouldn’t it be brilliant if we could build a REST API using Azure Functions where we would only be charged for each individual request against the API rather than deploying to an App Service that is going to have an ongoing monthly cost associated with it regardless of usage?

Within Azure Functions and even ASP.NET Web API there is always a certain amount of boilerplate code that usually ends up being repeated for each and every method e.g. a typical HTTP triggered Azure Function looks like the below:

Wouldn’t it be great if there were a way to remove this boilerplate and declare our APIs using a simple fluent, code-based API? This is where Function Monkey (https://github.com/JamesRandall/FunctionMonkey) from James Randall (https://www.azurefromthetrenches.com/) comes in. We’ll also be using Cosmonaut (https://github.com/Elfocrash/Cosmonaut) from Nick Chapsas (http://chapsas.com/) to simplify our data access against Cosmos DB.

In this example I’d like to create a REST API to perform CRUD operations against a Cosmos DB collection called “process” that will be responsible for keeping track of when a process begins and completes. Our API will need to expose the following endpoints:

  • POST: /process
  • PUT: /process/{id}/succeeded
  • PUT: /process/{id}/failed
  • GET: /process/{id}
  • GET: /process?status={status}

The implementation of the API is described below and all code can be found on my GitHub at https://github.com/benjarvis18/AzureFunctionsRESTApi.

What is Function Monkey?

Function Monkey provides a fluent API around Azure Functions to allow us to easily define functions without being tightly coupled to the Azure Functions runtime itself. The framework allows us to implement a commanding pattern where we define commands (or queries) with handlers and then have a mediator dispatch the commands to their handlers as they are received. The beauty of this is our handlers, where our logic sits, are decoupled from the underlying trigger meaning we could easily configure our application so the handler is triggered by a queue rather than a HTTP request or even take our code out of Azure Functions and put it into a full-blown ASP.NET Web API with minimal changes.

Azure Functions doesn’t currently have dependency injection built in natively (there are some workarounds available) however, Function Monkey has this functionality baked in using the ASP.NET Core DI library meaning we can easily make our code testable. It’s also easy to configure validation using the built-in integration with the popular FluentValidation library.

The Code

Commands

Commands provide the input to our functions so they are a logical place to begin.

Each command implements the ICommand marker interface and defines a set of properties that will be passed in to the functions and provided to our command handlers. They are POCOs so there is no complication within them, the object I have defined as my CreateProcessCommand is given as an example below:

Handlers

Our command handlers are where the magic happens! Their responsibility is to take the command that is passed into the function and process it.

Each handler implements the ICommandHandler<T> interface with T being the type of the command that the handler is handling. An example of the handler for the CreateProcessCommand is shown below:

The handler supports dependency injection so in our case the constructor accepts an instance of ICosmosStore<Process> which is our Cosmos DB “repository” that is provided by Cosmonaut to allow us to perform operations against Cosmos DB.

The command itself is processed in the ExecuteAsync method, in this case we are creating an instance of our Process entity and adding it to Cosmos DB using the cosmos store provided by Cosmonaut.

Bringing It All Together - Configuration

All of the configuration of Function Monkey is done through a class that implements the IFunctionAppConfiguration interface; this class is used by the Function Monkey compiler to generate the required C# assembly that will contain our functions. My FunctionAppConfiguration class is below:

The IFunctionHostBuilder that is passed into our configuration class exposes a fluent API that allows us to define our functions.

The first step is to call the Setup method which allows us to configure our dependencies for dependency injection using the provided service collection and register our command handlers; in my case I am using the Discover method provided by the command registry to locate and register any command handlers in my assembly.

The next method (OpenApiEndpoint) configures the Open API definition for our API and configures the Swagger user interface that will allow us to explore and use our API later on.

The final step is to define the configuration for our functions. This blog post is only covering HTTP functions however, other triggers are described in the documentation at https://functionmonkey.azurefromthetrenches.com.

The API we are building in this example only covers a single resource so we are defining a single route called “/api/v1/process”; within this route we can define multiple functions. Each function is linked to a command and we can specify the HTTP method the function will represent (GET, POST, PUT etc) and a custom route for the function. We can also specify route parameters, which are useful when we want to pass through an ID or other parameter (see the function definition for GetProcessByIdQuery as an example).

As you can see, the configuration is extremely simple with minimal code and no complicated JSON configuration files to deal with. The fluent API makes things clear and easy to understand.

Observations

One thing I came across that’s worth noting is that the class that implements IFunctionAppConfiguration isn’t suitable for retrieving any configuration that needs to be retrieved at runtime as the class is called at build time.

As an example, when configuring Cosmonaut I originally defined my FunctionAppConfiguration class like below where I used Environment.GetEnvironmentVariable to get the values for the Cosmos DB database name, endpoint and auth key. However, when I tried to debug the application the Function Monkey compiler failed because GetEnvironmentVariable returned null. This isn’t the behaviour I wanted because I wanted the values to be retrieved at runtime similar to if they were defined in the ASP.NET Startup class.

To get around this I created a static CosmosStoreFactory class with a method to retrieve an instance of CosmosStore, using Environment.GetEnvironmentVariable to get the configuration values.

I then registered CosmosStore in the DI container like below, using CosmosStoreFactory to instantiate it:

GetCosmosStore is only called at runtime so the function configuration has been loaded at that point meaning the correct connection details are retrieved.

Running Our Application

To run my functions locally I can simply hit F5 and Function Monkey will build my functions and the functions emulator will start.

We configured an Open API endpoint for our function app so if I navigate to http://localhost:7071/openapi/ I get access to Swagger UI where I can see the endpoints that have been defined and also try calling some of the endpoints.

image

Now, if I want to try out the POST method I can select it and click “Try it Out”, from there I can fill in the details for my request like below:

image

Once I click the “Execute” button a request will be fired off to the server and I can see that a 200 (OK) HTTP response is returned meaning the operation was successful:

image

Now, if I go to the document explorer in my Cosmos DB emulator at https://localhost:8081/_explorer/index.html I can see that the following document has been created:

We can also call our API outside of Swagger UI e.g. if I go to Chrome and navigate to http://localhost:7071/api/v1/process?status=in-progress I get the following response:

As you can see, the code involved in creating our API is extremely minimal when compared to creating an API in Azure Functions without Function Monkey. The local development / debugging experience is great and it would be extremely easy to get this solution deployed to Azure, with the only difference being that we would need to pass an authentication key when calling our API endpoints.

Conclusion

To conclude, Azure Functions are an excellent choice when developing a REST API if you want a service that is easy to scale and cheap to run. Combined with Cosmos DB as a data store it’s extremely easy to have something simple working in a CRUD scenario within a matter of hours. Function Monkey makes the whole process painless and allows you to decouple your commands and handlers from the underlying triggers making them easy to test and giving more flexibility if commands need to be processed differently as the application grows.

As always, if there are any questions then please comment or below or contact me on Twitter via @BenJarvisBI.

SQL Server Database Integration Testing Using NUnit

On a current project I have a set of Azure Functions that process files and write various pieces of audit information to an Azure SQL Database to be used for monitoring and further processing. The C# application implements the repository pattern to encapsulate the data access logic and uses Dapper (https://github.com/StackExchange/Dapper) as a micro ORM to simplify the code and execute the stored procedures that perform the CRUD operations against the database.

I wanted a way to automate the integration testing of my repositories and stored procedures so I developed the solution described below using NUnit as the test framework and SQL Server LocalDB as the database to run my tests against.

I had the following requirements for my solution which NUnit has been able to satisfy:

  • Quick – tests should run quickly and not require massive amounts of set up / tear down
  • Independent – all tests should be independent from one another and responsible for their own set up / tear down
  • Simple – the test code should be simple to understand and easy to work with when writing new tests.
  • Work Everywhere – the tests should be able to work anywhere and not require huge dependencies like a full SQL Server instance, they should be able to work with SQL LocalDB

The full solution can be found on my GitHub at https://github.com/benjarvis18/NUnitSQLIntegrationTesting.

Solution Layout

The solution has the following projects:

  • NUnitSQLIntegrationTesting.Database – SSDT database project that includes tables, stored procedures and reference data.
  • NUnitSQLIntegrationTesting.Core – class library that includes the repositories that are used by the C# code to perform data access.
  • NUnitSQLIntegrationTesting.IntegrationTests – NUnit test project

Repository

The test database has two tables: dbo.Customer and dbo.CustomerType. CustomerType is a lookup table so the data is static and maintained within our SSDT project, Customer is the table that we want to perform read and write operations against from within our C# application so the NUnitSQLIntegrationTesting.Core project has a CustomerRepository class within it to call our stored procedures and map any returned objects to our domain model.

The code in our repository class is included below:

As above, the repository has the following methods:

  • CreateCustomerAsync – inserts a new row into dbo.Customer using the supplied Customer object
  • GetCustomerAsync – returns a customer object for the supplied CustomerId
  • SetCustomerEmailAsync – updates the EmailAddress column in dbo.Customer for the supplied CustomerId
  • GetCustomersByTypeAsync – returns a collection of customer objects that have the specified CustomerTypeId

The code in the repository is pretty simple with the only complication being some additional code required to map the nested CustomerType object when retrieving a Customer object.

One key element of our repository implementation is that all SQL connections are created using the ISqlServerConnectionFactory implementation that is injected into our repository. In our test scripts we are able to mock this interface so it instead provides a connection to our test database.

Test Infrastructure

The following elements can be found in the NUnitSQLIntegrationTesting.IntegrationTests project.

Helpers/TestDatabase.cs

TestDatabase represents an instance of our test database and provides methods for creating and dropping the database.

The initialise method (below) makes use of the APIs provided in the Microsoft.SqlServer.DacFx package to take the dacpac for the database that sits within our solution and deploy it to our SQL LocalDB instance.

We then have some other methods that are explained below:

  • Drop – drops the database
  • RunSetUpScript – executes the provided SQL against the test database (used for setting up data to be used within a test).
  • RunCleanUpScript – executes a script to delete test data from each table to clean up the database in preparation for running the next test. Currently this is hardcoded but the DatabaseDeleter script from Jimmy Bogard (https://gist.github.com/jbogard/5805783) could easily be adapted to work here.
  • GetSqlConnection – creates a new SQL connection for use in our test scripts.

Data/DatabaseTestConfig.cs

This class is implemented as an NUnit SetUpFixture with a SetUp method and a TearDown method, each being decorated with the NUnit OneTimeSetUp and OneTimeTearDown attributes respectively. This means that the SetUp method is called once at the beginning of a test run and the TearDown method is called once at the end of a test run.

This class provides a singleton instance of our TestDatabase meaning each test is connecting to the same database which prevents us having to deploy a new instance of the test database for each test. If we did deploy an instance of our test database for each test the run time of our tests would increase significantly.

Helpers/Assert.cs

This class contains our SQL specific assertions and is responsible for comparing our actual and expected results.

An example of one of the assertions is below, this method takes an expected result as a dynamic object and retrieves the actual result by querying the SQL database.

The method uses a library called Dynamitey (https://github.com/ekonbenefits/dynamitey) to retrieve the list of properties from the objects and then iterates through each property comparing the actual value with the expected value.

Data/DatabaseTestBase.cs

This class is the base class that all of our integration tests inherit from.

This class again provides SetUp and TearDown methods that are decorated with NUnit SetUp and TearDown attributes however, the difference this time is that these methods will be called at the beginning and end of each test.

One of my requirements was that each test should be able to run independently of the others with no dependencies between tests, to allow this I need to run each test in serial so only a single test is using my database at any one time. To implement this I have used a SemaphoreSlim (see https://docs.microsoft.com/en-us/dotnet/standard/threading/semaphore-and-semaphoreslim) with a maximum request count of 1. The SetUp method calls the Wait method on the semaphore which blocks the thread if another thread is already executing and only allow the thread to continue once the TearDown method has been called by the other thread to release the semaphore. Each test class shares the same instance of the semaphore meaning our tests run in serial; this negates the need for using ordered tests or another method that adds further complication. The TearDown method also executes any clean up scripts against our test database to ensure no test data is left behind.

This class also provides a method called GetFixture which makes use of a library called AutoFixture and it’s integration with Moq (see this article from The TrainLine https://engineering.thetrainline.com/an-introduction-to-xunit-moq-and-autofixture-995315f656f for more information on how this is useful) to provide a fixture that can be used to get an instance of our repository in our tests.

Tests

Now we’ve run through the infrastructure required by our tests we can look at the actual tests themselves.

The CustomerRepositoryTests class found in the Data folder of our NUnitIntegrationTesting.IntegrationTests project contains the tests for our CustomerRepository.

I won’t look at each test in this post as they all follow the same format however, the SetCustomerEmailAddress_WhenGivenAValidCustomerIdAndEmail_DatabaseRowIsUpdated test provides a good example as it includes some set up.

As above, we first call GetFixture to get our AutoFixture Fixture that we then use to create a new instance of our CustomerRepository, injecting our mocked instance of the ISqlServerConnectionFactory along the way.

We then execute the RunSetUpScript method to generate some test data in our database, I’m adding two rows in this case, one that I want to be updated and another that I don’t so I can confirm that my stored procedure is only updating the row I want it to. Once the database is set up we can call our SetCustomerEmailAsync method on our repository to set the email for customer id 1 to test.new.email@testing.com.

At this point the database row should be updated so we now call Helpers.Assert.SqlResultMatches to verify that the customer id for the customer with the email address test.new.email@testing.com is 1.

Our stored procedure is included below so the code should do what we expect and set the email address to test.new.email@testing.com for customer id 1.

As below, when we run our tests in visual studio they all pass:

image

Now, if we change our stored procedure slightly to update the email address where the customer id is the one passed in as a parameter + 1, and run our test again we get a different result:

image

If we look at the details we can see the expected customer id was 1 but what was actually returned is 2:

image

Now we can easily see if a change we’ve made to our stored procedures or our repository code has caused any regression in functionality.

Conclusion

As seen above it is extremely easy to get NUnit to provide a simple way to integration test C# repositories and SQL code. The solution now gives us peace of mind when making changes to the repository code or SQL stored procedures and means that any bugs can be picked up at the development stage rather than further along the SDLC or after the code is deployed to production.

In this example I am testing C# repositories that are calling the database however, it would be extremely easy to apply this to a typical BI scenario where you need to test stored procedures on their own.

As always, if there are any questions please comment below or contact me on Twitter via @BenJarvisBI.

Cosmos DB Bulk Executor Library – Performance Comparison

One of the many exciting announcements made at MSBuild recently was the release of the new Cosmos DB Bulk Executor library that offers massive performance improvements when loading large amounts of data to Cosmos DB (see https://docs.microsoft.com/en-us/azure/cosmos-db/bulk-executor-overview for details on how it works). A project I’ve worked on involved copying large amounts of data to Cosmos DB using ADF and we observed that the current Cosmos DB connector doesn’t always make full use of the provisioned RU/s so I am keen to see what the new library can offer and look to see if our clients can take advantage of these improvements.

In this post I will be doing a comparison between the performance of the Cosmos DB connector in ADF V1, ADF V2 and an app written in C# using the Bulk Executor library. As mentioned in the Microsoft announcement, the new library has already been integrated into a new version of the Cosmos DB connector for ADF V2 so the tests using ADF V2 are also using the Bulk Executor library.

All tests involved copying 1 million rows of data from an Azure SQL DB to a single Cosmos DB, the scaling settings used for the resources involved are:

  • Cosmos DB Collection – partitioned by id, 50000 RU/s
  • Azure SQL DB – S2 (50 DTUs)

Each test was repeated 3 times to enable an average time taken to be calculated.

The document inserted into each the database looked like the below:

Test 1 – ADF V1

To set up the ADF V1 test I used the Copy Data Wizard to generate a pipeline that would copy data from my Azure SQL DB to Cosmos DB. I then executed the pipeline 3 times, recreating the Cosmos DB collection each time.

Each test behaved in a similar way, the process slowly ramped up to 100,000 requests per minute and sustained that throughput until completion. The tests only consumed around 3,000 RU/s out of the 10,000 RU/s provisioned to each partition in the collection (each collection was created with 5 partitions).

image

The results of the test were:

  • Run 1 – 677 seconds
  • Run 2 – 641 seconds
  • Run 3 – 513 seconds
  • Average – 610 seconds

The performance increased after each run with run 3 taking 513 seconds, nearly 3 minutes quicker than the first test. I can’t explain the differences in time taken however, it seems that ADF progressively ramped up the throughput quicker after each run so it may be down to scaling within the ADF service itself.

Test 2 – ADF V2

To set up the ADF V2 test I again used the Copy Data Wizard to generate the pipeline. I then executed the pipeline 3 times, recreating the Cosmos DB collection each time.

ADF V2 represents a massive improvement over V1 with a 75% increase in performance. The connector used more than the provisioned throughput meaning some throttling occurred however, this means that the collection could have been scaled even further to obtain higher performance.

Interestingly the ADF V2 requests didn’t appear on the number of requests metric shown in the Azure Portal so I’m unable to look at how many requests ADF V2 was able to sustain. I’m unsure of the reason for this however, it could be something like ADF using the direct connection mode to Cosmos DB rather than connecting through the gateway meaning the requests aren’t counted.

The results of the test were:

  • Run 1 – 163 seconds
  • Run 2 – 158 seconds
  • Run 3 – 156 seconds
  • Average – 159 seconds

The performance of ADF V2 was more consistent that V1 and remained reasonably steady across all tests.

Test 3 – C# w/ Bulk Executor Library

To set up the C# test I wrote a quick C# console application that uses the Bulk Executor library, the application was running on my local machine rather than within Azure so there will obviously be a performance hit from the additional network latency. The source code for the application can be found at https://github.com/benjarvis18/CosmosDBBulkExecutorTest.

The results of the test were:

  • Run 1 – 240 seconds
  • Run 2 – 356 seconds
  • Run 3 –  352 seconds
  • Average –  316 seconds

The performance of the C# application is less consistent however, this is probably due to the network from my local machine to Azure. My application is also not very scalable as it is loading the whole dataset into memory rather than streaming it, as would be required with a larger dataset. The actual code itself is probably also not as optimised as it could be.

Overall however, the performance of my C# application was still 50% better than ADF V1.

Comparison

The final results of the tests are below:

image

As above, ADF V2 is the clear winner in the test offering a 75% increase in performance when compared to ADF V1. This represents a huge performance gain and could provide some significant costs savings for users that are loading large amounts of data into Cosmos DB. My C# application offered 50% better performance than ADF V1, running outside of Azure without any optimisation so the performance benefits of the new library are significant.

The new library is an exciting development for Cosmos DB and allows us to fully utilise the capabilities it has to offer when dealing with large amounts of data. I look forward to making use of these benefits in projects, especially the significant improvements in ADF V2!

As always, if you have any questions or comments please let me know.

ADF V2 Issue With File Extension After Decompressing Files

On a current client project we are taking files from an on-prem file server and uploading them to Azure Blob Storage using ADF V2. The files are compressed on-prem using GZip compression and need to be decompressed before they are placed in blob storage where some other processes will pick them up.

ADF V2 natively supports decompression of files as documented at https://docs.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#compression-support. With this functionality ADF should change the extension of the file when it is decompressed so 1234_567.csv.gz would become 1234_567.csv however, I’ve noticed that this doesn’t happen in all cases.

In our particular case the file names and extensions of the source files are all uppercase and when ADF uploads them it doesn’t alter the file extension e.g. if I upload 1234_567.CSV.GZ I get 1234_567.CSV.GZ in blob storage rather than 1234_567.CSV.

If I upload 1234_567.csv.gz the functionality works correctly and I get 1234_567.csv in blob storage.This means that the file extension replace is case sensitive when it should be case insensitive.

This bug isn’t a major issue for us as the file is decompressed and we can change the extension when we process the file further however, it’s something that stumped me for a while.

I’ve raised a bug at https://feedback.azure.com/forums/270578-data-factory/suggestions/34012312--bug-file-name-isn-t-changed-when-decompressing-f to get this fixed so please vote and I’ll update the post once the issue has been resolved.

Uploading Files from On-Prem File System to Azure Blob Storage Using ADF V2

Introduction

Recently we had a client requirement whereby we needed to upload some files from an on-prem file server to Azure Blob Storage so they could be processed further. The file system connector in ADF V2 with a self-hosted integration runtime was the perfect solution for this so this blog post will discuss the process for getting a basic set up working to test the functionality.

Test Data

The first task is to generate some data to upload. I did this by executing the below PowerShell script to create 100 files, with each being 10 MB in size.

$numberOfFiles = 100
$directory = "C:\Temp\Files"
$fileSize = 1000000 * 10 # 10 MB

Remove-Item $directory -Recurse
New-Item $directory -ItemType Directory

for ($i = 0; $i -lt $numberOfFiles + 1; $i++) 
{
    fsutil file createnew "$directory\Test_$i.txt" $fileSize
}

Now we’ve got some files to upload it’s time to create our resources in Azure.

Azure Configuration

The following PowerShell script uses the Azure RM PowerShell cmdlets to create our resources. This script will provision the following resources in our Azure subscription:

  • Resource Group
  • ADF V2 Instance
  • Storage Account

$location = "West Europe"
$resouceGroupName = "ADFV2FileUploadResourceGroup"
$dataFactoryName = "ADFV2FileUpload"
$storageAccountName = "adfv2fileupload"

# Login to Azure
$credential = Get-Credential
Login-AzureRmAccount -Credential $credential

# Create Resource Group
New-AzureRmResourceGroup -Name $resouceGroupName -Location $location

# Create ADF V2
$setAzureRmDataFactoryV2Splat = @{
    Name = $dataFactoryName
    ResourceGroupName = $resouceGroupName
    Location = $location
}
Set-AzureRmDataFactoryV2 @setAzureRmDataFactoryV2Splat 

# Create Storage Account
$newAzureRmStorageAccountSplat = @{
    Kind = "Storage"
    Name = $storageAccountName
    ResourceGroupName = $resouceGroupName
    Location = $location
    SkuName = "Standard_LRS"
}
New-AzureRmStorageAccount @newAzureRmStorageAccountSplat

Our Azure Portal should now show the following:

image

Now we’ve got our resources provisioned it’s time to configure ADF to access our on-prem data.

ADF Configuration

Self-Hosted Integration Runtime

In ADF V2 the integration runtime is responsible for providing the compute infrastructure that carries out data movement between data stores. A self-hosted integration runtime is an on-premise version of the integration runtime that is able to perform copy activities to and from on-premise data stores.

When we configure a self-hosted integration runtime the data factory service, that sits in Azure, will orchestrate the nodes that make up the integration runtime through the use of Azure Service Bus meaning our nodes that are hosted on-prem are performing all of our data movement and connecting to our on-premises data sources while being triggered by our data factory pipelines that are hosted in the cloud. A self-hosted integration runtime can have multiple nodes associated with it, which not only caters for high availability but also gives an additional performance benefit as ADF will use all of the available nodes to perform processing.

To create our self-hosted integration runtime we need to use the following PowerShell script:

$credential = Get-Credential
Login-AzureRmAccount -Credential $credential

$resouceGroupName = "ADFV2FileUploadResourceGroup"
$dataFactoryName = "ADFV2FileUpload"
$integrationRuntimeName = "SelfHostedIntegrationRuntime"

# Create integration runtime
$setAzureRmDataFactoryV2IntegrationRuntimeSplat = @{
    Type = 'SelfHosted'
    Description = "Integration Runtime to Access On-Premise Data"
    DataFactoryName = $dataFactoryName
    ResourceGroupName = $resouceGroupName
    Name = $integrationRuntimeName
}
Set-AzureRmDataFactoryV2IntegrationRuntime @setAzureRmDataFactoryV2IntegrationRuntimeSplat

# Get the key that is required for our on-prem app
$getAzureRmDataFactoryV2IntegrationRuntimeKeySplat = @{
    DataFactoryName = $dataFactoryName
    ResourceGroupName = $resouceGroupName
    Name = $integrationRuntimeName
}
Get-AzureRmDataFactoryV2IntegrationRuntimeKey @getAzureRmDataFactoryV2IntegrationRuntimeKeySplat

The above script creates the integration runtime then retrieves the authentication key that is required by the application that runs on our integration runtime node.

The next step is to configure our nodes, to do this we need to download the integration runtime at https://www.microsoft.com/en-gb/download/details.aspx?id=39717.

Once we run the downloaded executable the installer will run and install the required application, the application will then open on the following screen asking for our authentication key:

image

We can enter the authentication key retrieved in the previous step into the box and click register to register our integration runtime with the data factory service. Please note that if your organisation has a proxy you will need to enter the details to ensure the integration runtime has the connectivity required.

After clicking register we receive the following prompt, click finish on this and allow the process to complete.

image

Once the process is complete our integration runtime is ready to go.

More information on creating the self-hosted integration runtime can be found at https://docs.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime.

ADF Pipeline

Now we’ve got our self-hosted integration runtime set up we can begin configuring our linked services, datasets and pipelines in ADF. We’ll do this using the new ADF visual tools.

Linked Services

The first step is to create our linked services. To do this we open up the visual tools, go to the author tab and select connections; we can then create a new linked service to connect to Azure Blob Storage:

image

Next we need to create a linked service for our on-prem file share.

First create a new linked service and select the file system connector, we can then fill in the relevant details.

image

There are a couple of things to note with the above configuration, the first is we have selected our self-hosted integration runtime for the “Connect via integration runtime” option. We have also specified an account that has access to the folder we would like to pick files up from, in my case I have configured a local account however, if you were accessing data from a file share within your domain you would supply some domain credentials.

Datasets

For our pipeline we will need to create two datasets, one for our file system that will be our source and another for blob storage that will be our sink.

Firstly, we will create our file system dataset. To do this we will create a new dataset and select the file system connector, we will then configure our connection tab, like below:

image

As above, we have selected our linked service we created earlier, added a filter to select all files and opted to perform a binary copy of the files in the directory.

Next, we need to create a dataset for blob storage. To do this we will create a new dataset using the Azure Blob Storage connector, we will then configure our connection tab, like below:

image

This time we have entered the container we’d like to copy our files to as the directory name in our file path, we have also opted for a binary copy again.

Pipeline

The final step is to create our pipeline that will tie everything together and perform our copy.

First, create a new pipeline with a single copy activity. The source settings for the copy activity are as follows:

image

The settings for the sink are:

image

In this situation we have set the copy behaviour to “Preserve Hierarchy”, which will preserve the folder structure from the source folder when copying the files to blob storage.

Now we’ve got everything set up we can select “Publish All” to publish our changes to the ADF service.

Testing

We’ve completed all of the necessary configuration so we can now trigger our pipeline to test that it works. In the pipeline editor click the trigger button to trigger off the copy. We can then go to the monitor tab to monitor the pipeline as it runs:

image

The pipeline will take a couple of minutes to complete and on completion we can see that the files we selected to upload are now in our blob container:

image

Conclusion

To conclude, ADF V2 presents a viable solution when there is a need to copy files from an on-prem file system to Azure. The solution is relatively easy to set up and gives a good level of functionality and performance out of the box.

When looking to deploy the solution to production there are some more considerations such as high availability of the self-hosted integration runtime nodes however, the documentation given on MSDN helps give a good understanding of how to set this up by adding multiple nodes to the integration runtime.

Adatis SQLBits Photo Booth

This year at SQLBits Adatis decided to create two technical demos to fit in with the magic theme, these were the Adatis Sorting Hat and the Adatis Photo Booth.

My blog post will look at the Adatis Photo Booth which was designed to allow us to print a wizarding student ID for participants; I’ll give an overview of the technology stack that was used and then do some analysis of the data we collected using Power BI.

image

The Requirement

The first place to start when designing a technical solution are the business requirements. In our case the requirements were:

  • The user should have an app running on a tablet that will ask them for some basic details such as name, email address and twitter account before allowing them to take a selfie to be printed on their card.
  • The image of the user should be uploaded to the Azure Cognitive Services Face API to capture their emotions.
  • The final data to be placed on the card should be added to a database, in the required format, for our card printing software to pick up.
  • The solution should require the smallest amount of coding possible.
  • The solution should be fully automated and require the least amount of interaction from the Adati on the stand.

The Technology

To satisfy our requirements we chose to use Power Apps for the front-end and Logic Apps for the backend making our solution completely serverless.

A brief overview of the steps in the process and the tech used at each stage is below:

Architecture Diagram

At a high level the technology being used is:

  • Power Apps – allows us to build a cross-platform, front-end application without writing any code.
  • Azure SQL Database – allows us to store the data captured from the application. Power Apps natively integrates with Azure SQL DB.
  • Azure Blob Storage – cheap to use and easy to integrate with our solution. Power Apps doesn’t natively integrate with Blob storage so we quickly created a custom API using the instructions at https://powerapps.microsoft.com/en-us/blog/custom-api-for-image-upload/ to upload our images.
  • Azure Cognitive Services – a set of APIs that allowed us to very easily add some intelligence to our application. The face API was easy to integrate and we were using machine learning to map the emotions of the face within minutes.

    The Technical Challenges

    While building the solution we came across a couple of limitations that are useful to raise, these are:

    • There is currently no integration between Power Apps and blob storage meaning you need to roll your own API to allow the images from the camera control to be uploaded. The process is simple and the blog post at https://powerapps.microsoft.com/en-us/blog/custom-api-for-image-upload/ gives you details of what needs to be done.
    • The cognitive services connector for Logic Apps doesn’t currently return the emotion data returned by the API. To get around this we made a web request within the logic app and parsed the JSON returned to extract the attributes we needed.

    Apart from the API to upload the images no coding was required to get this solution to function making Power Apps and Logic Apps a very good choice when you need to create a solution that needs to be delivered quickly.

    The Data

    Now we’ve looked at the technical solution we can use Power BI to look at some of the data we collected at SQL Bits.

    To analyse the data I connected my Power BI desktop to the Azure SQL Database we were using, I then created some measures and visuals against the data to give the below result:

    image

    The visuals we have are:

    • Top Left - tree map that look at the count of students by their primary emotion (the emotion that scored highest)
    • Top Middle – % of happy (where primary emotion is happiness) and % of unhappy (where primary emotion is anger, sadness or disgust) over each 3 hour period.
    • Top Right – total number of students we produced cards for, overall % of happy and % unhappy students.
    • Bottom – primary emotion logged over time

    As you can see, we did quite well with over 46% of our students having a primary emotion of happy, with only 13% being unhappy. It seems that over Thursday and Friday we had some unhappy students but on Saturday everybody was happy!

    Conclusion

    To conclude, we created our tech demos for SQLBits so everybody can have some fun on the stands however, as you’ve seen, the demos are backed by the cloud technology that we use all day, everyday. Power Apps and Logic Apps are an excellent set of tools and allow you to build applications quickly and easily while connecting various services together; all while being part of the Azure PaaS and SaaS offering without the headache of provisioning servers. The Cognitive Services APIs also present a great offering and allow you to quickly and easily add some intelligence to your application without needing to roll your own solution.

    As always, if you have any questions or comments then please let me know.

    Analysis of Spatial Data Using Cosmos DB

    Introduction

    Recently, while researching Cosmos DB, I came across the in-built capabilities for managing spatial data.

    Cosmos DB is Microsoft’s globally distributed, multi-model database. It has the capability to store various types of data, such as document, graph and key-value and can elastically scale to cope with varying needs. The piece of Cosmos DB that this post will be discussing is the spatial capabilities of the document model.

    The problem I have chosen to solve using the spatial functionality is working out which airfields are within the range of an aircraft when given its true airspeed and fuel endurance in hours; with the range being calculated by multiplying the airspeed by the endurance.

    The Data

    The first step was to find a data set containing a list of all of the world’s airfields, this was found on GitHub at https://github.com/mwgg/Airports. The data set contains the details we need, which are:

    • ICAO code – this is the unique identifier for an airport
    • Airport Name
    • Latitude and Longitude of the Airport

    The next step was to create a Cosmos DB account in the Azure Portal and write a C# app to do some transformations on the data and load the documents into our Cosmos DB collection.

    I first created a C# object that matched the structure of the JSON data:

    using Newtonsoft.Json;
    
    namespace LoadAirportData
    {
        public class Airport
        {
            [JsonProperty("id")]
            public string Id => this.Icao;
    
            [JsonProperty("icao")]
            public string Icao { get; set; }
    
            [JsonProperty("iata")]
            public string Iata { get; set; }
    
            [JsonProperty("name")]
            public string Name { get; set; }
    
            [JsonProperty("city")]
            public string City { get; set; }
    
            [JsonProperty("state")]
            public string State { get; set; }
    
            [JsonProperty("country")]
            public string Country { get; set; }
    
            [JsonProperty("elevation")]
            public int Elevation { get; set; }
    
            [JsonProperty("lat")]
            public double Latitude { get; set; }
    
            [JsonProperty("lon")]
            public double Longitude { get; set; }
    
            [JsonProperty("tz")]
            public string TimeZone { get; set; }
        }
    }
    

    I then created a C# object that matched the structure of the document I wanted to insert into Cosmos DB. The “Point” data type is used to serialize the latitude and longitude into the GeoJSON structure that Cosmos DB supports:

    using Microsoft.Azure.Documents.Spatial;
    using Newtonsoft.Json;
    
    namespace LoadAirportData
    {
        public class AirportDocument
        {
            public AirportDocument(Airport airport)
            {
                Id = airport.Icao;
                Iata = airport.Iata;
                Name = airport.Name;
                Location = new Point(airport.Longitude, airport.Latitude);
            }
    
            [JsonProperty("id")]
            public string Id { get; set; }
    
            [JsonProperty("iata")]
            public string Iata { get; set; }
    
            [JsonProperty("name")]
            public string Name { get; set; }
            
            [JsonProperty("location")]
            public Point Location { get; set; }       
        }
    }
    

    Finally I created a method that dropped the Cosmos DB database, recreated the database and the document collection then loaded the documents into the collection:

    using Microsoft.Azure.Documents.Client;
    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Threading.Tasks;
    using Microsoft.Azure.Documents;
    using Newtonsoft.Json;
    using System.Net;
    using System.Collections.Concurrent;
    
    namespace LoadAirportData
    {
        public class CosmosDbImporter
        {
            private const string ENDPOINT_URL = "<YOUR ENDPOINT>";
            private const string PRIMARY_KEY = "<YOUR KEY>";
    
            private const string DATABASE_NAME = "airports";
            private const string COLLECTION_NAME = "airportData";
    
            private const string IMPORT_PATH = @"C:\Temp\Airports.json";
    
            public async Task ImportData()
            {
                var documentClient = new DocumentClient(new Uri(ENDPOINT_URL), PRIMARY_KEY);
    
                // Delete the database if it exists
                try
                {
                    await documentClient.DeleteDatabaseAsync(UriFactory.CreateDatabaseUri(DATABASE_NAME));
                }
                catch (DocumentClientException ex)
                {
                    if (ex.StatusCode != HttpStatusCode.NotFound)
                        throw;
                }
    
                // Create the Database
                await documentClient.CreateDatabaseIfNotExistsAsync(new Database() { Id = DATABASE_NAME });
    
                // Create the collection and switch on spatial indexing
                DocumentCollection collection = new DocumentCollection() { Id = COLLECTION_NAME };
                collection.IndexingPolicy = new IndexingPolicy(new SpatialIndex(DataType.Point));
    
                await documentClient.CreateDocumentCollectionIfNotExistsAsync(UriFactory.CreateDatabaseUri(DATABASE_NAME), collection);
    
                // Read the file and deserialize to our Airport object
                var data = System.IO.File.ReadAllText(IMPORT_PATH);
                var airports = JsonConvert.DeserializeObject<Dictionary<string, Airport>>(data);
    
                // Upload documents to CosmosDB            
                await Task.WhenAll(
                    from partition in Partitioner.Create(airports.Values).GetPartitions(50)
                    select Task.Run(async delegate
                    {
                        using (partition)
                        {
                            while (partition.MoveNext())
                            {
                                Console.WriteLine($"Processing {partition.Current.Icao}");
    
                                var airportDocument = new AirportDocument(partition.Current);
                                await documentClient.CreateDocumentAsync(UriFactory.CreateDocumentCollectionUri(DATABASE_NAME, COLLECTION_NAME), airportDocument);
                            }
                        }
                    }));
            }
        }
    }
    

    One thing to note is the above code enables spatial indexing when creating the collection, without this enabled performance is extremely poor when performing spatial queries.

    The beauty of Cosmos DB is that it is able to elastically scale to the performance level specified by the user through the number of RUs (request units) that are allocated to the collection. While loading the data into Cosmos DB I wanted to scale up my database to take advantage of the multithreading in my C# app and speed up my processing so I just went in to the Azure Portal and adjusted the number of RUs allocated to the collection, the change was almost instant and my process instantly sped up. Once I had finished importing the data I was able to scale my database back down so I’m no longer paying for any unused capacity.

    image

    Querying

    Now the data is in Cosmos DB we can begin to play around with some spatial queries.

    To query airfields within a certain distance of a specified point I can run the following query which returns all of the airfields within 25km of Blackbushe airport. As you can see, the SQL syntax for querying Cosmos DB is very similar to T-SQL meaning it’s very easy to re-use your SQL Server skills:

    SELECT  airports.id AS ICAO,
            airports.name AS Name,
            ST_DISTANCE(airports.location, {"type": "Point", "coordinates": [-0.8475000262,51.3238983154]}) AS Distance
    FROM    airports 
    WHERE   ST_DISTANCE(airports.location, {"type": "Point", "coordinates": [-0.8475000262,51.3238983154]}) < 25000

    The above query returns the following results, which are the 7 airfields that are within 25km of Blackbushe:

    [
        {
            "ICAO": "EGHL",
            "Name": "Lasham Airport",
            "Distance": 19964.7890768588
        },
        {
            "ICAO": "EGVO",
            "Name": "RAF Odiham",
            "Distance": 11985.957064869535
        },
        {
            "ICAO": "EGTF",
            "Name": "Fairoaks Airport",
            "Distance": 20229.321025944442
        },
        {
            "ICAO": "EGLF",
            "Name": "Farnborough Airport",
            "Distance": 7286.035340157135
        },
        {
            "ICAO": "EGLK",
            "Name": "Blackbushe Airport",
            "Distance": 0
        },
        {
            "ICAO": "EGLM",
            "Name": "White Waltham Airfield",
            "Distance": 20312.693531316185
        },
        {
            "ICAO": "EGLP",
            "Name": "Brimpton Airfield",
            "Distance": 23311.94703537874
        }
    ]

    The App

    The next step is to create an application that uses the functionality to find the airfields within the range of an aircraft. To do this I created a basic ASP.NET MVC application that has a form with the following fields:

    image

    When the form is submitted the following C# code is executed:

    [HttpPost]
    public async Task Index(AirportFinderModel model)
    {
    	var documentClient = new DocumentClient(new Uri(ENDPOINT_URL), PRIMARY_KEY);
    
    	var baseAirfield = await documentClient.ReadDocumentAsync(UriFactory.CreateDocumentUri(DATABASE_NAME, COLLECTION_NAME, model.BaseAirfield));
    
    	var availableDistanceMeters = (model.EnduranceHours * model.TrueAirspeed) * 1852;
    
    	var result =
    		documentClient
    		.CreateDocumentQuery(UriFactory.CreateDocumentCollectionUri(DATABASE_NAME, COLLECTION_NAME))
    		.Where(a => a.Location.Distance(baseAirfield.Document.Location) <= availableDistanceMeters)
    		.ToList();
    
    	return View("Result", new AirportFinderResultModel()
    	{
    		MapPoints = JsonConvert.SerializeObject(result.Select(a => new MapPoint()
    		{
    			Title = a.Id,
    			Description = a.Name,
    			Longitude = a.Location.Position.Longitude,
    			Latitude = a.Location.Position.Latitude
    		}))
    	});
    }

    The above code connects to Cosmos DB and retrieves the details for the base airfield that was specified, it then calculates the range of the aircraft in meters by multiplying the endurance (in hours) by the true airspeed in knots (nautical miles per hour) and then multiplying that my 1852 (number of meters in a nautical mile). A Linq query is then run against Cosmos DB using the built-in spatial functions to find airfields within the specified distance. The result is then converted into a JSON array that can be understood by the Google Maps API that is being used on the client side.

    The client side uses the Google Maps API to plot the airfields on a map, giving us a view like the one below when given a base airfield of Blackbushe (EGLK), a true airspeed of 100kts and an endurance of 4.5 hours:

    image

    The current functionality of the app is extremely basic but could easily be expanded to make the range calculation more accurate by looking at wind and other factors that can affect range. This could be done by creating a polygon representing our range and then using the ST_WITHIN function to find all airfields within that polygon. The functionality could also be enhanced to take into account other attributes of the airfield by deciding if an airfield is suitable based on runway length and other facilities.

    Conclusion

    As you can see, using Cosmos DB it is extremely easy to analyse spatial data and the elastic scaling capability allows you to easily increase capacity of your application to cope with increased usage and large amounts of data. The spatial capabilities of Cosmos DB can make it a viable alternative to other databases such as SQL Server that aren’t always as easy to scale on demand and don’t have the flexibility that a document model can give.

    As always, if you have any questions or comments then let me know.