Ben Jarvis' Blog

ADF V2 Issue With File Extension After Decompressing Files

On a current client project we are taking files from an on-prem file server and uploading them to Azure Blob Storage using ADF V2. The files are compressed on-prem using GZip compression and need to be decompressed before they are placed in blob storage where some other processes will pick them up.

ADF V2 natively supports decompression of files as documented at With this functionality ADF should change the extension of the file when it is decompressed so 1234_567.csv.gz would become 1234_567.csv however, I’ve noticed that this doesn’t happen in all cases.

In our particular case the file names and extensions of the source files are all uppercase and when ADF uploads them it doesn’t alter the file extension e.g. if I upload 1234_567.CSV.GZ I get 1234_567.CSV.GZ in blob storage rather than 1234_567.CSV.

If I upload 1234_567.csv.gz the functionality works correctly and I get 1234_567.csv in blob storage.This means that the file extension replace is case sensitive when it should be case insensitive.

This bug isn’t a major issue for us as the file is decompressed and we can change the extension when we process the file further however, it’s something that stumped me for a while.

I’ve raised a bug at to get this fixed so please vote and I’ll update the post once the issue has been resolved.

Uploading Files from On-Prem File System to Azure Blob Storage Using ADF V2


Recently we had a client requirement whereby we needed to upload some files from an on-prem file server to Azure Blob Storage so they could be processed further. The file system connector in ADF V2 with a self-hosted integration runtime was the perfect solution for this so this blog post will discuss the process for getting a basic set up working to test the functionality.

Test Data

The first task is to generate some data to upload. I did this by executing the below PowerShell script to create 100 files, with each being 10 MB in size.

$numberOfFiles = 100
$directory = "C:\Temp\Files"
$fileSize = 1000000 * 10 # 10 MB

Remove-Item $directory -Recurse
New-Item $directory -ItemType Directory

for ($i = 0; $i -lt $numberOfFiles + 1; $i++) 
    fsutil file createnew "$directory\Test_$i.txt" $fileSize

Now we’ve got some files to upload it’s time to create our resources in Azure.

Azure Configuration

The following PowerShell script uses the Azure RM PowerShell cmdlets to create our resources. This script will provision the following resources in our Azure subscription:

  • Resource Group
  • ADF V2 Instance
  • Storage Account

$location = "West Europe"
$resouceGroupName = "ADFV2FileUploadResourceGroup"
$dataFactoryName = "ADFV2FileUpload"
$storageAccountName = "adfv2fileupload"

# Login to Azure
$credential = Get-Credential
Login-AzureRmAccount -Credential $credential

# Create Resource Group
New-AzureRmResourceGroup -Name $resouceGroupName -Location $location

# Create ADF V2
$setAzureRmDataFactoryV2Splat = @{
    Name = $dataFactoryName
    ResourceGroupName = $resouceGroupName
    Location = $location
Set-AzureRmDataFactoryV2 @setAzureRmDataFactoryV2Splat 

# Create Storage Account
$newAzureRmStorageAccountSplat = @{
    Kind = "Storage"
    Name = $storageAccountName
    ResourceGroupName = $resouceGroupName
    Location = $location
    SkuName = "Standard_LRS"
New-AzureRmStorageAccount @newAzureRmStorageAccountSplat

Our Azure Portal should now show the following:


Now we’ve got our resources provisioned it’s time to configure ADF to access our on-prem data.

ADF Configuration

Self-Hosted Integration Runtime

In ADF V2 the integration runtime is responsible for providing the compute infrastructure that carries out data movement between data stores. A self-hosted integration runtime is an on-premise version of the integration runtime that is able to perform copy activities to and from on-premise data stores.

When we configure a self-hosted integration runtime the data factory service, that sits in Azure, will orchestrate the nodes that make up the integration runtime through the use of Azure Service Bus meaning our nodes that are hosted on-prem are performing all of our data movement and connecting to our on-premises data sources while being triggered by our data factory pipelines that are hosted in the cloud. A self-hosted integration runtime can have multiple nodes associated with it, which not only caters for high availability but also gives an additional performance benefit as ADF will use all of the available nodes to perform processing.

To create our self-hosted integration runtime we need to use the following PowerShell script:

$credential = Get-Credential
Login-AzureRmAccount -Credential $credential

$resouceGroupName = "ADFV2FileUploadResourceGroup"
$dataFactoryName = "ADFV2FileUpload"
$integrationRuntimeName = "SelfHostedIntegrationRuntime"

# Create integration runtime
$setAzureRmDataFactoryV2IntegrationRuntimeSplat = @{
    Type = 'SelfHosted'
    Description = "Integration Runtime to Access On-Premise Data"
    DataFactoryName = $dataFactoryName
    ResourceGroupName = $resouceGroupName
    Name = $integrationRuntimeName
Set-AzureRmDataFactoryV2IntegrationRuntime @setAzureRmDataFactoryV2IntegrationRuntimeSplat

# Get the key that is required for our on-prem app
$getAzureRmDataFactoryV2IntegrationRuntimeKeySplat = @{
    DataFactoryName = $dataFactoryName
    ResourceGroupName = $resouceGroupName
    Name = $integrationRuntimeName
Get-AzureRmDataFactoryV2IntegrationRuntimeKey @getAzureRmDataFactoryV2IntegrationRuntimeKeySplat

The above script creates the integration runtime then retrieves the authentication key that is required by the application that runs on our integration runtime node.

The next step is to configure our nodes, to do this we need to download the integration runtime at

Once we run the downloaded executable the installer will run and install the required application, the application will then open on the following screen asking for our authentication key:


We can enter the authentication key retrieved in the previous step into the box and click register to register our integration runtime with the data factory service. Please note that if your organisation has a proxy you will need to enter the details to ensure the integration runtime has the connectivity required.

After clicking register we receive the following prompt, click finish on this and allow the process to complete.


Once the process is complete our integration runtime is ready to go.

More information on creating the self-hosted integration runtime can be found at

ADF Pipeline

Now we’ve got our self-hosted integration runtime set up we can begin configuring our linked services, datasets and pipelines in ADF. We’ll do this using the new ADF visual tools.

Linked Services

The first step is to create our linked services. To do this we open up the visual tools, go to the author tab and select connections; we can then create a new linked service to connect to Azure Blob Storage:


Next we need to create a linked service for our on-prem file share.

First create a new linked service and select the file system connector, we can then fill in the relevant details.


There are a couple of things to note with the above configuration, the first is we have selected our self-hosted integration runtime for the “Connect via integration runtime” option. We have also specified an account that has access to the folder we would like to pick files up from, in my case I have configured a local account however, if you were accessing data from a file share within your domain you would supply some domain credentials.


For our pipeline we will need to create two datasets, one for our file system that will be our source and another for blob storage that will be our sink.

Firstly, we will create our file system dataset. To do this we will create a new dataset and select the file system connector, we will then configure our connection tab, like below:


As above, we have selected our linked service we created earlier, added a filter to select all files and opted to perform a binary copy of the files in the directory.

Next, we need to create a dataset for blob storage. To do this we will create a new dataset using the Azure Blob Storage connector, we will then configure our connection tab, like below:


This time we have entered the container we’d like to copy our files to as the directory name in our file path, we have also opted for a binary copy again.


The final step is to create our pipeline that will tie everything together and perform our copy.

First, create a new pipeline with a single copy activity. The source settings for the copy activity are as follows:


The settings for the sink are:


In this situation we have set the copy behaviour to “Preserve Hierarchy”, which will preserve the folder structure from the source folder when copying the files to blob storage.

Now we’ve got everything set up we can select “Publish All” to publish our changes to the ADF service.


We’ve completed all of the necessary configuration so we can now trigger our pipeline to test that it works. In the pipeline editor click the trigger button to trigger off the copy. We can then go to the monitor tab to monitor the pipeline as it runs:


The pipeline will take a couple of minutes to complete and on completion we can see that the files we selected to upload are now in our blob container:



To conclude, ADF V2 presents a viable solution when there is a need to copy files from an on-prem file system to Azure. The solution is relatively easy to set up and gives a good level of functionality and performance out of the box.

When looking to deploy the solution to production there are some more considerations such as high availability of the self-hosted integration runtime nodes however, the documentation given on MSDN helps give a good understanding of how to set this up by adding multiple nodes to the integration runtime.

Adatis SQLBits Photo Booth

This year at SQLBits Adatis decided to create two technical demos to fit in with the magic theme, these were the Adatis Sorting Hat and the Adatis Photo Booth.

My blog post will look at the Adatis Photo Booth which was designed to allow us to print a wizarding student ID for participants; I’ll give an overview of the technology stack that was used and then do some analysis of the data we collected using Power BI.


The Requirement

The first place to start when designing a technical solution are the business requirements. In our case the requirements were:

  • The user should have an app running on a tablet that will ask them for some basic details such as name, email address and twitter account before allowing them to take a selfie to be printed on their card.
  • The image of the user should be uploaded to the Azure Cognitive Services Face API to capture their emotions.
  • The final data to be placed on the card should be added to a database, in the required format, for our card printing software to pick up.
  • The solution should require the smallest amount of coding possible.
  • The solution should be fully automated and require the least amount of interaction from the Adati on the stand.

The Technology

To satisfy our requirements we chose to use Power Apps for the front-end and Logic Apps for the backend making our solution completely serverless.

A brief overview of the steps in the process and the tech used at each stage is below:

Architecture Diagram

At a high level the technology being used is:

  • Power Apps – allows us to build a cross-platform, front-end application without writing any code.
  • Azure SQL Database – allows us to store the data captured from the application. Power Apps natively integrates with Azure SQL DB.
  • Azure Blob Storage – cheap to use and easy to integrate with our solution. Power Apps doesn’t natively integrate with Blob storage so we quickly created a custom API using the instructions at to upload our images.
  • Azure Cognitive Services – a set of APIs that allowed us to very easily add some intelligence to our application. The face API was easy to integrate and we were using machine learning to map the emotions of the face within minutes.

    The Technical Challenges

    While building the solution we came across a couple of limitations that are useful to raise, these are:

    • There is currently no integration between Power Apps and blob storage meaning you need to roll your own API to allow the images from the camera control to be uploaded. The process is simple and the blog post at gives you details of what needs to be done.
    • The cognitive services connector for Logic Apps doesn’t currently return the emotion data returned by the API. To get around this we made a web request within the logic app and parsed the JSON returned to extract the attributes we needed.

    Apart from the API to upload the images no coding was required to get this solution to function making Power Apps and Logic Apps a very good choice when you need to create a solution that needs to be delivered quickly.

    The Data

    Now we’ve looked at the technical solution we can use Power BI to look at some of the data we collected at SQL Bits.

    To analyse the data I connected my Power BI desktop to the Azure SQL Database we were using, I then created some measures and visuals against the data to give the below result:


    The visuals we have are:

    • Top Left - tree map that look at the count of students by their primary emotion (the emotion that scored highest)
    • Top Middle – % of happy (where primary emotion is happiness) and % of unhappy (where primary emotion is anger, sadness or disgust) over each 3 hour period.
    • Top Right – total number of students we produced cards for, overall % of happy and % unhappy students.
    • Bottom – primary emotion logged over time

    As you can see, we did quite well with over 46% of our students having a primary emotion of happy, with only 13% being unhappy. It seems that over Thursday and Friday we had some unhappy students but on Saturday everybody was happy!


    To conclude, we created our tech demos for SQLBits so everybody can have some fun on the stands however, as you’ve seen, the demos are backed by the cloud technology that we use all day, everyday. Power Apps and Logic Apps are an excellent set of tools and allow you to build applications quickly and easily while connecting various services together; all while being part of the Azure PaaS and SaaS offering without the headache of provisioning servers. The Cognitive Services APIs also present a great offering and allow you to quickly and easily add some intelligence to your application without needing to roll your own solution.

    As always, if you have any questions or comments then please let me know.

    Analysis of Spatial Data Using Cosmos DB


    Recently, while researching Cosmos DB, I came across the in-built capabilities for managing spatial data.

    Cosmos DB is Microsoft’s globally distributed, multi-model database. It has the capability to store various types of data, such as document, graph and key-value and can elastically scale to cope with varying needs. The piece of Cosmos DB that this post will be discussing is the spatial capabilities of the document model.

    The problem I have chosen to solve using the spatial functionality is working out which airfields are within the range of an aircraft when given its true airspeed and fuel endurance in hours; with the range being calculated by multiplying the airspeed by the endurance.

    The Data

    The first step was to find a data set containing a list of all of the world’s airfields, this was found on GitHub at The data set contains the details we need, which are:

    • ICAO code – this is the unique identifier for an airport
    • Airport Name
    • Latitude and Longitude of the Airport

    The next step was to create a Cosmos DB account in the Azure Portal and write a C# app to do some transformations on the data and load the documents into our Cosmos DB collection.

    I first created a C# object that matched the structure of the JSON data:

    using Newtonsoft.Json;
    namespace LoadAirportData
        public class Airport
            public string Id => this.Icao;
            public string Icao { get; set; }
            public string Iata { get; set; }
            public string Name { get; set; }
            public string City { get; set; }
            public string State { get; set; }
            public string Country { get; set; }
            public int Elevation { get; set; }
            public double Latitude { get; set; }
            public double Longitude { get; set; }
            public string TimeZone { get; set; }

    I then created a C# object that matched the structure of the document I wanted to insert into Cosmos DB. The “Point” data type is used to serialize the latitude and longitude into the GeoJSON structure that Cosmos DB supports:

    using Microsoft.Azure.Documents.Spatial;
    using Newtonsoft.Json;
    namespace LoadAirportData
        public class AirportDocument
            public AirportDocument(Airport airport)
                Id = airport.Icao;
                Iata = airport.Iata;
                Name = airport.Name;
                Location = new Point(airport.Longitude, airport.Latitude);
            public string Id { get; set; }
            public string Iata { get; set; }
            public string Name { get; set; }
            public Point Location { get; set; }       

    Finally I created a method that dropped the Cosmos DB database, recreated the database and the document collection then loaded the documents into the collection:

    using Microsoft.Azure.Documents.Client;
    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Threading.Tasks;
    using Microsoft.Azure.Documents;
    using Newtonsoft.Json;
    using System.Net;
    using System.Collections.Concurrent;
    namespace LoadAirportData
        public class CosmosDbImporter
            private const string ENDPOINT_URL = "<YOUR ENDPOINT>";
            private const string PRIMARY_KEY = "<YOUR KEY>";
            private const string DATABASE_NAME = "airports";
            private const string COLLECTION_NAME = "airportData";
            private const string IMPORT_PATH = @"C:\Temp\Airports.json";
            public async Task ImportData()
                var documentClient = new DocumentClient(new Uri(ENDPOINT_URL), PRIMARY_KEY);
                // Delete the database if it exists
                    await documentClient.DeleteDatabaseAsync(UriFactory.CreateDatabaseUri(DATABASE_NAME));
                catch (DocumentClientException ex)
                    if (ex.StatusCode != HttpStatusCode.NotFound)
                // Create the Database
                await documentClient.CreateDatabaseIfNotExistsAsync(new Database() { Id = DATABASE_NAME });
                // Create the collection and switch on spatial indexing
                DocumentCollection collection = new DocumentCollection() { Id = COLLECTION_NAME };
                collection.IndexingPolicy = new IndexingPolicy(new SpatialIndex(DataType.Point));
                await documentClient.CreateDocumentCollectionIfNotExistsAsync(UriFactory.CreateDatabaseUri(DATABASE_NAME), collection);
                // Read the file and deserialize to our Airport object
                var data = System.IO.File.ReadAllText(IMPORT_PATH);
                var airports = JsonConvert.DeserializeObject<Dictionary<string, Airport>>(data);
                // Upload documents to CosmosDB            
                await Task.WhenAll(
                    from partition in Partitioner.Create(airports.Values).GetPartitions(50)
                    select Task.Run(async delegate
                        using (partition)
                            while (partition.MoveNext())
                                Console.WriteLine($"Processing {partition.Current.Icao}");
                                var airportDocument = new AirportDocument(partition.Current);
                                await documentClient.CreateDocumentAsync(UriFactory.CreateDocumentCollectionUri(DATABASE_NAME, COLLECTION_NAME), airportDocument);

    One thing to note is the above code enables spatial indexing when creating the collection, without this enabled performance is extremely poor when performing spatial queries.

    The beauty of Cosmos DB is that it is able to elastically scale to the performance level specified by the user through the number of RUs (request units) that are allocated to the collection. While loading the data into Cosmos DB I wanted to scale up my database to take advantage of the multithreading in my C# app and speed up my processing so I just went in to the Azure Portal and adjusted the number of RUs allocated to the collection, the change was almost instant and my process instantly sped up. Once I had finished importing the data I was able to scale my database back down so I’m no longer paying for any unused capacity.



    Now the data is in Cosmos DB we can begin to play around with some spatial queries.

    To query airfields within a certain distance of a specified point I can run the following query which returns all of the airfields within 25km of Blackbushe airport. As you can see, the SQL syntax for querying Cosmos DB is very similar to T-SQL meaning it’s very easy to re-use your SQL Server skills:

   AS Name,
            ST_DISTANCE(airports.location, {"type": "Point", "coordinates": [-0.8475000262,51.3238983154]}) AS Distance
    FROM    airports 
    WHERE   ST_DISTANCE(airports.location, {"type": "Point", "coordinates": [-0.8475000262,51.3238983154]}) < 25000

    The above query returns the following results, which are the 7 airfields that are within 25km of Blackbushe:

            "ICAO": "EGHL",
            "Name": "Lasham Airport",
            "Distance": 19964.7890768588
            "ICAO": "EGVO",
            "Name": "RAF Odiham",
            "Distance": 11985.957064869535
            "ICAO": "EGTF",
            "Name": "Fairoaks Airport",
            "Distance": 20229.321025944442
            "ICAO": "EGLF",
            "Name": "Farnborough Airport",
            "Distance": 7286.035340157135
            "ICAO": "EGLK",
            "Name": "Blackbushe Airport",
            "Distance": 0
            "ICAO": "EGLM",
            "Name": "White Waltham Airfield",
            "Distance": 20312.693531316185
            "ICAO": "EGLP",
            "Name": "Brimpton Airfield",
            "Distance": 23311.94703537874

    The App

    The next step is to create an application that uses the functionality to find the airfields within the range of an aircraft. To do this I created a basic ASP.NET MVC application that has a form with the following fields:


    When the form is submitted the following C# code is executed:

    public async Task Index(AirportFinderModel model)
    	var documentClient = new DocumentClient(new Uri(ENDPOINT_URL), PRIMARY_KEY);
    	var baseAirfield = await documentClient.ReadDocumentAsync(UriFactory.CreateDocumentUri(DATABASE_NAME, COLLECTION_NAME, model.BaseAirfield));
    	var availableDistanceMeters = (model.EnduranceHours * model.TrueAirspeed) * 1852;
    	var result =
    		.CreateDocumentQuery(UriFactory.CreateDocumentCollectionUri(DATABASE_NAME, COLLECTION_NAME))
    		.Where(a => a.Location.Distance(baseAirfield.Document.Location) <= availableDistanceMeters)
    	return View("Result", new AirportFinderResultModel()
    		MapPoints = JsonConvert.SerializeObject(result.Select(a => new MapPoint()
    			Title = a.Id,
    			Description = a.Name,
    			Longitude = a.Location.Position.Longitude,
    			Latitude = a.Location.Position.Latitude

    The above code connects to Cosmos DB and retrieves the details for the base airfield that was specified, it then calculates the range of the aircraft in meters by multiplying the endurance (in hours) by the true airspeed in knots (nautical miles per hour) and then multiplying that my 1852 (number of meters in a nautical mile). A Linq query is then run against Cosmos DB using the built-in spatial functions to find airfields within the specified distance. The result is then converted into a JSON array that can be understood by the Google Maps API that is being used on the client side.

    The client side uses the Google Maps API to plot the airfields on a map, giving us a view like the one below when given a base airfield of Blackbushe (EGLK), a true airspeed of 100kts and an endurance of 4.5 hours:


    The current functionality of the app is extremely basic but could easily be expanded to make the range calculation more accurate by looking at wind and other factors that can affect range. This could be done by creating a polygon representing our range and then using the ST_WITHIN function to find all airfields within that polygon. The functionality could also be enhanced to take into account other attributes of the airfield by deciding if an airfield is suitable based on runway length and other facilities.


    As you can see, using Cosmos DB it is extremely easy to analyse spatial data and the elastic scaling capability allows you to easily increase capacity of your application to cope with increased usage and large amounts of data. The spatial capabilities of Cosmos DB can make it a viable alternative to other databases such as SQL Server that aren’t always as easy to scale on demand and don’t have the flexibility that a document model can give.

    As always, if you have any questions or comments then let me know.