BenJarvis

Ben Jarvis' Blog

Uploading Files from On-Prem File System to Azure Blob Storage Using ADF V2

Introduction

Recently we had a client requirement whereby we needed to upload some files from an on-prem file server to Azure Blob Storage so they could be processed further. The file system connector in ADF V2 with a self-hosted integration runtime was the perfect solution for this so this blog post will discuss the process for getting a basic set up working to test the functionality.

Test Data

The first task is to generate some data to upload. I did this by executing the below PowerShell script to create 100 files, with each being 10 MB in size.

$numberOfFiles = 100
$directory = "C:\Temp\Files"
$fileSize = 1000000 * 10 # 10 MB

Remove-Item $directory -Recurse
New-Item $directory -ItemType Directory

for ($i = 0; $i -lt $numberOfFiles + 1; $i++) 
{
    fsutil file createnew "$directory\Test_$i.txt" $fileSize
}

Now we’ve got some files to upload it’s time to create our resources in Azure.

Azure Configuration

The following PowerShell script uses the Azure RM PowerShell cmdlets to create our resources. This script will provision the following resources in our Azure subscription:

  • Resource Group
  • ADF V2 Instance
  • Storage Account

$location = "West Europe"
$resouceGroupName = "ADFV2FileUploadResourceGroup"
$dataFactoryName = "ADFV2FileUpload"
$storageAccountName = "adfv2fileupload"

# Login to Azure
$credential = Get-Credential
Login-AzureRmAccount -Credential $credential

# Create Resource Group
New-AzureRmResourceGroup -Name $resouceGroupName -Location $location

# Create ADF V2
$setAzureRmDataFactoryV2Splat = @{
    Name = $dataFactoryName
    ResourceGroupName = $resouceGroupName
    Location = $location
}
Set-AzureRmDataFactoryV2 @setAzureRmDataFactoryV2Splat 

# Create Storage Account
$newAzureRmStorageAccountSplat = @{
    Kind = "Storage"
    Name = $storageAccountName
    ResourceGroupName = $resouceGroupName
    Location = $location
    SkuName = "Standard_LRS"
}
New-AzureRmStorageAccount @newAzureRmStorageAccountSplat

Our Azure Portal should now show the following:

image

Now we’ve got our resources provisioned it’s time to configure ADF to access our on-prem data.

ADF Configuration

Self-Hosted Integration Runtime

In ADF V2 the integration runtime is responsible for providing the compute infrastructure that carries out data movement between data stores. A self-hosted integration runtime is an on-premise version of the integration runtime that is able to perform copy activities to and from on-premise data stores.

When we configure a self-hosted integration runtime the data factory service, that sits in Azure, will orchestrate the nodes that make up the integration runtime through the use of Azure Service Bus meaning our nodes that are hosted on-prem are performing all of our data movement and connecting to our on-premises data sources while being triggered by our data factory pipelines that are hosted in the cloud. A self-hosted integration runtime can have multiple nodes associated with it, which not only caters for high availability but also gives an additional performance benefit as ADF will use all of the available nodes to perform processing.

To create our self-hosted integration runtime we need to use the following PowerShell script:

$credential = Get-Credential
Login-AzureRmAccount -Credential $credential

$resouceGroupName = "ADFV2FileUploadResourceGroup"
$dataFactoryName = "ADFV2FileUpload"
$integrationRuntimeName = "SelfHostedIntegrationRuntime"

# Create integration runtime
$setAzureRmDataFactoryV2IntegrationRuntimeSplat = @{
    Type = 'SelfHosted'
    Description = "Integration Runtime to Access On-Premise Data"
    DataFactoryName = $dataFactoryName
    ResourceGroupName = $resouceGroupName
    Name = $integrationRuntimeName
}
Set-AzureRmDataFactoryV2IntegrationRuntime @setAzureRmDataFactoryV2IntegrationRuntimeSplat

# Get the key that is required for our on-prem app
$getAzureRmDataFactoryV2IntegrationRuntimeKeySplat = @{
    DataFactoryName = $dataFactoryName
    ResourceGroupName = $resouceGroupName
    Name = $integrationRuntimeName
}
Get-AzureRmDataFactoryV2IntegrationRuntimeKey @getAzureRmDataFactoryV2IntegrationRuntimeKeySplat

The above script creates the integration runtime then retrieves the authentication key that is required by the application that runs on our integration runtime node.

The next step is to configure our nodes, to do this we need to download the integration runtime at https://www.microsoft.com/en-gb/download/details.aspx?id=39717.

Once we run the downloaded executable the installer will run and install the required application, the application will then open on the following screen asking for our authentication key:

image

We can enter the authentication key retrieved in the previous step into the box and click register to register our integration runtime with the data factory service. Please note that if your organisation has a proxy you will need to enter the details to ensure the integration runtime has the connectivity required.

After clicking register we receive the following prompt, click finish on this and allow the process to complete.

image

Once the process is complete our integration runtime is ready to go.

More information on creating the self-hosted integration runtime can be found at https://docs.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime.

ADF Pipeline

Now we’ve got our self-hosted integration runtime set up we can begin configuring our linked services, datasets and pipelines in ADF. We’ll do this using the new ADF visual tools.

Linked Services

The first step is to create our linked services. To do this we open up the visual tools, go to the author tab and select connections; we can then create a new linked service to connect to Azure Blob Storage:

image

Next we need to create a linked service for our on-prem file share.

First create a new linked service and select the file system connector, we can then fill in the relevant details.

image

There are a couple of things to note with the above configuration, the first is we have selected our self-hosted integration runtime for the “Connect via integration runtime” option. We have also specified an account that has access to the folder we would like to pick files up from, in my case I have configured a local account however, if you were accessing data from a file share within your domain you would supply some domain credentials.

Datasets

For our pipeline we will need to create two datasets, one for our file system that will be our source and another for blob storage that will be our sink.

Firstly, we will create our file system dataset. To do this we will create a new dataset and select the file system connector, we will then configure our connection tab, like below:

image

As above, we have selected our linked service we created earlier, added a filter to select all files and opted to perform a binary copy of the files in the directory.

Next, we need to create a dataset for blob storage. To do this we will create a new dataset using the Azure Blob Storage connector, we will then configure our connection tab, like below:

image

This time we have entered the container we’d like to copy our files to as the directory name in our file path, we have also opted for a binary copy again.

Pipeline

The final step is to create our pipeline that will tie everything together and perform our copy.

First, create a new pipeline with a single copy activity. The source settings for the copy activity are as follows:

image

The settings for the sink are:

image

In this situation we have set the copy behaviour to “Preserve Hierarchy”, which will preserve the folder structure from the source folder when copying the files to blob storage.

Now we’ve got everything set up we can select “Publish All” to publish our changes to the ADF service.

Testing

We’ve completed all of the necessary configuration so we can now trigger our pipeline to test that it works. In the pipeline editor click the trigger button to trigger off the copy. We can then go to the monitor tab to monitor the pipeline as it runs:

image

The pipeline will take a couple of minutes to complete and on completion we can see that the files we selected to upload are now in our blob container:

image

Conclusion

To conclude, ADF V2 presents a viable solution when there is a need to copy files from an on-prem file system to Azure. The solution is relatively easy to set up and gives a good level of functionality and performance out of the box.

When looking to deploy the solution to production there are some more considerations such as high availability of the self-hosted integration runtime nodes however, the documentation given on MSDN helps give a good understanding of how to set this up by adding multiple nodes to the integration runtime.

Loading