Tristan Robinson

Tristan Robinson's Blog

Converting Data Factory Azure Function Activities from Web to Native

Microsoft have recently added support to call Azure Functions natively within Data Factory. Previous to this, you only had one option if you wanted to leverage the serverless compute – which was through a web activity. While this certainly did a job, it wasn’t ideal as it exposed the function/host key within code and also the http message so as to authenticate with the functions.  As such, I expect quite a few people’s pipelines will be in a similar state to my own. I have recently been through the process of upgrading them in line with the new native ADF functionality and as part of this, I thought it would be worth my time to share the process I went through so you don’t suffer the same pains!

image


Linked Service

The first step in this process is to add a new ADF linked service into your data factory. When you try to do this, you’ll probably end up doing what always seems to happen when first searching for an Azure Function connector and search the prompt in front of you. This is default filtered to data connectors so it won’t appear, and you’ll need to switch the tab at the top to compute, at which point you’ll see it. The next step is to add the URL to your functions end point which will be in the format https://functionappname.azurewebsites.net. You’ll then need to supply a key to the function app for authentication purposes. I strongly advise you to set up an Azure Key Vault and store the key in there, and then in ADF you can access the key via a secret. This is so simple to do and promotes best practice from a security perspective as well as being useful for CI/CD purposes. This key can be both a function key (specific to a single function) OR a host key (access to all functions). Most of the tutorials on the web as well as the official MS content currently reference the function key, so I was slightly worried at first you could only call functions via the function key. This is not the case, and the host key can be used too. If you’re not too familiar with functions, this can be found within the application settings of your function app.

image


Pipelines

The next step in the process is replacing those web activities with native function activities. While this is a fairly simple process, you’ll no longer be calling a GET on a HTTP webhook directly via a URL string, and passing in parameters. Instead, you’ll need to specify the function you need to call and pass in the parameters via a JSON body as part of a POST call to the end point. For any functions not requiring parameters, you’ll need to use a GET rather than a POST with an empty body as ADF has issues validating a body without any content.

To pass the parameters into a body, you’ll need to wrap them in JSON with the following format. I’ve provided examples of strings, integers, and expressions but hopefully you get the point here.

image


Function Code

As part of this process, you’ll also need to modify your C# code within your function apps. Previous to this upgrade, you may be handling your functions through the HttpResponseMessage object (see below for an example – I’ve highlighted in red the code snippets that relate to the subject matter). Your web activity is happy to accept a response from this class in Data Factory!

Now that we are using the native functions, Data Factory will no longer accept a Http message as a response, and instead wants a JSON object (JObject). As a result, our HttpResponseMessage needs to be replaced by an IActionResult object (see below for an example of changes). The change is relatively trivial and you only need to decide on the type of Object to create instead of just creating a response.

To use this object, you’ll need to import the Microsoft.AspNetCore.Mvc libraries through NuGet. At this point, I ran into an issue since the dependencies required by Microsoft.AspNetCore.Mvc conflicted with my existing libraries – to such an extent where separate NuGet packages wanted my Newtonsoft.Json library to be both equal to version 9.01 and greater than version 11.0.1. To solve this issue, I imported separate parts of the main library – in this instance it was Microsoft.AspNetCore.Mvc.Abstractions and Microsoft.AspNetCore.Mvc.Core. These 2 libraries did not have the same constraints as the entire framework library.

image

image


While the above code changes allowed me to send/receive a JObject, in most scenarios I also needed to POST parameters into the function. In this case, I also needed to modify the code to read this (see highlighted code block below). The content is now treated as a stream which then needs to be deserialised from JSON into an object. The parameters can then be read by treating the object as a list.

image


Errors

If you’ve found this blog as part of a Google search, you’ll more likely than not of hit the following generic error { "errorCode": "3600", "message": "Error calling the endpoint.", "failureType": "UserError", "target": "Activate New Parent Pipeline" }. This not a particularly helpful error as it basically tells you, you're not able to receive a correctly formatted response. More often than not, this will be due to an issue in your function code, than something in ADF. I spent a while thinking it was an authentication error when it was not! You may receive a slightly more verbose error if you check the function app logs through something such as Azure Insights – again, I would strongly recommend you set this up as part of your Azure functions service as it’s not ideal to rely on ADF to debug this. Alternatively, I would also suggest setting up Postman to debug the issues locally to have a bit more control over the variables at runtime through the Locals output.

image


Conclusion

Hopefully you will find this guide useful if you’ve about to go through a similar process to myself. It might also be useful for anyone setting up functions in ADF for the first time as it will cover most of the content required for the services to talk to one another.

Using Azure Automation to Archive SFTP Files Orchestrated through ADF

I recently ran into a problem which required me to work with a SFTP site as part of our ETL process in Azure. Using the traditional BI stack, FTP tasks would be natively supported but this is no longer the case when working with Azure – this now requires a bit of plumbing. This blog will take a look at the ways in which you can interact with FTP/SFTP sites as part of your pipelines in ADF and the issues I ran into. Specifically the task I was trying to solve was to archive files (move them to a separate folder) that had passed into our staging layer, so they would not be picked up on the next process run. The blog will also cover passing parameters into a webhook as I also needed the solution to be flexible to handle different SFTP sources.

Like many similar tasks through Azure, you can go about them in a number of ways (Logic Apps, Functions, PS Runbooks, etc) but its about finding the solution that fits with your architecture and you find comfortable developing and maintaining. I initially started looked at using Logic Apps since quite a bit of the work was done for you, they have the connectors already setup, and the framework to achieve some simple tasks which sounded perfect for the job. Unfortunately, Logic Apps is not designed for any type of heavy lifting. You can move files using the platform but are capped at a mere 50MB when trying to pick up and put down a file across the platform. Rumours were that Microsoft would extend this to 1GB at some point during 2018 but this has not been the case so far.  There was also no such task to just change the remote path location as far as I could tell – it looks like people are up-voting it here.

Next I looked at using Azure functions, since I would do something similar through SSIS if this was a traditional stack problem. This involves writing a bit of C# to talk to the SFTP site and do the tasks for you. While this is very possible, it didn’t fit into our existing architecture. I also felt it was easier to build something to maintain using PowerShell – a language which I’m enjoying coding in more and more.

Lastly I looked at using Azure Automation and doing the tasks through PowerShell. There are a number of modules which you can use to achieve the goal, for which I used Posh SSH. I also looked at WinSCP but found the Posh cmdlets more flexible.

The first thing I did was work out how to pass parameters from a webhook into the script. This was so that the script could be used across multiple SFTP folders. In ADF it was as simple as defining the headers/body to pass with the POST call to the webhook.

 

image

 

Within the PS script I then defined the following region to pass the parameters into. The parameters are passed in as an object which is then deconstructed into the various header/body elements. The body is also deconstructed further from JSON into the variable I wanted. I believe it is also mandatory to name the object received by the script as $WebhookData as this is not configurable externally.

# Get parameter value
Param
([object]$WebhookData) 

#region Verify Webhook
if ($WebHookData){

    # Collect properties of WebhookData
    $WebhookName     =     $WebHookData.WebhookName
    $WebhookHeaders  =     $WebHookData.RequestHeader
    $WebhookBody     =     $WebHookData.RequestBody

    # Collect individual headers if required. Input converted from JSON.
    $Input = (ConvertFrom-Json -InputObject $WebhookBody)
    Write-Output  "WebhookBody: $($Input)"
    Write-Output -InputObject ('Runbook started from webhook {0}.' -f $WebhookName)
    
    # Extract variables
    $folderName = $Input.FolderName
}
else
{
   Write-Error -Message 'Runbook was not started from Webhook' -ErrorAction stop
}
#endregion

 

Once this is then defined within the script, you can setup the webhook from the Automation end and paste the URL it generates back into ADF. Its also at this point that the webhook configuration allows you to define a parameter as part of the webhook. Without the body of code above, this will not be an option, and cannot be created first as I originally assumed.

 

image

 

Once the parameter passing has been dealt with, the code block for connecting to the SFTP site and performing the task is straight forward.

# Get variable values
$userName = Get-AutomationVariable -Name 'SftpUserName'
$userPassword = Get-AutomationVariable -Name 'SftpPassword'
$hostName = Get-AutomationVariable -Name 'SftpHost'
$port = Get-AutomationVariable -Name 'SftpPort'

# Create PS credential from username/password
$userPassword = ConvertTo-SecureString -String $userPassword -AsPlainText -Force
$userCredential = New-Object -TypeName System.Management.Automation.PSCredential ($userName, $userPassword)

# Create new SFTP session
$session = New-SFTPSession -ComputerName $hostName -Credential $userCredential -Port $port -AcceptKey 

# Retrieve child objects on remote path and for every zip rename to zip.bk
$remotePath = "/Data/" + $folderName + "/"
Get-SFTPChildItem $session $remotePath| 
    ForEach-Object {
        if ($_.Fullname -like '*.zip' -Or $_.Fullname -like "*.txt") {  
            $NewName = $_.Name + ".bk"
            Rename-SFTPFile $session -Path $_.FullName -NewName $NewName
            Write-Output "$($_.FullName) has been renamed to $($NewName)"
        }
    }

 

The first task is to extract the variables from the Automation resource. These should not be hard-coded into the script, but stored as variables external to the runbook. Then a PSCredential requires configuration in the format above.

Finally the task to archive the files is performed. This involves creating a new SFTP session, and iterating over the child items within the remote path. For each file it finds, it will then perform the archiving process as long as they have the correct extension.

While I originally wanted to move the files between folders I realised this was not possible with the modules I was using. I’m sure with a bit more research something would be available out there to achieve this, specifically changing the path of a remote file.

Therefore I was limited to changing the name of a remote file by adding a .bk to the end of the filename. ADF would then only pass over files ending in .zip the next time round.

 

Conclusion

I’m hoping this will be useful for anyone in the same situation. I’ve specifically used the webhook parameters multiple times now.  While I didn’t achieve my original goal - with a small workaround I still satisfied my original requirement to archive the files from future process runs.  There’s also some other code snippets that might be of good reference for future implementations here too, specifically thinking around the PS credential creation. Hope it helps!