ZachStagers

Zach Stagers

Naming downloaded files with Azure Data Lake Store File System Task

I was recently working on a hybrid project where we download files from a lake and transform the data with SSIS. I was stunned to find that there’s no native ability to name the file you download from the lake! Even more frustrating, the downloaded file was inconsistently named as Data.<GUID>, rendering the SSIS File System Task useless in this case also. PowerShell to the rescue…

Using an execute process task to call the following PowerShell script, we were able to overcome this challenge.

param([string] $NewFileName, [string] $LocalFolder, [string] $FileNameFilter)

$file = Get-ChildItem -Path $LocalFolder -Filter $FileNameFilter | ? { $_.LastWriteTime -gt (Get-Date).AddSeconds(-15) } | select -Last 1

Move-Item -Path $file.FullName -Destination $LocalFolder"\"$NewFileName -Force

The script accepts three parameters:

  • $NewFileName – What you want to name the file to, including the file extension.
  • $LocalFolder – The local folder in which the file resides.
  • $FileNameFilter – A mask to apply for searching for the downloaded file. In this case, we used Data.* where * is a wildcard for the GUID

Get-ChildItem is used to obtain the details of the latest file written to our $LocalFolder within the last 15 seconds. This just adds an element of security, minimizing risk of the script being used outside of the SSIS process and renaming files it shouldn’t.

Move-Item is used instead of Rename-Item, as in our case we wanted to overwrite the file if it already existed.

If you have multiple packages using this script, which are called in parallel by a master package, I would highly recommend adding a completion constraint between all of the Execute Package Tasks to ensure no file accidentally renamed inappropriately by another package running at the same time. If removing parallelism isn’t an option for performance reasons, you could set up a different local folder per package.

Deploying Multiple U-SQL Procedures with PowerShell


If you’d like to deploy multiple U-SQL procedures without having to open each one in Visual Studio and submit the job to Data Lake Analytics manually, here’s a PowerShell script which you can point at a folder location containing your .USQL files to loop through them and submit them for you.

This method relies on the Login-AzureRmAccount command with a service principal, which you can learn more about here.

The Script

$azureAccountName = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
$azurePassword = ConvertTo-SecureString "xxxxxxxxxxxxxxxxxxxx/xxxxxxxxxxxxxxx/xxxxxx=" -AsPlainText -Force
$psCred = New-Object System.Management.Automation.PSCredential($azureAccountName, $azurePassword)
$psTenantID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
$adla = "zachstagersadla"
$fileLocation = "C:\SourceControl\USQL.Project\*.usql"

Login-AzureRmAccount -ServicePrincipal -TenantID $psTenantID -Credential $psCred | Out-Null

ForEach ($file in Get-ChildItem -Path $fileLocation)
	{
		$scriptContents = [IO.File]::ReadAllText($file.FullName)
		Submit-AzureRmDataLakeAnalyticsJob `
			-AccountName $adla `
			-Name $file.Name `
			-Script $scriptContents | Out-Null;
		
		Write-Host "`n" $file.Name "submitted."
	} 

Parameter Configuration

Where these various GUID’s can be found within the Azure portal is liable to change, but at time of writing I have provided a path to follow to find each of them.

$azureAccountName – This is the Application Id of your Enterprise Application, which can be found by navigating to Azure Active Directory > Enterprise Applications > All Applications > Selecting your application > Properties > Application Id.

$azurePassword – This is the secret key of your Enterprise Application, and would have been generated during application registration. If you’ve created your application, but have not generated a secret key, you can do so by navigating to: Enterprise Applications > New Application > Application You’re Developing > OK, take me to App Registration > Change the drop down from ‘My Apps’ to ‘All Apps’ (may not be required) > Select your application > Settings > Keys > Fill in the details and click save. Note the important message about the key only being available to copy until before navigating away from the page!

$psTenantID – From within the Azure Portal, go to Azure Active Directory, open the Properties blade, and copy the ‘Directory ID’.

$adla – The name of the data lake analytics resource you’re deploying to.

$fileLocation – The file location on your local machine which contains the USQL scripts you wish to deploy. Note the “\*.usql” on the end in the example, this is a wildcard search for all files ending in .usql.

Permissions

The service principal needs to be given the following permissions to successfully execute against the lake:

· Owner of the Data Lake Analytics resource you’re deploying to. This is configured via the Access Control blade.

· Owner of the Data Lake Store resource associated to the Analytics resource you’re deploying to. This is configured via the Access Control blade.

· Read, Write, and Execute permissions against the sub-folders within the Data Lake Store. Ensure you select ‘This folder and all children’ and ‘An access permission entry and a default permission entry’. This is configured by entering the Data Lake Store, selecting Data Explorer, then Access.