ZachStagers

Zach Stagers

Introduction to Azure Notebooks

Azure Notebooks is a Microsoft Azure Platform as a Service (PaaS) offering of Jupyter Notebooks, here I’d like to introduce you to the Azure version and some of its benefits over the traditional version.

A brief overview of Jupyter Notebooks

By downloading and installing you’re able to create documents with a mixture of markdown text, code, and visualizations from within your web browser. This isn’t just any document though – it allows for code modification, execution, and live visualization generation.

It’s a fantastic tool for documentation, training, learning, and interactive report generation and exploration.

For more information on Jupyter, check out Nigel Meakins’ Introduction to Jupyter Notebooks post.

Benefits of Azure Notebooks

No installation, no maintenance

As with any PaaS solution, Azure Notebooks makes it far quicker and easier to get up and running, as there’s no download or installation required. Microsoft handles all the maintenance for you too!

Easier sharing

Just click your library, and hit Share to be presented with a popup of sharing options:

clip_image001

At the moment, you can share via a direct URL, social media (options being Twitter, Facebook, and Google+), embed code, and emailing directly from the pop-up.

Other Features of Azure Notebooks

Slides

This is an excellent tool for presenting your work from directly within your notebook – meaning you can modify and execute code from your slideshow. This allows you to better adapt your presentation to your audience, helping explain or answer questions with additional examples and not having to swap to another application.

To set up your presentation, open your Notebook, from the View menu, go to Cell Toolbar, and select Slideshow.

This gives you the Slide Type option on each Cell of your notebook. Once you’ve configured each of your cells, you can select Enter/Exit RISE Slideshow to enter presentation mode and see how it looks.

clip_image003

Notebook Cloning

Companies and universities are turning their books and other content in to Azure Notebooks, making it publicly available for all to clone to their own libraries to play around with and learn.

Here are a few to get you started:

https://notebooks.azure.com/Microsoft/libraries

https://notebooks.azure.com/jakevdp/libraries/PythonDataScienceHandbook

It’s also smart to clone your own notebook if you plan on tinkering around and don’t quite know where you’ll end up, essentially backing up and preserving an original copy.

Public and private notebooks

Not ready to share your work? Keep your notebook private until you are.

If you’ve already shared your notebook, but later want to lock it down to make changes, go in to the settings of your library and make it private again. It couldn’t be simpler.

Limitations

This product is still in preview, and I’ve no doubt will be growing in capability, but at time of writing, the following limitations are in place:

  • Jupyter supports over 40 languages, at time of writing Azure Notebooks supports three: Python (2 and 3), R, and F#.
  • 4GB memory usage limit.
  • The service restrictions documentation mention Microsoft reserving the right to remove your data after 60 days of inactivity.
  • I’ve read elsewhere online that there’s a 1GB storage limit – but I haven’t been able to find this detailed in Microsoft documentation.

Get Started

To start producing your own notebook, head to https://notebooks.azure.com, sign in with a Microsoft account, and create a library.

Add a notebook file to your library by pressing “New”, select the type of Notebook you’d like to create (Python/R/F#), give it a name and press New again to create it. The same New button can be used to add files, such as CSV’s you can reference in your notebook, to your library, either from your own computer, or from the web.

Click on your notebook file to start editing, and once in the editor, I’d recommend taking the self-paced user interface tour available from the Help menu if you haven’t worked with Jupyter before.

Deploying Multiple U-SQL Procedures with PowerShell


If you’d like to deploy multiple U-SQL procedures without having to open each one in Visual Studio and submit the job to Data Lake Analytics manually, here’s a PowerShell script which you can point at a folder location containing your .USQL files to loop through them and submit them for you.

This method relies on the Login-AzureRmAccount command with a service principal, which you can learn more about here.

The Script

$azureAccountName = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
$azurePassword = ConvertTo-SecureString "xxxxxxxxxxxxxxxxxxxx/xxxxxxxxxxxxxxx/xxxxxx=" -AsPlainText -Force
$psCred = New-Object System.Management.Automation.PSCredential($azureAccountName, $azurePassword)
$psTenantID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
$adla = "zachstagersadla"
$fileLocation = "C:\SourceControl\USQL.Project\*.usql"

Login-AzureRmAccount -ServicePrincipal -TenantID $psTenantID -Credential $psCred | Out-Null

ForEach ($file in Get-ChildItem -Path $fileLocation)
	{
		$scriptContents = [IO.File]::ReadAllText($file.FullName)
		Submit-AzureRmDataLakeAnalyticsJob `
			-AccountName $adla `
			-Name $file.Name `
			-Script $scriptContents | Out-Null;
		
		Write-Host "`n" $file.Name "submitted."
	} 

Parameter Configuration

Where these various GUID’s can be found within the Azure portal is liable to change, but at time of writing I have provided a path to follow to find each of them.

$azureAccountName – This is the Application Id of your Enterprise Application, which can be found by navigating to Azure Active Directory > Enterprise Applications > All Applications > Selecting your application > Properties > Application Id.

$azurePassword – This is the secret key of your Enterprise Application, and would have been generated during application registration. If you’ve created your application, but have not generated a secret key, you can do so by navigating to: Enterprise Applications > New Application > Application You’re Developing > OK, take me to App Registration > Change the drop down from ‘My Apps’ to ‘All Apps’ (may not be required) > Select your application > Settings > Keys > Fill in the details and click save. Note the important message about the key only being available to copy until before navigating away from the page!

$psTenantID – From within the Azure Portal, go to Azure Active Directory, open the Properties blade, and copy the ‘Directory ID’.

$adla – The name of the data lake analytics resource you’re deploying to.

$fileLocation – The file location on your local machine which contains the USQL scripts you wish to deploy. Note the “\*.usql” on the end in the example, this is a wildcard search for all files ending in .usql.

Permissions

The service principal needs to be given the following permissions to successfully execute against the lake:

· Owner of the Data Lake Analytics resource you’re deploying to. This is configured via the Access Control blade.

· Owner of the Data Lake Store resource associated to the Analytics resource you’re deploying to. This is configured via the Access Control blade.

· Read, Write, and Execute permissions against the sub-folders within the Data Lake Store. Ensure you select ‘This folder and all children’ and ‘An access permission entry and a default permission entry’. This is configured by entering the Data Lake Store, selecting Data Explorer, then Access.

Bulk updating SSIS package protection level with DTUTIL

I was recently working with an SSIS project with well over 100 packages. It had been developed using the ‘Encrypt Sensitive with User Key’ package and project protection level, project deployment model, and utilizing components from the Azure Feature Pack.

The project was to be deployed to a remote server within another domain using a different user to the one which developed the packages, which caused an error with the ‘Encrypt Sensitive with User Key’ setting upon deployment.

As a side note, regarding package protection level and the Azure Feature Pack, ‘Don’t Save Sensitive’ cannot be used as you’ll receive an error stating: “Error when copying file from Azure Data Lake Store to local drive using SSIS. [Azure Data Lake Store File System Task] Error: There is an internal error to refresh token. Value cannot be null. Parameter name: input”. There seems to be an issue with the components not accepting values from an environment variable if ‘Don’t Save Sensitive’ is used.

With project deployment model, the project and all packages within it need to have the same protection level, but there’s no easy way from within visual studio to apply the update to all packages, and I didn’t want to have to open 100 packages individually to change the setting manually! SQL and DTUTIL to the rescue…

Step One

Check the entire project and all packages out of source control, making sure you have the latest version. Close visual studio.

Step Two

Change the protection level at the project level by right clicking the project and selecting properties. A dialogue box should open, on the project tab you should see a Security section, under which you can change the protection level.

Step Three

Using SQL against the SSISDB (assuming the project is deployed locally), you can easily produce a list of all the packages you need to update within a DTUTIL string to change the encryption setting:

USE SSISDB

DECLARE @FolderPath VARCHAR(1000) = 'C:\SourceControl\FolderX'
DECLARE @DtutilString VARCHAR(1000) = 
	'dtutil.exe /file "'+ @FolderPath +'\XXX" /encrypt file;"'+ @FolderPath +'\XXX";2;Password1! /q'

SELECT DISTINCT
       REPLACE(@DtutilString, 'XXX', [name])
FROM internal.packages
WHERE project_id = 2

This query will produce you a string like the below for each package in your project:

dtutil.exe /file "C:\SourceControl\FolderX\Package1.dtsx" /encrypt file;"C:\SourceControl\FolderX\Package1.dtsx";2;Password1! /q

This executes DTUTIL for the file specified, encrypting the package using ‘Encrypt with password’ (2) and the password Password1! in this case. The /q tells the command to execute “quietly” – meaning you won’t be prompted with an “Are you sure?” message each time. More information about DTUTIL can be found here.

If the project isn’t deployed, the same can be achieved through PowerShell against the folder the packages live in.

Step Four

Copy the list of packages generated by Step Three, open notepad, and paste. Save the file as a batch file (.bat).

Run the batch file as an administrator (right click the file, select run as admin).

A command prompt window should open, and you should see it streaming through you packages with a success message.

Step Five

The encryption setting for each package is also stored in the project file, and needs to be changed here too.

Find the project file for the project you’re working with (.dtproj), right click it and select Open with, then notepad or your preferred text editor.

Within the SSIS:PackageMetaData node for each package, there’s a nested <SSIS:Property SSIS:Name=”ProtectionLevel”> element containing the integer value for its corresponding protection level.

Run a find replace for this full string, swapping only the integer value to the desired protection level. In this example we’re going from ‘Don’t Save Sensitive’ (0) to ‘Encrypt with password’ (2):

Capture

Save and close the project file.

Step Six

Re-open visual studio and your project, spot check a few packages to ensure the correct protection level is now selected. Build your project, and check back in to source control once satisfied.

SSIS Azure Data Lake Store Destination Mapping Problem

The Problem

My current project involves Azure’s Data Lake Store and Analytics. We’re using the SSIS Azure Feature Pack’s Azure Data Lake Store Destination to move data from our clients on premise system into the Lake, then using U-SQL to generate a delta file which goes on to be loaded into the warehouse. U-SQL is a “schema-on-read” language, which means you need a consistent and predictable format to be able to define the schema as you pull data out.

We ran in to an issue with this schema-on-read approach, but once you understand the issue, it’s simple to rectify. The Data Lake Store Destination task does not use the same column ordering which is shown in the destination mapping. Instead, it appears to rely on an underlying column identifier. This means that if you apply any conversions to a column in the data flow, this column will automatically be placed at the end of file– taking away the predictability of the file format, and potentially making your schema inconsistent if you have historic data in the Lake.

An Example

Create a simple package which pulls data from a flat file and moves it into the Lake.

clip_image002


Mappings of the Destination are as follows:

clip_image004


Running the package, and viewing the file in the Lake gives us the following (as we’d expect, based on the mappings):

clip_image006


Now add a conversion task – the one in my package just converts Col2 to a DT_I4, update the mappings in the destination, and run the package.

clip_image008


Open the file up in the Lake again, and you’ll find that Col2 is now at the end and contains the name of the input column, not the destination column:

clip_image010

The Fix

As mention in my “The Problem” section, the fix is extremely simple – just handle it in your U-SQL by re-ordering the columns appropriately during extraction! This article is more about giving a heads up and highlighting the problem, than a mind-blowing solution.