Ust

Ust Oldfield's Blog

Azure Active Directory Authentication and Azure Data Catalog

In a previous post I introduced Azure Data Catalog. Because it’s great for data discovery and for data asset management, it makes sense to automate, as much as possible, the process of registering new data assets, and allowing users to discover data in a more natural, perhaps conversational, way. In order to automate the registration of data assets or to allow discovery through other tools, it’s necessary to look at how Azure Data Catalog authenticates users using Azure Active Directory (AAD). This post is going to explore some of the options the Azure Data Catalog uses for authentication and a walkthrough of a code example to make authentication work without user input.

Azure Active Directory Authentication

If you have interacted with Azure Data Catalog before, you will find that there are two ways of doing so. First, there’s the web application that allows you to conduct data discovery and data asset management. Then there’s the native application that sits on your local machine that can be used for registering data assets. These use different methods of authenticating using Azure Active Directory. The first one uses Web Browser to Web Application authentication. The second uses Native Application to Web API authentication.

Web Browser to Web Application

What is involved with Web Browser to Web Application authentication? Simply put, the web application directs the user’s browser to get them to sign-in AAD. AAD then returns a token which authenticates the user to use the web application. In practice, it’s a bit more complex, so here’s a diagram to help explain it.

image

In a bit more detail, the process it follows is:

1) A user visits the application and needs to sign in, they are redirected via a sign-in request to the authentication endpoint in AAD.

2) The user signs in on the sign-in page.

3) If authentication is successful, AAD creates an authentication token and returns a sign-in response to the application’s Reply URL that was configured in the Azure Portal. The returned token includes claims about the user and AAD that are required by the application to validate the token.

4) The application validates the token by using a public signing key and issuer information available at the federation metadata document for Azure AD. After the application validates the token, Azure AD starts a new session with the user. This session allows the user to access the application until it expires.

This method of authentication is used by Azure Data Catalog when discovering data through the browser.

Native Application to Web API

What’s the process of Native Application to Web API authentication? Simply put, the application will ask you to sign-in to AAD, so that it can acquire a token in order to access resources from the Web API. In practice, it’s a bit more complex, so here’s a diagram to help explain it.

image

In a bit more detail, the process it follows is:

1) The native application makes a request to the authorisation endpoint in AAD, but using a browser pop-up. This request includes the Application ID and redirect URI of the native application (see the following article for native applications and registering them in Azure) and the Application ID URI of the Web API. The user is then requested to sign-in.

2) AAD authenticates the user. AAD then issues an authorisation code response back to the application’s redirect URI.

3) The Application then stops the browser activity and extracts the authorisation code from the response. Using the authorisation code, the Application then requests an access token from AAD. It also uses details about the native application and the desired resource (Web API).

4) The authorisation code and details are checked by AAD, which then returns an access token and a refresh token.

5) The Application then uses the access token to add to the authorisation header in its request to the Web API. Which returns the requested resource, based on successful authentication.

6) When the access token expires, the refresh token is used to acquire a new access token without requiring the user to sign-in again.

This method of authentication is used by Azure Data Catalog when registering data assets via the desktop application.

Automated Interaction with Azure Data Catalog

In both of the examples above, they require the user to interact in order to provide sign-in credentials. This is not ideal if we want to automate the registration of data assets or conduct data discovery outside of the browser. Therefore we’ll need to use a different method of authentication. This is the Server Application to Web API authentication method. Simply, it assumes that the server has already required a user to login and therefore has the user’s credentials. It then uses those credentials to request the access and refresh tokens from AAD.

image

In a bit more detail, the process it follows is:

1) The Server Application makes a request to AAD’s Token Endpoint, bypassing the Authentication Endpoint, providing the credential, Application ID and Application URI.

2) AAD authenticates the application and returns an access token that can be used to call the Web API.

3) The Application uses the access token to add to the authorisation header in its request to the Web API. Which returns the requested resource, based on successful authentication.

This method is what we’re going to use to automate our interaction with Azure Data Catalog.

From an authentication aspect, the code for Server Application to Web API is simple and this example will take us to the point of returning that token, from which we can then use to request resources from the Azure Data Catalog API. The full code can be found my GitHub repo.

We are going to use the Client Id and Secret from an application we’ve registered in AAD (full process can be found in this Microsoft article on Integrating Applications with AAD).

private static string clientId = "ApplicationId";

private static string secret = "ApplicationKey";

Then, we’re going to make sure we’re connecting to the correct AAD instance

private static string authorityUri = string.Format("https://login.windows.net/{0}", tenantId);

So we can create an authorisation context

AuthenticationContext authContext = new AuthenticationContext(authorityUri);

In order to acquire a token

authResult = await authContext.AcquireTokenAsync(resourceUri, new ClientCredential(clientId, secret));

Which can then be used in an authorisation header in requests to the Azure Data Catalog API. In the next related post, we’ll explore how to make a call to the API using this authentication method.

Tabular Automation and NuGet

In a recent blog post, I wrote about processing an Azure Analysis Services tabular model using Azure Functions. In it, there’s a lengthy process of downloading some DLLs and uploading them to the Azure Function. Handily, the Analysis Services team at Microsoft have released the Analysis Services NuGet package, which means that the necessary DLLs can be automatically installed to an Azure Function without much hassle. This blog is going to go through the steps of adding the NuGet package to your Azure Function.

Add a new file to your function called project.json

image

Input the following code in the newly created file

{
   "frameworks": {
     "net46":{
       "dependencies": {
         "Microsoft.AnalysisServices.retail.amd64": "15.0.2"
       }
     }
    }
}

Then save the Azure Function to proceed with the NuGet restore and compile your function. You should see the following logs in your log window.

image

That is the entire process. Much easier than documented previously!

Introduction to Azure Data Catalog

With the rise of self-service business intelligence tools, like Power BI, and an increased engagement with data in the workplace, people’s expectations of where they can find expert information about data has changed. Where previously there would an expert that people would have to book time with in order to understand data, now people expect to get quick and detailed information about the data assets that an enterprise holds and maintains without going through a single contact. With Azure Data Catalog, data consumers can quickly discover data assets and gain knowledge about the data from documentation, tags and glossary terms from the subject matter experts. This post aims to give a brief introduction to Azure Data Catalog and what it can broadly be used for.

What is Azure Data Catalog?

Azure Data Catalog is a fully managed Azure service which is an enterprise-wide metadata catalogue that enables data discovery. With Azure Data Catalog, you register; discover; annotate; and, for some sources, connect to data assets. Azure Data Catalog is designed to manage disparate information about data; to make it easy to find data assets, understand them, and connect to them. Any user (analyst, data scientist, or developer) can discover, understand, and consume data sources. Azure Data Catalog is a one-stop central shop for all users to contribute their knowledge and build a community and culture of data.

What can Azure Data Catalog be used for?

As mentioned in the earlier headings, Azure Data Catalog can be used for data asset management; data governance; and data discovery. For data asset management, this means knowing what data is available and where; for data governance teams, this means answering questions like: where is my customer data? or what does this data model look like?; for data discovery, this means knowing which data is suitable for particular reports and who you can go to if you have any questions. There are some common scenarios for using Azure Data Catalog that Microsoft has put together, and it’s well worth reading to get a fuller understanding of what Azure Data Catalog can be used for.




Process an Azure Analysis Services Tabular Model from an Azure Function

A couple of weeks ago I wrote a blog post detailing how to process a tabular model from a stored procedure. The challenge there was to have a way of processing a tabular model on demand from a web application. The challenge still exists, but how do you achieve the same in Azure using Platform as a Service (PaaS) objects which do not have the benefit of a full server and items, such as msdb and SQL Agent, to aid in the process?

In this post, I’m going to show you how to process a tabular model only using Azure PaaS offerings. Not only am I going to show you how to do process a tabular model on-demand, but also how to process a tabular model on a schedule. This post has taken inspiration and part of the code base from the a Microsoft blog: Automating Azure Analysis Services processing with Azure Functions.

Azure Functions

Before we begin properly, it’s worth spending some time introducing Azure Functions. According to Microsoft, Azure Functions are:

…a solution for easily running small pieces of code, or "functions," in the cloud. You can write just the code you need for the problem at hand, without worrying about a whole application or the infrastructure to run it. Functions can make development even more productive, and you can use your development language of choice, such as C#, F#, Node.js, Java, or PHP. Pay only for the time your code runs and trust Azure to scale as needed. Azure Functions lets you develop server less applications on Microsoft Azure.

They are super useful for extending the capabilities of any solution and not just limited to what we’re going to cover here.

On-Demand Refresh

The use of Azure Functions creates a trigger for the on-demand refresh of a tabular model from the web application or web hook, this is to make sure that selected elements of data in a tabular model, for example hot partitions, are always up to date. The following describes the process that Azure Functions will be involved in this scenario:

image

The steps that are needed to create an Azure Function for On-Demand Refresh are as follow:

1) Create an Azure Function App

Navigate to the Azure Portal and create a Function App (the name changes quite a bit, so don’t be concerned if it’s not exactly displayed as it is below)

image

image

2) Create a new Function

After you’ve created the Function App, we need to add a new Webhook + API function, which we’ll use as the basis for our on-demand refresh. Click on the + button next to Functions, select Webhook + API, choose C# as your language and click Create this function.

image

3) Configure the Function

Download the latest client libraries for Analysis Services. This needs to be done to your local machine so you can then copy these files to your Azure Function App. After you’ve downloaded the client libraries, the DLLs can be found in C:\Program Files (x86)\Microsoft SQL Server\140\SDK\Assemblies. The two files you need are:

C:\Program Files (x86)\Microsoft SQL Server\140\SDK\Assemblies\Microsoft.AnalysisServices.Core.DLL
C:\Program Files (x86)\Microsoft SQL Server\140\SDK\Assemblies\Microsoft.AnalysisServices.Tabular.DLL

The documentation references the 130 assemblies, which is not correct and will not work. You need the assemblies in 140.

In order to add these assemblies to your function, you’ll need to add a folder called “bin”. To do this, select your Function App, click Platform features, and under Development Tools, click Advanced tools (Kudu).

image

In Kudu, click Debug console and select Cmd. Navigate to the site\wwwroot\OnDemandHttpTrigger folder and add the “bin” folder here by clicking the + button.

image

Once you’ve added the “bin” folder, go back over to the Azure portal and select your function OnDemandHttpTrigger.

On the right under View files, navigate to the bin folder. Then click the Upload button to add the two previously mentioned DLLs to the bin folder.

image

You should see the two DLLs in your bin folder now.

image

4) Add Azure Analysis Services Connection String to the Function App

This step is going to add the connection string to the Azure Analysis Services (AAS) service to the entire Function App, not just individual functions.

Click the name of your Function App, then select Platform features. Select Application settings under General Settings.

image

Now we need to add our connection string under the Connection strings section. You’ll need your AAS server name and a user ID and password that has access to the AAS database.

You can find your AAS server name by navigating to your AAS database in the Azure portal and copying the value constructed after clicking Show server connection strings:

image

Your connection string should look like this:

Provider=MSOLAP;Data Source=<your aas server>; Initial Catalog=<aas database name>;User ID=<your username>;Password=<your password>

Back in the screen for the Function App, fill in the Name textbox with a name for your connection string and paste your connection string in the Value text box:

image

Click Save at the top to save these settings for your Functions.

5) Time for Code

Our Function App has been configured, now we need to add code to the function. The function comes with a working function, for which to test out the functionality, but we don’t need everything that is on offer.

image

We’re going to programmatically process the tabular model. In doing so, we’ll leverage Analysis Services Management Objects (AMO). If you’re new to AMO, start here.

Paste in the following code (all code referenced can also be download from my GitHub Repo):

#r "Microsoft.AnalysisServices.Tabular.DLL"

#r "Microsoft.AnalysisServices.Core.DLL"

#r "System.Configuration"

using System;

using System.Configuration;

using Microsoft.AnalysisServices.Tabular;

using System.Net;

public static async Task<HttpResponseMessage> Run(HttpRequestMessage req, TraceWriter log)

{

log.Info("C# HTTP trigger function processed a request.");

// parse query parameter

string status = req.GetQueryNameValuePairs()

.FirstOrDefault(q => string.Compare(q.Key, "status", true) == 0)

.Value;

if (status == null)

{

// Get request body

dynamic data = await req.Content.ReadAsAsync<object>();

status = data?.status;

}

if (status == "execute")

{log.Info($"C# trigger function started at: {DateTime.Now}"); 

try

            {

Microsoft.AnalysisServices.Tabular.Server asSrv = new Microsoft.AnalysisServices.Tabular.Server();

var connStr = ConfigurationManager.ConnectionStrings["AASTabular"].ConnectionString;

asSrv.Connect(connStr);

Database db = asSrv.Databases["azureadventureworks"];

Model m = db.Model;

//db.Model.RequestRefresh(RefreshType.Full);     // Mark the model for refresh

m.RequestRefresh(RefreshType.Full);     // Mark the model for refresh

//m.Tables["Date"].RequestRefresh(RefreshType.Full);     // Mark only one table for refresh

db.Model.SaveChanges();     //commit  which will execute the refresh

asSrv.Disconnect();

            }

catch (Exception e)

            {

log.Info($"C# trigger function exception: {e.ToString()}");

            }

log.Info($"C# trigger function finished at: {DateTime.Now}");

}

return status == "execute"

?req.CreateResponse(HttpStatusCode.OK, "Successfully Processed Tabular Model ")

:req.CreateResponse(HttpStatusCode.BadRequest, "Please pass a status on the query string or in the request body");

}

Click the Save button at the top.

6) Test, Test, Test

Click the Run button at the top to test the function

image

The function can also be tested in a web browser, and be called by a Web App using the POST HTTP method.

image

Now we have a fully functioning method of refreshing an Azure Analysis Services tabular model on-demand.

Scheduled Refresh

The use of Azure Functions creates a trigger for the scheduled refresh of a tabular model, this is to make sure that the entire tabular model has the latest data and is always up to date. The following describes the process that Azure Functions will be involved in this scenario:

image

The steps that are needed to create an Azure Function for Scheduled Refresh are as follow:

1) Create a Function

We’ve created our Function App, and now we need to add a new Timer Trigger function, which we’ll use as the basis for our scheduled refresh. Click on the + button next to Functions, select Timer, choose C# as your language and click Create this function.

2) Configure the Timer

What use is a timer without a schedule? To give the timer a schedule, click Integrate, set the schedule and click Save.

image

The Schedule text box expects a CRON expression to define the days and times that the function should execute. Click the little Documentation button on the screen above to read about CRON expressions. The schedule I’ve set is to run once everyday at 09:30AM.

3) Configure the Function

See step 3 of the On-Demand Function for detailed steps. You’ll need to create the bin folder and upload the DLLs to the bin folder.

4) Time for Code

We’ve configured our function, so now it’s time to add the code. The code base is much simpler than the On-Demand code base, mainly because it’s doing fewer tasks. But the AMO section is exactly the same. Paste in the following code:

#r "Microsoft.AnalysisServices.Tabular.DLL"

#r "Microsoft.AnalysisServices.Core.DLL"

#r "System.Configuration"

using System;

using System.Configuration;

using Microsoft.AnalysisServices.Tabular;

public static void Run(TimerInfo myTimer, TraceWriter log)

{

log.Info($"C# Timer trigger function started at: {DateTime.Now}"); 

try

            {

Microsoft.AnalysisServices.Tabular.Server asSrv = new Microsoft.AnalysisServices.Tabular.Server();

var connStr = ConfigurationManager.ConnectionStrings["AASTabular"].ConnectionString;

asSrv.Connect(connStr);

Database db = asSrv.Databases["azureadventureworks"];

Model m = db.Model;

//db.Model.RequestRefresh(RefreshType.Full);     // Mark the model for refresh

m.RequestRefresh(RefreshType.Full);     // Mark the model for refresh

//m.Tables["Date"].RequestRefresh(RefreshType.Full);     // Mark only one table for refresh

db.Model.SaveChanges();     //commit  which will execute the refresh

asSrv.Disconnect();

            }

catch (Exception e)

            {

log.Info($"C# Timer trigger function exception: {e.ToString()}");

            }

log.Info($"C# Timer trigger function finished at: {DateTime.Now}");

}

Click the save button at the top.

5) Test, Test, Test

Click the Run button at the top to test the function

image


Now we have a fully functioning method of refreshing an Azure Analysis Services tabular model on-demand.

Conclusion

I have shown you how simple it is to invoke two methods of refreshing a tabular model using Azure Functions: an On-Demand refresh and a refresh by Schedule. I hope that you take inspiration and use these methods in your use of both Azure Analysis Services and Azure Functions.

Archiving the Data Lake

In a blog introducing the Data Lake Framework, keen readers will be aware that in the diagram there’s a box titled “ARCHIVE” but it has not been brought up since. The reason why the Archive layer in the data lake has not been discussed is because we’ve been waiting for the Archive Tier in Blob Storage.

To remind readers of the framework and where the archive layer sits in it, here it is again with the archive layer highlighted.

image

The Archive Blob

The Archive access tier in blob storage was made generally available today (13th December 2017) and with it comes the final piece in the puzzle to archiving data from the data lake.

Where Hot and Cool access tiers can be applied at a storage account level, the Archive access tier can only be applied to a blob storage container. To understand why the Archive access tier can only be applied to a container, you need to understand the features of the Archive access tier. It is intended for data that has no or low SLAs for availability within an organisation and the data is stored offline (Hot and Cool access tiers are online). Therefore, it can take up to 15 hours for data to be made online and available. Brining Archive data online is a process called rehydration (fitting for the data lake). If you have lots of blob containers in a storage account, you can archive them and rehydrate them as required, rather than having to rehydrate the entire storage account.

Archive Pattern

An intended use for the Archive access tier is to store raw data that must be preserved, even after it has been fully processed, and does not need to be accessed within 180 days.

Data gets loaded into the RAW area of the data lake, is fully processed through to CURATED, and a copy of the raw data is archived off to a blob container with a Cool access tier applied to it. When the archive cycle comes about, a new Cool access tiered blob container is created and the now old container has its access tier changed to Archive.

For example, our Archive cycle is monthly and we have a Cool access tiered blob container in our storage account called “December 2017”. When data has finished being processed in the Azure Data Lake, the Raw data is archived to this blob container. January comes around, we create a new blob container called “January 2018” with Cool access tier settings and change the access tier of “December 2017” from Cool to Archive.

This data has now been formally achieved and is only available for disaster recovery, auditing or compliance purposes. 




LETS Process Data–Modern ETL for the Data Lake

At the PASS Summit this year, I attended a session by Michael Rys. In this session he introduced the concept of LETS as an approach to process data in the data lake. If you are familiar with data lake, then you will be familiar of having to apply a schema to the data held within. The LETS approach is purpose design for schematization.

Where ETL stands for Extract, Transform, Load or ELT stands for Extract, Load, Transform – LETS stands for Load, Extract, Transform, Store.

Data are Loaded into the data lake

Data are Extracted and schematized

Data are Transformed in rowsets

Data are Stored in a location, such as the Catalog in Azure Data Lake Analytics, Azure Data Warehouse, Azure Analysis Services, for analysis purposes.

image

I really like this approach as it makes sense for how data are handled in the data lake. It’s something that I will be advocating and using, and I hope you do too!



Automating The Deployment of Azure Data Factory Custom Activities

Custom Activities in Azure Data Factory (ADF) are a great way to extend the capabilities of ADF by utilising C# functionality. Custom Activities are useful if you need to move data to/from a data store that ADF does not support, or to transform/process data in a way that isn't supported by Data Factory, as it can be used within an ADF pipeline.

Deploying Custom Activities to ADF is a manual process, which requires many steps. Microsoft’s documentation lists them as:

  • Compile the project. Click Build from the menu and click Build Solution.
  • Launch Windows Explorer, and navigate to bin\debug or bin\release folder depending on the type of build.
  • Create a zip file MyDotNetActivity.zip that contains all the binaries in the \bin\Debug folder. Include the MyDotNetActivity.pdb file so that you get additional details such as line number in the source code that caused the issue if there was a failure.
  • Create a blob container named customactivitycontainer if it does not already exist
  • Upload MyDotNetActivity.zip as a blob to the customactivitycontainer in a general purpose Azure blob storage that is referred to by AzureStorageLinkedService.

The number of steps means that it can take some time to deploy Custom Activities and, because it is a manual process, can contain errors such as missing files or uploading to the wrong storage account.

To avoid that errors and delays caused by a manual deployment, we want to automate as much as possible. Thanks to PowerShell, it’s possible to automate the entire deployment steps.

The script to do this is as follows:

Login-AzureRmAccount

# Parameters
$SourceCodePath = "C:\PathToCustomActivitiesProject\"
$ProjectFile ="CustomActivities.csproj"
$Configuration = "Debug"
#Azure parameters
$StorageAccountName = "storageaccountname"
$ResourceGroupName = "resourcegroupname"
$ContainerName = "blobcontainername"


# Local Variables
$MsBuild = "C:\Program Files (x86)\MSBuild\14.0\Bin\MSBuild.exe";           
$SlnFilePath = $SourceCodePath + $ProjectFile;           
                            
# Prepare the Args for the actual build           
$BuildArgs = @{           
     FilePath = $MsBuild           
     ArgumentList = $SlnFilePath, "/t:rebuild", ("/p:Configuration=" + $Configuration), "/v:minimal"           
     Wait = $true           
     }         

# Start the build           
Start-Process @BuildArgs
# initiate a sleep to avoid zipping up a half built project
Sleep 5

# create zip file

$zipfilename = ($ProjectFile -replace ".csproj", "") + ".zip"

$source = $SourceCodePath + "bin\" + $Configuration
$destination = $SourceCodePath + $zipfilename
if(Test-path $destination) {Remove-item $destination}
Add-Type -assembly "system.io.compression.filesystem"
[io.compression.zipfile]::CreateFromDirectory($Source, $destination)

#create storage account if not exists
$storageAccount = Get-AzureRmStorageAccount -ErrorAction Stop | where-object {$_.StorageAccountName -eq $StorageAccountName}      
if  ( !$storageAccount ) {
     $StorageLocation = (Get-AzureRmResourceGroup -ResourceGroupName $ResourceGroupName).Location
     $StorageType = "Standard_LRS"
     New-AzureRmStorageAccount -ResourceGroupName $ResourceGroupName  -Name $StorageAccountName -Location $StorageLocation -Type $StorageType
}

#create container if not exists
$ContainerObject = Get-AzureStorageContainer -ErrorAction Stop | where-object {$_.Name -eq $ContainerName}
if (!$ContainerObject){
$storagekey = Get-AzureRmStorageAccountKey -ResourceGroupName $ResourceGroupName -Name $StorageAccountName
$context = New-AzureStorageContext -StorageAccountName $StorageAccountName -StorageAccountKey $storagekey.Key1 -Protocol Http
New-AzureStorageContainer -Name $ContainerName -Permission Blob -Context $context
}

# upload to blob
#set default context
Set-AzureRmCurrentStorageAccount -StorageAccountName $StorageAccountName -ResourceGroupName  $ResourceGroupName
Get-AzureRmStorageAccount -ResourceGroupName $ResourceGroupName -Name $StorageAccountName
# Upload file
Set-AzureStorageBlobContent –Container $ContainerName -File $destination


By removing the manual steps in building, zipping and deploying ADF Custom Activities, you remove the risk of something going wrong and you add the reassurance that you have a consistent method of deployment which will hopefully speed up your overall development and deployments.

As always, if you have any questions or comments, do let me know.

Instant Bot: deploying a Bot in minutes with Azure Bot Service

I had been playing around with the Bot Framework for a while but hadn’t really got anywhere, largely due to having enough time to create something worthwhile, when I came across the Azure Bot Service whilst I was trawling through the documentation of the Bot Framework. The Azure Bot Service is currently in preview and allowed me to quickly author and deploy a basic bot for the purpose of this post and walkthrough.

Creating the Bot

Like most services in Azure, creating the bot is easy and requires the following input

image

Configuring the Bot

Setting up the Bot is a bit more involved. Once the bot has been created, you’ll be presented with the following screen

image

The App ID and Password are auto-generated by Microsoft, but you will need to make note of the password and store it securely as it is only displayed once in the app registration process.

Next you want to select the language in which the bot is developed and deployed. You have the choice of C# or NodeJS. I opted for C# as it’s a language I am most familiar with.

image

Choose your template, accept the T’s & C’s and your bot is ready to be deployed!

Deploying the Bot

The Bot has been created and configured, displaying its source code which can be further tweaked in the browser or Visual Studio. You can also embed your Bot in a number of existing apps, websites and services.

image


Chatting with the Bot

The basic bot isn’t the most stimulating of conversational partners but it is satisfying to see your creation talk back, even if it repeats what you have just told it.

image

The Bot Framework opens up many possibilities to make the services you offer engaging in a conversational way. The Azure Bot Service makes the Bot Framework that much more accessible to quickly deploy bots and have them out there, engaging with users.


Data Flow Pt 2: Vertexes In Azure Data Lake

Following on from my previous post on Job Execution in the Azure Data Lake, this post will explore the concept of Extents and how they are utilised by Vertexes in the actual Job processing in Azure.

The U-SQL script that has been authored, compiled and deployed is the logical plan of how the author intends to transform input data into output data. This creates a total amount of work – essentially the amount of data it has to process – which is decomposed into a set of vertexes. Each vertex will process a subset of data, or extents (see Azure Data Lake Storage for more information) and represent a fraction of the total.

Vertexes are displayed, or grouped, in a super vertex, also known as a stage. Vertexes in each stage are doing the same operation on a different part of the same data. The number of vertexes in a stage indicates the maximum theoretical parallelisation of that stage. The containers requested for the job will be allocated to complete each vertex. Say there is a 10GB file. This file will be split into 40 Extents and allocated to at least 10 Vertexes. If one wants to process all of the file in parallel then requesting 10 containers will allow for concurrent parallelism.

All this is visualised as the job graph.

image

If you have multiple transformations going on in your USQL script this will create multiple stages, and the output of one vertex becomes the input of another vertex in a dependent stage. As a result, dependent stages can begin processing even if preceding stages haven’t completed processing.

image

If dependency does exist, it will create a critical path – which is the dependency chain of vertexes which keep the job running to the very end because the vertex on the bottom depends on the output of the vertex on the top. This can be seen in the Vertex Execution View in the Visual Studio Azure Data Lake Tools view. It’s useful for optimising job, by re-positioning or re-writing elements of your script, by checking which vertex takes the longest.


It may not be possible to use all the reserved parallelism during a stage if there are fewer vertexes than Azure Data Lake Analytics Units (ADLAUs) available.

image

For example, if I have 10 ADLAUs – it’s great for an early stage as all ADLAUs will be allocated a vertex. But with later stages, more and more ADLAUs will be idle. So by the last stage only 2 of the 14 are utilised. Unfortunately it is not currently possible to dynamically de-allocate ADLAUs during a job.


In conclusion, understanding how vertexes interact with extents and can influence parallelism, you should be able to provision adequate resources for your jobs as well as know where to start the investigation for any long running queries you might have with the Vertex Execution View. In the next post of the series I will be looking into resource costs and optimising spend and time taken to process data.




Connecting Power BI to Hive

On a recent project I was tasked with importing data into Power BI from a Hive table. For those of you who are new to Azure or Big Data, Hive is a data warehousing infrastructure for Hadoop which sits in the HDInsight stack on Azure. The primary purpose of Hive is to provide data summarisation, query and analysis for big data sets. In this blog I’m going to take you through the steps and note any Gotchas so that you can connect to Hive using Power BI.

Connecting to Hive

As Hive is part of the Azure HDInsight stack it would be tempting to select the HDInsight or Hadoop connector when you’re getting data. However, note HDFS in brackets beside the Azure HDInsight and Hadoop File options as this means that you’ll be connecting to the underlying data store, which can be Azure Data Lake Store or Azure Blob Storage – both of which use HDFS architectures.

image

But this doesn’t help when you want to access a Hive table. In order to access a Hive table you will first of all need to install the Hive ODBC driver from Microsoft. Once you’ve downloaded and installed the driver you’ll be able to make your connection to Hive using the ODBC connector in PowerBI.

image

You will need to input a connection string to connect even though it says optional. The format of the connection string is as follows:

Driver={Microsoft Hive ODBC Driver};Host=hdinsightclustername.azurehdinsight.net;Port=443;Schema=default; RowsFetchedPerBlock=10000; HiveServerType=2; AuthMech=6; DefaultStringColumnLength=200;

One the next screen you’ll be asked to enter a username and password. The credentials used here are not what you use to access Azure but the credentials you created when you set up the HDInsight cluster and use to login to the cluster.

Click connect and you’ll be able to pull through the tables you need into Power BI. Or, if you want to be selective in what is returned, you can write a HiveQL query in the ODBC dialog. It’s also worth noting that at the moment it’s only possible to do an import of Hive Data in Power BI and not perform Direct Query, so if your data set is huge you’ll want to summarise the data or be really selective in what is returned first.