Ust

Ust Oldfield's Blog

Archiving the Data Lake

In a blog introducing the Data Lake Framework, keen readers will be aware that in the diagram there’s a box titled “ARCHIVE” but it has not been brought up since. The reason why the Archive layer in the data lake has not been discussed is because we’ve been waiting for the Archive Tier in Blob Storage.

To remind readers of the framework and where the archive layer sits in it, here it is again with the archive layer highlighted.

image

The Archive Blob

The Archive access tier in blob storage was made generally available today (13th December 2017) and with it comes the final piece in the puzzle to archiving data from the data lake.

Where Hot and Cool access tiers can be applied at a storage account level, the Archive access tier can only be applied to a blob storage container. To understand why the Archive access tier can only be applied to a container, you need to understand the features of the Archive access tier. It is intended for data that has no or low SLAs for availability within an organisation and the data is stored offline (Hot and Cool access tiers are online). Therefore, it can take up to 15 hours for data to be made online and available. Brining Archive data online is a process called rehydration (fitting for the data lake). If you have lots of blob containers in a storage account, you can archive them and rehydrate them as required, rather than having to rehydrate the entire storage account.

Archive Pattern

An intended use for the Archive access tier is to store raw data that must be preserved, even after it has been fully processed, and does not need to be accessed within 180 days.

Data gets loaded into the RAW area of the data lake, is fully processed through to CURATED, and a copy of the raw data is archived off to a blob container with a Cool access tier applied to it. When the archive cycle comes about, a new Cool access tiered blob container is created and the now old container has its access tier changed to Archive.

For example, our Archive cycle is monthly and we have a Cool access tiered blob container in our storage account called “December 2017”. When data has finished being processed in the Azure Data Lake, the Raw data is archived to this blob container. January comes around, we create a new blob container called “January 2018” with Cool access tier settings and change the access tier of “December 2017” from Cool to Archive.

This data has now been formally achieved and is only available for disaster recovery, auditing or compliance purposes. 




One-way Domain Trust and Power BI

I ran into a problem setting up on-premises data gateways on a client recently, whereby they had two domains but with a one-way trust. The result was that when authenticating within the Power BI Service to retrieve data from on-premises data sources in the untrusted domain it would throw an error.

At this point it is worth spending some time explaining the architecture.

The Architecture

The architecture might be familiar to many who use Power BI and the on-premises data gateway, with a little caveat. Domain 1 is the main domain. Domain 2 is the secondary domain and trusts Domain 1. Domain 1, on the other hand, doesn’t trust Domain 2.

image

A user in Domain 1 can access data sources in both Domain 1 and Domain 2. They can create their Power BI reports with a live connection or direct query and publish them to the Power BI Service. In order to use the reports in the service, on-premises data gateways need to be established to provide a messaging service between on-premises and the cloud. In this example, each domain has a domain controller, a tabular server and an on-premises data gateway for each tabular server.

The Problem

When a user logged-on to the Power BI Service tries to access data from Domain 2, their credentials are passed down to the on-premises data gateway, checked against the domain controller in Domain 2 and returns an error to the Power BI Service.

What I think happens is that the user (User.One@Domain1.com) will have their credentials passed down through the on-premises data gateway to the domain controller in Domain 2. Either the domain controller will not be able to find the user, it is the untrusted domain, and will not be able to pass the short name (DOMAIN1\USERONE) to the tabular server, or it tries to check with the domain controller in Domain 1 and encounters the dreaded Kerberos and cannot perform a double hop to return the short name. Either way, the result is the same in that the short name cannot be passed to the tabular server.

image

The Solution

As you can imagine, there are a few solutions to the problem.

  • If it is a Kerberos related issue, then Kerberos will have to be configured separately
  • Make Domain 2 a trusted domain
  • User mapping in Power BI Service

This latter approach is the one I opted for because it was guaranteed to work and would not change the current domain and network configuration.

In the gateways settings in the Power BI Service, I went to the Users tab under my data source and clicked on Map user names. In there I mapped users in Domain 1 to users in Domain 2.

image

If you have a large number of users, individual mapping might not be preferable or feasible, which is why you can replace the Domain names in part of the user string, as in example 3. This, however, does rely upon users in Domain 1 having an equivalent account in Domain 2. This is not always the case, for which the wildcard to service account would work, as shown in example 4.







LETS Process Data–Modern ETL for the Data Lake

At the PASS Summit this year, I attended a session by Michael Rys. In this session he introduced the concept of LETS as an approach to process data in the data lake. If you are familiar with data lake, then you will be familiar of having to apply a schema to the data held within. The LETS approach is purpose design for schematization.

Where ETL stands for Extract, Transform, Load or ELT stands for Extract, Load, Transform – LETS stands for Load, Extract, Transform, Store.

Data are Loaded into the data lake

Data are Extracted and schematized

Data are Transformed in rowsets

Data are Stored in a location, such as the Catalog in Azure Data Lake Analytics, Azure Data Warehouse, Azure Analysis Services, for analysis purposes.

image

I really like this approach as it makes sense for how data are handled in the data lake. It’s something that I will be advocating and using, and I hope you do too!



Geek in a Lift

This is probably my first proper non-technical post and I’m posting it because life at Adatis isn’t always glamourous and filled with exciting tech. Sometimes you get stuck in a lift in Coventry. That’s what happened to me.

The Context

There’s not an awful lot of context to this…. I’ve been up in Coventry all week on a client’s site, and staying at one of the Premier Inn’s in the city. For most of the week, it has been a pleasant stay, getting to know Coventry better (the city has some incredible restaurants for very little money) but otherwise uneventful. That is, until last night…

The Elephant In The Room

Well, said elephant is probably me as I caused a lift, which can hold up to 8 people, in a Premier Inn to breakdown. My first thought was “FFS”. After getting hold of the lift company, my first action was to tell my girlfriend (no sympathy there), so my second port of call were my work colleagues. This is where the fun begins…

“I’m stuck in the lift” – and the reply is not sympathy but elevator related puns, or encouragement to crack out some code, or general enjoyment at my plight.

WhatsApp Image 2017-10-05 at 18.33.59

Not impressed

After half an hour in the lift, hunger set in. And so the pictures of food flooded in to make the hunger worse. Elevator puns were still coming in from the late joiners.

Despair set in.

WhatsApp Image 2017-10-05 at 18.36.55

With the encouragement of the guys at Adatis, I was tweeting my ordeal using the #GeekInALift hashtag on both Twitter and Instagram.

Over an hour and a half after I got in the lift, I was rescued by the engineer.

WhatsApp Image 2017-10-05 at 19.32.26

FREEDOM

Lessons Learnt

The obvious lesson that I learnt was that I shouldn’t be so reliant upon machinery and technology. Sometimes they break.

The most rewarding lesson was that I can always count on everyone at Adatis to lift my spirits and keep me sane when the current situation is bleak. It’s something that I have taken for granted in the past, but no longer. It genuinely makes me proud to be an Adati.





Automating The Deployment of Azure Data Factory Custom Activities

Custom Activities in Azure Data Factory (ADF) are a great way to extend the capabilities of ADF by utilising C# functionality. Custom Activities are useful if you need to move data to/from a data store that ADF does not support, or to transform/process data in a way that isn't supported by Data Factory, as it can be used within an ADF pipeline.

Deploying Custom Activities to ADF is a manual process, which requires many steps. Microsoft’s documentation lists them as:

  • Compile the project. Click Build from the menu and click Build Solution.
  • Launch Windows Explorer, and navigate to bin\debug or bin\release folder depending on the type of build.
  • Create a zip file MyDotNetActivity.zip that contains all the binaries in the \bin\Debug folder. Include the MyDotNetActivity.pdb file so that you get additional details such as line number in the source code that caused the issue if there was a failure.
  • Create a blob container named customactivitycontainer if it does not already exist
  • Upload MyDotNetActivity.zip as a blob to the customactivitycontainer in a general purpose Azure blob storage that is referred to by AzureStorageLinkedService.

The number of steps means that it can take some time to deploy Custom Activities and, because it is a manual process, can contain errors such as missing files or uploading to the wrong storage account.

To avoid that errors and delays caused by a manual deployment, we want to automate as much as possible. Thanks to PowerShell, it’s possible to automate the entire deployment steps.

The script to do this is as follows:

Login-AzureRmAccount

# Parameters
$SourceCodePath = "C:\PathToCustomActivitiesProject\"
$ProjectFile ="CustomActivities.csproj"
$Configuration = "Debug"
#Azure parameters
$StorageAccountName = "storageaccountname"
$ResourceGroupName = "resourcegroupname"
$ContainerName = "blobcontainername"


# Local Variables
$MsBuild = "C:\Program Files (x86)\MSBuild\14.0\Bin\MSBuild.exe";           
$SlnFilePath = $SourceCodePath + $ProjectFile;           
                            
# Prepare the Args for the actual build           
$BuildArgs = @{           
     FilePath = $MsBuild           
     ArgumentList = $SlnFilePath, "/t:rebuild", ("/p:Configuration=" + $Configuration), "/v:minimal"           
     Wait = $true           
     }         

# Start the build           
Start-Process @BuildArgs
# initiate a sleep to avoid zipping up a half built project
Sleep 5

# create zip file

$zipfilename = ($ProjectFile -replace ".csproj", "") + ".zip"

$source = $SourceCodePath + "bin\" + $Configuration
$destination = $SourceCodePath + $zipfilename
if(Test-path $destination) {Remove-item $destination}
Add-Type -assembly "system.io.compression.filesystem"
[io.compression.zipfile]::CreateFromDirectory($Source, $destination)

#create storage account if not exists
$storageAccount = Get-AzureRmStorageAccount -ErrorAction Stop | where-object {$_.StorageAccountName -eq $StorageAccountName}      
if  ( !$storageAccount ) {
     $StorageLocation = (Get-AzureRmResourceGroup -ResourceGroupName $ResourceGroupName).Location
     $StorageType = "Standard_LRS"
     New-AzureRmStorageAccount -ResourceGroupName $ResourceGroupName  -Name $StorageAccountName -Location $StorageLocation -Type $StorageType
}

#create container if not exists
$ContainerObject = Get-AzureStorageContainer -ErrorAction Stop | where-object {$_.Name -eq $ContainerName}
if (!$ContainerObject){
$storagekey = Get-AzureRmStorageAccountKey -ResourceGroupName $ResourceGroupName -Name $StorageAccountName
$context = New-AzureStorageContext -StorageAccountName $StorageAccountName -StorageAccountKey $storagekey.Key1 -Protocol Http
New-AzureStorageContainer -Name $ContainerName -Permission Blob -Context $context
}

# upload to blob
#set default context
Set-AzureRmCurrentStorageAccount -StorageAccountName $StorageAccountName -ResourceGroupName  $ResourceGroupName
Get-AzureRmStorageAccount -ResourceGroupName $ResourceGroupName -Name $StorageAccountName
# Upload file
Set-AzureStorageBlobContent –Container $ContainerName -File $destination


By removing the manual steps in building, zipping and deploying ADF Custom Activities, you remove the risk of something going wrong and you add the reassurance that you have a consistent method of deployment which will hopefully speed up your overall development and deployments.

As always, if you have any questions or comments, do let me know.

Instant Bot: deploying a Bot in minutes with Azure Bot Service

I had been playing around with the Bot Framework for a while but hadn’t really got anywhere, largely due to having enough time to create something worthwhile, when I came across the Azure Bot Service whilst I was trawling through the documentation of the Bot Framework. The Azure Bot Service is currently in preview and allowed me to quickly author and deploy a basic bot for the purpose of this post and walkthrough.

Creating the Bot

Like most services in Azure, creating the bot is easy and requires the following input

image

Configuring the Bot

Setting up the Bot is a bit more involved. Once the bot has been created, you’ll be presented with the following screen

image

The App ID and Password are auto-generated by Microsoft, but you will need to make note of the password and store it securely as it is only displayed once in the app registration process.

Next you want to select the language in which the bot is developed and deployed. You have the choice of C# or NodeJS. I opted for C# as it’s a language I am most familiar with.

image

Choose your template, accept the T’s & C’s and your bot is ready to be deployed!

Deploying the Bot

The Bot has been created and configured, displaying its source code which can be further tweaked in the browser or Visual Studio. You can also embed your Bot in a number of existing apps, websites and services.

image


Chatting with the Bot

The basic bot isn’t the most stimulating of conversational partners but it is satisfying to see your creation talk back, even if it repeats what you have just told it.

image

The Bot Framework opens up many possibilities to make the services you offer engaging in a conversational way. The Azure Bot Service makes the Bot Framework that much more accessible to quickly deploy bots and have them out there, engaging with users.


Data Flow Pt 2: Vertexes In Azure Data Lake

Following on from my previous post on Job Execution in the Azure Data Lake, this post will explore the concept of Extents and how they are utilised by Vertexes in the actual Job processing in Azure.

The U-SQL script that has been authored, compiled and deployed is the logical plan of how the author intends to transform input data into output data. This creates a total amount of work – essentially the amount of data it has to process – which is decomposed into a set of vertexes. Each vertex will process a subset of data, or extents (see Azure Data Lake Storage for more information) and represent a fraction of the total.

Vertexes are displayed, or grouped, in a super vertex, also known as a stage. Vertexes in each stage are doing the same operation on a different part of the same data. The number of vertexes in a stage indicates the maximum theoretical parallelisation of that stage. The containers requested for the job will be allocated to complete each vertex. Say there is a 10GB file. This file will be split into 40 Extents and allocated to at least 10 Vertexes. If one wants to process all of the file in parallel then requesting 10 containers will allow for concurrent parallelism.

All this is visualised as the job graph.

image

If you have multiple transformations going on in your USQL script this will create multiple stages, and the output of one vertex becomes the input of another vertex in a dependent stage. As a result, dependent stages can begin processing even if preceding stages haven’t completed processing.

image

If dependency does exist, it will create a critical path – which is the dependency chain of vertexes which keep the job running to the very end because the vertex on the bottom depends on the output of the vertex on the top. This can be seen in the Vertex Execution View in the Visual Studio Azure Data Lake Tools view. It’s useful for optimising job, by re-positioning or re-writing elements of your script, by checking which vertex takes the longest.


It may not be possible to use all the reserved parallelism during a stage if there are fewer vertexes than Azure Data Lake Analytics Units (ADLAUs) available.

image

For example, if I have 10 ADLAUs – it’s great for an early stage as all ADLAUs will be allocated a vertex. But with later stages, more and more ADLAUs will be idle. So by the last stage only 2 of the 14 are utilised. Unfortunately it is not currently possible to dynamically de-allocate ADLAUs during a job.


In conclusion, understanding how vertexes interact with extents and can influence parallelism, you should be able to provision adequate resources for your jobs as well as know where to start the investigation for any long running queries you might have with the Vertex Execution View. In the next post of the series I will be looking into resource costs and optimising spend and time taken to process data.




Connecting Power BI to Hive

On a recent project I was tasked with importing data into Power BI from a Hive table. For those of you who are new to Azure or Big Data, Hive is a data warehousing infrastructure for Hadoop which sits in the HDInsight stack on Azure. The primary purpose of Hive is to provide data summarisation, query and analysis for big data sets. In this blog I’m going to take you through the steps and note any Gotchas so that you can connect to Hive using Power BI.

Connecting to Hive

As Hive is part of the Azure HDInsight stack it would be tempting to select the HDInsight or Hadoop connector when you’re getting data. However, note HDFS in brackets beside the Azure HDInsight and Hadoop File options as this means that you’ll be connecting to the underlying data store, which can be Azure Data Lake Store or Azure Blob Storage – both of which use HDFS architectures.

image

But this doesn’t help when you want to access a Hive table. In order to access a Hive table you will first of all need to install the Hive ODBC driver from Microsoft. Once you’ve downloaded and installed the driver you’ll be able to make your connection to Hive using the ODBC connector in PowerBI.

image

You will need to input a connection string to connect even though it says optional. The format of the connection string is as follows:

Driver={Microsoft Hive ODBC Driver};Host=hdinsightclustername.azurehdinsight.net;Port=443;Schema=default; RowsFetchedPerBlock=10000; HiveServerType=2; AuthMech=6; DefaultStringColumnLength=200;

One the next screen you’ll be asked to enter a username and password. The credentials used here are not what you use to access Azure but the credentials you created when you set up the HDInsight cluster and use to login to the cluster.

Click connect and you’ll be able to pull through the tables you need into Power BI. Or, if you want to be selective in what is returned, you can write a HiveQL query in the ODBC dialog. It’s also worth noting that at the moment it’s only possible to do an import of Hive Data in Power BI and not perform Direct Query, so if your data set is huge you’ll want to summarise the data or be really selective in what is returned first.

Data Flow: Job Execution in the Azure Data Lake

In this blog we will deep dive into the job execution of the Azure Data Lake Analytics (ADLA). If you’ve used ADLA you will have come across the following image:

Picture1

This is the graphical representation of the job execution. But what is is doing at each stage?

Preparing

The job execution begins with the authoring of the U-SQL script. The U-SQL script itself is the logical plan of how you intend to transform input data into output data. The script gets compiled, which will translate your U-SQL script primarily into C#, as well as a few other items such as XML– which will contain metadata information and the job graph. Once compiled it will create an initial plan – which will then be optimised thus producing an optimised plan. If it fails at this stage it will mostly be down to your U-SQL script failing to compile successfully. Fortunately, the error codes at this stage are fairly helpful and will indicate the line in question.

Queued

Once compiled and optimised the job progresses to the queued stage. There are a few things which can cause the a job to queue and these are:

  • Running Jobs
    • The default setting per subscription is 3. So if you or someone else in your subscription is already running a series of jobs, your job will be queued until at least one of them completes
  • Higher priority jobs
    • As the queue is ordered by job priority, with loser numbers having a higher priority, other jobs can jump to the top of the queue thus forcing your job to queue for longer
  • Lack of resources
    • Even if there are no running jobs and there are no jobs ahead of yours in the queue, your job will continue to queue if there are not enough Azure Data Lake Analytic Units (ADLAUs) to start the job.

 

Running

Once the job has finished queuing it will run. The Job Manager allocates vertexes, which are collections of data to be processed, to the ADLAUs and uses YARN to orchestrate it. The vertexes will then execute. At this stage, if there is a vertex failure you can download the vertex locally and debug in visual studio.

When the vertexes have successfully completed you will be able to consume the end product – either locally, if you have run the job locally, or in the Azure Data Lake Store.

Conclusion

Job Execution

The above process can be visualised in this graph. Knowing how ADLA executes your job is the first of many steps to being able to write performant U-SQL and debug effectively. Be on the look out for more blogs on the technical deep dive into ADLA!

Testing the Waters: An Overview of Data Science using Azure Data Lakes

Data Science can fit seamlessly within the ecosystem of the data lake, whether this is through HDInsight or the extensibility of Azure Data Lake Analytics and U-SQL. This blog will give a brief overview of what Data Science is; how to link Data Science toolkits to the Azure Data Lake; and best practices for managing the data output from experiments.

 

Data Science

Data Science is the relatively new kid on the block. One way to consider data science is as an evolutionary step in interdisciplinary fields like business analysis that incorporate computer science, modelling, statistics, analytics, and mathematics.
At its core, data science involves using automated methods to analyse massive amounts of data and to extract insight from them. Data science is helping to create new branches of science, and influencing areas of social science and the humanities. The trend is expected to accelerate in the coming years as data from mobile sensors, sophisticated instruments, the web, and more, grows.

 

Data Science In The Data Lake

The nature of the Azure Data Lake Store lends itself to Data Science in that it can hold any data, which the data scientist will want to access, transform and analyse.

 

HDInsight contains many implementations for data science, such as Spark, R Server and others. Hooking HDInsight to Azure Data Lake Store is pretty simple and follows these steps:

In the Azure Portal Marketplace, select HDInsight which will bring up a series of blades. In this blog, I will be using Spark as my cluster type on HDInsight.

image

In the storage settings you can then link your HDInsight cluster to Azure Data Lake Store

image

Confirm your configuration on the next blade and wait around 20 minutes for your cluster to deploy and you’re good to go!

 

With Azure Data Lake Analytics, you incorporate data science by extending the capabilities of U-SQL and you do this by installing a series of files.

Open up your Azure Data Lake Analytics account and click on Sample Scripts at the top. This will bring forward the following blade

image

From there you’ll want to click on the U-SQL Advanced Analytics tab, which will copy about 1.5GB of files to the default Azure Data Lake Store associated to your ADLA account. This will take about 3 minutes to complete.

When it’s finished copying the files it will then call a job to register the extension, which can be found in the assemblies folder of the master database.

image

More resources about extensibility of U-SQL can be found here:

https://blogs.msdn.microsoft.com/azuredatalake/2017/03/10/using-custom-python-libraries-with-u-sql/

https://blogs.msdn.microsoft.com/azuredatalake/2016/11/22/u-sql-advanced-analytics-introducing-python-extensions-for-u-sql/

The Laboratory

Within Azure Data Lake Store, Folder and File Management is incredibly important for a well running data lake. See my blogs on Storage and Best Practices and Shaping The Lake for more information on how to set up your Azure Data Lake Store.

The Laboratory is an area to be exclusively used by a data scientist. It’s an area where they can persist the results of experiments and data sets without impacting the day-to-day operations within the data lake or other data scientists. The laboratory is organised in to two broad area: Desks and Exhibits.

Desks contain personal workspaces, the contents of which can be organised however the person wishes. It can be as well organised, or disorganised, as the person themselves. The Exhibit contains data sources produced in the Laboratory which are ready to be consumed by other users or systems. Both of which are laid out below.

image

As always, if you have any feedback or comments do let me know!