Adatis

Adatis BI Blogs

Getting started with Azure Data Factory Mapping Data Flows

Azure Data Factory V2 is the go-to service for moving large amounts of data within the Azure platform and, up until relatively recently, was focussed predominantly on control flow rather than data flow. Those familiar with SQL Server Integration Services will be aware of the difference however, to clarify, Control flow is used to orchestrate activities within a data pipeline whilst Data flow is used to physically move and transform the data. In May 2019, the Azure Data Factory team released into public preview the Azure Data Factory Mapping Data Flows (MDF) functionality, which effectively moves ADF from being an orchestration tool into a fully fledged ETL tool. This new feature allows developers to configure data transformation logic using a no code, drag and drop approach and implements many of the transformation concepts that existed in SSIS. In fact, there exists an excellent comparison between SSIS, SQL and ADF MDF written by Kamil Nowinski available through this link which highlights how the three technologies matchup: https://sqlplayer.net/2018/12/azure-data-factory-v2-and-its-available-components-in-data-flows/. Points to note are that ADF does not have event handlers in the way SSIS does and also does not have an equivalent to the script component within Data Flows. It does have a custom activity which allows you to write C# within the control flow but currently you cannot write any custom code within a Mapping Data Flow. Prior to the release of MDF, Data Factory could move data from one place to another with its copy activity however it could not modify the file in any meaningful way and would require external services such as SQL Data Warehouse or Data Lake Analytics (RIP) to be used to fill this gap. The copy activity performed the data movement using the Azure Integration Runtime which provided the compute power needed to complete the operation, so does that mean that Mapping Data Flows run on the same compute resource? NO is the answer. In fact, your graphical data flow is converted into Scala and then compiled into a JAR library to be executed on a Databricks (Spark) cluster which is deployed and managed by Microsoft solely for running your data flow activities. This does mean that any pipelines that utilise MDF have a slight overhead to allow for the cluster to start-up and configure, however Microsoft are looking at ways to reduce this and there is no need for an ETL developer looking to build MDF’s to know anything about Databricks, Scala or Spark Clusters – although it will certainly help!So, it’s in public preview so let’s get using it! This blog will walk through the process of creating a basic cleaning data flow that will populate a SQL Database table with the values from a delimited text file. To begin, we need some data factory objects, anyone familiar with data factory will understand we need Linked Services and Datasets to connect to our data and create schemas, and Pipelines to give structure to the activities and create our control flow. In this example I have a small comma separated file that has some data quality issues such as,Leading and trailing white spaceWeird column namesNon standardised NULL valuesWeak data typesThe data flow we will create will address all of these issues before depositing the file into a SQL Database. First, we should create a linked service that connects to a blob store or data lake account where our dirty file is stored. We then need a dataset that sits on top of the Linked Service which allows us to locate and read the file using the specified parameters such as file path, column delimiter, row delimiter etc. Additionally, in this dataset we can import the schema of the file so that we have some column names available. Next, we can create a linked service that connects to our output SQL Database and also a dataset that points to the correct table again, importing the schema. After creating these objects we also need a pipeline to orchestrate the process and this will ultimately call our fancy new data flow. You only need a single activity in this pipeline which will be found under the “Move & Transform” heading and is called “Data Flow (Preview)”At this point you should have something similar to the following with the exception of the data flow object under the Data Flows (Preview) tab:NOTE: I have not gone into much detail about creating these objects as they will be familiar to most ADF developers. For pointers on how to create the above follow this link: https://docs.microsoft.com/en-us/azure/data-factory/quickstart-create-data-factory-portal Now we can begin creating our Data Flow. Firstly, we will add the source and the sink to our data flow, this gives us our bookends as a clear starting point, and we can then play about with middle bit afterwards. Also, at this point, toggle the switch at the top of the Data Factory UI named “Data Flow Debug”, this will start a cluster up for you and avoid you having to wait later onHover over the Data Flows (Preview) header and click on the ellipsis, choose “Add Data Flow” and this will take you into the Mapping Data Flow UI where you can begin creating you Data Flows. Remember to set a sensible name for your data flow in the general tab of the data flow UI. Click “Add Source” and we can begin configuring the source activity of our data flow. Provide a clear name that identifies the source. This source is fed from a dataset so the names should closely match. Choose the dataset that is linked to your blob account and uncheck “Allow schema drift”. This is a useful option and allows for constantly changing sources files to be handled however we will cover that in a later blog. On the “Source Options” tab you can choose to add actions that occur after completion such as deleting or moving source files. On the “Projection” tab you can tailor your schema which is defined in the source dataset. My preference is to leave these all as strings to avoid any early typing errors as we will address these later in the flow. Finally, on the Optimize, Inspect and Data Preview tabs, all defaults can remain the same. Now click on the tiny + icon in the bottom right corner of the source activity and choose “Sink” from the drop-down list. Here we can configure the output of our flow which in this case will be a SQL Database. Specify the name, again relating it to your sink dataset, and choose the SQL Database dataset created earlier. On the “Settings” tab you can choose which methods can be used by ADF when working with your table. These can be any combination of Insertion, Deletion, Upserting or Updating. Also, you can define actions to occur in the database before loading the data such as recreating the entire table or truncating the existing table before loading into it. Finally, on the “Mapping” tab you can map columns from source to sink. Be aware that any columns that are not strings in your database will not be able to be mapped until the data typing has occurred. Now we have a basic copy structure that does nothing clever yet but does connect our dirty source file up to our sink dataset. We can begin doing the actual transform. The first transformation we will do will be to Trim white space from columns. Click on the + icon and choose “Derived Column”. Within the “Derived Column Settings” tab you should add each of the columns in your source dataset and then enter the following expression for each one in the expressions editor: trim({column name}). This expression will remove any whitespace from the columns value ensuring the database receives a clean value. Now we will standardise any NULL values and also transform any columns into their correct data types. To do this, click the + icon again and choose “Derived Column” again. Similar to the above step you can add an entry in the “Derived Column Settings” tab for each column, adding the below expression for each column: replace(replace({column name}, ' ',''),'Unknown',''). This expression will replace any empty values with NULL and also any values where we have ‘Unknown’ will also get replaced with NULL so that we have some standardisation before loading into the database. Any NULL values already present will be untouched. In my dataset I need to change one column from its originating string data type into an int so that it can be dropped in the DB. Rather than doing this change in place, it is best to create a new column to do this so that you have an original column and the new column with the correct type. Whilst still in the expression editor, hover over any column name in the “OUTPUT SCHEMA” window that is visible on the left hand side and choose the + icon. This will allow you add a new column to you data flow and you can use any of the conversion functions (toInteger, toLong, toString, toDate, toBoolean etc) to coerce the value into its correct type.At this point you should have four steps that resemble the below screenshot. Once your Data Flow Debug session is online you can debug the data flow and hopefully see the nice clean values pass through into the database. Throughout this process I recommend taking a peek at the Inspect and Data Preview tabs. The Inspect tabs give a bit more information about what steps are taking place on the data in that activity and the Data Preview will show you how the data will look, although the Debug session needs to be active for this to work. Finally, the optimize tab allows you to set the partitioning of the data using techniques such as Round Robin, HASH and range distribution although these are out of the scope of this blog. Hopefully this brief tutorial has been helpful and allowed you to gain some early knowledge on Data Factory Mapping Data Flows meaning that you can go on to create Flows that are tailored to your needs and cover off a wider variety of scenarios. Any questions, thoughts or feedback, please catch me on twitter @MattTheHow.

The Azure Modern Data Warehouse: Unparalleled Performance

Today, 80% of organisations adopt cloud-first strategies to scale, reduce costs, capitalise on new capabilities including advanced analytics and AI and to remain competitive. Cloud-adoption is accelerating, and data exploitation is a key driver. The central tenet to this enterprise-wide exploitation of data is the cloud-based Modern Data Warehouse. Legacy on-premises or appliance based EDWs, that were once the strategic asset for only the largest of enterprise organisations, not only limit performance, and flexibility, but are also harder to set up & scale securely. The Modern Data Warehouse fundamentally changes the landscape for data analytics by making analytics available to everyone across organisations of all sizes, and not only the largest enterprise. A modern data analytics platform enables you to bring together huge volumes of relational and non-relational, or structured and unstructured data into a single repository; the highly scalable and cost-effective Azure Data Lake. This provides access across the enterprise from basic information consumption through to innovation led data science. Big data processing capability for data preparation, such as transformation and cleansing, can be performed as well as infusing Machine Learning and AI with the results made readily available for analysis through visual tools like Power BI. Azure provides unparalleled performance at incredible value. To further support this claim, Microsoft have just announced the GigaOM TPC-DS Benchmark Results that further cements Azure SQL Data Warehouse as a leader for price/performance for both decision support benchmarks, having already attained best price/performance status for the TPC-H benchmark, announced back in Feb 2019. TPC-DS @ 30TB$ per Query per Hour TPC-H @ 30TB$ per Query per Hour Azure SQL Data Warehouse (Azure SQL DW) always delivered on performance when compared to alternatives, and now GigaOm found analytics on Azure is 12x faster and 73% cheaper when compared using the TPC-DS benchmark. Azure SQL DW has established itself as the alternative to on-premises data warehouse platforms and leader in Cloud Analytics. Adatis have been at the cutting edge of Cloud Analytics Solutions since the introduction of the Azure SQL Data Warehouse PaaS offering back in 2015. In the last 18 months we have noticed the profile of Azure SQL DW rise sharply; with Azure SQL DW outperforming and taking over workloads from its closest competitors. We specialise in all aspects of delivering the value of Cloud Analytics, AI and the Modern Data Warehouse, from strategic business value led engagements through technical design and implementation to on-going continuous improvement via fully managed DataOps practices. Adatis utilise Microsoft Azure technologies, in conjunction with first-party Spark based services, that securely integrate to provide enterprise-grade, cloud-scale analytics and insight for all and partner deeply with Microsoft to enable data driven transformation for our customers. We work to develop a modern data analytics strategy and ensure it is implemented and supported in the best way possible, aligning to your specific company’s goals and overriding strategy. If you want to find out how Adatis can help you make sense of your data and learn more about the Modern Data Warehouse, please join us in London on 6th June for an exclusive workshop. We will guide you through the Microsoft landscape and showcase how we can help you get more value from your data, wherever you are on your data transformation journey. Register here for our "Put your Data to Work" in-person event to be held in London on 6th June 2019 Additional Resources Microsoft Azure Big Data and Analytics Information on Azure SQL Data Warehouse Try Azure SQL Data Warehouse for Free #SimplyUnmatched, #analytics, #Azure, #AzureSQLDW, #MSPowerBI

Converting Data Factory Azure Function Activities from Web to Native

Microsoft have recently added support to call Azure Functions natively within Data Factory. Previous to this, you only had one option if you wanted to leverage the serverless compute – which was through a web activity. While this certainly did a job, it wasn’t ideal as it exposed the function/host key within code and also the http message so as to authenticate with the functions.  As such, I expect quite a few people’s pipelines will be in a similar state to my own. I have recently been through the process of upgrading them in line with the new native ADF functionality and as part of this, I thought it would be worth my time to share the process I went through so you don’t suffer the same pains!Linked ServiceThe first step in this process is to add a new ADF linked service into your data factory. When you try to do this, you’ll probably end up doing what always seems to happen when first searching for an Azure Function connector and search the prompt in front of you. This is default filtered to data connectors so it won’t appear, and you’ll need to switch the tab at the top to compute, at which point you’ll see it. The next step is to add the URL to your functions end point which will be in the format https://functionappname.azurewebsites.net. You’ll then need to supply a key to the function app for authentication purposes. I strongly advise you to set up an Azure Key Vault and store the key in there, and then in ADF you can access the key via a secret. This is so simple to do and promotes best practice from a security perspective as well as being useful for CI/CD purposes. This key can be both a function key (specific to a single function) OR a host key (access to all functions). Most of the tutorials on the web as well as the official MS content currently reference the function key, so I was slightly worried at first you could only call functions via the function key. This is not the case, and the host key can be used too. If you’re not too familiar with functions, this can be found within the application settings of your function app.PipelinesThe next step in the process is replacing those web activities with native function activities. While this is a fairly simple process, you’ll no longer be calling a GET on a HTTP webhook directly via a URL string, and passing in parameters. Instead, you’ll need to specify the function you need to call and pass in the parameters via a JSON body as part of a POST call to the end point. For any functions not requiring parameters, you’ll need to use a GET rather than a POST with an empty body as ADF has issues validating a body without any content. To pass the parameters into a body, you’ll need to wrap them in JSON with the following format. I’ve provided examples of strings, integers, and expressions but hopefully you get the point here.Function CodeAs part of this process, you’ll also need to modify your C# code within your function apps. Previous to this upgrade, you may be handling your functions through the HttpResponseMessage object (see below for an example – I’ve highlighted in red the code snippets that relate to the subject matter). Your web activity is happy to accept a response from this class in Data Factory!Now that we are using the native functions, Data Factory will no longer accept a Http message as a response, and instead wants a JSON object (JObject). As a result, our HttpResponseMessage needs to be replaced by an IActionResult object (see below for an example of changes). The change is relatively trivial and you only need to decide on the type of Object to create instead of just creating a response.To use this object, you’ll need to import the Microsoft.AspNetCore.Mvc libraries through NuGet. At this point, I ran into an issue since the dependencies required by Microsoft.AspNetCore.Mvc conflicted with my existing libraries – to such an extent where separate NuGet packages wanted my Newtonsoft.Json library to be both equal to version 9.01 and greater than version 11.0.1. To solve this issue, I imported separate parts of the main library – in this instance it was Microsoft.AspNetCore.Mvc.Abstractions and Microsoft.AspNetCore.Mvc.Core. These 2 libraries did not have the same constraints as the entire framework library. While the above code changes allowed me to send/receive a JObject, in most scenarios I also needed to POST parameters into the function. In this case, I also needed to modify the code to read this (see highlighted code block below). The content is now treated as a stream which then needs to be deserialised from JSON into an object. The parameters can then be read by treating the object as a list.ErrorsIf you’ve found this blog as part of a Google search, you’ll more likely than not of hit the following generic error { "errorCode": "3600", "message": "Error calling the endpoint.", "failureType": "UserError", "target": "Activate New Parent Pipeline" }. This not a particularly helpful error as it basically tells you, you're not able to receive a correctly formatted response. More often than not, this will be due to an issue in your function code, than something in ADF. I spent a while thinking it was an authentication error when it was not! You may receive a slightly more verbose error if you check the function app logs through something such as Azure Insights – again, I would strongly recommend you set this up as part of your Azure functions service as it’s not ideal to rely on ADF to debug this. Alternatively, I would also suggest setting up Postman to debug the issues locally to have a bit more control over the variables at runtime through the Locals output.ConclusionHopefully you will find this guide useful if you’ve about to go through a similar process to myself. It might also be useful for anyone setting up functions in ADF for the first time as it will cover most of the content required for the services to talk to one another.

Azure Data Factory v2 : ForEach Activity : Handling Null Items

The ForEach activity is a great addition to Azure Data Factory v2 (ADF v2) – however, you can encounter issues in some situations where you pass a null in it’s ‘Items’ setting for it to iterate.  When you pass a Null, you receive the error: Error { "errorCode": "InvalidTemplate", "message": "The function 'length' expects its parameter to be an array or a string. The provided value IS of type 'Null'.", "failureType": "UserError", "target": "ForEach" } This happens because the initialisation of the ForEach iterator checks the length of the string or array that is passed in on the ‘Items’ setting.  When a null is supplied (i.e. no items to create an array from), the length function fails.  I would like to see the ADF ForEach Activity check for null first and only check the length and continue with the iterator when it’s not null but it doesn’t, although I’m sure that will tighten up in future versions. Arguably the correct way to handle this is to implement an IF condition operator activity within your pipeline that tests for Null and only execute the ForEach iterator activities when you can confirm the object you want to iterate is not null.  However, for me that slightly overcomplicates the pipeline as you end up with nested activities / or additional pipelines that make it difficult to maintain and really understand what’s happening in the pipeline. There’s an alternative and I’d be interested in understanding if there are any better alternatives to achieve the same result. Please comment below. Rather than passing the Null object, we can run an inline test within the Items attribute and pass either a valid object for iterating (when the object is not null) or pass it an empty array when the object is null.  It took me a while to work out how to create an empty array.  @array(‘’) returned an array with a length of 1 so the ForEach loop fired but then subsequently failed as there was nothing to grab from the array.  An empty string had the same effect too.  We need to generate an array (or a string) with a length of zero.  The solution was to ‘take’ an item from array(‘’) – which item?  The item with index of 0.  This returned me a non null array with a length of 0 – the ForEach loop now didn’t fail, it now succeeded and better still, didn’t trigger the sub activities – the result I wanted. Wrapping this into a coalesce provides a self contained defensive null handling pattern that should arguably always be used on the Items setting of the ForEach activity. @coalesce(<##Your string or array to iterate that might be null##>,take(array(''),0)) I would love to hear better/alternative approaches that don’t rely on the IF conditional activity.

Using Azure Automation to Archive SFTP Files Orchestrated through ADF

I recently ran into a problem which required me to work with a SFTP site as part of our ETL process in Azure. Using the traditional BI stack, FTP tasks would be natively supported but this is no longer the case when working with Azure – this now requires a bit of plumbing. This blog will take a look at the ways in which you can interact with FTP/SFTP sites as part of your pipelines in ADF and the issues I ran into. Specifically the task I was trying to solve was to archive files (move them to a separate folder) that had passed into our staging layer, so they would not be picked up on the next process run. The blog will also cover passing parameters into a webhook as I also needed the solution to be flexible to handle different SFTP sources. Like many similar tasks through Azure, you can go about them in a number of ways (Logic Apps, Functions, PS Runbooks, etc) but its about finding the solution that fits with your architecture and you find comfortable developing and maintaining. I initially started looked at using Logic Apps since quite a bit of the work was done for you, they have the connectors already setup, and the framework to achieve some simple tasks which sounded perfect for the job. Unfortunately, Logic Apps is not designed for any type of heavy lifting. You can move files using the platform but are capped at a mere 50MB when trying to pick up and put down a file across the platform. Rumours were that Microsoft would extend this to 1GB at some point during 2018 but this has not been the case so far.  There was also no such task to just change the remote path location as far as I could tell – it looks like people are up-voting it here. Next I looked at using Azure functions, since I would do something similar through SSIS if this was a traditional stack problem. This involves writing a bit of C# to talk to the SFTP site and do the tasks for you. While this is very possible, it didn’t fit into our existing architecture. I also felt it was easier to build something to maintain using PowerShell – a language which I’m enjoying coding in more and more. Lastly I looked at using Azure Automation and doing the tasks through PowerShell. There are a number of modules which you can use to achieve the goal, for which I used Posh SSH. I also looked at WinSCP but found the Posh cmdlets more flexible. The first thing I did was work out how to pass parameters from a webhook into the script. This was so that the script could be used across multiple SFTP folders. In ADF it was as simple as defining the headers/body to pass with the POST call to the webhook.     Within the PS script I then defined the following region to pass the parameters into. The parameters are passed in as an object which is then deconstructed into the various header/body elements. The body is also deconstructed further from JSON into the variable I wanted. I believe it is also mandatory to name the object received by the script as $WebhookData as this is not configurable externally. # Get parameter value Param ([object]$WebhookData) #region Verify Webhook if ($WebHookData){ # Collect properties of WebhookData $WebhookName = $WebHookData.WebhookName $WebhookHeaders = $WebHookData.RequestHeader $WebhookBody = $WebHookData.RequestBody # Collect individual headers if required. Input converted from JSON. $Input = (ConvertFrom-Json -InputObject $WebhookBody) Write-Output "WebhookBody: $($Input)" Write-Output -InputObject ('Runbook started from webhook {0}.' -f $WebhookName) # Extract variables $folderName = $Input.FolderName } else { Write-Error -Message 'Runbook was not started from Webhook' -ErrorAction stop } #endregion   Once this is then defined within the script, you can setup the webhook from the Automation end and paste the URL it generates back into ADF. Its also at this point that the webhook configuration allows you to define a parameter as part of the webhook. Without the body of code above, this will not be an option, and cannot be created first as I originally assumed.     Once the parameter passing has been dealt with, the code block for connecting to the SFTP site and performing the task is straight forward. # Get variable values $userName = Get-AutomationVariable -Name 'SftpUserName' $userPassword = Get-AutomationVariable -Name 'SftpPassword' $hostName = Get-AutomationVariable -Name 'SftpHost' $port = Get-AutomationVariable -Name 'SftpPort' # Create PS credential from username/password $userPassword = ConvertTo-SecureString -String $userPassword -AsPlainText -Force $userCredential = New-Object -TypeName System.Management.Automation.PSCredential ($userName, $userPassword) # Create new SFTP session $session = New-SFTPSession -ComputerName $hostName -Credential $userCredential -Port $port -AcceptKey # Retrieve child objects on remote path and for every zip rename to zip.bk $remotePath = "/Data/" + $folderName + "/" Get-SFTPChildItem $session $remotePath| ForEach-Object { if ($_.Fullname -like '*.zip' -Or $_.Fullname -like "*.txt") { $NewName = $_.Name + ".bk" Rename-SFTPFile $session -Path $_.FullName -NewName $NewName Write-Output "$($_.FullName) has been renamed to $($NewName)" } }   The first task is to extract the variables from the Automation resource. These should not be hard-coded into the script, but stored as variables external to the runbook. Then a PSCredential requires configuration in the format above. Finally the task to archive the files is performed. This involves creating a new SFTP session, and iterating over the child items within the remote path. For each file it finds, it will then perform the archiving process as long as they have the correct extension. While I originally wanted to move the files between folders I realised this was not possible with the modules I was using. I’m sure with a bit more research something would be available out there to achieve this, specifically changing the path of a remote file. Therefore I was limited to changing the name of a remote file by adding a .bk to the end of the filename. ADF would then only pass over files ending in .zip the next time round.   Conclusion I’m hoping this will be useful for anyone in the same situation. I’ve specifically used the webhook parameters multiple times now.  While I didn’t achieve my original goal - with a small workaround I still satisfied my original requirement to archive the files from future process runs.  There’s also some other code snippets that might be of good reference for future implementations here too, specifically thinking around the PS credential creation. Hope it helps!

The thing that stumps everyone new to Azure Data Factory

Recently I was playing around with Azure Data Factory and was making my way through an online tutorial when I came across a frustrating error message that was not very helpful and my data wasn’t moving anywhere! The aim of this exercise was to move some text files in Azure Data Lake Store over to a table in an Azure SQL Database. Sounds easy enough! The data within the Data Lake store was organised into a Year and Month hierarchy for the folders, and each days transactions were stored in a file which was named after the day within the relevant month folder. The task then was to create a pipeline which copies the dataset in the Data Lake Store over to the dbo.Orders table in Azure SQL DB every day within the scheduled period (Q1 2016). After creating all the json scripts and deploying them (with no errors), I clicked on the ‘Monitor and Manage’ tile to monitor the activities, check everything was working as it should be and monitor the progress. After waiting for at least 10 minutes, I started to get frustrated. As you can see, all the Pipeline Activities for each slice have a Status of: ‘Waiting: Dataset Dependecies’ and if you look at the Activity Window Explorer (see below), the activity window for each day shows a Status of Waiting (a solid orange square).   The files existed in the correct folders within my Data Lake Store and there were no errors during deployment so what was it waiting for?! Well, it turns out we need to set an additional property on the input dataset to let Data Factory know that the data is being generated externally and not internally within Data Factory from another pipeline .     That property is "external": true at the bottom of the script. Below is an example of a data factory and the arrows point to the data sets that need to have this external property set to true. Using the diagram of the pipeline once deployed, it is easy to identify which datasets need to be external as they are not linked to anything upstream.

Using Lookup, Execute Pipeline and For Each Activity in Azure Data Factory V2

In my previous blog I looked how we can utilise pipeline parameters to variablise certain aspects of a dataset to avoid duplication of work. In this blog I will take that one step further and use parameters to feed into a For Each activity that can iterate over a data object and perform an action for each item. This blog assumes some prior knowledge of Azure Data Factory and it won’t hurt to read my previous blogPreviously I showed how to use a parameter file to copy a single table from Azure SQL DB down into a blob. Now lets use the For Each activity to fetch every table in the database from a single pipeline run. The benefit of doing this is that we don’t have to create any more linked services of data sets, we are only going to create one more pipeline that will contain the loop. The big picture here looks like below. The first activity to note is the lookup activity. This can go off and fetch a value from either SQL or JSON based sources and then incorporate that value into activities further down the chain. Here we are using SQL and you can see that we have supplied a SQL query that will fetch the schema and table names of our database. One “gotcha” is that even though you supply a SQL query, you still need to provide a dummy table name in the SQL dataset. It will use the query above at run time but won’t pass deployment without a table name. Also note that at this point, we do nothing with the returned value.Next, we have the Execute Pipeline activity which can accept input parameters and pass those down into the executed pipelines (or child pipelines as per the diagram). Within the type properties we can specify the parameters we want to pass in. The names here need to match whatever parameters we specify in the child pipeline but for the “value” we can make use of the new expression language to get a hold of the output of the previous lookup activity. We then reference the pipeline we want to execute, and that we need to wait for it to complete before continuing with our parent pipeline. Finally, we use the “dependsOn” attribute to ensure that our Execute Pipeline activity occurs AFTER our lookup has completed successfully. At this point we have told the child pipeline which tables to copy and then told it to start. Our child pipeline now just needs to iterate over that list and produce our output files. To do this it only needs one activity which is the For Each. The For Each really has two components which are the outer configurables (such as the items to iterate over) and then the inner activity to perform on each item. The outer section looks like this: Here we can configure the “isSequential” property which when set to “false” allows Data Factory to parallelise the inner activity, otherwise it will run each activity one after another. The other property is the “items” which is what the activity will iterate through. Because we fed the table list in to the “tableList” parameter from the Execute Pipeline activity we can specify that as our list of items. Now for the inner activity: Whilst this is a fairly chunky bit of JSON, those familiar with the copy activity in ADF V1 will probably feel pretty comfortable with this. The key difference is that we are again making use of expressions and parameters to make our template generic. You can see in the “output” attribute we are dynamically specifying the output blob name by using the schema and table name properties gleaned from our input data set. Also, in the source attribute we dynamically build our SQL query to select all the data from the table that we are currently on using the @item() property. This method of combing text and parameters is called string interpolation and allows us to easily mix static and dynamic content without the needed for additional functions or syntax. That’s it! By making use of only a few extra activities we can really easily do tasks that would have taken much longer in previous versions of ADF. You can find the full collection of JSON objects using this link: http://bit.ly/2zwZFLB. Watch this space for the next blog which will look at custom logging of data factory activities!

Pipeline Parameters in Azure Data Factory V2

The second release of Azure Data Factory (ADF) includes several new features that vastly improve the quality of the service. One of which is the ability to pass parameters down the pipeline into datasets. In this blog I will show how we can use parameters to manipulate a generic pipeline structure to copy a SQL table into a blob. Whilst this is a new feature for ADF, this blog assumes some prior knowledge of ADF (V1 or V2)   Creating the Data Factory V2 One other major difference with ADF V2 is that we no longer need to chain datasets and pipelines to create a working solution. Whilst the concept of dependency still exists, that is no longer needed to run a pipeline, we can now just run them ad hoc. This is important for this demo because we want to run the job once, check it succeeds and then move onto the next table instead of managing any data slices. You can create a new V2 data factory either from the portal or using this command in PowerShell: $df = Set-AzureRmDataFactoryV2 -ResourceGroupName <resource group> –Location <location> -Name <data factory name> If you are familiar with the PowerShell cmdlets for ADF V1 then you can make use of nearly all of them in V2 by appending “V2” to the end of the cmdlet name. ADF V2 Object Templates Now we have a Data Factory to work with we can start deploying objects. In order to make use of parameters we need to firstly specify how we will receive the value of each parameter. Currently there are two ways: 1.      Via a parameter file specified when you invoke the pipeline 2.      Using a lookup activity to obtain a value and pass that into a parameter This blog will focus on the simpler method using a parameter file. Later blogs will demonstrate the use of the lookup activity When we invoke our pipeline, I will show how to reference that file but for now we know we are working with a single parameter called “tableName”. Now we can move on to our pipeline definition which is where the parameter values are initially received. To do this we need to add an attribute called “parameters” to the definition file that will contain all the parameters that will be used within the pipeline. See below: The same concept needs to be carried through to the dataset that we want to feed the parameters in to. Within the dataset definition we need to have the parameter attribute specified in the same way as in the pipeline. Now that we have declared the parameters to the necessary objects I can show you how to pass data into a parameter. As mentioned before, the pipeline parameter will be populated by the parameter file however the dataset parameter will need to be populated from within the pipeline. Instead of simply referring to a dataset by name as in ADF V1 we now need the ability to supply more data and so the “inputs” and “outputs” section of our pipeline now looks like the below:Firstly, we declare the reference type, hence “DatasetReference”. We then give the reference name. This could be parameterised if neededFinally, for each parameter in our dataset (in this case there is only one called “tableName”) we supply the corresponding value from the pipelines parameter set. We can get the value of the pipeline parameter using the “@Pipeline.parameters.<parameter name>” syntax.At this point we have received the values from a file, passed them through the pipeline into a dataset and now it is time to use that value to manipulate the behaviour of the dataset. Because we are using the parameter to define our file name we can use its value as part of the “fileName” attribute, see below:Now we have the ability to input a table name and our pipeline will fetch that table from the database and copy it into a blob. Perhaps a diagram to help provide the big picture:Now we have a complete working pipeline that is totally generic, meaning we can change the parameters we feed in but should never have to change the JSON definition files. A pipeline such as this could have many uses but in the next blog I will show how we can use a ForEach loop (another ADF v2 feature) to copy every table from a data base still only using a single pipeline and some parameters. P.S. Use this link to see the entire json script used to create all the objects required for this blog.  http://bit.ly/2zwZFLB

Automating The Deployment of Azure Data Factory Custom Activities

Custom Activities in Azure Data Factory (ADF) are a great way to extend the capabilities of ADF by utilising C# functionality. Custom Activities are useful if you need to move data to/from a data store that ADF does not support, or to transform/process data in a way that isn't supported by Data Factory, as it can be used within an ADF pipeline.Deploying Custom Activities to ADF is a manual process, which requires many steps. Microsoft’s documentation lists them as:Compile the project. Click Build from the menu and click Build Solution.Launch Windows Explorer, and navigate to bin\debug or bin\release folder depending on the type of build.Create a zip file MyDotNetActivity.zip that contains all the binaries in the \bin\Debug folder. Include the MyDotNetActivity.pdb file so that you get additional details such as line number in the source code that caused the issue if there was a failure.Create a blob container named customactivitycontainer if it does not already existUpload MyDotNetActivity.zip as a blob to the customactivitycontainer in a general purpose Azure blob storage that is referred to by AzureStorageLinkedService.The number of steps means that it can take some time to deploy Custom Activities and, because it is a manual process, can contain errors such as missing files or uploading to the wrong storage account. To avoid that errors and delays caused by a manual deployment, we want to automate as much as possible. Thanks to PowerShell, it’s possible to automate the entire deployment steps. The script to do this is as follows:Login-AzureRmAccount# Parameters $SourceCodePath = "C:\PathToCustomActivitiesProject\"$ProjectFile ="CustomActivities.csproj"$Configuration = "Debug" #Azure parameters$StorageAccountName = "storageaccountname"$ResourceGroupName = "resourcegroupname"$ContainerName = "blobcontainername"# Local Variables$MsBuild = "C:\Program Files (x86)\MSBuild\14.0\Bin\MSBuild.exe";            $SlnFilePath = $SourceCodePath + $ProjectFile;                                         # Prepare the Args for the actual build            $BuildArgs = @{                 FilePath = $MsBuild                 ArgumentList = $SlnFilePath, "/t:rebuild", ("/p:Configuration=" + $Configuration), "/v:minimal"                 Wait = $true                 }          # Start the build            Start-Process @BuildArgs # initiate a sleep to avoid zipping up a half built projectSleep 5# create zip file $zipfilename = ($ProjectFile -replace ".csproj", "") + ".zip"$source = $SourceCodePath + "bin\" + $Configuration$destination = $SourceCodePath + $zipfilenameif(Test-path $destination) {Remove-item $destination}Add-Type -assembly "system.io.compression.filesystem"[io.compression.zipfile]::CreateFromDirectory($Source, $destination) #create storage account if not exists$storageAccount = Get-AzureRmStorageAccount -ErrorAction Stop | where-object {$_.StorageAccountName -eq $StorageAccountName}       if  ( !$storageAccount ) {     $StorageLocation = (Get-AzureRmResourceGroup -ResourceGroupName $ResourceGroupName).Location     $StorageType = "Standard_LRS"     New-AzureRmStorageAccount -ResourceGroupName $ResourceGroupName  -Name $StorageAccountName -Location $StorageLocation -Type $StorageType} #create container if not exists$ContainerObject = Get-AzureStorageContainer -ErrorAction Stop | where-object {$_.Name -eq $ContainerName}if (!$ContainerObject){$storagekey = Get-AzureRmStorageAccountKey -ResourceGroupName $ResourceGroupName -Name $StorageAccountName$context = New-AzureStorageContext -StorageAccountName $StorageAccountName -StorageAccountKey $storagekey.Key1 -Protocol HttpNew-AzureStorageContainer -Name $ContainerName -Permission Blob -Context $context} # upload to blob#set default contextSet-AzureRmCurrentStorageAccount -StorageAccountName $StorageAccountName -ResourceGroupName  $ResourceGroupNameGet-AzureRmStorageAccount -ResourceGroupName $ResourceGroupName -Name $StorageAccountName # Upload fileSet-AzureStorageBlobContent –Container $ContainerName -File $destination By removing the manual steps in building, zipping and deploying ADF Custom Activities, you remove the risk of something going wrong and you add the reassurance that you have a consistent method of deployment which will hopefully speed up your overall development and deployments.As always, if you have any questions or comments, do let me know.