Terry McCann

Terry McCann's Blog

How to prepare for 70-766 - Perform Big Data Engineering on Microsoft Cloud Services


There is a new exam currently in beta titled "Perform Big Data Engineering on Microsoft Cloud Services (beta)". With all new exams there is little content on how to revise for the exam beyond the exams summary. This exam however, is what Adatis specialises in! Microsoft may call this "Big Data Engineering" we call it "Modern Data Analytics" and we have a few blogs on the subject. You can sign up to the exam here: https://www.microsoft.com/en-us/learning/exam-70-776.aspx

Below you will find links to blog posts by Adatis consultants on topics related to all the key objectives of this exam. I will endeavour to keep this up-to-date with new content added by the team. Good luck with the exam.

Design and Implement Complex Event Processing By Using Azure Stream Analytics (15-20%)
Streaming data is vital to achieving real-time analytics. The following blogs posts focus on this and offer an introduction and walkthrough for getting started with Stream Analytics. When talking about a wider Lambda approach to Big Data, streaming enables rapid processing via a “Speed” layer.   


Design and Implement Analytics by Using Azure Data Lake (25-30%)
Azure Data Lake Store and Analytics are a vital component of the “Modern Data Analytics”. Data which is too large for traditional single server processing needs distributed parallel computation. Rather than pulling data and processing ADLA pushes the processing to the data. Understanding how to process large volumes of data is one part of the “Batch” layer in Lambda


Design and Implement Azure SQL Data Warehouse Solutions (15-20%)
Either as an alternative or in accompaniment to Data Lake is Azure SQL Data Warehouse. If Data Lake is batch across many files, Azure SQLDW is parallel batch over many databases. The key to both services is processing at the storage and not at the compute. The following is an on-going blog series covering the basics all the way to a  deep-dive.  


Design and Implement Cloud-Based Integration by using Azure Data Factory (15-20%)
If you’re looking for a paas solution to move data in Azure, there is only really one option. Azure Data Factory. The following blogs will get you up-to-speed with ADF.


Manage and Maintain Azure SQL Data Warehouse, Azure Data Lake, Azure Data Factory, and Azure Stream Analytics (20-25%)
Know each of the parts is only half the battle, you need to know how, when and why to use each part. What are the best practices?


Azure Data Factory, using the Copy Data task to migrate data from on premise SQL Server to Blob storage.


    In the first blog in this series I talked about working through a use case. Over the next 3 blogs we will look at 3 different methods for migrating data to Azure Blob storage.

    1. Using the Azure Data Factory Copy Data Wizard. (on table)
    2. Using BIML and SSIS (entire database - SSIS)
    3. Using Azure Data Factory and PowerShell (entire database - ADF)

    The reason I have included the latter 2 versions is because if you just want to load an entire database in the blob storage it can be quicker to use one of these methods as a one off or on a scheduled basis. Hand writing all the JSON required for each move table from on premise to blob storage is very time consuming. Depending on whether you need to do a one off upload or something on a schedule options 2 and 3 might help.

    Our original use case from an introduction to Azure Data Factory:

    Let's imagine a process where you have an on premise SQL server box, you want to move multiple tables to blob storage, from there you then want to issue a stored procedure which will consume that data in to an Azure SQL data warehouse via PolyBase - As illustrated in the image below.


    Linked services:
    On-premise SQL database
    Azure BLOB storage
    Azure SQL data warehouse

    Table in On-premise SQL database
    The blob container
    The stored procedure

    Pipeline to move data from SQL database to blob storage
    Pipeline to issue stored procedure

    In the blog we will tackle the first part:


    Copying data:
    We are going to start looking in a bit more in detail at the Azure Data Factories (ADF) copy data task (CD). CD is still in preview (at the time of writing [01/2007]). Prior to the inclusion of the copy data wizard, you had to manually configure ADF artefacts and write most of the JSON for linked services, datasets and pipeline by hand. The copy data task is a wizard for generating a data movement activity pipeline, complete with datasets and linked services.

    To get started connect to azure and navigate to your existing Azure data factory (if you do not have an existing ADF you can follow how to create one here http://blogs.adatis.co.uk/Terry%20McCann/post/Setting-up-your-first-Azure-Data-Factory).


    To begin setting up a copy data pipeline click on the "copy data (PREVIEW)" option in the ADF panel in Azure. Once you have selected "Copy data" you will be taken to the new ADF interface. Datafactory.azure.com enables the creation and monitoring of ADF pipelines.

    The general process for creating an ADF pipeline process (prior to the copy data task) was

    1. Create an ADF
    2. Create a linked service/s
    3. Create a gateway as needed
    4. Create you input and output datasets
    5. Create a pipeline
    6. Monitor the pipeline

    When using the ADF copy data the process is slightly flipped

    1. Create an ADF
    2. Configure the properties for the pipeline
    3. Create a gateway
    4. Configure the linked service/s
    5. Configure the datasets
    6. Deploy all configurations
    7. Monitor the pipeline.

    The main difference here is that you do not deploy anything until it has all been configured, you have the added advantage that it is somewhat cumbersome to do this manually. At present, the copy data task is very narrow in its functionality. If your intention is to build a more advanced pipeline will either need to generate a move task and tweak it or create it all manually.

    Copy data has many short comings, for our example the most prevalent is that a movement to blob storage only has the option to sink data to a folder and not multiple folders. Option 3 in our list of methods for migration aims to get around this limitation using PowerShell

    Configure Pipeline properties:
    Once you have selected "copy data" you will be launched in to datafactory.azure.com, the new fresher looking environment. The copy data task is a 4 stop process which will guide you through the creation of a data movement pipeline (I want to highlight that this is only used for data movement and not transformation). This is a great way to get you started with ADF without having to understand the json or trickier elements such as data slices and scheduling, although we will touch on scheduling as it is quite tricky.


    (image 1 - Configure Properties)

    The first screen is you will see are the properties of the pipeline you're creating. It is here you will configure the frequency and schedule of the pipeline. A pipeline is a group of logically related activities.

    1. Task name - This is important and will be used as a prefix for the names of datasets and data stores.
    2. Task description -
    3. Task schedule - See below for a more in depth analysis.
    4. Start time - This date is in UTC / End time - This data is also in UTC

    For quick conversions to your time zone, I recommend worldtimebuddy (http://www.worldtimebuddy.com/)

    More on schedules:
    The Microsoft page about scheduling is incredibly deep and takes a lot of effort to fully digest and understand. I will attempt to impart my understanding of pipeline scheduling in to a brief list of key points. You can read more here https://docs.microsoft.com/en-us/azure/data-factory/data-factory-scheduling-and-execution I would recommend that you do read this site as there is a lot of good examples. The key points from this document are:

    • A schedule can be one-off or between dates
    • It can run on a scheduled basis Minute, hourly, daily or weekly. Every 15 minutes is the minimum.
      This forms what is known as a tumbling window. Microsoft defines a tumbling window as "A series of fixed-size, non-overlapping, contiguous time intervals". Tumbling windows are also known as activity windows.
    • A pipeline schedule's interval needs to be the same as a datasets availability - However, it does not need to run at the same time.

    For our example we will use frequency of "Daily" and an interval of "1", this will run our pipeline every day. To write this in JSON and not through the wizard you would use the following JSON as part of your pipeline.

    "scheduler": {
    "frequency": "Daily",
    "interval": 1

    To create a pipeline which will run undefinably you can set the end date time to "12/31/2099 12:00am" which while this is not infinite, the date will outlive ADF. Start date time will default to the time you have created the pipeline (n.b. these dates are expressed in US format MM/DD/YYYY).

    Creating linked services (Source data store):
    The next screen is the configuration of the linked sources. ADF is still new and the terminology is somewhat confusing. Depending on tool you're using and sometimes the screen you're looking at, ADF will mix up the names for key parts, anywhere you see the term "data store" assume it is referring to a linked service.


    For our example we will use the SQL Server option (bottom right of the above image).


    (Image - SQL Server linked service configuration screen)

    You should now be able to configure the connection details (via a gateway) to you SQL server database.

    1. Connection Name - You can refer to the Adatis naming standard as a reference
      LS - Linked service
      MSQL - Microsoft SQL Server
      Person - Table being exported.
    2. Gateway - Select an existing or create a new gateway (see below)
    3. Server name - For my example I am using my local server with the default instance "." local would also work. If you're connecting to a named instance this will need to be server\InstanceName
    4. Database Name - Which database you want to connect to
    5. Credential encryption - You have the option to save a credential in Azure or use authentication through the browser.
      For simplicity I am using the latter. For production, use the former.
    6. Authentication type - How to connect to SQL Server, Windows or SQL login.
    7. User name
    8. Password


    Creating and configuring a gateway:
    In our example we will be connecting to a local version of SQL Server, to connect and read data we will need to create an ADF gateway connection and also install our gateway on the server which has our database (or at least connection to that database). You have a few options to create the gateway, but before you can configure any of these you will need to download and install the gateway. You can find the latest version of the gateway here https://www.microsoft.com/en-gb/download/details.aspx?id=39717. Once installed the gateway will be waiting for an access key.


    (image - Microsoft gateway - Awaiting gateway key)

    We have 3 options to create an ADF gateway and obtain the key the gateway is expecting.

    1. Copy Data configuration page (click create gateway)
      This will build the gateway and add the name to your pipeline. You will need to take the access key it generates and add that to you installed gateway.
    2. Add via author and deploy
      Navigate to author and deploy on the main panel of ADF in Azure.
      Click on "...More" and select New Data gateway, configure and deploy. This will return a key. Add the key to the gateway on your server.
    3. Via PowerShell
      Open a new PowerShell prompt and connect to Azure (Login-AzureRmAccount)
      Replace the following with your Azure details - $ResourceGroup, $DataFactoryName and $Gateway
      New-AzureRmDataFactoryGateway -ResourceGroupName $ResourceGroup -Name $Gateway -DataFactoryName $DataFactoryName -Description $Gateway
      This will return a key. Add the key to the gateway on your server.



    (image - A registered gateway)


    (Image - main screen on a registered gateway)

    Configuring linked services:
    Select next to choose which table/s you wish to move.


    (Image - ADF Copy - Select tables)

    You can select one or more tables here. For our example we will be consuming the data using PolyBase. We want our data to sink to its own container in Azure. As such we cannot move multiple tables at once (at the time of writing this is limited to one sink container).


    (Image - ADF data filter)

    You will next be asked how you want to filter the data. Each time our data runs we are looking to move the whole table. If we were looking to do incremental loads, we could select a column which indicates which rows to import each hour. For our example select Filter: None. Next


    (image - ADF destination Source)

    Configuring the destination source:
    On the next screen you will see the list of available sinks (where you can insert data). You will notice the list of sinks is far smaller than the list of sources - at present not all sources can be sinks.

    For our example select Azure Blob storage


    1. Connection Name - Based on Adatis ADF naming standards http://blogs.adatis.co.uk/Terry%20McCann/post/Azure-Data-Factory-Suggested-naming-conventions-and-best-practices
      1. LS_ Linked Service
      2. ABLB_ Blob storage
      3. Person - blob container data will sink to
    2. Account selection method - Manual/Azure list
    3. Azure subscription - Select you subscription
    4. Storage account name - Select your storage account


    (Image - Selecting a blob container)

    Select a folder for the file to sink to. I created a folder ahead of time called person.


    (Image - ADF file format configuration screen)

    Customise you output settings. For now we will just select the defaults to create a CSV.


    Select finish to build your pipeline.


    (image - ADF deployment completed screen)

    As long as everything has worked you will see the screen above. Congratulations your pipeline has been deployed. To monitor the pipeline and see what has been created select the link "click here to monitor your pipeline". You will be taken to a different screen in the ADF portal. We will have more on how to monitor ADF shortly.


    Image - (ADF pipeline in datafactory.azure.com)

    You can check data has moved successfully using Azure storage explorer. ASE is a great utility for browsing files in blob storage. You can download ASE here http://storageexplorer.com/


    (image - Storage explorer)

    I can see that my file is there and is populated as intended. Once a further 24 hours has passed this file will be over written.

    So we have seen what we can do with the Copy data task in Azure. While it is fantastic at basic data movement functions, copy data does not offer much beyond that. I have listed the following pains and shortfalls which exit in ADF copy data at present .

    Limitations of the copy data wizard:
    There are quite a few limitations, some of these are nice to have, others are show stoppers.

    • The CD action is limited to only a subset of the pipeline activities. As the name suggests you can only copy data, or move data. There is no transformation wizard.
    • The menus are very temperamental and regularly do not work
    • You cannot name a dataset - InputDataset-8tl was created in my example. This is not helpful
    • The name of the pipeline is also not helpful.
    • You cannot chain multiple activities together
    • Each pipeline needs to created separately.
    • You can only sink datasets to one blob container.

    Now that we have our data in blob storage we can begin to look at the rest of our solution, where we will create an Azure SQL Data Warehouse, with external PolyBase tables. We will use stored procedures to persist the external tables in to ASDW.

    In the next blog we will look at moving an entire database to Azure blob storage using SSIS and BIML.



What is Azure Data Factory?


(Image 1 - Azure Data Factory – Transformation Pipeline)


Azure data factory (ADF) is a big data processing platform from Microsoft on the Azure platform. For database developers, the obvious comparison is with Microsoft's SQL Server integration services (SSIS). SSIS is an ETL tool (extract data, transform it and load), ADF is not an ETL tool. ADF is more akin to ELT frameworks (Extract-Load-Transform), while the terms are similar, the process is very different. For those who have worked in the data warehouse arena for a number of years, you will be very familiar with ETL and switching to ELT can feel somewhat alien and cumbersome. As such, making obvious comparisons to SSIS will only hinder your understanding of the core concepts of ADF. ADF is not a cloud version of SSIS as many would have you believe, it is both more and less.

On the topic of ETL, Dan Linstedt (The father of Data Vault), published an interesting article "ETL is Dead! Long Live ELT" (http://danlinstedt.com/allposts/datavaultcat/etl-is-dead-long-live-etl/). In this, Linstedt argues that ETL is now defunct for a number of reasons, the primary being big data. To enable fast processing of big data, we have seen a shift to parallel processing with tools such has Hadoop's HDFS, Cassandra, Spark and on the Microsoft side, Data Lake and Azure Data warehouse. By having our data spread across many nodes we can push the processing to the data via a MapReduce job. ADF is designed to be a big data processing pipeline, so it makes sense to process where the data lives rather than extracting all our data, transforming it and then loading it in one processes. We instead ingest the data and load it in to a parallel storage engine (HDFS) and then transform. If this is the first time you're reading about ELT, I recommend both Linstedt's article where Linstedt states "ETL truly is DEAD. Metadata & Lineage are NOT dead, ELT is the future" and also James Serra's blog “The difference between ETL and ELT” (http://www.jamesserra.com/archive/2012/01/difference-between-etl-and-elt/).

ADF is often described as an "Orchestration" tool (https://azure.microsoft.com/en-gb/services/data-factory/) and this is because of its two functions. Being ELT and not ETL, the two main functions might not surprise you. They are moving data and transforming it. When I say transforming data you might assume ADF does the transformation, however the only action ADF actually preforms is to issue a command for a transformation to happen (Issue an EXEC for a stored procedure or a MapReduce job for example). It is this issuing of commands which makes the tool an orchestration and not a multipurpose ETL tool such as SSIS.

The data processing landscape is change, while ADF is still a relatively new tool, a lot of development is ongoing and improvements are being made. ADF is available through the Azure Portal and using the ADF add-in for visual studio (https://azure.microsoft.com/en-us/blog/azure-data-factory-visual-studio-extension-for-authoring-pipelines/).

Now we know what ADF does, let look at how it does it. ADF has 3 main components, linked services, datasets and pipelines.

Core Concepts:


(Image 2 - High level ADF)

The diagram above is showing how ADF all slots together. If we follow this starting with a linked service we can explore this further. A linked service is our connection to some form of resource. This could be an on premise SQL database, a blob storage container, an azure SQL data warehouse or something else. It is here that we want something to happen. That could be extracting data, sinking data (an ADF term for storing data) or transforming data using stored procedures, MapReduce etc. A dataset is on which part of a linked service you want an activity to be performed (a table, view, stored procedure). A dataset is not a database, but rather a table or a container in blob storage. A linked service can have one or many datasets. An activity is performed by a pipeline which will perform an action/actions on a dataset/datasets.


Linked services

A linked services is best thought of as a logical data connection, allowing ADF to connect to an external resource. A linked services acts as a representation of a data store or a compute resource.
A data store is used as part of a data movement activity. A compute resource is executed as part data transformation exercise.
Click here for more information on the creation and different sources you can use for linked services.


A dataset represents a data structure in ADF, each belonging to a linked service. A linked service might be an Azure Blob storage account or an Azure SQL database, the dataset is a representation of the data, a folder of files in blob storage or a table in an Azure database, not the database itself.
For a full list of sources and sinks you see the list of links below.


Pipelines offer the power for ADF. There are 2 types of pipeline which support our notion on ELT.

Data movement pipeline - The Data movement pipeline does what its name implies. It moves data from one dataset to another. When referring to Azure documentation these two datasets will be referred to as the source and the sink. In a data movement activity, data can be moved from any source to any sink. For example, you can move data from Azure Blob storage in to a SQL database or from HDFS to an Azure SQL Data warehouse. This part handles the E and the L of our ETL. Data is extracted and loaded. The only difference to SSIS, is that at this point there is no T. No transformation has happened at this point.

Data transformation pipeline - The data transformation activities again do as the name implies - Transform. As ADF is an orchestration tool, a transformation could be a MapReduce job, a streaming job, a stored proc or a data lake U-SQL query. To fully use ADF you will need to know how you're storing data (ADFS/Data lake) and then what language you want to use to process the data. Knowing how ADF functions is only half the battle.

In relationship terms, you can think of the interconnection of linked services, pipelines and datasets as follows: A listed service has one or more datasets (A SQL database with many tables), a pipeline performs an action on one or more datasets.

An Example:

Let's imagine a process where you have an on premise SQL server box, you want to move multiple tables to blob storage, from there you then want to issue a stored procedure which will consume that data in to an Azure SQL data warehouse via PolyBase - As illustrated in the image below.


(Image 3 - Example Process)

To complete this we would need to create the following artefacts in ADF:

Linked services:
On-premise SQL database
Azure BLOB storage
Azure SQL data warehouse

Table in On-premise SQL database
The blob container
The stored procedure

Pipeline to move data from SQL database to blob storage
Pipeline to issue stored procedure

In later blogs we will look at creating this example in full, however if you want to get started now, the links at the bottom of the page will guide you through setting up ADF.

How to get started:

There are 3 ways to develop for ADF.

  1. Azure portal - Semi-rich IDE. All in the Azure portal.
  2. Visual studio with ADF add-in - Richer development environment. Allows for source control of code through TFS and automated deployment of changes. This is required as the Azure portal can sometime get caught in a syntax loop where nothing will deploy. VS will remove and deploy artefacts which makes this a lot easier. You will sometime have to delete your ADF and start again. If your code is in VS all you need to do is deploy.
  3. PowerShell - Anything you can in the portal you can do in PowerShell.

My development preference is 2 and 3. I use Visual studio for all my code which is source controlled with TFS, then I use PowerShell to build my ADF environment.

Thanks for reading this introduction to ADF. I hope you have found it somewhat insightful. We have a whole series of blogs looking in depth at different aspects and design patterns for loading a data warehouse with Azure Data Factory. You can read more in our next blog looking at setting up your first Azure Data Factory.


Learning path:  https://azure.microsoft.com/en-gb/documentation/learning-paths/data-factory/

Documentation:  https://docs.microsoft.com/en-us/azure/data-factory/

Introduction to ADF [YouTube]: https://www.youtube.com/watch?v=_hOYAT69yIw

ADF overview [Channel9] https://channel9.msdn.com/Blogs/Azure/Azure-Data-Factory-Overview