Terry McCann

Terry McCann's Blog

How to prepare for 70-766 - Perform Big Data Engineering on Microsoft Cloud Services

image

There is a new exam currently in beta titled "Perform Big Data Engineering on Microsoft Cloud Services (beta)". With all new exams there is little content on how to revise for the exam beyond the exams summary. This exam however, is what Adatis specialises in! Microsoft may call this "Big Data Engineering" we call it "Modern Data Analytics" and we have a few blogs on the subject. You can sign up to the exam here: https://www.microsoft.com/en-us/learning/exam-70-776.aspx

Below you will find links to blog posts by Adatis consultants on topics related to all the key objectives of this exam. I will endeavour to keep this up-to-date with new content added by the team. Good luck with the exam.

Design and Implement Complex Event Processing By Using Azure Stream Analytics (15-20%)
Streaming data is vital to achieving real-time analytics. The following blogs posts focus on this and offer an introduction and walkthrough for getting started with Stream Analytics. When talking about a wider Lambda approach to Big Data, streaming enables rapid processing via a “Speed” layer.   

http://blogs.adatis.co.uk/simonwhiteley/post/Adatis-Hackathon-Jan-2015-Streaming-Analytics-First-Thoughts
http://blogs.adatis.co.uk/Jose%20Mendes/post/IoT-Hub-Device-Explorer-Stream-Analytics-Visual-Studio-2015-and-Power-BI
http://blogs.adatis.co.uk/sachatomey/post/2017/01/19/Power-BI-Streaming-Datasets-An-Alternative-PowerShell-Push-Script
http://blogs.adatis.co.uk/Jose%20Mendes/post/Data-Data-Revolution

Design and Implement Analytics by Using Azure Data Lake (25-30%)
Azure Data Lake Store and Analytics are a vital component of the “Modern Data Analytics”. Data which is too large for traditional single server processing needs distributed parallel computation. Rather than pulling data and processing ADLA pushes the processing to the data. Understanding how to process large volumes of data is one part of the “Batch” layer in Lambda

http://blogs.adatis.co.uk/ustoldfield/post/data-lakes
http://blogs.adatis.co.uk/ustoldfield/post/Data-Flow-Job-Execution-in-the-Azure-Data-Lake
http://blogs.adatis.co.uk/ustoldfield/post/Data-Flow-Pt-2-Vertexes-In-Azure-Data-Lake

Design and Implement Azure SQL Data Warehouse Solutions (15-20%)
Either as an alternative or in accompaniment to Data Lake is Azure SQL Data Warehouse. If Data Lake is batch across many files, Azure SQLDW is parallel batch over many databases. The key to both services is processing at the storage and not at the compute. The following is an on-going blog series covering the basics all the way to a  deep-dive.  

http://blogs.adatis.co.uk/simonwhiteley/post/A-Guide-to-Azure-SQL-DataWarehouse
http://blogs.adatis.co.uk/simonwhiteley/post/Azure-SQLDW-What-is-it
http://blogs.adatis.co.uk/simonwhiteley/post/Azure-SQLDW-How-Does-Scaling-Work
http://blogs.adatis.co.uk/simonwhiteley/post/Azure-SQLDW-Distribution
http://blogs.adatis.co.uk/simonwhiteley/post/Azure-SQLDW-Polybase
http://blogs.adatis.co.uk/simonwhiteley/post/Azure-SQLDW-Polybase-Design-Patterns
http://blogs.adatis.co.uk/simonwhiteley/post/Azure-SQLDW-Polybase-Limitations
http://blogs.adatis.co.uk/simonwhiteley/post/Azure-SQLDW-CTAS-Statements

Design and Implement Cloud-Based Integration by using Azure Data Factory (15-20%)
If you’re looking for a paas solution to move data in Azure, there is only really one option. Azure Data Factory. The following blogs will get you up-to-speed with ADF.

http://blogs.adatis.co.uk/terrymccann/post/Getting-started-with-Azure-Data-Factory
http://blogs.adatis.co.uk/terrymccann/post/Setting-up-your-first-Azure-Data-Factory
http://blogs.adatis.co.uk/terrymccann/post/Azure-Data-Factory-using-the-Copy-Data-task-to-migrate-data-from-on-premise-SQL-Server-to-Blob-storage

Manage and Maintain Azure SQL Data Warehouse, Azure Data Lake, Azure Data Factory, and Azure Stream Analytics (20-25%)
Know each of the parts is only half the battle, you need to know how, when and why to use each part. What are the best practices?

http://blogs.adatis.co.uk/ustoldfield/post/Deploying-a-Hybrid-Cloud
http://blogs.adatis.co.uk/terrymccann/post/Azure-Data-Factory-Suggested-naming-conventions-and-best-practices
http://blogs.adatis.co.uk/ustoldfield/post/Azure-Data-Lake-Store-Storage-and-Best-Practices
http://blogs.adatis.co.uk/ustoldfield/post/Shaping-The-Lake-Data-Lake-Framework

SQL Server 2017 Graph data processing. An introduction.

image

On the 19th of April 2017 at the Microsoft Data AMP, Microsoft announced SQL Server 2017 and a few new advanced analytics features (slide above). You can watch the full AMP here https://www.microsoft.com/en-us/sql-server/data-amp. One of the announcements related to the new support of Graph objects. There have been rumblings over the past few years of Microsoft working on a Graph engine. Project Trinity appeared to be just that https://www.microsoft.com/en-us/research/project/trinity/. Trinity never really made much of an impact, possibly due to other vendors having an easier to use product. Hopefully Microsoft are changing this by introducing graph querying directly in to the engine for SQL Server 2017. In this blog post I will look at the upcoming SQL Graph the graph processing engine in SQL Server 2017.

What is a Graph Database?
Relational databases are fantastic at answering many types of queries. There are developers who can write a query to answer any question you can ask of your data. But sometimes a relational database is not always the best place to ask those questions.

The rise of NoSQL databases in the early 2000s is in part due to this. Developers felt they could make a better solution which was built to solve a particular problem or support a particular process, rather than hack a generalised system (relational database). One of those patterns was social graphs. In a social graph you want to analyse deep hieratical data and traverse up and down hierarchies. Imagine you are a developer at LinkedIn. You want to know who to suggest to a new member who they should connect with. A simple choice is to show them everyone who is connected to someone they already know – Think about the six degrees of separation to Kevin Bacon.

Imagine we have the following social graph. Simon is new to LinkedIn. He has recently signed up and has made connections with Emma and Sacha. We want to suggest connections to Simon based on the people Sacha and Emma know. In the diagram below you can see our social graph. We want to recommend the people in light blue. We can do this by moving one level up in our hierarch away from Simon and collecting the returned results.

image

Each of the elements on the diagram above has a name. The circles are referred to as Nodes. The lines connecting them are relationships or Edges. You can think of a node as an entity (a thing, a person, an account, a movie, an actor etc). The Edge is what connects them, it is what gives them context. A person knows a person, a movie is directed by a director. Node->Relationship->Node OR Node->Edge->Node. There can be multiple nodes and multiple edges. Simon might know Emma and work with Sacha (know and work are our edges/relationships). With this set up we can then ask questions such "Of the people who know Simon, who do they know?". This would return our people coloured blue. A query that is relatively trivial in SQL at this level becomes incredibly complex when you add in “now show me who each of the blue nodes knows” and so forth, moving up and up the nodes. 

This query pattern is perfect for creating a recommendation engine. Simon bought a book on Lambda Architecture, what did people who bought the same book also buy. That query would be very hard (yes possible, but complicated) to write in SQL. So graph databases were built to solve this problem. There are a few big players already in the Graph world, Neo4j being the big one. Neo4j has its own SQL like syntax for writing these matching processes call Cypher Query Language or CQL. CQL is very mature and we equipped to answer complex graph queries.

A quick intro to SQL Graph
SQL Graph is a similar concept to what is described above, but built in to the core SQL Server engine. This means 2 new table types NODE and EDGE and a few new TSQL functions in particular MATCH(). SQL Graph at the time of writing is only available in SQL 2017 ctp 2.0. You can read more and download ctp2.0 here https://blogs.technet.microsoft.com/dataplatforminsider/2017/04/19/sql-server-2017-community-technology-preview-2-0-now-available/. Once ctp 2.0 is installed there is nothing else you need to do to enable the new graph syntax and storage.

There is an example you can download from Microsoft which is a similar set up to the example in the image above. However I have used some real data shredded from IMDB the internet movie database. This data is available to download from Kaggle https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset

This data contains 5000 movies with the following information:

  1. The name of the movie
  2. The 3 main actors
  3. The director
  4. The genres

New syntax - https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-sql-graph

I have loaded this in to a SQL Server 2017 ctp 2.0 instance. Graph queries will not just work on your existing tables, you need to transform the data in to NODES and EDGES.

NODE – A representation of an entity or object

-- Create NODE tables DROP TABLE IF EXISTS Genre; CREATE TABLE Genre (GenreId INTEGER PRIMARY KEY, Genre VARCHAR(100)) AS NODE;

When data is inserted in to a NODE a unique $_nodeid is generated. This is a key value pair with metadata about that NODE.

EDGE – A representation of a relationship between two NODES

-- Create EDGE tables. DROP TABLE IF EXISTS ActedIn; CREATE TABLE ActedIn AS EDGE; -- Actor ActedIn Movie

Using the CREATE TABLE AS EDGE will build the required structure for you EDGE. When you create an EDGE, you are creating a relationship between two $_nodeid. That could be a movie and an actor. An EDGE needs to be inserted with two corresponding $_nodeid.

MATCH – How to query your graph objects

SELECT Movie.Movie FROM Actor Actor1, ActedIn, Movie WHERE MATCH (Actor1-(ActedIn)->Movie) AND Actor1.Actor = '50 Cent'

MATCH uses a pattern based query structure. In the example above this is NODE-(EDGE)->NODE. Executing this query will return all the films 50 Cent has acted in (more complicated examples are shown in the video).

In the following video I will take you through the process of creating a graph with SQL Server 2017.

 

Final thoughts
SQL Server 2017 ctp 2.0 is classed as production ready, so SQL Graph is ready to be used in a production environment. The level of functionality listed in the documentation is quite lacking when compared with competitors tools. Cypher for Neo4j has a fantastic pattern matching function which allows you to hop n relationships. This is not currently available nor is the ability to find any node connected to my current node. SQL Server Graph currently feels like a version 0.1. It is almost there, but not quite. I look forward to seeing how graph develops over the coming months and hope to see the missing functionality appear. You can read more about the limitations here https://blogs.technet.microsoft.com/dataplatforminsider/2017/04/20/graph-data-processing-with-sql-server-2017/

Links
A SQL Server Graph overview - https://docs.microsoft.com/en-us/sql/relational-databases/graphs/sql-graph-overview
Sample from Microsoft - https://docs.microsoft.com/en-us/sql/relational-databases/graphs/sql-graph-sample
Limitations - https://blogs.technet.microsoft.com/dataplatforminsider/2017/04/20/graph-data-processing-with-sql-server-2017/

Azure Data Factory, using the Copy Data task to migrate data from on premise SQL Server to Blob storage.

 

    In the first blog in this series I talked about working through a use case. Over the next 3 blogs we will look at 3 different methods for migrating data to Azure Blob storage.

    1. Using the Azure Data Factory Copy Data Wizard. (on table)
    2. Using BIML and SSIS (entire database - SSIS)
    3. Using Azure Data Factory and PowerShell (entire database - ADF)

    The reason I have included the latter 2 versions is because if you just want to load an entire database in the blob storage it can be quicker to use one of these methods as a one off or on a scheduled basis. Hand writing all the JSON required for each move table from on premise to blob storage is very time consuming. Depending on whether you need to do a one off upload or something on a schedule options 2 and 3 might help.

    Our original use case from an introduction to Azure Data Factory:

    Let's imagine a process where you have an on premise SQL server box, you want to move multiple tables to blob storage, from there you then want to issue a stored procedure which will consume that data in to an Azure SQL data warehouse via PolyBase - As illustrated in the image below.

    clip_image001

    Linked services:
    On-premise SQL database
    Azure BLOB storage
    Azure SQL data warehouse

    Datasets:
    Table in On-premise SQL database
    The blob container
    The stored procedure

    Pipelines:
    Pipeline to move data from SQL database to blob storage
    Pipeline to issue stored procedure

    In the blog we will tackle the first part:

    clip_image002

    Copying data:
    We are going to start looking in a bit more in detail at the Azure Data Factories (ADF) copy data task (CD). CD is still in preview (at the time of writing [01/2007]). Prior to the inclusion of the copy data wizard, you had to manually configure ADF artefacts and write most of the JSON for linked services, datasets and pipeline by hand. The copy data task is a wizard for generating a data movement activity pipeline, complete with datasets and linked services.

    To get started connect to azure and navigate to your existing Azure data factory (if you do not have an existing ADF you can follow how to create one here http://blogs.adatis.co.uk/Terry%20McCann/post/Setting-up-your-first-Azure-Data-Factory).

    clip_image003

    To begin setting up a copy data pipeline click on the "copy data (PREVIEW)" option in the ADF panel in Azure. Once you have selected "Copy data" you will be taken to the new ADF interface. Datafactory.azure.com enables the creation and monitoring of ADF pipelines.

    The general process for creating an ADF pipeline process (prior to the copy data task) was

    1. Create an ADF
    2. Create a linked service/s
    3. Create a gateway as needed
    4. Create you input and output datasets
    5. Create a pipeline
    6. Monitor the pipeline

    When using the ADF copy data the process is slightly flipped

    1. Create an ADF
    2. Configure the properties for the pipeline
    3. Create a gateway
    4. Configure the linked service/s
    5. Configure the datasets
    6. Deploy all configurations
    7. Monitor the pipeline.

    The main difference here is that you do not deploy anything until it has all been configured, you have the added advantage that it is somewhat cumbersome to do this manually. At present, the copy data task is very narrow in its functionality. If your intention is to build a more advanced pipeline will either need to generate a move task and tweak it or create it all manually.

    Copy data has many short comings, for our example the most prevalent is that a movement to blob storage only has the option to sink data to a folder and not multiple folders. Option 3 in our list of methods for migration aims to get around this limitation using PowerShell

    Configure Pipeline properties:
    Once you have selected "copy data" you will be launched in to datafactory.azure.com, the new fresher looking environment. The copy data task is a 4 stop process which will guide you through the creation of a data movement pipeline (I want to highlight that this is only used for data movement and not transformation). This is a great way to get you started with ADF without having to understand the json or trickier elements such as data slices and scheduling, although we will touch on scheduling as it is quite tricky.

    clip_image004

    (image 1 - Configure Properties)

    The first screen is you will see are the properties of the pipeline you're creating. It is here you will configure the frequency and schedule of the pipeline. A pipeline is a group of logically related activities.

    1. Task name - This is important and will be used as a prefix for the names of datasets and data stores.
    2. Task description -
    3. Task schedule - See below for a more in depth analysis.
    4. Start time - This date is in UTC / End time - This data is also in UTC

    For quick conversions to your time zone, I recommend worldtimebuddy (http://www.worldtimebuddy.com/)

    More on schedules:
    The Microsoft page about scheduling is incredibly deep and takes a lot of effort to fully digest and understand. I will attempt to impart my understanding of pipeline scheduling in to a brief list of key points. You can read more here https://docs.microsoft.com/en-us/azure/data-factory/data-factory-scheduling-and-execution I would recommend that you do read this site as there is a lot of good examples. The key points from this document are:

    • A schedule can be one-off or between dates
    • It can run on a scheduled basis Minute, hourly, daily or weekly. Every 15 minutes is the minimum.
      This forms what is known as a tumbling window. Microsoft defines a tumbling window as "A series of fixed-size, non-overlapping, contiguous time intervals". Tumbling windows are also known as activity windows.
    • A pipeline schedule's interval needs to be the same as a datasets availability - However, it does not need to run at the same time.

    For our example we will use frequency of "Daily" and an interval of "1", this will run our pipeline every day. To write this in JSON and not through the wizard you would use the following JSON as part of your pipeline.

    "scheduler": {
    "frequency": "Daily",
    "interval": 1
    }

    To create a pipeline which will run undefinably you can set the end date time to "12/31/2099 12:00am" which while this is not infinite, the date will outlive ADF. Start date time will default to the time you have created the pipeline (n.b. these dates are expressed in US format MM/DD/YYYY).

    Creating linked services (Source data store):
    The next screen is the configuration of the linked sources. ADF is still new and the terminology is somewhat confusing. Depending on tool you're using and sometimes the screen you're looking at, ADF will mix up the names for key parts, anywhere you see the term "data store" assume it is referring to a linked service.

    clip_image005

    For our example we will use the SQL Server option (bottom right of the above image).

    clip_image006

    (Image - SQL Server linked service configuration screen)

    You should now be able to configure the connection details (via a gateway) to you SQL server database.

    1. Connection Name - You can refer to the Adatis naming standard as a reference
      LS - Linked service
      MSQL - Microsoft SQL Server
      Person - Table being exported.
    2. Gateway - Select an existing or create a new gateway (see below)
    3. Server name - For my example I am using my local server with the default instance "." local would also work. If you're connecting to a named instance this will need to be server\InstanceName
    4. Database Name - Which database you want to connect to
    5. Credential encryption - You have the option to save a credential in Azure or use authentication through the browser.
      For simplicity I am using the latter. For production, use the former.
    6. Authentication type - How to connect to SQL Server, Windows or SQL login.
    7. User name
    8. Password

 

    Creating and configuring a gateway:
    In our example we will be connecting to a local version of SQL Server, to connect and read data we will need to create an ADF gateway connection and also install our gateway on the server which has our database (or at least connection to that database). You have a few options to create the gateway, but before you can configure any of these you will need to download and install the gateway. You can find the latest version of the gateway here https://www.microsoft.com/en-gb/download/details.aspx?id=39717. Once installed the gateway will be waiting for an access key.

    clip_image007

    (image - Microsoft gateway - Awaiting gateway key)

    We have 3 options to create an ADF gateway and obtain the key the gateway is expecting.

    1. Copy Data configuration page (click create gateway)
      This will build the gateway and add the name to your pipeline. You will need to take the access key it generates and add that to you installed gateway.
    2. Add via author and deploy
      Navigate to author and deploy on the main panel of ADF in Azure.
      Click on "...More" and select New Data gateway, configure and deploy. This will return a key. Add the key to the gateway on your server.
    3. Via PowerShell
      Open a new PowerShell prompt and connect to Azure (Login-AzureRmAccount)
      Replace the following with your Azure details - $ResourceGroup, $DataFactoryName and $Gateway
      New-AzureRmDataFactoryGateway -ResourceGroupName $ResourceGroup -Name $Gateway -DataFactoryName $DataFactoryName -Description $Gateway
      This will return a key. Add the key to the gateway on your server.

 

    clip_image008

    (image - A registered gateway)

    clip_image009

    (Image - main screen on a registered gateway)

    Configuring linked services:
    Select next to choose which table/s you wish to move.

    clip_image010

    (Image - ADF Copy - Select tables)

    You can select one or more tables here. For our example we will be consuming the data using PolyBase. We want our data to sink to its own container in Azure. As such we cannot move multiple tables at once (at the time of writing this is limited to one sink container).

    clip_image011

    (Image - ADF data filter)

    You will next be asked how you want to filter the data. Each time our data runs we are looking to move the whole table. If we were looking to do incremental loads, we could select a column which indicates which rows to import each hour. For our example select Filter: None. Next

    clip_image012

    (image - ADF destination Source)

    Configuring the destination source:
    On the next screen you will see the list of available sinks (where you can insert data). You will notice the list of sinks is far smaller than the list of sources - at present not all sources can be sinks.

    For our example select Azure Blob storage

    clip_image013

    1. Connection Name - Based on Adatis ADF naming standards http://blogs.adatis.co.uk/Terry%20McCann/post/Azure-Data-Factory-Suggested-naming-conventions-and-best-practices
      1. LS_ Linked Service
      2. ABLB_ Blob storage
      3. Person - blob container data will sink to
    2. Account selection method - Manual/Azure list
    3. Azure subscription - Select you subscription
    4. Storage account name - Select your storage account

    clip_image014

    (Image - Selecting a blob container)

    Select a folder for the file to sink to. I created a folder ahead of time called person.

    clip_image015

    (Image - ADF file format configuration screen)

    Customise you output settings. For now we will just select the defaults to create a CSV.

    clip_image016

    Select finish to build your pipeline.

    clip_image017

    (image - ADF deployment completed screen)

    As long as everything has worked you will see the screen above. Congratulations your pipeline has been deployed. To monitor the pipeline and see what has been created select the link "click here to monitor your pipeline". You will be taken to a different screen in the ADF portal. We will have more on how to monitor ADF shortly.

    clip_image018

    Image - (ADF pipeline in datafactory.azure.com)

    You can check data has moved successfully using Azure storage explorer. ASE is a great utility for browsing files in blob storage. You can download ASE here http://storageexplorer.com/

    clip_image019

    (image - Storage explorer)

    I can see that my file is there and is populated as intended. Once a further 24 hours has passed this file will be over written.

    So we have seen what we can do with the Copy data task in Azure. While it is fantastic at basic data movement functions, copy data does not offer much beyond that. I have listed the following pains and shortfalls which exit in ADF copy data at present .

    Limitations of the copy data wizard:
    There are quite a few limitations, some of these are nice to have, others are show stoppers.

    • The CD action is limited to only a subset of the pipeline activities. As the name suggests you can only copy data, or move data. There is no transformation wizard.
    • The menus are very temperamental and regularly do not work
    • You cannot name a dataset - InputDataset-8tl was created in my example. This is not helpful
    • The name of the pipeline is also not helpful.
    • You cannot chain multiple activities together
    • Each pipeline needs to created separately.
    • You can only sink datasets to one blob container.

    Now that we have our data in blob storage we can begin to look at the rest of our solution, where we will create an Azure SQL Data Warehouse, with external PolyBase tables. We will use stored procedures to persist the external tables in to ASDW.

    In the next blog we will look at moving an entire database to Azure blob storage using SSIS and BIML.

    Links
    https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-scheduling-and-execution

    http://blogs.adatis.co.uk/Terry%20McCann/post/Azure-Data-Factory-Suggested-naming-conventions-and-best-practices

My Experience of the Microsoft Professional Program for Data Science

clip_image001

(Image 1 – Microsoft 2017 - https://academy.microsoft.com/en-us/professional-program/data-science)

 

In 2016 I was talking to Andrew Fryer (@DeepFat)- Microsoft technical evangelist, (after he attended Dundee university to present about Azure Machine Learning), about how Microsoft were piloting a degree course in data science. My interest was immediately spiked. Shortly after this hints began appear and the Edx page went live. Shortly after the Edx page went live, the degree was rebranded as the "Professional Program". I registered to be part of the pilot, however was not accepted until the course went live in September 2016.

 

Prior to 2016 my background was in BI, predominately in Microsoft Kimball data warehousing using SQL Server. At the end of 2015 I enrolled on a Master's Degree in Data Science through the University of Dundee. I did this with the intention of getting exposure to tools I had an interest in, but had some/little commercial experience (R, Machine learning and statistics). This course is ongoing and will finish in 2018, I highly recommend it! I would argue that it is the best Data Science Master's degree course in the UK. So going in to the MPP I had a decent idea of what to expect, plus a lot of SQL experience, R and Power BI. Beyond that I had attended a few sessions at various conferences on Azure ML. When the syllabus for the MPP came out, it directly complemented my studies.

 

Link to program - https://academy.microsoft.com/en-us/professional-program/data-science

Link to Dundee Masters - https://www.dundee.ac.uk/study/pg/data-science/

 

Structure of the program

The program is divided up in to 9 modules and a final project. All modules need to be completed but there are different options you can take - You can customise the course to suit your interests. You can choose to pay for the course (which you will need to do if you intend to work towards the certification) or audit the course for free.  I will indicate which modules I took and why. Most modules recommend at least 6 weeks part-time to complete. I started the first module in the middle of September 2016 and completed the final project middle of January 2017 – So the 6 week estimate is quite high, especially if you already have decent a base knowledge of the concepts already.

 

You can if you wish complete multiple modules at once. I am not sure I recommend this approach as to get the most out of the course, you should read around the subject as well as watching the videos. Each module has a start date and an end date that you need to complete it between. If you do not you will need to do it all again. You can start a module in one period and wait until the next for another module. You do not need to complete them all in 3 months. If you pay for the module but do not request your certificate before the course closes, you will need to take it again (top tip, as soon as you're happy with you score, request you certificate).

 

Module list

Module

Detail

Time taken

Data Science Orientation

Data Science Orientation

2 - 3 days

Query Relational Data

Querying Data with Transact-SQL

1 day - Exam only

Analyze and Visualize Data

Analyzing and Visualizing Data with Excel

image Analyzing and Visualizing Data with Power BI

2 - 4  days

Understand Statistics

Statistical Thinking for Data Science and Analytics

7 - 9 days

Explore Data with Code

Introduction to R for Data Science

imageIntroduction to Python for Data Science

7 - 9 days

Understand Core Data Science Concepts

Data Science Essentials

7 - 9 days

Understand Machine Learning

Principles of Machine Learning

2 weeks

Use Code to Manipulate and Model Data

image Programming with R for Data Science

imageProgramming with Python for Data Science

R - 2 - 3 days
Python - 3 weeks

Develop Intelligent Solutions

 

Applied Machine Learning

image Implementing Predictive Solutions with Spark in HDInsight

Developing Intelligent Applications

2 weeks

Final Project

Data Science Challenge

2 months*

 

The times taken are based on the time I had spare. I completed each module between projects, in the evening and at the weekend.

This module can be completed in a few days, however you need to wait until it has completed to get you grade.

 

Structure of the modules

Each modules is online. You log on to the Edx website and watch videos by leading experts. Either at the end of the video, after reading some text or at the end of a section of the modules you are given a multiple choice test. The multiple choice options are graded and form part of your overall score. The other main assessment method is labs, where you will be required to complete a series of tasks and enter the results. Unlike certifications, you get to see what your score is as you progress through the module. The multiple choice questions generally allow you to have two to three attempts at the answer, sometimes these are true/false with two attempts, which does undermine the integrity of the course.

 

There is normally a final section which you're only given one chance to answer, and holds a higher % towards your final mark. You need 70% to pass. Once you hit 70% you can claim your certificate - if you have chosen to pay for the module. Modules range from $20 to $100. For the most part I answered the questions fully and tried for the highest score possible. However, In all honestly towards the end, once I hit around 80%, I started looking at a different module. If the module was really interesting I would persevere.

 

Modules

Data Science Orientation, Query Relational Data & Analyze and Visualize Data. These modules are very basic and really only skim the surface of all the topics they describe. The first module is a gentle introduction to the main concepts you will learn throughout the program. The next modules focused on querying data with SQL. Regardless of your opinion of SQL, you must agree that SQL the is language of data. Having an understanding of the fundamentals of SQL is paramount, as almost every level of the Microsoft Data Science stack has integration with databases. If you're familiar with SQL (I already held an MCSE in SQL 2012) you can skip the main content of this module and just take the test at the end. For the next you have an option of Excel or Power BI for visualisation. As I have experience with Power BI I opted for this module. Once again this is a very basic introduction to Power BI. It will get you familiar enough with the tool that you can do basic data exploration. Some parts of this course jarred with me. Data visualisation is so important and a key skill for any data scientist. In the Power BI module one of the exercises was to create a 3d pie chart. Pie charts are not a good visualisation as it is hard to differentiate between angles and making it 3d only escalates the issue. I wish Microsoft would have made reference to some of the great data viz experts when making this module - I cannot comment on the Excel version.

 

Understanding statistics. This module is different from its predecessors, in that it is not run by Microsoft. This is a MOOC from Columbia university, which you might have completed before. It covers a lot of the basic and more advanced stats that you need to know for data science. In particular a solid grounding in probability and probability theory. In BI you become familiar with descriptive stats and measures of variance, however I had not done a great deal of stats beyond this. I have researching statistical methods for the MSc, but I had not done any real stats since A-Level maths. This course was really interesting and I learnt a lot. I don’t know if this is the best way to really learn stats, but it is a good primer to what you need to know. I found topping up my understanding with blogs, books and YouTube helped support this module.

 

Explore data with code. You have two options again for this module, R and Python. Which should you learn I imagine you're asking, well the simple answer is both. Knowing either R or Python will get you so far, knowing both with make you a unicorn. Many ask why to learn one language over the other - aside from the previous point. R is very easy to get in to, it has a rich catalogue of libraries written by some of the smartest statistical minds. It has a simple interface and is easy to install. Python is harder to learn in my opinion as the language is massive! I found Python harder to work with, but it is much richer. I would recommend Python just for SciKitLearn the machine learning library. The python module is extended to use code dojo (the great online tuition site). As you progress through the questions and examples, you have an ide which will check you understanding and  will grade you as you go. I found this really helpful. This module is again a bit on the easier side. If you think the later Python module will be similar, you are in for a surprise! I did not take the R module as I was already using R in my day job.

 

Understand core data science concepts. Almost a redo of the first module and the understanding statistics module. Not a lot to say here, but repetition helped me understand and remember the concepts. The more I had to think about the core concepts the more they stuck. This module could have been removed with little to no impact on the course, but helped solidify my knowledge.

 

Understanding Machine learning. As this is a Microsoft course this module is all about Azure Machine Learning. If you have not used Azure ML before, it has a nice drag and drop interface which allows you to build quick simple models and create a web api key which you can then pass data to using any tool with a REST API. This module is half theory and half practical. There are a lot of labs, so you will need to take you time. If you skip ahead you will get the answers wrong and might not make it to 70%.

 

Using code to manipulate and model data. This section has two options again R and Python. I know quite a bit or R already so I started with Python. I wanted to do them both to see how you can do machine learning in both. I was expecting a continuation of the code dojo format from the previous module, this was far from the case. Each of the modules up until this point have worked with you to find the right answer. This module will equip you with the basics, but expect you to find the correct function and answer. Believe me when I say it was hard (with little prior experience of Python). The course will lead you to towards the right resources, but you need to read the documentation to answer the question. This was a great change of pace. Having to search for the answers made me absorb more than just the quizzes. This module was a struggle. Once I completed this I did the same for R. On a difficulty scale, if the Python module was 100, R was only at 20. The disparity in difficult is massive and frankly unfair. I was able to complete the R module very quickly. I left feeling disappointed that this did not have the same complexity that the Python module did.

 

Develop intelligent solutions. For this section you can pick one of three modules, Machine learning, Spark or micro services. I went with Spark. Why? Because I had already worked with Spark and Hadoop as part of the MSc at Dundee. I knew how it worked and what it did from an open source point of view, but not from a Microsoft HD-Insight perspective. This module was tricky but nothing compared to the Python module. I spent the best part of the week working on Spark, setting up HD-Insight clusters and forgetting to tear them down (top tip! Don’t leave a HD-Insight cluster running - They are EXPENSIVE!). The last module is a machine learning project, so picking the "Applied Machine Learning" option might put you in a better place than your competition. I did not attempt either the Machine Learning or the Micro-services modules.

 

Final project. Here is where the fun begins. You're given a problem and a dataset. You need to clean, reduce, derive features and process the dataset, then apply an ML technique to predict something. In my case it was whether or not someone will default on a loan. You could use any technique you liked as long as the final result was in Azure ML. I was pretty happy with my model early on and made very few tweaks as the course progressed. Unlike the previous modules where you can complete a module and get your score, your final score is only available once the module has ended. You will build an ML experiment and test against a private dataset. You can submit your experiment 3 times a day to be scored against the private data (maximus of 100 attempts). This will give you an indication of your score, but this is not your score! You score is calculated against a different dataset after the module has finished.  You top 5 scores will be used to test against the private closed data. If you have over-fitted you model, you might have a shock (as many did on the forums) when you score is marked.

 

I completed all modules at the start of January and waited until February to get my final score. My highest scoring answer, when used against the closed private dataset, did not get over the required 70% to pass. This was surprising but not all that unexpected. I had over-fitted the model. To counter balance this, I created 5 different experiments with 5 similar but different approaches. All score similar (~1-3% accuracy difference). This was enough to see me past the required 70% and to obtain the MPP in data science. The private dataset has been published now. In the coming weeks I will blog about the steps I took to predict if someone would default on their loan.

 

I have been asked at different stages of the course "would you recommend the course?". It really depends on what you want out of the course! If you expect to be a data scientist after completing the MPP, then you might be in for a shock. To get the most out of the course you need to supplement it with wider reading / research. YouTube has many great videos and recorded lectures which will really help process the content and see it taught from a different angle. If you're looking to get an understanding of the key techniques in  Data Science (from a Microsoft point-of-view) then you should take this course. If you're doing a degree where you need to do research, many of the modules will really help and build upon what you already know.

 

I hope you have found this interesting and that it has helped you decide whether or not you want to invest the time and money (each module is not free). If you do decide and you persevere you will too be the owner of the MPP in Data Science (as seen below).

 

Terry McCann - Adatis Data Science Consultant & Organiser of the Exeter Data Science User Group - You can find us on MeetUp.

 

 

clip_image001[6]

Azure Data Factory - Suggested naming conventions and best practices.

image

 

Years ago I stumbled on a great blog about SSIS best practices and naming conventions. In the years to come I could refer back to this at the start of every SSIS project. Written by Jamie Thomson, this has become the standard, and although there were variants, Jamie's still remains very popular (Jamie Thompson, Link).

 

I want to propose a list of best practices and naming conventions for the various elements of ADF based around my experience with ADF. This list may change over time as I discover what works and what does not. At present the standard names generated by ADF are generic and not very helpful (LinkedServiceAzureSQLDataWareouse1 etc). In light of this I propose the following best practices and standards for naming ADF elements.

 

Best practices

 

  1. Use a standard naming convention. For a large ADF project you can end up with 300-400 json objects. It is tricky to navigate that many objects and deploying from visual studio makes this even harder. Having a standard helps to quickly locate which json object needs to be changed.
  2. When you have a solution with a lot of JSON objects (a few hundred or more) publishing from visual studio becomes an issue. A bottleneck begins to appear, which is brought on by each element needing to validate against it peers. When you have hundreds of elements deploying new/updated pipelines adds a significant overhead. In my recent example I had 222 JSON objects and 74 pipelines, when initially deploying, each pipeline (after solution validation) took ~10 seconds each. Once published and updated this increased to ~40 seconds per pipeline.

 

74x10 is approximately 12:20 minutes (13 minutes in total)

74x40 is approximately 49:20 minutes (gave up after 40 minutes)

 

This can be resolved by adding a step in to your automation which removes your ADF instance before publishing a large amount of changes (see point 3 - Automate with Po$h)

 

  1. Don’t neglect PowerShell. PowerShell is a great way to automate the development of json objects - especially when you need to create a few hundred! I have a blog coming in the next few weeks detailing how to do this.

 

* More best practices will be added.

 

Naming conventions

 

There are a few standard naming conventions which apply to all elements in Azure Data factory.

 

  1. Object names must start with a letter or a number, and can contain only letters, numbers, and the dash (-) character.
  2. Maximum number of characters in a table name: 260.
  3. Object names must start with a letter number, or an underscore (_).
  4. Following characters are not allowed: “.”, “+”, “?”, “/”, “<”, ”>”,”*”,”%”,”&”,”:”,”\”

(Microsoft, 2017 [https://docs.microsoft.com/en-us/azure/data-factory/data-factory-naming-rules])

 

Linked services & Datasets.

A linked service connects data from a source to a destination (sink), it stands to reason that there would therefore be the same for a data set. Rather than having two separate lists, you will see the below table has a column for Linked services and datasets.

 

Type

Linked Service

Name

Linked Service

Dataset

Full

Azure

Azure Blob storage

ABLB_

LS_ABLB_

DS_ABLB_

LS_ABLB_Example

 

Azure Data Lake Store

ADLS_

LS_ADLS_

DS_ADLS_

LS_ADLS_Example

 

Azure SQL Database

ASQL_

LS_ASQL_

DS_ASQL_

LS_ASQL_Example

 

Azure SQL Data Warehouse

ASDW_

LS_ASDW_

DS_ASDW_

LS_ASDW_Example

 

Azure Table storage

ATBL_

LS_ATBL_

DS_ATBL_

LS_ATBL_Example

 

Azure DocumentDB

ADOC_

LS_ADOC_

DS_ADOC_

LS_ADOC_Example

 

Azure Search Index

ASER_

LS_ASER_

DS_ASER_

LS_ASER_Example

Databases

SQL Server*

MSQL_

LS_SQL_

DS_SQL_

LS_SQL_Example

 

Oracle*

ORAC_

LS_ORAC_

DS_ORAC_

LS_ORAC_Example

 

MySQL*

MYSQ_

LS_MYSQ_

DS_MYSQ_

LS_MYSQ_Example

 

DB2*

DB2_

LS_DB2_

DS_DB2_

LS_DB2_Example

 

Teradata*

TDAT_

LS_TDAT_

DS_TDAT_

LS_TDAT_Example

 

PostgreSQL*

POST_

LS_POST_

DS_POST_

LS_POST_Example

 

Sybase*

SYBA_

LS_SYBA_

DS_SYBA_

LS_SYBA_Example

 

Cassandra*

CASS_

LS_CASS_

DS_CASS_

LS_CASS_Example

 

MongoDB*

MONG_

LS_MONG_

DS_MONG_

LS_MONG_Example

 

Amazon Redshift

ARED_

LS_ARED_

DS_ARED_

LS_ARED_Example

File

File System*

FILE_

LS_FILE_

DS_FILE_

LS_FILE_Example

 

HDFS*

HDFS_

LS_HDFS_

DS_HDFS_

LS_HDFS_Example

 

Amazon S3

AMS3_

LS_AMS3_

DS_AMS3_

LS_AMS3_Example

 

FTP

FTP_

LS_FTP_

DS_FTP_

LS_FTP_Example

Others

Salesforce

SAFC_

LS_SAFC_

DS_SAFC_

LS_SAFC_Example

 

Generic ODBC*

ODBC_

LS_ODBC_

DS_ODBC_

LS_ODBC_Example

 

Generic OData

ODAT_

LS_ODAT_

DS_ODAT_

LS_ODAT_Example

 

Web Table (table from HTML)

WEBT_

LS_WEBT_

DS_WEBT_

LS_WEBT_Example

 

GE Historian*

GEHI_

LS_GEHI_

DS_GEHI_

LS_GEHI_Example

 

Pipelines

Pipelines are slightly different in that there are two different types. The copy data activity and the data transformation activity.

 

Type

Name

Action

Example

Data movement Activity

PL_DATA_

NA

PL_DATA_DS_SQL_Person_To_DS_ABLB_Person

Data transformation pipeline

PL_TRAN_

SPRC - Stored Procedure

PL_TRAN_SPRC_CleanDimAccount

 

PL_TRAN_

DNET - Script

PL_TRAN_DNET_AggregateSales

 

PL_TRAN_

ADLK - Azure Data Lake

PL_TRAN_ADLK_AggregateSales

 

PL_TRAN_

HIVE - Hive

PL_TRAN_HIVE_AggregateSales

 

PL_TRAN_

PIG - Pig

PL_TRAN_PIG_AggregateSales

 

PL_TRAN_

MAPR - MapReduce

PL_TRAN_MAPR_AggregateSales

 

PL_TRAN_

HADP - Hadoop Stream

PL_TRAN_HADP_StreamData

 

PL_TRAN_

AML - Azure Machine Learning

PL_TRAN_AML_CalculateMonthlyChurn

 

If you have any recommendations, please leave a comment of send me a tweet @SQLShark

Setting up your first Azure Data Factory.

Continuing on from the previous blog (http://blogs.adatis.co.uk/terrymccann/post/Getting-started-with-Azure-Data-Factory), we will look up setting up our first Azure Data Factory. We will do this in 3 different ways, using 3 different tools.

The three methods we will look at (in order) are:

1. Using the Azure portal
2. Using Visual studio with the add-in for ADF
3. Using PowerShell

Option 1 - Using Azure Portal.

This is the easy to set up, however not the easiest to work with once set up. I will go in to reasons why in later blogs. To begin setting up our new instance of ADF head over the the Azure portal and log in. You will find the Azure Portal at https://portal.azure.com/ . Please note that Azure is not free and you will either need to set up a free 1 month trial (https://azure.microsoft.com/en-gb/free/) , already have an existing MSDN account or unfortunately spend some money.

Once logged in you can find the option to add an ADF instance on the left of your screen. Click the green plus and search for "Data Factory". The icon below should appear, click on this to begin configuring your new instance of ADF.

clip_image001

If you're new to Azure I suggest you read Microsoft guide to Azure, to familiarise yourself with terms such as "resource group" (You will find a good introduction here: https://docs.microsoft.com/en-us/azure/fundamentals-introduction-to-azure)

clip_image002

Once you have clicked to create a new ADF instance you will see the following blade (a blade is a sliding window in Azure). Give your ADF a name. The name of your ADF needs to be unique to Azure, "MyDataFactory" will not cut it unfortunately. We will have a naming guide which should help you decide on how to make your first ADF instance. I will be naming my instance AdatisADFExample. On the image above, the following numbers relate to the following elements.

  1. Give your instance a name (as explained above)
  2. Select your subscription (if you have multiple select which one you're allowed to use. I generally select the one which has the most amount of money remaining.
  3. Choose to create a new resource group or use an existing. I would recommend that you create a new resource group to hold all of the artefacts required for this demo (ADF, ASDW, Blob storage)
  4. Select a location. ADF is not available in that many locations at present. You can pick East US, West US or north Europe. As I am in the UK I will select North Europe.

Finally select pin to dashboard and the select "create". Congratulations! You have successfully created your first instance of Azure Data Factory. Feel free to skip to the next section if you do not wish to learn how to complete the same operation in Visual Studio and in PowerShell.

Option 2 - Using Visual Studio

Before starting you will need to have Visual Studio 2015 installed, I am currently using the community edition which is free and does everything I need it to (available at https://www.visualstudio.com/downloads/ [Accessed 22/01/2017]), you will also need the ADF add-in for visual studio (https://marketplace.visualstudio.com/items?itemName=AzureDataFactory.MicrosoftAzureDataFactoryToolsforVisualStudio2015#review-details). If you have problems installing the ADF add-in, check the installation guide as it does require the Azure SDK.

Once you have everything installed open Visual Studio and select a new project. You should see Data Factory on the left hand side if it has installed correctly. Once selected you should see the screen below. You have 2 options, create an empty ADF project or use a template.

clip_image003

If you want to see how a project (other than this one) is created, select the template project and you will be able to pick one of eight sample solutions.

clip_image004

For the moment select an Empty Data Factory project. Give it a name and a location and hit OK. The visual studio project for ADF allows you to create everything you need and also to publish to Azure at the click of a button. If the solution explorer has not opened, you can enable this under the view menu. You should now be able to see the new ADF project in the solution explorer.

clip_image005

To use Visual studio to publish you will need at least one linked service, dataset or pipeline. For our example we will create a linked service to an Azure SQL Data Warehouse. Right click on the linked service in the solution explorer and select Add new item.

This does not necessarily need to exist at this point.

clip_image006

From here you will be asked to select which type of linked service you want to create. We will be connecting to an Azure SQL Data Warehouse, however as you can see there are a lot of options you can pick from, ranging from on premise SQL Database to Amazon Redshift. This list is frequently being updated and added too. For our example, select the Azure SQL Data Warehouse Linked service. Give the linked service a name (n.b. this a pain to change after, so if you can make it something meaningful this will help).

clip_image007

Once you have created a new linked service you will be presented with the JSON above. The only difference between each linked service is the structure and syntax of the JSON. $schema relates to what can and cannot be selected as part of that linked service. You can think of this as an XML schema. This comes in handy for intelllisense. You will want to change the name to something more indicative of what you're doing. Configure the connection string and we are ready to continue.

Now that we have something to publish, we will first create an instance of ADF to publish to. Right click on the parent node in the solution (as below) and select "Publish".

clip_image008

You will then see the following screen. You might need to change your account to the account you log in to Azure with if you do not see this screen.

clip_image009

You now have the option to either publish to an existing ADF instance or create a new one. You will notice that Visual Studio has read my previously created ADF instance from Azure (1).

Select "Create New Data Factory".

  1. Give it a name. For this instance I will call mine AdatisADFExample2
  2. You have the option to give your instance tags. If you intend to create a large amount of ADF instances then you can add a tag or two to help you later identify what each is doing.
  3. Give you ADF instance a description.
  4. Select which subscription to use
  5. Select a resource group - You cannot select to create a new one here, this needs to be done in the Azure portal, you can only select to use an existing RG
  6. Select a region.

Once populated, select next. If you have created additional elements such as linked services, datasets or pipelines these would appear on the next screen. Select Next and Next again. This will begin to process of creating you ADF instance. Unlike with our example in Azure, this will not automatically be pinned to your dashboard. To do this you will need to find the item in Azure and manually pin it to your dashboard. Save and close you visual studio solution.

Option 3 - Using PowerShell

The last option is possibly the easiest and like option 1 does not require anything to deploy. This is my personal preference. To create a new ADF instance all you need to do is open your PowerShell IDE and run the following PowerShell:

Import-Module AzureRM
Import-Module Azure

Login-AzureRmAccount

# Create a new ADF
New-AzureRmDataFactory -ResourceGroupName "blbstorage" -Name "AdatisADFExample3" -Location "NorthEurope"

AzureRm and Azure commandlets are required and can be installed by running Install-Module AzureRm and Install-Module Azure. Login-AzureRmAccount will prompt for you Azure log in details. Then New-AzureRmDataFactory will create a new ADF instance. For this example, I have named my instance AdatisADFExample3.

To remove and of the created instance you can also do this with PowerShell.

# Delete datafactory
Remove-AzureRmDataFactory -Name "AdatisADFExample3" -ResourceGroupName "blbstorage"

When I am developing, I will use PowerShell to build the new ADF instance, then once it has been created use Visual Studio to develop linked services, datasets and pipelines. Sometimes Azure can get caught in a loop when nothing will validate or deploy. When this happens use PowerShell to remove the ADF instance (and all created artefacts) then use Visual Studio to redeploy the entire solution. If you have made any changes in Azure directly, these will be lost.

So what have we created?

If you click on your new ADF instance you will see the following screen.

clip_image010

Core actions:

1. Author and deploy

A&D is the original way to create the main artefacts in ADF. As above you can also use Visual Studio and PowerShell.

clip_image011

New data store - Create a new linked services
New Compute - A new HD Insight cluster,
New dataset - A new dataset in a linked service
New pipeline - Create a new pipeline
New data gateway - Create a gateway (installed locally on a server)

Authoring and deploying via this environment is somewhat cumbersome. Everything you want to do in ADF needs to be written in JSON. A&D will auto generate most of what you need however the route to starting is not the easiest.

2. Copy data

Copy data (at the time of writing is still in preview) is a GUI on top of the Author and deploy function. It is a wizard based walk through which takes you through a copy task. Essentially this is the creation of a data movement activity pipeline, however Azure will build the linked services, datasets and pipelines as required. We will look at this for our movement of data from on premise SQL to Azure Blob.

3. Monitor and manage

Monitor pipelines through a graphical interface. This will take you to the new monitor dashboard. We will have a blog around monitoring pipelines later in this series.

4. Sample pipelines

Sample pipelines will create all the artefacts to get you started with an example ADF pipeline. This is great for beginners. There is also a load of templates in visual studio. You will need a preconfigured blob storage account and depending on your chosen example some additional Azure services.

5. Diagram

The somewhat defunct view. This has largely been replace by Monitor and manage.

clip_image012

6. Metrics and operations

Operational metrics. This gives you a high level view over what has succeeded and failed.

In this blog we looked setting up an Azure data factory in the Azure Portal, Visual Studio and with PowerShell. My personal preference is PowerShell with visual studio. Datasets, pipelines and linked services (7,8 & 9) will be what we will begin looking at in our next blog - Copying data from on premise SQL Server to Azure Blob storage with Azure Data Factory

What is Azure Data Factory?

clip_image001

(Image 1 - Azure Data Factory – Transformation Pipeline)

Overview:

Azure data factory (ADF) is a big data processing platform from Microsoft on the Azure platform. For database developers, the obvious comparison is with Microsoft's SQL Server integration services (SSIS). SSIS is an ETL tool (extract data, transform it and load), ADF is not an ETL tool. ADF is more akin to ELT frameworks (Extract-Load-Transform), while the terms are similar, the process is very different. For those who have worked in the data warehouse arena for a number of years, you will be very familiar with ETL and switching to ELT can feel somewhat alien and cumbersome. As such, making obvious comparisons to SSIS will only hinder your understanding of the core concepts of ADF. ADF is not a cloud version of SSIS as many would have you believe, it is both more and less.

On the topic of ETL, Dan Linstedt (The father of Data Vault), published an interesting article "ETL is Dead! Long Live ELT" (http://danlinstedt.com/allposts/datavaultcat/etl-is-dead-long-live-etl/). In this, Linstedt argues that ETL is now defunct for a number of reasons, the primary being big data. To enable fast processing of big data, we have seen a shift to parallel processing with tools such has Hadoop's HDFS, Cassandra, Spark and on the Microsoft side, Data Lake and Azure Data warehouse. By having our data spread across many nodes we can push the processing to the data via a MapReduce job. ADF is designed to be a big data processing pipeline, so it makes sense to process where the data lives rather than extracting all our data, transforming it and then loading it in one processes. We instead ingest the data and load it in to a parallel storage engine (HDFS) and then transform. If this is the first time you're reading about ELT, I recommend both Linstedt's article where Linstedt states "ETL truly is DEAD. Metadata & Lineage are NOT dead, ELT is the future" and also James Serra's blog “The difference between ETL and ELT” (http://www.jamesserra.com/archive/2012/01/difference-between-etl-and-elt/).

ADF is often described as an "Orchestration" tool (https://azure.microsoft.com/en-gb/services/data-factory/) and this is because of its two functions. Being ELT and not ETL, the two main functions might not surprise you. They are moving data and transforming it. When I say transforming data you might assume ADF does the transformation, however the only action ADF actually preforms is to issue a command for a transformation to happen (Issue an EXEC for a stored procedure or a MapReduce job for example). It is this issuing of commands which makes the tool an orchestration and not a multipurpose ETL tool such as SSIS.

The data processing landscape is change, while ADF is still a relatively new tool, a lot of development is ongoing and improvements are being made. ADF is available through the Azure Portal and using the ADF add-in for visual studio (https://azure.microsoft.com/en-us/blog/azure-data-factory-visual-studio-extension-for-authoring-pipelines/).

Now we know what ADF does, let look at how it does it. ADF has 3 main components, linked services, datasets and pipelines.

Core Concepts:

clip_image002

(Image 2 - High level ADF)

The diagram above is showing how ADF all slots together. If we follow this starting with a linked service we can explore this further. A linked service is our connection to some form of resource. This could be an on premise SQL database, a blob storage container, an azure SQL data warehouse or something else. It is here that we want something to happen. That could be extracting data, sinking data (an ADF term for storing data) or transforming data using stored procedures, MapReduce etc. A dataset is on which part of a linked service you want an activity to be performed (a table, view, stored procedure). A dataset is not a database, but rather a table or a container in blob storage. A linked service can have one or many datasets. An activity is performed by a pipeline which will perform an action/actions on a dataset/datasets.

 

Linked services

A linked services is best thought of as a logical data connection, allowing ADF to connect to an external resource. A linked services acts as a representation of a data store or a compute resource.
A data store is used as part of a data movement activity. A compute resource is executed as part data transformation exercise.
Click here for more information on the creation and different sources you can use for linked services.

Datasets

A dataset represents a data structure in ADF, each belonging to a linked service. A linked service might be an Azure Blob storage account or an Azure SQL database, the dataset is a representation of the data, a folder of files in blob storage or a table in an Azure database, not the database itself.
For a full list of sources and sinks you see the list of links below.

Pipelines

Pipelines offer the power for ADF. There are 2 types of pipeline which support our notion on ELT.

Data movement pipeline - The Data movement pipeline does what its name implies. It moves data from one dataset to another. When referring to Azure documentation these two datasets will be referred to as the source and the sink. In a data movement activity, data can be moved from any source to any sink. For example, you can move data from Azure Blob storage in to a SQL database or from HDFS to an Azure SQL Data warehouse. This part handles the E and the L of our ETL. Data is extracted and loaded. The only difference to SSIS, is that at this point there is no T. No transformation has happened at this point.

Data transformation pipeline - The data transformation activities again do as the name implies - Transform. As ADF is an orchestration tool, a transformation could be a MapReduce job, a streaming job, a stored proc or a data lake U-SQL query. To fully use ADF you will need to know how you're storing data (ADFS/Data lake) and then what language you want to use to process the data. Knowing how ADF functions is only half the battle.

In relationship terms, you can think of the interconnection of linked services, pipelines and datasets as follows: A listed service has one or more datasets (A SQL database with many tables), a pipeline performs an action on one or more datasets.

An Example:

Let's imagine a process where you have an on premise SQL server box, you want to move multiple tables to blob storage, from there you then want to issue a stored procedure which will consume that data in to an Azure SQL data warehouse via PolyBase - As illustrated in the image below.

clip_image003

(Image 3 - Example Process)

To complete this we would need to create the following artefacts in ADF:

Linked services:
On-premise SQL database
Azure BLOB storage
Azure SQL data warehouse

Datasets:
Table in On-premise SQL database
The blob container
The stored procedure

Pipelines:
Pipeline to move data from SQL database to blob storage
Pipeline to issue stored procedure

In later blogs we will look at creating this example in full, however if you want to get started now, the links at the bottom of the page will guide you through setting up ADF.

How to get started:

There are 3 ways to develop for ADF.

  1. Azure portal - Semi-rich IDE. All in the Azure portal.
  2. Visual studio with ADF add-in - Richer development environment. Allows for source control of code through TFS and automated deployment of changes. This is required as the Azure portal can sometime get caught in a syntax loop where nothing will deploy. VS will remove and deploy artefacts which makes this a lot easier. You will sometime have to delete your ADF and start again. If your code is in VS all you need to do is deploy.
  3. PowerShell - Anything you can in the portal you can do in PowerShell.

My development preference is 2 and 3. I use Visual studio for all my code which is source controlled with TFS, then I use PowerShell to build my ADF environment.

Thanks for reading this introduction to ADF. I hope you have found it somewhat insightful. We have a whole series of blogs looking in depth at different aspects and design patterns for loading a data warehouse with Azure Data Factory. You can read more in our next blog looking at setting up your first Azure Data Factory.

Links:

Learning path:  https://azure.microsoft.com/en-gb/documentation/learning-paths/data-factory/

Documentation:  https://docs.microsoft.com/en-us/azure/data-factory/

Introduction to ADF [YouTube]: https://www.youtube.com/watch?v=_hOYAT69yIw

ADF overview [Channel9] https://channel9.msdn.com/Blogs/Azure/Azure-Data-Factory-Overview