Adatis

Adatis BI Blogs

Migrating to Native Scoring with SQL Server 2017 PREDICT

Microsoft introduced native predictive model scoring with the release of SQL Server 2017. The PREDICT function (Documentation) is now a native T-SQL function that eliminates having to score using R or Python through the sp_execute_external_script procedure. It's an alternative to sp_rxPredict. In both cases you do not need to install R but with PREDICT you do not need to enable SQLCLR either - it's truly native. PREDICT should make predictions much faster as the process avoids having to marshal the data between SQL Server and Machine Learning Services (Previously R Services). Migrating from the original sp_execute_external_script approach to the new native approach tripped me up so I thought I'd share a quick summary of what I have learned. Stumble One: Error occurred during execution of the builtin function 'PREDICT' with HRESULT 0x80004001. Model type is unsupported. Reason: Not all models are supported. At the time of writing, only the following models are supported: rxLinMod rxLogit rxBTrees rxDtree rxdForest sp_rxPredict supports additional models including those available in the MicrosoftML package for R (I was using attempting to use rxFastTrees). I presume this limitation will reduce over time. The list of supported models is referenced in the PREDICT function (Documentation). Stumble Two: Error occurred during execution of the builtin function 'PREDICT' with HRESULT 0x80070057. Model is corrupt or invalid. Reason: The serialisation of the model needs to be modified for use by PREDICT. Typically you might serialise your model in R like this: model <- data.frame(model=as.raw(serialize(model, NULL))) Instead you need to use the rxSerializeModel method: model <- data.frame(rxSerializeModel(model, realtimeScoringOnly = TRUE)) There's a corresponding rxUnserializeModel method, so it's worth updating the serialisation across the board so models can be used interchangeably in the event all models are eventually supported.  I have been a bit legacy. That's it.  Oh, apart from the fact PREDICT is supported in Azure SQL DB, despite the documentation saying the contrary.

Automating The Deployment of Azure Data Factory Custom Activities

Custom Activities in Azure Data Factory (ADF) are a great way to extend the capabilities of ADF by utilising C# functionality. Custom Activities are useful if you need to move data to/from a data store that ADF does not support, or to transform/process data in a way that isn't supported by Data Factory, as it can be used within an ADF pipeline.Deploying Custom Activities to ADF is a manual process, which requires many steps. Microsoft’s documentation lists them as:Compile the project. Click Build from the menu and click Build Solution.Launch Windows Explorer, and navigate to bin\debug or bin\release folder depending on the type of build.Create a zip file MyDotNetActivity.zip that contains all the binaries in the \bin\Debug folder. Include the MyDotNetActivity.pdb file so that you get additional details such as line number in the source code that caused the issue if there was a failure.Create a blob container named customactivitycontainer if it does not already existUpload MyDotNetActivity.zip as a blob to the customactivitycontainer in a general purpose Azure blob storage that is referred to by AzureStorageLinkedService.The number of steps means that it can take some time to deploy Custom Activities and, because it is a manual process, can contain errors such as missing files or uploading to the wrong storage account. To avoid that errors and delays caused by a manual deployment, we want to automate as much as possible. Thanks to PowerShell, it’s possible to automate the entire deployment steps. The script to do this is as follows:Login-AzureRmAccount# Parameters $SourceCodePath = "C:\PathToCustomActivitiesProject\"$ProjectFile ="CustomActivities.csproj"$Configuration = "Debug" #Azure parameters$StorageAccountName = "storageaccountname"$ResourceGroupName = "resourcegroupname"$ContainerName = "blobcontainername"# Local Variables$MsBuild = "C:\Program Files (x86)\MSBuild\14.0\Bin\MSBuild.exe";            $SlnFilePath = $SourceCodePath + $ProjectFile;                                         # Prepare the Args for the actual build            $BuildArgs = @{                 FilePath = $MsBuild                 ArgumentList = $SlnFilePath, "/t:rebuild", ("/p:Configuration=" + $Configuration), "/v:minimal"                 Wait = $true                 }          # Start the build            Start-Process @BuildArgs # initiate a sleep to avoid zipping up a half built projectSleep 5# create zip file $zipfilename = ($ProjectFile -replace ".csproj", "") + ".zip"$source = $SourceCodePath + "bin\" + $Configuration$destination = $SourceCodePath + $zipfilenameif(Test-path $destination) {Remove-item $destination}Add-Type -assembly "system.io.compression.filesystem"[io.compression.zipfile]::CreateFromDirectory($Source, $destination) #create storage account if not exists$storageAccount = Get-AzureRmStorageAccount -ErrorAction Stop | where-object {$_.StorageAccountName -eq $StorageAccountName}       if  ( !$storageAccount ) {     $StorageLocation = (Get-AzureRmResourceGroup -ResourceGroupName $ResourceGroupName).Location     $StorageType = "Standard_LRS"     New-AzureRmStorageAccount -ResourceGroupName $ResourceGroupName  -Name $StorageAccountName -Location $StorageLocation -Type $StorageType} #create container if not exists$ContainerObject = Get-AzureStorageContainer -ErrorAction Stop | where-object {$_.Name -eq $ContainerName}if (!$ContainerObject){$storagekey = Get-AzureRmStorageAccountKey -ResourceGroupName $ResourceGroupName -Name $StorageAccountName$context = New-AzureStorageContext -StorageAccountName $StorageAccountName -StorageAccountKey $storagekey.Key1 -Protocol HttpNew-AzureStorageContainer -Name $ContainerName -Permission Blob -Context $context} # upload to blob#set default contextSet-AzureRmCurrentStorageAccount -StorageAccountName $StorageAccountName -ResourceGroupName  $ResourceGroupNameGet-AzureRmStorageAccount -ResourceGroupName $ResourceGroupName -Name $StorageAccountName # Upload fileSet-AzureStorageBlobContent –Container $ContainerName -File $destination By removing the manual steps in building, zipping and deploying ADF Custom Activities, you remove the risk of something going wrong and you add the reassurance that you have a consistent method of deployment which will hopefully speed up your overall development and deployments.As always, if you have any questions or comments, do let me know.

How to prepare for 70-766 - Perform Big Data Engineering on Microsoft Cloud Services

There is a new exam currently in beta titled "Perform Big Data Engineering on Microsoft Cloud Services (beta)". With all new exams there is little content on how to revise for the exam beyond the exams summary. This exam however, is what Adatis specialises in! Microsoft may call this "Big Data Engineering" we call it "Modern Data Analytics" and we have a few blogs on the subject. You can sign up to the exam here: https://www.microsoft.com/en-us/learning/exam-70-776.aspx Below you will find links to blog posts by Adatis consultants on topics related to all the key objectives of this exam. I will endeavour to keep this up-to-date with new content added by the team. Good luck with the exam. Design and Implement Complex Event Processing By Using Azure Stream Analytics (15-20%)Streaming data is vital to achieving real-time analytics. The following blogs posts focus on this and offer an introduction and walkthrough for getting started with Stream Analytics. When talking about a wider Lambda approach to Big Data, streaming enables rapid processing via a “Speed” layer.    http://blogs.adatis.co.uk/simonwhiteley/post/Adatis-Hackathon-Jan-2015-Streaming-Analytics-First-Thoughtshttp://blogs.adatis.co.uk/Jose%20Mendes/post/IoT-Hub-Device-Explorer-Stream-Analytics-Visual-Studio-2015-and-Power-BIhttp://blogs.adatis.co.uk/sachatomey/post/2017/01/19/Power-BI-Streaming-Datasets-An-Alternative-PowerShell-Push-Scripthttp://blogs.adatis.co.uk/Jose%20Mendes/post/Data-Data-Revolution Design and Implement Analytics by Using Azure Data Lake (25-30%)Azure Data Lake Store and Analytics are a vital component of the “Modern Data Analytics”. Data which is too large for traditional single server processing needs distributed parallel computation. Rather than pulling data and processing ADLA pushes the processing to the data. Understanding how to process large volumes of data is one part of the “Batch” layer in Lambda http://blogs.adatis.co.uk/ustoldfield/post/data-lakeshttp://blogs.adatis.co.uk/ustoldfield/post/Data-Flow-Job-Execution-in-the-Azure-Data-Lakehttp://blogs.adatis.co.uk/ustoldfield/post/Data-Flow-Pt-2-Vertexes-In-Azure-Data-Lake Design and Implement Azure SQL Data Warehouse Solutions (15-20%)Either as an alternative or in accompaniment to Data Lake is Azure SQL Data Warehouse. If Data Lake is batch across many files, Azure SQLDW is parallel batch over many databases. The key to both services is processing at the storage and not at the compute. The following is an on-going blog series covering the basics all the way to a  deep-dive.   http://blogs.adatis.co.uk/simonwhiteley/post/A-Guide-to-Azure-SQL-DataWarehousehttp://blogs.adatis.co.uk/simonwhiteley/post/Azure-SQLDW-What-is-ithttp://blogs.adatis.co.uk/simonwhiteley/post/Azure-SQLDW-How-Does-Scaling-Workhttp://blogs.adatis.co.uk/simonwhiteley/post/Azure-SQLDW-Distributionhttp://blogs.adatis.co.uk/simonwhiteley/post/Azure-SQLDW-Polybasehttp://blogs.adatis.co.uk/simonwhiteley/post/Azure-SQLDW-Polybase-Design-Patternshttp://blogs.adatis.co.uk/simonwhiteley/post/Azure-SQLDW-Polybase-Limitationshttp://blogs.adatis.co.uk/simonwhiteley/post/Azure-SQLDW-CTAS-Statements Design and Implement Cloud-Based Integration by using Azure Data Factory (15-20%)If you’re looking for a paas solution to move data in Azure, there is only really one option. Azure Data Factory. The following blogs will get you up-to-speed with ADF. http://blogs.adatis.co.uk/terrymccann/post/Getting-started-with-Azure-Data-Factoryhttp://blogs.adatis.co.uk/terrymccann/post/Setting-up-your-first-Azure-Data-Factoryhttp://blogs.adatis.co.uk/terrymccann/post/Azure-Data-Factory-using-the-Copy-Data-task-to-migrate-data-from-on-premise-SQL-Server-to-Blob-storage Manage and Maintain Azure SQL Data Warehouse, Azure Data Lake, Azure Data Factory, and Azure Stream Analytics (20-25%)Know each of the parts is only half the battle, you need to know how, when and why to use each part. What are the best practices? http://blogs.adatis.co.uk/ustoldfield/post/Deploying-a-Hybrid-Cloudhttp://blogs.adatis.co.uk/terrymccann/post/Azure-Data-Factory-Suggested-naming-conventions-and-best-practiceshttp://blogs.adatis.co.uk/ustoldfield/post/Azure-Data-Lake-Store-Storage-and-Best-Practiceshttp://blogs.adatis.co.uk/ustoldfield/post/Shaping-The-Lake-Data-Lake-Framework

Instant Bot: deploying a Bot in minutes with Azure Bot Service

I had been playing around with the Bot Framework for a while but hadn’t really got anywhere, largely due to having enough time to create something worthwhile, when I came across the Azure Bot Service whilst I was trawling through the documentation of the Bot Framework. The Azure Bot Service is currently in preview and allowed me to quickly author and deploy a basic bot for the purpose of this post and walkthrough. Creating the BotLike most services in Azure, creating the bot is easy and requires the following inputConfiguring the BotSetting up the Bot is a bit more involved. Once the bot has been created, you’ll be presented with the following screenThe App ID and Password are auto-generated by Microsoft, but you will need to make note of the password and store it securely as it is only displayed once in the app registration process. Next you want to select the language in which the bot is developed and deployed. You have the choice of C# or NodeJS. I opted for C# as it’s a language I am most familiar with. Choose your template, accept the T’s & C’s and your bot is ready to be deployed!Deploying the BotThe Bot has been created and configured, displaying its source code which can be further tweaked in the browser or Visual Studio. You can also embed your Bot in a number of existing apps, websites and services.Chatting with the BotThe basic bot isn’t the most stimulating of conversational partners but it is satisfying to see your creation talk back, even if it repeats what you have just told it. The Bot Framework opens up many possibilities to make the services you offer engaging in a conversational way. The Azure Bot Service makes the Bot Framework that much more accessible to quickly deploy bots and have them out there, engaging with users.

Geographic Spatial Analysis with Azure Data Lake Analytics (ADLA)

Whilst working on an Azure Data Lake project, a requirement hit the backlog that could be easily solved with a Geographical Information System (GIS) or even SQL Server - Spatial data type support was introduced into SQL Server 2008. However, Azure Data Lake Analytics (ADLA) does not natively support spatial data analytics so we'll have to extract the data into another service right? Wrong ? :) Due to the extensibility of Azure Data Lake Analytics, we can enhance it to do practically anything. In fact, we can lean on existing components and enhance the service without having to develop the enhancement itself. This blog is a quick run through demonstrating how to enhance ADLA such that it will support Spatial analytics and meet our project requirement. Problem For simplicity I've trivialised the problem. Here's the requirement: Indicate which Bus Stops are within 1.5 km of Southwark Tube Station. To support this requirement, we have two datasets: A list of all the Bus Stops in London, including their Geo location (circa 20k records) The Geo location record of Southwark Tube Station (a single record !) In fact, the location of the tube station is pretty accurate and is geo located to the entrance pavement outside the tube station: This would be an easy problem for a GIS to solve. You would specify the central point i.e. our Southwark Tube station marker and draw a circle, or buffer, with a radius 1.5 km around it and select all bus stops that fall within or intersect with that circle. This spatial analysis is easy for these systems as that's essentially what they are built to do. SQL Server 2008 introduced the Spatial Data Type, this allowed spatial style analysis to be performed on geo data using T-SQL in conjunction with the supplied Geometry and Geography data types. More info on those can be found here So, how can we solve our problem in ADLA, without a GIS and without having to export the data to SQL Server?? Solution You can register existing assemblies with ADLA. It so happens that the SQL Server Data Types and Spatial assemblies are nicely packaged up and can be used directly within ADLA itself - think about that, it's pretty awesome ! Caveat: At the time of writing we have no idea of the licence implications. It will be up to you to ensure you are not in breach :) Those assemblies can be downloaded from here.  You only need to download and install the following file: ENU\x64\SQLSysClrTypes.msi This installs two key assemblies, which you'll need to grab and upload to your Data Lake Store: C:\Program Files (x86)\Microsoft SQL Server\130\SDK\Assemblies\Microsoft.SqlServer.Types.dll C:\Windows\System32\SqlServerSpatial130.dll Once they have been uploaded to your Data Lake Store, you need to register those assemblies with ADLA. DECLARE @ASSEMBLY_PATH string = "/5.UTILITY/USQL-Extend/SQL-Server/"; DECLARE @TYPES_ASM string = @ASSEMBLY_PATH+"Microsoft.SqlServer.Types.dll"; DECLARE @SPATIAL_ASM string = @ASSEMBLY_PATH+"SqlServerSpatial130.dll"; CREATE DATABASE IF NOT EXISTS SQLServerExtensions; USE DATABASE SQLServerExtensions; DROP ASSEMBLY IF EXISTS SqlSpatial; CREATE ASSEMBLY SqlSpatial FROM @TYPES_ASM WITH ADDITIONAL_FILES = ( @SPATIAL_ASM ); Following registration of the assemblies, we can see the registration loaded in the ADLA Catalog database we created: We are now ready to use this U-SQL enhancement in our U-SQL Query - let's go right ahead and solve our problem in one U-SQL Script. // Reference the assemblies we require in our script. // System.Xml we get for free as a System Assembly so we didn't need to register that and our SQLServerExtensions.SqlSpatial assembly REFERENCE SYSTEM ASSEMBLY [System.Xml]; REFERENCE ASSEMBLY SQLServerExtensions.SqlSpatial; // Once the appropriate assemblies are registered, we can alias them using the USING keyword. USING Geometry = Microsoft.SqlServer.Types.SqlGeometry; USING Geography = Microsoft.SqlServer.Types.SqlGeography; USING SqlChars = System.Data.SqlTypes.SqlChars; // First create the centralised point. // In this case it's the pavement outside the entrance of Southwark Tube Station, London. // Format is Longitude, Latitude and then SRID. // NB: It's Longitude then Latitude, that's the opposite way to what you might expect.. DECLARE @southwarkTube Geography = Geography.Point(-0.104777,51.503829,4326); // Next we extract our entire London bus stop data set from the file. // There's about 20k of them. @busStopInput = EXTRACT [StopCode] string, [StopName] string, [Latitude] double?, [Longitude] double? FROM @"/1.RAW/OpenData/Transport/bus-stops-narrow-full-london.csv" USING Extractors.Csv(skipFirstNRows:1,silent:true); // This is effectively the transform step and where the magic happens // Very similar syntax to what you would do in T-SQL. // We are returning all the bus stops that fall within 1500m of Southwark Tube // Essentially we return all stops that intersect with a 1500m buffer around the central tube point @closeBusStops= SELECT * FROM @busStopInput WHERE @southwarkTube.STBuffer(1500).STIntersects(Geography.Point((double)@busStopInput.Longitude,(double)@busStopInput.Latitude,4326)).ToString()=="True"; // The results are written out to a csv file. OUTPUT @closeBusStops TO "/4.LABORATORY/Desks/Sach/spatial-closebusstops.csv" USING Outputters.Csv(outputHeader: true); The query outputs a list of bus stops that are within the specified Spatial distance from Southwark Tube Station. If we have a look at all the bus stops (in red) and overlay all the 'close' bus stops (in green), we can see the results: Pretty neat. Azure Data Lake Analytics does not natively support spatial data analytics but by simply utilising the assemblies that ship with SQL Server, we can extend the capability of U-SQL to provide that functionality or practically any functionality we desire.

Data Flow Pt 2: Vertexes In Azure Data Lake

Following on from my previous post on Job Execution in the Azure Data Lake, this post will explore the concept of Extents and how they are utilised by Vertexes in the actual Job processing in Azure. The U-SQL script that has been authored, compiled and deployed is the logical plan of how the author intends to transform input data into output data. This creates a total amount of work – essentially the amount of data it has to process – which is decomposed into a set of vertexes. Each vertex will process a subset of data, or extents (see Azure Data Lake Storage for more information) and represent a fraction of the total. Vertexes are displayed, or grouped, in a super vertex, also known as a stage. Vertexes in each stage are doing the same operation on a different part of the same data. The number of vertexes in a stage indicates the maximum theoretical parallelisation of that stage. The containers requested for the job will be allocated to complete each vertex. Say there is a 10GB file. This file will be split into 40 Extents and allocated to at least 10 Vertexes. If one wants to process all of the file in parallel then requesting 10 containers will allow for concurrent parallelism. All this is visualised as the job graph.If you have multiple transformations going on in your USQL script this will create multiple stages, and the output of one vertex becomes the input of another vertex in a dependent stage. As a result, dependent stages can begin processing even if preceding stages haven’t completed processing. If dependency does exist, it will create a critical path – which is the dependency chain of vertexes which keep the job running to the very end because the vertex on the bottom depends on the output of the vertex on the top. This can be seen in the Vertex Execution View in the Visual Studio Azure Data Lake Tools view. It’s useful for optimising job, by re-positioning or re-writing elements of your script, by checking which vertex takes the longest. It may not be possible to use all the reserved parallelism during a stage if there are fewer vertexes than Azure Data Lake Analytics Units (ADLAUs) available. For example, if I have 10 ADLAUs – it’s great for an early stage as all ADLAUs will be allocated a vertex. But with later stages, more and more ADLAUs will be idle. So by the last stage only 2 of the 14 are utilised. Unfortunately it is not currently possible to dynamically de-allocate ADLAUs during a job. In conclusion, understanding how vertexes interact with extents and can influence parallelism, you should be able to provision adequate resources for your jobs as well as know where to start the investigation for any long running queries you might have with the Vertex Execution View. In the next post of the series I will be looking into resource costs and optimising spend and time taken to process data.

Azure SQLDW-CTAS Statements

So here we are - you've designed your target distribution, you're scaling your warehouse appropriately and you've set up Polybase to access raw data outside of the system. You are now ready to bring data into the warehouse itself so we can start working on it. Firstly, you should be aware that we're building using ELT not ETL, ie: we design to allow for the warehouse itself to do the work as it is the powerful processing option available to us. When first introducing Polybase, I described how it is the only method of loading data into SQLDW that does not get bottlenecked through the control node. Well the same limitations will be present if we try and transform our data using traditional ETL tools tool. If we used an SSIS data flow, for example, this would extract data out, pull it across to the SSIS server, perform the transformation and then attempt to reinsert the data. Not only are we introducing network latency, but we are also ignoring the power of the MPP engine AND encountering the throughput limits of the control node. In order to utilise SQLDW effectively, we write SQL for our transformations, rather than relying on external tools such as SSIS. This ensures the work is being done by our compute nodes and, therefore, can be scaled up to increase performance. General best practice, therefore, would be write stored procedures for each of the data movements we want to occur. This allows us to add in auditing, logging etc. But what we're interested in here is the core data movement itself. Writing a traditional INSERT statement isn’t the fastest way to get data into a table. There is a special syntax which creates a new table and inserts into it, that is automatically configured for optimal bulk loading, this is the CTAS, or "Create Table as Select" statement.CREATE TABLE [dbo].[MyTable] WITH ( HEAP, DISTRIBUTION = ROUND_ROBIN ) AS SELECT * FROM [dbo].[MySourceTable]   As it is the only minimally-logged, bulk insert method within SQLDW, it’s the fastest way to get data into a table. If you have the table already exists and you want to wipe it and fill it with data, the fastest way to do so is to delete the table and recreate it using a CTAS. Performing a truncate/insert combination would be slower, as each row insert causes a write to the transaction log. SQLDW performs all standard actions as a transaction by default, so if you have a huge insert statement and something goes wrong, be prepared for a lengthy rollback process. The table creates by a CTAS statement infers the schema from the query, so you need to be explicit with ISNULLs, CASTs and anything needed to point the query in the right direction. This is where the bulk of the data processing happens - the select part of the CTAS both defines the destination table and carries out the data transformations, all as a single step. If you’ve ever used “SELECT field, field2 INTO [dbo].[NewTable] FROM [dbo].[SourceTable]” then you’ll have a pretty good idea of how it’ll work. But if you ran the two side by side, CTAS would perform faster as, again, it is specifically treated differently by the SQLDW engine. What does this mean to our traditional ETL pattern? If you're following the standard approach of having intermediary tables between data processing steps (ie: Staged data, cleaned data, transformed data and so on), these transient tables can actually be deleted between loads. They won't exist in your database project. They won't exist until your load process has been run. A CTAS to 'Clean' our fact table would be:BEGIN TRY --Primary CTAS to insert data from external table into strongly typed Clean table CREATE TABLE Clean.MyTable WITH( HEAP, DISTRIBUTION = ROUND_ROBIN) AS SELECT ISNULL(CAST(Column_0 as INT),-1) TableId, CAST(Column_1 as VARCHAR(50)) SomeTextField, CAST(Column_2 as DECIMAL(12,5)) SomeNumber FROM SRC.MySourceTable OPTION (LABEL = 'Clean.LoadMyTable.CTAS'); END TRY BEGIN CATCH --CTAS Failed, mark process as failed and throw error SET @ErrorMsg = 'Error loading table "MyTable" during Cleaning CTAS: ' + ERROR_MESSAGE() RAISERROR (@ErrorMsg, 16, 1) END CATCH By including the CAST and ISNULL statements, I’m controlling the schema of the created tables. The resulting table would have the following definition:CREATE TABLE [Clean].[MyTable] WITH ( HEAP, DISTRIBUTION = ROUND_ROBIN ) ( [TableId] INT NOT NULL, [SomeTextField] VARCHAR(50) NULL, [SomeNumber] NUMERIC(12,5) NULL ); By using CTAS wherever possible, you will maintain the speed of your data loads and avoid and issues around long-running transactions. An efficient SQLDW loading process contains ONLY CTAS functions. For example – there is no such thing as a MERGE statement within SQLDW. In order to merge the results of one table into another, you can write a CTAS that selects a union of the two tables and inserts them into a new table. You then drop the old table and rename your newly created table in it’s stead. There are many similar patterns that will help you around any data movement problem – we’ll discuss them in later posts in this series. For more information about the CTAS statement, see the Microsoft Docs.

Connecting Power BI to Hive

On a recent project I was tasked with importing data into Power BI from a Hive table. For those of you who are new to Azure or Big Data, Hive is a data warehousing infrastructure for Hadoop which sits in the HDInsight stack on Azure. The primary purpose of Hive is to provide data summarisation, query and analysis for big data sets. In this blog I’m going to take you through the steps and note any Gotchas so that you can connect to Hive using Power BI. Connecting to HiveAs Hive is part of the Azure HDInsight stack it would be tempting to select the HDInsight or Hadoop connector when you’re getting data. However, note HDFS in brackets beside the Azure HDInsight and Hadoop File options as this means that you’ll be connecting to the underlying data store, which can be Azure Data Lake Store or Azure Blob Storage – both of which use HDFS architectures. But this doesn’t help when you want to access a Hive table. In order to access a Hive table you will first of all need to install the Hive ODBC driver from Microsoft. Once you’ve downloaded and installed the driver you’ll be able to make your connection to Hive using the ODBC connector in PowerBI.You will need to input a connection string to connect even though it says optional. The format of the connection string is as follows:Driver={Microsoft Hive ODBC Driver};Host=hdinsightclustername.azurehdinsight.net;Port=443;Schema=default; RowsFetchedPerBlock=10000; HiveServerType=2; AuthMech=6; DefaultStringColumnLength=200;One the next screen you’ll be asked to enter a username and password. The credentials used here are not what you use to access Azure but the credentials you created when you set up the HDInsight cluster and use to login to the cluster. Click connect and you’ll be able to pull through the tables you need into Power BI. Or, if you want to be selective in what is returned, you can write a HiveQL query in the ODBC dialog. It’s also worth noting that at the moment it’s only possible to do an import of Hive Data in Power BI and not perform Direct Query, so if your data set is huge you’ll want to summarise the data or be really selective in what is returned first.

Azure Data Lake Store : Authorization Failed - The user does not have the permission to perform this operation

Just a quick one as it took slightly longer to solve than it possibly should have done so this might help others with the same problem. I was working fine with an Azure Data Lake Store for a few weeks until yesterday when I couldn’t access the files anymore.  The Azure Portal reported the following:Authorization Failed - The user does not have the permission to perform this operation. It turns out I was accessing it from a new location and the ADLS Firewall needed updating to whitelist my IP Address…. After adding my IP, the Data Lake Store is now accessible again.

Data Flow: Job Execution in the Azure Data Lake

In this blog we will deep dive into the job execution of the Azure Data Lake Analytics (ADLA). If you’ve used ADLA you will have come across the following image: This is the graphical representation of the job execution. But what is is doing at each stage? Preparing The job execution begins with the authoring of the U-SQL script. The U-SQL script itself is the logical plan of how you intend to transform input data into output data. The script gets compiled, which will translate your U-SQL script primarily into C#, as well as a few other items such as XML– which will contain metadata information and the job graph. Once compiled it will create an initial plan – which will then be optimised thus producing an optimised plan. If it fails at this stage it will mostly be down to your U-SQL script failing to compile successfully. Fortunately, the error codes at this stage are fairly helpful and will indicate the line in question. Queued Once compiled and optimised the job progresses to the queued stage. There are a few things which can cause the a job to queue and these are: Running Jobs The default setting per subscription is 3. So if you or someone else in your subscription is already running a series of jobs, your job will be queued until at least one of them completes Higher priority jobs As the queue is ordered by job priority, with loser numbers having a higher priority, other jobs can jump to the top of the queue thus forcing your job to queue for longer Lack of resources Even if there are no running jobs and there are no jobs ahead of yours in the queue, your job will continue to queue if there are not enough Azure Data Lake Analytic Units (ADLAUs) to start the job.   Running Once the job has finished queuing it will run. The Job Manager allocates vertexes, which are collections of data to be processed, to the ADLAUs and uses YARN to orchestrate it. The vertexes will then execute. At this stage, if there is a vertex failure you can download the vertex locally and debug in visual studio. When the vertexes have successfully completed you will be able to consume the end product – either locally, if you have run the job locally, or in the Azure Data Lake Store. Conclusion The above process can be visualised in this graph. Knowing how ADLA executes your job is the first of many steps to being able to write performant U-SQL and debug effectively. Be on the look out for more blogs on the technical deep dive into ADLA!

Objects detection with Data Lakes Analytics

In this blog I’m going to show one of the advantages of linking Data Lakes Analytics with Machine Learning. We’ll be uploading a series of images to the Data Lake, we will then run a USQL script that will detect objects in the images and create relative tags in a text file. First of all you need an instance of  Data Lake Store and one of Data Lake Analytics, once these are up and running we need to enable Python/R/Cognitive in your Data Lake Analytics instance (here is a blog to help you out on this). First things first, we need to put an image in our Data Lake Store, following Azure Data Lake best practices I put the images in my laboratory subfolder. Once our images are in place we need to create a script, in your Data Lake analytics instance click on New Job This will open a new blade with an empty script, let’s give our new Job a name “ImageTagging”. In order to use Image tagging we need to import the relevant ASSEMBLIES:REFERENCE ASSEMBLY ImageCommon;REFERENCE ASSEMBLY ImageTagging; Next we need to extract information (location, filename etc.) on the image file(s) we want to analyse, in this case we’ll process all images in the specified folder.@images= EXTRACT FileName string, ImgData byte[] FROM @"/Laboratory/Desks/CSbrescia/ImageTagging/{FileName:*}.jpg" USING new Cognition.Vision.ImageExtractor(); The following step is where the magic happens, the script analyses all the images located in the folder indicated before, it detects all objects present in each image and create tags; here is the structure of this “variable”: Image name Number of tagged objects detected A string with all the tags @TaggedObjects= PROCESS @images     PRODUCE FileName,            NumObjects int,            Tags string    READONLY FileName    USING new Cognition.Vision.ImageTagger(); Now we can write our variable with all the tags to an output fileOUTPUT @TaggedObjectsTO "/Laboratory/Desks/CSbrescia/ImageTagging/ImageTags.tsv"   USING Outputters.Tsv(); Here are the images I used in this example And here is the list of objects detected   In conclusion, we have created a pretty handy tool for automatic image tagging using Data Lake with very little knowledge required on the background processes involved. To be noted that there seems to be an image size limit, i had to resize all images to about 500 kb.   

A Guide to Azure SQL DataWarehouse

So you've heard the hype - the Azure SQL DW is going to solve all of your problems in one fell swoop… Right? Well… maybe. The system itself is a mix of technologies designed for low concurrency analytics across huge amounts of relational data. In short, it's a cloud-scalable T-SQL-based MPP platform, with all the benefits and restrictions that performing everything in parallel brings. If your problem can be solved by performing lots of calculations over small of your data before aggregating the results into a whole, this is the technology for you. However, before you jump right in, be aware that SQLDW is a very different beast to other SQL tools. There are specific concepts you need to be familiar with before building your system, otherwise you're not going to see the promised performance gains and will likely lose faith very quickly! I'm in the process of documenting these concepts, plus there is a wealth of information available on the Azure SQLDW site. For the next few months, I'll be running through the blog topics below and updating the links accordingly. If there are topics you'd like me to add to the list, please get in touch! Azure SQLDW Core Concepts: - What is it? - How Does Scaling Work? - Distributions - Polybase - Polybase Limitations - Polybase Design Patterns - CTAS - Resource Classes - Partitioning Designing ETL (or ELT) In Azure SQLDW - Row counts - Statistics - Surrogate Keys Performance Tuning Azure SQLDW - Plan Exploration in SQLDW - Data Movement Types - Minimising Data Movement Managing Azure SQLDW - Backup & Restore - Monitoring Distributions - System Monitoring - Job Orchestration - Scaling and Management - Performance Tuning Queries Azure SQLDW Architecture - Presentation Layers - Data Lake Integrations

Testing the Waters: An Overview of Data Science using Azure Data Lakes

Data Science can fit seamlessly within the ecosystem of the data lake, whether this is through HDInsight or the extensibility of Azure Data Lake Analytics and U-SQL. This blog will give a brief overview of what Data Science is; how to link Data Science toolkits to the Azure Data Lake; and best practices for managing the data output from experiments.   Data Science Data Science is the relatively new kid on the block. One way to consider data science is as an evolutionary step in interdisciplinary fields like business analysis that incorporate computer science, modelling, statistics, analytics, and mathematics.At its core, data science involves using automated methods to analyse massive amounts of data and to extract insight from them. Data science is helping to create new branches of science, and influencing areas of social science and the humanities. The trend is expected to accelerate in the coming years as data from mobile sensors, sophisticated instruments, the web, and more, grows.   Data Science In The Data Lake The nature of the Azure Data Lake Store lends itself to Data Science in that it can hold any data, which the data scientist will want to access, transform and analyse.   HDInsight contains many implementations for data science, such as Spark, R Server and others. Hooking HDInsight to Azure Data Lake Store is pretty simple and follows these steps: In the Azure Portal Marketplace, select HDInsight which will bring up a series of blades. In this blog, I will be using Spark as my cluster type on HDInsight. In the storage settings you can then link your HDInsight cluster to Azure Data Lake Store Confirm your configuration on the next blade and wait around 20 minutes for your cluster to deploy and you’re good to go!   With Azure Data Lake Analytics, you incorporate data science by extending the capabilities of U-SQL and you do this by installing a series of files. Open up your Azure Data Lake Analytics account and click on Sample Scripts at the top. This will bring forward the following blade From there you’ll want to click on the U-SQL Advanced Analytics tab, which will copy about 1.5GB of files to the default Azure Data Lake Store associated to your ADLA account. This will take about 3 minutes to complete. When it’s finished copying the files it will then call a job to register the extension, which can be found in the assemblies folder of the master database. More resources about extensibility of U-SQL can be found here: https://blogs.msdn.microsoft.com/azuredatalake/2017/03/10/using-custom-python-libraries-with-u-sql/ https://blogs.msdn.microsoft.com/azuredatalake/2016/11/22/u-sql-advanced-analytics-introducing-python-extensions-for-u-sql/ The Laboratory Within Azure Data Lake Store, Folder and File Management is incredibly important for a well running data lake. See my blogs on Storage and Best Practices and Shaping The Lake for more information on how to set up your Azure Data Lake Store. The Laboratory is an area to be exclusively used by a data scientist. It’s an area where they can persist the results of experiments and data sets without impacting the day-to-day operations within the data lake or other data scientists. The laboratory is organised in to two broad area: Desks and Exhibits. Desks contain personal workspaces, the contents of which can be organised however the person wishes. It can be as well organised, or disorganised, as the person themselves. The Exhibit contains data sources produced in the Laboratory which are ready to be consumed by other users or systems. Both of which are laid out below. As always, if you have any feedback or comments do let me know!

Azure Data Lake Store–Storage and Best Practices

The Azure Data Lake Store is an integral component for creating a data lake in Azure as it is where data is physically stored in many implementations of a data lake. Under the hood, the Azure Data Lake Store is the Web implementation of the Hadoop Distributed File System (HDFS). Meaning that files are split up and distributed across an array of cheap storage.   What this blog will go into is the physical storage of files in the Azure Data Lake Store and then best practices, which will utilise the framework.   Azure Data Lake Store File Storage As mentioned, the Azure Data Lake Store is the Web implementation of HDFS. Each file you place into the store is split into 250MB chunks called extents. This enables parallel read and write. For availability and reliability, extents are replicated into three copies. As files are split into extents, bigger files have more opportunities for parallelism than smaller files. If you have a file smaller than 250MB it is going to be allocated to one extent and one vertex (which is the work load presented to the Azure Data Lake Analytics), whereas a larger file will be split up across many extents and can be accessed by many vertexes.   The format of the file has a huge implication for the storage and parallelisation. Splittable formats – files which are row oriented, such as CSV – are parallelizable as data does not span extents. Non-splittable formats, however, – files what are not row oriented and data is often delivered in blocks, such as XML or JSON – cannot be parallelized as data spans extents and can only be processed by a single vertex.   In addition to the storage of unstructured data, Azure Data Lake Store also stores structured data in the form of row-oriented, distributed clustered index storage, which can also be partitioned. The data itself is held within the “Catalog” folder of the data lake store, but the metadata is contained in the data lake analytics. For many, working with the structured data in the data lake is very similar to working with SQL databases.   Azure Data Lake Store Best Practices The best practices generally involve the framework as outlined in the following blog: http://blogs.adatis.co.uk/ustoldfield/post/Shaping-The-Lake-Data-Lake-Framework The framework allows you to manage and maintain your data lake. So, when setting up your Azure Data Lake Store you will want to initially create the following folders in your Root Raw is where data is landed in directly from source and the underlying structure will be organised ultimately by Source. Source is categorised by Source Type, which reflects the ultimate source of data and the level of trust one should associate with the data. Within the Source Type, data is further organised by Source System. Within the Source System, the folders are organised by Entity and, if possible, further partitioned using the standard Azure Data Factory Partitioning Pattern of Year > Month > Day etc., as this will allow you to achieve partition elimination using file sets. The folder structure of Enriched and Curated is organised by Destination Data Model. Within each Destination Data Model folder is structured by Destination Entity. Enriched or Curated can either be in the folder structure and / or within the Database.  

Azure Data Factory, using the Copy Data task to migrate data from on premise SQL Server to Blob storage.

This is the third blog in an ongoing series on Azure Data Factory. I recommend you start with the following blogs before continuing: Introduction to Azure Data Factory Setting up your first Azure Data Factory   In the first blog in this series I talked about working through a use case. Over the next 3 blogs we will look at 3 different methods for migrating data to Azure Blob storage. Using the Azure Data Factory Copy Data Wizard. (on table) Using BIML and SSIS (entire database - SSIS) Using Azure Data Factory and PowerShell (entire database - ADF) The reason I have included the latter 2 versions is because if you just want to load an entire database in the blob storage it can be quicker to use one of these methods as a one off or on a scheduled basis. Hand writing all the JSON required for each move table from on premise to blob storage is very time consuming. Depending on whether you need to do a one off upload or something on a schedule options 2 and 3 might help. Our original use case from an introduction to Azure Data Factory: Let's imagine a process where you have an on premise SQL server box, you want to move multiple tables to blob storage, from there you then want to issue a stored procedure which will consume that data in to an Azure SQL data warehouse via PolyBase - As illustrated in the image below. Linked services: On-premise SQL database Azure BLOB storage Azure SQL data warehouse Datasets: Table in On-premise SQL database The blob container The stored procedure Pipelines: Pipeline to move data from SQL database to blob storage Pipeline to issue stored procedure In the blog we will tackle the first part: Copying data: We are going to start looking in a bit more in detail at the Azure Data Factories (ADF) copy data task (CD). CD is still in preview (at the time of writing [01/2007]). Prior to the inclusion of the copy data wizard, you had to manually configure ADF artefacts and write most of the JSON for linked services, datasets and pipeline by hand. The copy data task is a wizard for generating a data movement activity pipeline, complete with datasets and linked services. To get started connect to azure and navigate to your existing Azure data factory (if you do not have an existing ADF you can follow how to create one here http://blogs.adatis.co.uk/Terry%20McCann/post/Setting-up-your-first-Azure-Data-Factory). To begin setting up a copy data pipeline click on the "copy data (PREVIEW)" option in the ADF panel in Azure. Once you have selected "Copy data" you will be taken to the new ADF interface. Datafactory.azure.com enables the creation and monitoring of ADF pipelines. The general process for creating an ADF pipeline process (prior to the copy data task) was Create an ADF Create a linked service/s Create a gateway as needed Create you input and output datasets Create a pipeline Monitor the pipeline When using the ADF copy data the process is slightly flipped Create an ADF Configure the properties for the pipeline Create a gateway Configure the linked service/s Configure the datasets Deploy all configurations Monitor the pipeline. The main difference here is that you do not deploy anything until it has all been configured, you have the added advantage that it is somewhat cumbersome to do this manually. At present, the copy data task is very narrow in its functionality. If your intention is to build a more advanced pipeline will either need to generate a move task and tweak it or create it all manually. Copy data has many short comings, for our example the most prevalent is that a movement to blob storage only has the option to sink data to a folder and not multiple folders. Option 3 in our list of methods for migration aims to get around this limitation using PowerShell Configure Pipeline properties:Once you have selected "copy data" you will be launched in to datafactory.azure.com, the new fresher looking environment. The copy data task is a 4 stop process which will guide you through the creation of a data movement pipeline (I want to highlight that this is only used for data movement and not transformation). This is a great way to get you started with ADF without having to understand the json or trickier elements such as data slices and scheduling, although we will touch on scheduling as it is quite tricky. (image 1 - Configure Properties) The first screen is you will see are the properties of the pipeline you're creating. It is here you will configure the frequency and schedule of the pipeline. A pipeline is a group of logically related activities. Task name - This is important and will be used as a prefix for the names of datasets and data stores. Task description - Task schedule - See below for a more in depth analysis. Start time - This date is in UTC / End time - This data is also in UTC For quick conversions to your time zone, I recommend worldtimebuddy (http://www.worldtimebuddy.com/) More on schedules:The Microsoft page about scheduling is incredibly deep and takes a lot of effort to fully digest and understand. I will attempt to impart my understanding of pipeline scheduling in to a brief list of key points. You can read more here https://docs.microsoft.com/en-us/azure/data-factory/data-factory-scheduling-and-execution I would recommend that you do read this site as there is a lot of good examples. The key points from this document are: A schedule can be one-off or between dates It can run on a scheduled basis Minute, hourly, daily or weekly. Every 15 minutes is the minimum. This forms what is known as a tumbling window. Microsoft defines a tumbling window as "A series of fixed-size, non-overlapping, contiguous time intervals". Tumbling windows are also known as activity windows. A pipeline schedule's interval needs to be the same as a datasets availability - However, it does not need to run at the same time. For our example we will use frequency of "Daily" and an interval of "1", this will run our pipeline every day. To write this in JSON and not through the wizard you would use the following JSON as part of your pipeline. "scheduler": {"frequency": "Daily","interval": 1} To create a pipeline which will run undefinably you can set the end date time to "12/31/2099 12:00am" which while this is not infinite, the date will outlive ADF. Start date time will default to the time you have created the pipeline (n.b. these dates are expressed in US format MM/DD/YYYY). Creating linked services (Source data store):The next screen is the configuration of the linked sources. ADF is still new and the terminology is somewhat confusing. Depending on tool you're using and sometimes the screen you're looking at, ADF will mix up the names for key parts, anywhere you see the term "data store" assume it is referring to a linked service. For our example we will use the SQL Server option (bottom right of the above image). (Image - SQL Server linked service configuration screen) You should now be able to configure the connection details (via a gateway) to you SQL server database. Connection Name - You can refer to the Adatis naming standard as a referenceLS - Linked serviceMSQL - Microsoft SQL ServerPerson - Table being exported. Gateway - Select an existing or create a new gateway (see below) Server name - For my example I am using my local server with the default instance "." local would also work. If you're connecting to a named instance this will need to be server\InstanceName Database Name - Which database you want to connect to Credential encryption - You have the option to save a credential in Azure or use authentication through the browser. For simplicity I am using the latter. For production, use the former. Authentication type - How to connect to SQL Server, Windows or SQL login. User name Password   Creating and configuring a gateway:In our example we will be connecting to a local version of SQL Server, to connect and read data we will need to create an ADF gateway connection and also install our gateway on the server which has our database (or at least connection to that database). You have a few options to create the gateway, but before you can configure any of these you will need to download and install the gateway. You can find the latest version of the gateway here https://www.microsoft.com/en-gb/download/details.aspx?id=39717. Once installed the gateway will be waiting for an access key. (image - Microsoft gateway - Awaiting gateway key) We have 3 options to create an ADF gateway and obtain the key the gateway is expecting. Copy Data configuration page (click create gateway)This will build the gateway and add the name to your pipeline. You will need to take the access key it generates and add that to you installed gateway. Add via author and deployNavigate to author and deploy on the main panel of ADF in Azure. Click on "...More" and select New Data gateway, configure and deploy. This will return a key. Add the key to the gateway on your server. Via PowerShellOpen a new PowerShell prompt and connect to Azure (Login-AzureRmAccount)Replace the following with your Azure details - $ResourceGroup, $DataFactoryName and $GatewayNew-AzureRmDataFactoryGateway -ResourceGroupName $ResourceGroup -Name $Gateway -DataFactoryName $DataFactoryName -Description $GatewayThis will return a key. Add the key to the gateway on your server.   (image - A registered gateway) (Image - main screen on a registered gateway) Configuring linked services:Select next to choose which table/s you wish to move. (Image - ADF Copy - Select tables) You can select one or more tables here. For our example we will be consuming the data using PolyBase. We want our data to sink to its own container in Azure. As such we cannot move multiple tables at once (at the time of writing this is limited to one sink container). (Image - ADF data filter) You will next be asked how you want to filter the data. Each time our data runs we are looking to move the whole table. If we were looking to do incremental loads, we could select a column which indicates which rows to import each hour. For our example select Filter: None. Next (image - ADF destination Source) Configuring the destination source:On the next screen you will see the list of available sinks (where you can insert data). You will notice the list of sinks is far smaller than the list of sources - at present not all sources can be sinks. For our example select Azure Blob storage Connection Name - Based on Adatis ADF naming standards http://blogs.adatis.co.uk/Terry%20McCann/post/Azure-Data-Factory-Suggested-naming-conventions-and-best-practices LS_ Linked Service ABLB_ Blob storage Person - blob container data will sink to Account selection method - Manual/Azure list Azure subscription - Select you subscription Storage account name - Select your storage account (Image - Selecting a blob container) Select a folder for the file to sink to. I created a folder ahead of time called person. (Image - ADF file format configuration screen) Customise you output settings. For now we will just select the defaults to create a CSV. Select finish to build your pipeline. (image - ADF deployment completed screen) As long as everything has worked you will see the screen above. Congratulations your pipeline has been deployed. To monitor the pipeline and see what has been created select the link "click here to monitor your pipeline". You will be taken to a different screen in the ADF portal. We will have more on how to monitor ADF shortly. Image - (ADF pipeline in datafactory.azure.com) You can check data has moved successfully using Azure storage explorer. ASE is a great utility for browsing files in blob storage. You can download ASE here http://storageexplorer.com/ (image - Storage explorer) I can see that my file is there and is populated as intended. Once a further 24 hours has passed this file will be over written. So we have seen what we can do with the Copy data task in Azure. While it is fantastic at basic data movement functions, copy data does not offer much beyond that. I have listed the following pains and shortfalls which exit in ADF copy data at present . Limitations of the copy data wizard:There are quite a few limitations, some of these are nice to have, others are show stoppers. The CD action is limited to only a subset of the pipeline activities. As the name suggests you can only copy data, or move data. There is no transformation wizard. The menus are very temperamental and regularly do not work You cannot name a dataset - InputDataset-8tl was created in my example. This is not helpful The name of the pipeline is also not helpful. You cannot chain multiple activities together Each pipeline needs to created separately. You can only sink datasets to one blob container. Now that we have our data in blob storage we can begin to look at the rest of our solution, where we will create an Azure SQL Data Warehouse, with external PolyBase tables. We will use stored procedures to persist the external tables in to ASDW. In the next blog we will look at moving an entire database to Azure blob storage using SSIS and BIML. Links https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-scheduling-and-execution http://blogs.adatis.co.uk/Terry%20McCann/post/Azure-Data-Factory-Suggested-naming-conventions-and-best-practices

My Experience of the Microsoft Professional Program for Data Science

(Image 1 – Microsoft 2017 - https://academy.microsoft.com/en-us/professional-program/data-science)   In 2016 I was talking to Andrew Fryer (@DeepFat)- Microsoft technical evangelist, (after he attended Dundee university to present about Azure Machine Learning), about how Microsoft were piloting a degree course in data science. My interest was immediately spiked. Shortly after this hints began appear and the Edx page went live. Shortly after the Edx page went live, the degree was rebranded as the "Professional Program". I registered to be part of the pilot, however was not accepted until the course went live in September 2016.   Prior to 2016 my background was in BI, predominately in Microsoft Kimball data warehousing using SQL Server. At the end of 2015 I enrolled on a Master's Degree in Data Science through the University of Dundee. I did this with the intention of getting exposure to tools I had an interest in, but had some/little commercial experience (R, Machine learning and statistics). This course is ongoing and will finish in 2018, I highly recommend it! I would argue that it is the best Data Science Master's degree course in the UK. So going in to the MPP I had a decent idea of what to expect, plus a lot of SQL experience, R and Power BI. Beyond that I had attended a few sessions at various conferences on Azure ML. When the syllabus for the MPP came out, it directly complemented my studies.   Link to program - https://academy.microsoft.com/en-us/professional-program/data-science Link to Dundee Masters - https://www.dundee.ac.uk/study/pg/data-science/   Structure of the program The program is divided up in to 9 modules and a final project. All modules need to be completed but there are different options you can take - You can customise the course to suit your interests. You can choose to pay for the course (which you will need to do if you intend to work towards the certification) or audit the course for free.  I will indicate which modules I took and why. Most modules recommend at least 6 weeks part-time to complete. I started the first module in the middle of September 2016 and completed the final project middle of January 2017 – So the 6 week estimate is quite high, especially if you already have decent a base knowledge of the concepts already.   You can if you wish complete multiple modules at once. I am not sure I recommend this approach as to get the most out of the course, you should read around the subject as well as watching the videos. Each module has a start date and an end date that you need to complete it between. If you do not you will need to do it all again. You can start a module in one period and wait until the next for another module. You do not need to complete them all in 3 months. If you pay for the module but do not request your certificate before the course closes, you will need to take it again (top tip, as soon as you're happy with you score, request you certificate).   Module list Module Detail Time taken Data Science Orientation Data Science Orientation 2 - 3 days Query Relational Data Querying Data with Transact-SQL 1 day - Exam only Analyze and Visualize Data Analyzing and Visualizing Data with Excel  Analyzing and Visualizing Data with Power BI 2 - 4  days Understand Statistics Statistical Thinking for Data Science and Analytics 7 - 9 days Explore Data with Code Introduction to R for Data Science Introduction to Python for Data Science 7 - 9 days Understand Core Data Science Concepts Data Science Essentials 7 - 9 days Understand Machine Learning Principles of Machine Learning 2 weeks Use Code to Manipulate and Model Data  Programming with R for Data Science Programming with Python for Data Science R - 2 - 3 daysPython - 3 weeks Develop Intelligent Solutions   Applied Machine Learning  Implementing Predictive Solutions with Spark in HDInsight Developing Intelligent Applications 2 weeks Final Project Data Science Challenge 2 months*   The times taken are based on the time I had spare. I completed each module between projects, in the evening and at the weekend. This module can be completed in a few days, however you need to wait until it has completed to get you grade.   Structure of the modules Each modules is online. You log on to the Edx website and watch videos by leading experts. Either at the end of the video, after reading some text or at the end of a section of the modules you are given a multiple choice test. The multiple choice options are graded and form part of your overall score. The other main assessment method is labs, where you will be required to complete a series of tasks and enter the results. Unlike certifications, you get to see what your score is as you progress through the module. The multiple choice questions generally allow you to have two to three attempts at the answer, sometimes these are true/false with two attempts, which does undermine the integrity of the course.   There is normally a final section which you're only given one chance to answer, and holds a higher % towards your final mark. You need 70% to pass. Once you hit 70% you can claim your certificate - if you have chosen to pay for the module. Modules range from $20 to $100. For the most part I answered the questions fully and tried for the highest score possible. However, In all honestly towards the end, once I hit around 80%, I started looking at a different module. If the module was really interesting I would persevere.   Modules Data Science Orientation, Query Relational Data & Analyze and Visualize Data. These modules are very basic and really only skim the surface of all the topics they describe. The first module is a gentle introduction to the main concepts you will learn throughout the program. The next modules focused on querying data with SQL. Regardless of your opinion of SQL, you must agree that SQL the is language of data. Having an understanding of the fundamentals of SQL is paramount, as almost every level of the Microsoft Data Science stack has integration with databases. If you're familiar with SQL (I already held an MCSE in SQL 2012) you can skip the main content of this module and just take the test at the end. For the next you have an option of Excel or Power BI for visualisation. As I have experience with Power BI I opted for this module. Once again this is a very basic introduction to Power BI. It will get you familiar enough with the tool that you can do basic data exploration. Some parts of this course jarred with me. Data visualisation is so important and a key skill for any data scientist. In the Power BI module one of the exercises was to create a 3d pie chart. Pie charts are not a good visualisation as it is hard to differentiate between angles and making it 3d only escalates the issue. I wish Microsoft would have made reference to some of the great data viz experts when making this module - I cannot comment on the Excel version.   Understanding statistics. This module is different from its predecessors, in that it is not run by Microsoft. This is a MOOC from Columbia university, which you might have completed before. It covers a lot of the basic and more advanced stats that you need to know for data science. In particular a solid grounding in probability and probability theory. In BI you become familiar with descriptive stats and measures of variance, however I had not done a great deal of stats beyond this. I have researching statistical methods for the MSc, but I had not done any real stats since A-Level maths. This course was really interesting and I learnt a lot. I don’t know if this is the best way to really learn stats, but it is a good primer to what you need to know. I found topping up my understanding with blogs, books and YouTube helped support this module.   Explore data with code. You have two options again for this module, R and Python. Which should you learn I imagine you're asking, well the simple answer is both. Knowing either R or Python will get you so far, knowing both with make you a unicorn. Many ask why to learn one language over the other - aside from the previous point. R is very easy to get in to, it has a rich catalogue of libraries written by some of the smartest statistical minds. It has a simple interface and is easy to install. Python is harder to learn in my opinion as the language is massive! I found Python harder to work with, but it is much richer. I would recommend Python just for SciKitLearn the machine learning library. The python module is extended to use code dojo (the great online tuition site). As you progress through the questions and examples, you have an ide which will check you understanding and  will grade you as you go. I found this really helpful. This module is again a bit on the easier side. If you think the later Python module will be similar, you are in for a surprise! I did not take the R module as I was already using R in my day job.   Understand core data science concepts. Almost a redo of the first module and the understanding statistics module. Not a lot to say here, but repetition helped me understand and remember the concepts. The more I had to think about the core concepts the more they stuck. This module could have been removed with little to no impact on the course, but helped solidify my knowledge.   Understanding Machine learning. As this is a Microsoft course this module is all about Azure Machine Learning. If you have not used Azure ML before, it has a nice drag and drop interface which allows you to build quick simple models and create a web api key which you can then pass data to using any tool with a REST API. This module is half theory and half practical. There are a lot of labs, so you will need to take you time. If you skip ahead you will get the answers wrong and might not make it to 70%.   Using code to manipulate and model data. This section has two options again R and Python. I know quite a bit or R already so I started with Python. I wanted to do them both to see how you can do machine learning in both. I was expecting a continuation of the code dojo format from the previous module, this was far from the case. Each of the modules up until this point have worked with you to find the right answer. This module will equip you with the basics, but expect you to find the correct function and answer. Believe me when I say it was hard (with little prior experience of Python). The course will lead you to towards the right resources, but you need to read the documentation to answer the question. This was a great change of pace. Having to search for the answers made me absorb more than just the quizzes. This module was a struggle. Once I completed this I did the same for R. On a difficulty scale, if the Python module was 100, R was only at 20. The disparity in difficult is massive and frankly unfair. I was able to complete the R module very quickly. I left feeling disappointed that this did not have the same complexity that the Python module did.   Develop intelligent solutions. For this section you can pick one of three modules, Machine learning, Spark or micro services. I went with Spark. Why? Because I had already worked with Spark and Hadoop as part of the MSc at Dundee. I knew how it worked and what it did from an open source point of view, but not from a Microsoft HD-Insight perspective. This module was tricky but nothing compared to the Python module. I spent the best part of the week working on Spark, setting up HD-Insight clusters and forgetting to tear them down (top tip! Don’t leave a HD-Insight cluster running - They are EXPENSIVE!). The last module is a machine learning project, so picking the "Applied Machine Learning" option might put you in a better place than your competition. I did not attempt either the Machine Learning or the Micro-services modules.   Final project. Here is where the fun begins. You're given a problem and a dataset. You need to clean, reduce, derive features and process the dataset, then apply an ML technique to predict something. In my case it was whether or not someone will default on a loan. You could use any technique you liked as long as the final result was in Azure ML. I was pretty happy with my model early on and made very few tweaks as the course progressed. Unlike the previous modules where you can complete a module and get your score, your final score is only available once the module has ended. You will build an ML experiment and test against a private dataset. You can submit your experiment 3 times a day to be scored against the private data (maximus of 100 attempts). This will give you an indication of your score, but this is not your score! You score is calculated against a different dataset after the module has finished.  You top 5 scores will be used to test against the private closed data. If you have over-fitted you model, you might have a shock (as many did on the forums) when you score is marked.   I completed all modules at the start of January and waited until February to get my final score. My highest scoring answer, when used against the closed private dataset, did not get over the required 70% to pass. This was surprising but not all that unexpected. I had over-fitted the model. To counter balance this, I created 5 different experiments with 5 similar but different approaches. All score similar (~1-3% accuracy difference). This was enough to see me past the required 70% and to obtain the MPP in data science. The private dataset has been published now. In the coming weeks I will blog about the steps I took to predict if someone would default on their loan.   I have been asked at different stages of the course "would you recommend the course?". It really depends on what you want out of the course! If you expect to be a data scientist after completing the MPP, then you might be in for a shock. To get the most out of the course you need to supplement it with wider reading / research. YouTube has many great videos and recorded lectures which will really help process the content and see it taught from a different angle. If you're looking to get an understanding of the key techniques in  Data Science (from a Microsoft point-of-view) then you should take this course. If you're doing a degree where you need to do research, many of the modules will really help and build upon what you already know.   I hope you have found this interesting and that it has helped you decide whether or not you want to invest the time and money (each module is not free). If you do decide and you persevere you will too be the owner of the MPP in Data Science (as seen below).   Terry McCann - Adatis Data Science Consultant & Organiser of the Exeter Data Science User Group - You can find us on MeetUp.    

What is Azure Data Factory?

(Image 1 - Azure Data Factory – Transformation Pipeline) Overview: Azure data factory (ADF) is a big data processing platform from Microsoft on the Azure platform. For database developers, the obvious comparison is with Microsoft's SQL Server integration services (SSIS). SSIS is an ETL tool (extract data, transform it and load), ADF is not an ETL tool. ADF is more akin to ELT frameworks (Extract-Load-Transform), while the terms are similar, the process is very different. For those who have worked in the data warehouse arena for a number of years, you will be very familiar with ETL and switching to ELT can feel somewhat alien and cumbersome. As such, making obvious comparisons to SSIS will only hinder your understanding of the core concepts of ADF. ADF is not a cloud version of SSIS as many would have you believe, it is both more and less. On the topic of ETL, Dan Linstedt (The father of Data Vault), published an interesting article "ETL is Dead! Long Live ELT" (http://danlinstedt.com/allposts/datavaultcat/etl-is-dead-long-live-etl/). In this, Linstedt argues that ETL is now defunct for a number of reasons, the primary being big data. To enable fast processing of big data, we have seen a shift to parallel processing with tools such has Hadoop's HDFS, Cassandra, Spark and on the Microsoft side, Data Lake and Azure Data warehouse. By having our data spread across many nodes we can push the processing to the data via a MapReduce job. ADF is designed to be a big data processing pipeline, so it makes sense to process where the data lives rather than extracting all our data, transforming it and then loading it in one processes. We instead ingest the data and load it in to a parallel storage engine (HDFS) and then transform. If this is the first time you're reading about ELT, I recommend both Linstedt's article where Linstedt states "ETL truly is DEAD. Metadata & Lineage are NOT dead, ELT is the future" and also James Serra's blog “The difference between ETL and ELT” (http://www.jamesserra.com/archive/2012/01/difference-between-etl-and-elt/). ADF is often described as an "Orchestration" tool (https://azure.microsoft.com/en-gb/services/data-factory/) and this is because of its two functions. Being ELT and not ETL, the two main functions might not surprise you. They are moving data and transforming it. When I say transforming data you might assume ADF does the transformation, however the only action ADF actually preforms is to issue a command for a transformation to happen (Issue an EXEC for a stored procedure or a MapReduce job for example). It is this issuing of commands which makes the tool an orchestration and not a multipurpose ETL tool such as SSIS. The data processing landscape is change, while ADF is still a relatively new tool, a lot of development is ongoing and improvements are being made. ADF is available through the Azure Portal and using the ADF add-in for visual studio (https://azure.microsoft.com/en-us/blog/azure-data-factory-visual-studio-extension-for-authoring-pipelines/). Now we know what ADF does, let look at how it does it. ADF has 3 main components, linked services, datasets and pipelines. Core Concepts: (Image 2 - High level ADF) The diagram above is showing how ADF all slots together. If we follow this starting with a linked service we can explore this further. A linked service is our connection to some form of resource. This could be an on premise SQL database, a blob storage container, an azure SQL data warehouse or something else. It is here that we want something to happen. That could be extracting data, sinking data (an ADF term for storing data) or transforming data using stored procedures, MapReduce etc. A dataset is on which part of a linked service you want an activity to be performed (a table, view, stored procedure). A dataset is not a database, but rather a table or a container in blob storage. A linked service can have one or many datasets. An activity is performed by a pipeline which will perform an action/actions on a dataset/datasets.   Linked services A linked services is best thought of as a logical data connection, allowing ADF to connect to an external resource. A linked services acts as a representation of a data store or a compute resource. A data store is used as part of a data movement activity. A compute resource is executed as part data transformation exercise. Click here for more information on the creation and different sources you can use for linked services. Datasets A dataset represents a data structure in ADF, each belonging to a linked service. A linked service might be an Azure Blob storage account or an Azure SQL database, the dataset is a representation of the data, a folder of files in blob storage or a table in an Azure database, not the database itself. For a full list of sources and sinks you see the list of links below. Pipelines Pipelines offer the power for ADF. There are 2 types of pipeline which support our notion on ELT. Data movement pipeline - The Data movement pipeline does what its name implies. It moves data from one dataset to another. When referring to Azure documentation these two datasets will be referred to as the source and the sink. In a data movement activity, data can be moved from any source to any sink. For example, you can move data from Azure Blob storage in to a SQL database or from HDFS to an Azure SQL Data warehouse. This part handles the E and the L of our ETL. Data is extracted and loaded. The only difference to SSIS, is that at this point there is no T. No transformation has happened at this point. Data transformation pipeline - The data transformation activities again do as the name implies - Transform. As ADF is an orchestration tool, a transformation could be a MapReduce job, a streaming job, a stored proc or a data lake U-SQL query. To fully use ADF you will need to know how you're storing data (ADFS/Data lake) and then what language you want to use to process the data. Knowing how ADF functions is only half the battle. In relationship terms, you can think of the interconnection of linked services, pipelines and datasets as follows: A listed service has one or more datasets (A SQL database with many tables), a pipeline performs an action on one or more datasets. An Example: Let's imagine a process where you have an on premise SQL server box, you want to move multiple tables to blob storage, from there you then want to issue a stored procedure which will consume that data in to an Azure SQL data warehouse via PolyBase - As illustrated in the image below. (Image 3 - Example Process) To complete this we would need to create the following artefacts in ADF: Linked services:On-premise SQL databaseAzure BLOB storageAzure SQL data warehouse Datasets:Table in On-premise SQL databaseThe blob containerThe stored procedure Pipelines:Pipeline to move data from SQL database to blob storagePipeline to issue stored procedure In later blogs we will look at creating this example in full, however if you want to get started now, the links at the bottom of the page will guide you through setting up ADF. How to get started: There are 3 ways to develop for ADF. Azure portal - Semi-rich IDE. All in the Azure portal. Visual studio with ADF add-in - Richer development environment. Allows for source control of code through TFS and automated deployment of changes. This is required as the Azure portal can sometime get caught in a syntax loop where nothing will deploy. VS will remove and deploy artefacts which makes this a lot easier. You will sometime have to delete your ADF and start again. If your code is in VS all you need to do is deploy. PowerShell - Anything you can in the portal you can do in PowerShell. My development preference is 2 and 3. I use Visual studio for all my code which is source controlled with TFS, then I use PowerShell to build my ADF environment. Thanks for reading this introduction to ADF. I hope you have found it somewhat insightful. We have a whole series of blogs looking in depth at different aspects and design patterns for loading a data warehouse with Azure Data Factory. You can read more in our next blog looking at setting up your first Azure Data Factory. Links: Learning path:  https://azure.microsoft.com/en-gb/documentation/learning-paths/data-factory/ Documentation:  https://docs.microsoft.com/en-us/azure/data-factory/ Introduction to ADF [YouTube]: https://www.youtube.com/watch?v=_hOYAT69yIw ADF overview [Channel9] https://channel9.msdn.com/Blogs/Azure/Azure-Data-Factory-Overview

Introduction to Azure Key Vault for SQL Server

What is it Azure Key Vault is a feature available in Microsoft Azure which can be used to manage and store keys that are used to encrypt your data. A big benefit to using Azure Key Vault is that the process of managing and maintaining your keys is completely streamlined; Keys can be created very quickly for development and testing purposes and can then in turn be seamlessly migrated to production keys where permissions can be granted or revoked as necessary. A variety of keys and secrets can be held in Azure Key Vault including; Authentication keys, Storage account keys, Data encryption keys, .PFX files and passwords. Stored keys are protected by Hardware Security Modules (HSMs); keys can be imported or generated in theses HSMs which are processed in FIPS 140-2 Level 2 validated HSMs. Uses and limitations for SQL Server Key Vaults can be created and used by anyone with an Azure subscription. They can be useful to Azure developers and security admins, but also to administrators who manage other Azure services for an organisation who can then be responsible for the management and maintenance of keys or secrets and can provide users with URIs which can be applied directly to their applications. Azure Key Vault can be integrated with SQL Server as an Extensible Key Management (EKM) provider to protect SQL Server encryption keys. This is particularly useful when using Always On Encryption which is available with SQL Server 2016 as with Always On Encryption SQL Server does not hold the keys used to decrypt the data it stores in Always On encryption fields making Azure Key Vault a perfect utilisation as a centralised key store for this functionality. SQL Server has a variety of encryption methods including; Transparent Data Encryption (TDE), Column Level Encryption (CLE) and Backup Encryption; these encryption methods implement a traditional key hierarchy where by the data is encrypted using a symmetric data encryption key (DEK) which is further protected by encrypting it with a hierarchy of keys stored in SQL Server. By instead using Azure Key Vault as the EKM provider architecture SQL Server can protect the data encryption keys by using an asymmetric key stored externally to SQL Server which in turn adds an additional security layer and separation between the management of keys and data. This functionality can be adopted by both cloud based SQL Server instances on Azure virtual machines and on-premises SQL Server instances. In order to implement Azure Key Vault to protect your SQL Server encryption keys you will use the SQL Server Connector; this acts as a bridge between SQL Server and Azure Key Vault. The SQL Server Connector needs to be registered with the relevant SQL Server instance which allows Azure Key Vault to be used as a cryptographic provider, next the configuration and permissions are set up from within Azure and the appropriate credentials are created from within SQL Server. Finally an asymmetric key is opened in Azure Key Vault which can be used to protect database encryption keys on the SQL Server instance. The SQL Server Connector is available as a download from Microsoft here and requires Windows Server 2012 or Windows Server 2012 R2 as your operating system. It currently supports the Enterprise 64-bit versions of SQL Server 2016, 2014, 2012 SP2, 2012 SP1 CU6 and 2008 R2 CU8; for earlier versions of SQL Server 2008 and 2012 there is a patch available which is linked from the afore mentioned Microsoft download page. For more information on the SQL Server Connector and other software requirements please see the Details and System Requirements again on the afore mentioned Microsoft download page. Potential limitations and issues to keep in mind include the following; · Azure Key Vault, like most cloud applications is a paid for service and although there is no upfront cost or termination fees there is a price for both key storage and operations. There are two available service tiers but only Premium offers HSM protected keys which are used with SQL Server implementations. The prices as of November 2016 are;      - £0.6109 per version of a key per month and £0.0183/10,000 operations for HSM protected keys.      - £0.0183/10,000 operations for secrets and software protected keys.      - £1.8327 per renewal request and £0.0183/10,000 operations for Certificate operations. · Authentication to Azure Key Vault requires Azure Active Directory; this is however included with Azure subscriptions. · Key Vault permissions assign a principal access to all secrets and keys within a vault which is something to keep I mind when assigning management and maintenance of the key vault. The best practise would be to have a separate Azure Key Vault for each application instance storing secrets or keys. · Where you have applications using Azure Key Vaults split across separate regions it is best practise to create separate key vaults in each of the relevant regions that will be used by those applications for both performance and security considerations. · There are transaction limits applied which allows the following maximum amount of transactions in 10 seconds, per vault, per region;      - 5 for ‘HSM- CREATE KEY’ transaction types.      - 1000 for ‘HSM- other transaction’ types.      - 10 for ‘Soft-key CREATE KEY’ transaction types.      - 1500 for ‘Soft-key other transaction’ types.      - 2000 for ‘All secrets, vault related transaction’ types. Further Reading I will be posting a future blog showing an in-depth real-life application of Azure Key Vault with a SQL Server instance using the SQL Server Connector and the steps taken to create it including initial set-up, testing and deployment strategies. I will also discuss the permissions and configurations that can be used and the roles that can be assigned for the management and maintenance of the implementation. For more detailed information see the following links; · Microsoft Azure – Key Vault · Microsoft Documents – What is Azure Key Vault · Microsoft – SQL Server Connector for Microsoft Key Vault

Shaping The Lake: Data Lake Framework

The Azure Data Lake has just gone into general availability and the management of Azure Data Lake Store, in particular, can seem daunting especially when dealing with big data. In this blog, I will take you through the risks and challenges of working with data lakes and big data. Then I will take you through a framework we’ve created to help best manage these risks and challenges. If you need a refresh as to what a data lake is and how to create your first Azure Data Lake Store and your first Azure Data Lake Analytics job, please feel free to follow the links. Risks and Challenges of Big Data and Data Lake The challenges posed by big data are as follow: Volume – is the sheer amount of data becoming unmanageable? Variety – Structured tables? Semi-structured JSON? Completely unstructured text dumps? We can normally manage with systems that contain just one of these, but if we’re dealing with a huge mix, it gets very tricky Velocity – How fast is the data coming in? And how fast do we need to get it to the people who need it? Veracity - How do we maintain accuracy, veracity, when the data is of varying volumes, the sources and structures are different and the speed in which they arrive in the Lake are of differing velocities? Managing all four simultaneously is where the challenges begin. It is very easy to treat a data lake as a dumping ground for anything and everything. Microsoft’s sale pitch says exactly this – “Storage is cheap, Store everything!!”. We tend to agree – but if the data is completely malformed, inaccurate, out of date or completely unintelligible, then it’s no use at all and will confuse anyone trying to make sense of the data. This will essentially create a data swamp, which no one will want to go into. Bad data & poorly managed files erode trust in the lake as a source of information. Dumping is bad. There is also data drowning – as the volume of the data tends towards the massive and the velocity only increases over time we are going to see more and more information available via the lake. When it gets to that point, if the lake is not well managed, then users are going to struggle to find what they’re after. The data may all be entirely relevant and accurate, but if users cannot find what they need then there is no value in the lake itself. Essentially, data drowning is when the amount of data is so vast you lose the ability to find what’s in there. If you ignore these challenges, treat the lake like a dumping ground, you will have contaminated your lake and it will no longer be fit for purpose. If no one uses the Data Lake, it’s a pointless endeavour and not worth maintaining. Everyone needs to be working together to ensure the lake stays clean, managed and good for a data dive! Those are the risks and challenges we face with Azure Data Lake. But how do we manage it? The Framework   We’ve carved the lake up into different sections. The key point is that the lake contains all sorts of different data – some that’s sanitised and ready to consume by the business user, some that’s indecipherable raw data that needs careful analysis before it is of use. By ensuring data are carefully managed you can instantly understand the level of preparation that data has undergone. Data flows from left to right – the further left areas represent where data has been input directly from source systems. The horizontal sections describe the level of preparation – Manual, Stream and Batch. Manual – aka, the laboratory. Here data is manually prepared with ad-hoc scripts. Stream – The data here flows in semi-real time, coming from event hubs and being landed after processing through stream-specific tools such as Streaming Analytics. Once landed, there is no further data processing – the lake is essentially a batch processing tool. Batch – This is more traditional data processing, the kind of “ETL” seen by many BI developers. We have a landing area for our raw data, a transitional area where data is cleaned, validated, enriched and augmented with additional sources and calculation, before finally being placed in a curated area where it is ready for consumption by the business. We’re taking the blank-canvas of the Data Lake Store and applying a folder structure, a file management process and a curation process over the top. The folder structure itself can be as detailed as you like, we follow a specific structure ourselves:   The Raw data area, the landing place for any files entering the lake, has sub-folders for each source of data. This allows for the easy browsing of the data sources within the Lake and ensures we are not receiving the same data twice, even if we use it within different systems. The Enriched and Curated layers however, have a specific purpose in mind. We don’t take data and enrich/clean/process it without a business driver, it’s not something we do for fun. We can therefore assign a project or system name to it, at this point it is organised into these end-systems. This means we can view the same structure within Enriched as within Curated. Essentially Raw data is categorised by Source whilst Enriched and Curated data is categorised by Destination. There’s nothing complicated about the Framework we’ve created or the processes we’ve ascribed to it, but it’s incredibly important that everyone is educated on the intent of it and the general purpose of the data lake. If one user doesn’t follow process when adding data, or an ETL developer doesn’t clean up test files, the system starts to fall apart and we succumb to the challenges we discussed at the start. To summarise, structure in your Azure Data Lake Store is key to maintaining order: • You need to enforce and maintain folder structure. • Remember that structure is necessary whether using unstructured data or tables & SQL • Bear in mind that schema on read applies temporary structure – but if you don’t know what you’re looking at, this is going to be very hard to do!

SQL PASS Summit–Day 3 and Reflections

Apologies for the delay in getting this blog out to you all. When PASS finished on the Friday we had to rush over to the airport to get our flight back to the UK. When I landed on Saturday I was suffering from jet lag and only now am I in a fit state to blog again.   I got the impression from the schedule that Day 3 of PASS was going to be a wind-down day as very few of the sessions seemed as intense as the previous days’ sessions. My first session of the day, despite being the last day of PASS, was early. Earlier than any of the keynotes, but worth getting up for – a Chalk Talk with the Data Warehouse Fast Track Team. This also included the Azure Data Warehouse team as well, and the conversation was much more focused on the Azure side of Data Warehousing. Lots of conversations around Polybase and patterns in how to get data from on-prem to cloud using Polybase. In terms of patterns, it was reassuring to learn that the approach Adatis has adopted is spot on. Simon Whiteley is the man to see about that. His blog is here: http://blogs.adatis.co.uk/simonwhiteley/ On the Fast Track theme, my next session was  exploring the SQL Server Fast Track Data Warehouse, which was interesting to know about, especially the various testing that these pre-configured servers go through. At some point next year, Microsoft will be releasing the Fast Track Testing Programme to the community so that everyone will be able to test their hardware to the same exacting standards and know what their maximum throughput / IO demand etc., is in order to properly gauge hardware performance. After this session I got talking to a few people about data warehousing. The conversation was so engrossing that I missed the session that I was due to attend. Luckily, most of the sessions at PASS are recorded so I will have to chase up that session and others when they get released.   My final session of the day was a Deep Dive of SQL SSIS 2016. It wasn’t so much a deep dive and more a run-down of upcoming features. The one I’m most excited about is the Azure Data Lake Store connector, which will be released once Azure Data Lake goes into General Availability, which I’ve been told is soon…..   Now that I’ve had to week to digest and reflect on SQL PASS Summit, my findings are thus: SQL PASS Summit is invaluable. It provides an opportunity to learn so much from so many people, and not just learn from the presenters. There are so many people from all over the SQL community, with different experiences of SQL, different experiences of data, different experiences of life, that you can’t not learn something. PASS provides the easy environment to share ideas among peers and learn new technologies, new ways of working and new tricks. I’ve already started sharing some of my learning's with colleagues and I can’t wait to share them with everyone else too!