Adatis

Adatis BI Blogs

Introduction to Azure Data Catalog

With the rise of self-service business intelligence tools, like Power BI, and an increased engagement with data in the workplace, people’s expectations of where they can find expert information about data has changed. Where previously there would an expert that people would have to book time with in order to understand data, now people expect to get quick and detailed information about the data assets that an enterprise holds and maintains without going through a single contact. With Azure Data Catalog, data consumers can quickly discover data assets and gain knowledge about the data from documentation, tags and glossary terms from the subject matter experts. This post aims to give a brief introduction to Azure Data Catalog and what it can broadly be used for. What is Azure Data Catalog?Azure Data Catalog is a fully managed Azure service which is an enterprise-wide metadata catalogue that enables data discovery. With Azure Data Catalog, you register; discover; annotate; and, for some sources, connect to data assets. Azure Data Catalog is designed to manage disparate information about data; to make it easy to find data assets, understand them, and connect to them. Any user (analyst, data scientist, or developer) can discover, understand, and consume data sources. Azure Data Catalog is a one-stop central shop for all users to contribute their knowledge and build a community and culture of data.What can Azure Data Catalog be used for?As mentioned in the earlier headings, Azure Data Catalog can be used for data asset management; data governance; and data discovery. For data asset management, this means knowing what data is available and where; for data governance teams, this means answering questions like: where is my customer data? or what does this data model look like?; for data discovery, this means knowing which data is suitable for particular reports and who you can go to if you have any questions. There are some common scenarios for using Azure Data Catalog that Microsoft has put together, and it’s well worth reading to get a fuller understanding of what Azure Data Catalog can be used for.

Process an Azure Analysis Services Tabular Model from an Azure Function

A couple of weeks ago I wrote a blog post detailing how to process a tabular model from a stored procedure. The challenge there was to have a way of processing a tabular model on demand from a web application. The challenge still exists, but how do you achieve the same in Azure using Platform as a Service (PaaS) objects which do not have the benefit of a full server and items, such as msdb and SQL Agent, to aid in the process?In this post, I’m going to show you how to process a tabular model only using Azure PaaS offerings. Not only am I going to show you how to do process a tabular model on-demand, but also how to process a tabular model on a schedule. This post has taken inspiration and part of the code base from the a Microsoft blog: Automating Azure Analysis Services processing with Azure Functions.Azure FunctionsBefore we begin properly, it’s worth spending some time introducing Azure Functions. According to Microsoft, Azure Functions are:…a solution for easily running small pieces of code, or "functions," in the cloud. You can write just the code you need for the problem at hand, without worrying about a whole application or the infrastructure to run it. Functions can make development even more productive, and you can use your development language of choice, such as C#, F#, Node.js, Java, or PHP. Pay only for the time your code runs and trust Azure to scale as needed. Azure Functions lets you develop server less applications on Microsoft Azure.They are super useful for extending the capabilities of any solution and not just limited to what we’re going to cover here.On-Demand RefreshThe use of Azure Functions creates a trigger for the on-demand refresh of a tabular model from the web application or web hook, this is to make sure that selected elements of data in a tabular model, for example hot partitions, are always up to date. The following describes the process that Azure Functions will be involved in this scenario:The steps that are needed to create an Azure Function for On-Demand Refresh are as follow:1) Create an Azure Function AppNavigate to the Azure Portal and create a Function App (the name changes quite a bit, so don’t be concerned if it’s not exactly displayed as it is below)2) Create a new FunctionAfter you’ve created the Function App, we need to add a new Webhook + API function, which we’ll use as the basis for our on-demand refresh. Click on the + button next to Functions, select Webhook + API, choose C# as your language and click Create this function.3) Configure the FunctionDownload the latest client libraries for Analysis Services. This needs to be done to your local machine so you can then copy these files to your Azure Function App. After you’ve downloaded the client libraries, the DLLs can be found in C:\Program Files (x86)\Microsoft SQL Server\140\SDK\Assemblies. The two files you need are:C:\Program Files (x86)\Microsoft SQL Server\140\SDK\Assemblies\Microsoft.AnalysisServices.Core.DLLC:\Program Files (x86)\Microsoft SQL Server\140\SDK\Assemblies\Microsoft.AnalysisServices.Tabular.DLLThe documentation references the 130 assemblies, which is not correct and will not work. You need the assemblies in 140.In order to add these assemblies to your function, you’ll need to add a folder called “bin”. To do this, select your Function App, click Platform features, and under Development Tools, click Advanced tools (Kudu).In Kudu, click Debug console and select Cmd. Navigate to the site\wwwroot\OnDemandHttpTrigger folder and add the “bin” folder here by clicking the + button.Once you’ve added the “bin” folder, go back over to the Azure portal and select your function OnDemandHttpTrigger.On the right under View files, navigate to the bin folder. Then click the Upload button to add the two previously mentioned DLLs to the bin folder.You should see the two DLLs in your bin folder now.4) Add Azure Analysis Services Connection String to the Function AppThis step is going to add the connection string to the Azure Analysis Services (AAS) service to the entire Function App, not just individual functions. Click the name of your Function App, then select Platform features. Select Application settings under General Settings.Now we need to add our connection string under the Connection strings section. You’ll need your AAS server name and a user ID and password that has access to the AAS database.You can find your AAS server name by navigating to your AAS database in the Azure portal and copying the value constructed after clicking Show server connection strings:Your connection string should look like this:Provider=MSOLAP;Data Source=<your aas server>; Initial Catalog=<aas database name>;User ID=<your username>;Password=<your password>Back in the screen for the Function App, fill in the Name textbox with a name for your connection string and paste your connection string in the Value text box:Click Save at the top to save these settings for your Functions.5) Time for CodeOur Function App has been configured, now we need to add code to the function. The function comes with a working function, for which to test out the functionality, but we don’t need everything that is on offer.We’re going to programmatically process the tabular model. In doing so, we’ll leverage Analysis Services Management Objects (AMO). If you’re new to AMO, start here.Paste in the following code (all code referenced can also be download from my GitHub Repo):#r "Microsoft.AnalysisServices.Tabular.DLL"#r "Microsoft.AnalysisServices.Core.DLL"#r "System.Configuration"using System;using System.Configuration;using Microsoft.AnalysisServices.Tabular;using System.Net;public static async Task<HttpResponseMessage> Run(HttpRequestMessage req, TraceWriter log){log.Info("C# HTTP trigger function processed a request.");// parse query parameterstring status = req.GetQueryNameValuePairs().FirstOrDefault(q => string.Compare(q.Key, "status", true) == 0).Value;if (status == null){// Get request bodydynamic data = await req.Content.ReadAsAsync<object>();status = data?.status;}if (status == "execute"){log.Info($"C# trigger function started at: {DateTime.Now}");  try            {Microsoft.AnalysisServices.Tabular.Server asSrv = new Microsoft.AnalysisServices.Tabular.Server();var connStr = ConfigurationManager.ConnectionStrings["AASTabular"].ConnectionString;asSrv.Connect(connStr);Database db = asSrv.Databases["azureadventureworks"];Model m = db.Model;//db.Model.RequestRefresh(RefreshType.Full);     // Mark the model for refreshm.RequestRefresh(RefreshType.Full);     // Mark the model for refresh//m.Tables["Date"].RequestRefresh(RefreshType.Full);     // Mark only one table for refreshdb.Model.SaveChanges();     //commit  which will execute the refreshasSrv.Disconnect();            }catch (Exception e)            {log.Info($"C# trigger function exception: {e.ToString()}");            }log.Info($"C# trigger function finished at: {DateTime.Now}"); }return status == "execute"?req.CreateResponse(HttpStatusCode.OK, "Successfully Processed Tabular Model ") :req.CreateResponse(HttpStatusCode.BadRequest, "Please pass a status on the query string or in the request body");}Click the Save button at the top.6) Test, Test, TestClick the Run button at the top to test the function The function can also be tested in a web browser, and be called by a Web App using the POST HTTP method.Now we have a fully functioning method of refreshing an Azure Analysis Services tabular model on-demand.Scheduled RefreshThe use of Azure Functions creates a trigger for the scheduled refresh of a tabular model, this is to make sure that the entire tabular model has the latest data and is always up to date. The following describes the process that Azure Functions will be involved in this scenario:The steps that are needed to create an Azure Function for Scheduled Refresh are as follow:1) Create a FunctionWe’ve created our Function App, and now we need to add a new Timer Trigger function, which we’ll use as the basis for our scheduled refresh. Click on the + button next to Functions, select Timer, choose C# as your language and click Create this function.2) Configure the TimerWhat use is a timer without a schedule? To give the timer a schedule, click Integrate, set the schedule and click Save.The Schedule text box expects a CRON expression to define the days and times that the function should execute. Click the little Documentation button on the screen above to read about CRON expressions. The schedule I’ve set is to run once everyday at 09:30AM.3) Configure the FunctionSee step 3 of the On-Demand Function for detailed steps. You’ll need to create the bin folder and upload the DLLs to the bin folder.4) Time for CodeWe’ve configured our function, so now it’s time to add the code. The code base is much simpler than the On-Demand code base, mainly because it’s doing fewer tasks. But the AMO section is exactly the same. Paste in the following code:#r "Microsoft.AnalysisServices.Tabular.DLL"#r "Microsoft.AnalysisServices.Core.DLL"#r "System.Configuration"using System;using System.Configuration;using Microsoft.AnalysisServices.Tabular;public static void Run(TimerInfo myTimer, TraceWriter log){log.Info($"C# Timer trigger function started at: {DateTime.Now}");  try            {Microsoft.AnalysisServices.Tabular.Server asSrv = new Microsoft.AnalysisServices.Tabular.Server();var connStr = ConfigurationManager.ConnectionStrings["AASTabular"].ConnectionString;asSrv.Connect(connStr);Database db = asSrv.Databases["azureadventureworks"];Model m = db.Model;//db.Model.RequestRefresh(RefreshType.Full);     // Mark the model for refreshm.RequestRefresh(RefreshType.Full);     // Mark the model for refresh//m.Tables["Date"].RequestRefresh(RefreshType.Full);     // Mark only one table for refreshdb.Model.SaveChanges();     //commit  which will execute the refreshasSrv.Disconnect();            }catch (Exception e)            {log.Info($"C# Timer trigger function exception: {e.ToString()}");            }log.Info($"C# Timer trigger function finished at: {DateTime.Now}"); }Click the save button at the top.5) Test, Test, TestClick the Run button at the top to test the function Now we have a fully functioning method of refreshing an Azure Analysis Services tabular model on-demand.ConclusionI have shown you how simple it is to invoke two methods of refreshing a tabular model using Azure Functions: an On-Demand refresh and a refresh by Schedule. I hope that you take inspiration and use these methods in your use of both Azure Analysis Services and Azure Functions.

Extraction and Analysis of GeoSpatial data with Azure Data Lake Analytics

I recently had to implement a solution to prove it was possible to integrate a shape file (.SHP) in Azure Data Lake Store (ADLS) for post geographic spatial analysis using Azure Data Lake Analytics (ADLA). A shape file is a data set used by a geographic analysis application that stores a collection of geographic features, such as streets or zip code boundaries, in the form of points, lines or area features.As you already figured, storing a shape file in ADLS is not a difficult goal to achieve, however, how can you possibly use ADLA to obtain the geographic data from the file? In this blog I’ll explain how we can extract the data to a supported format, such as CSV, and use it to run geographic spatial analysis in ADLA, with the support of the spatial data types introduced in the SQL Server 2008 (details here).As always, whenever we face a limitation of ADLA, C# is our best friend. In order to read the content of a shape file, we need to start by adding a geospatial assembly to our solution, which, in my case, was the “Catfood” ESRI Shapefile Reader (details here).The shape file used in this example contains a list of parks in London. The following code demonstrates how to extract the metadata and the geographic shapes to a CSV file. The only shapes extracted are polygons, although it is possible to add more if needed. public static void CreateWorkForThreads() { //Create a new dataset and store the data in a table DataSet ds = CreateNewDataSet(); DataTable dt = ds.Tables[0]; int i; int count = 0; // Parse the shapefile and select the columns we are interested in using (Shapefile shapefile = new Shapefile(@"path\file.shp")) { foreach (Shape shape in shapefile) { string[] metadataNames = shape.GetMetadataNames(); string geometry = ""; int countParts = 0; int countShape = 0; DataRow dr = dt.NewRow(); //Extract the metadata. The first iteraction will extract the name of the columns if (metadataNames != null) { foreach (string metadataName in metadataNames) { if (count == 0) dr[metadataName] = metadataName; else dr[metadataName] = shape.GetMetadata(metadataName); } } //Shape is not part of the metadata, so manually defining the name of the column if (count == 0) { dr["shape"] = "shape"; } else { // cast shape based on the type switch (shape.Type) { case ShapeType.Point: // a point is just a single x/y point ShapePoint shapePoint = shape as ShapePoint; MessageBox.Show("Point (" + shapePoint.Point.X.ToString() + ", " + shapePoint.Point.Y.ToString() + ")"); break; case ShapeType.Polygon: // a polygon contains one or more parts - each part is a list of points which // are clockwise for boundaries and anti-clockwise for holes // see http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf ShapePolygon shapePolygon = shape as ShapePolygon; foreach (PointD[] part in shapePolygon.Parts) { countShape = 0; if (countParts == 0) geometry = "("; else geometry = geometry + " | ("; foreach (PointD point in part) { if (part.Length - 1 != countShape) geometry = geometry + point.X + " " + point.Y + " |"; else geometry = geometry + point.X + " " + point.Y + " )"; countShape++; } countParts++; } break; default: break; } //Build our Polygon. //Eg. POLYGON((-122.358 47.653, -122.348 47.649| -122.348 47.658, -122.358 47.658, -122.358 47.653)) dr["shape"] = "POLYGON(" + geometry + ")"; } dt.Rows.Add(dr); count++; } } //Extract the data to a csv file using (System.IO.StreamWriter sw = new System.IO.StreamWriter(@"path\filename.csv")) { foreach (DataRow row in dt.Rows) { object[] array = row.ItemArray; for (i = 0; i < array.Length - 1; i++) { sw.Write(array[i].ToString() + ","); } sw.WriteLine(array[i].ToString()); } } } public static DataSet CreateNewDataSet() { DataSet dsTemp = new DataSet(); DataTable dtTemp = new DataTable("londonparks"); dtTemp.Columns.Add("id", typeof(string)); dtTemp.Columns.Add("parkname", typeof(string)); dtTemp.Columns.Add("street", typeof(string)); dtTemp.Columns.Add("postcode", typeof(string)); dtTemp.Columns.Add("shape", typeof(string)); dsTemp.Tables.Add(dtTemp); return dsTemp; }Now that we have a valid file that can be processed by ADLA, we can upload it to ADLS and start performing geospatial analysis. To do so, I simply followed the logic described in Sacha’s blog (here).The following U-SQL has in consideration a dataset that contains details of the trajectory of a courier, tracked on a daily basis. With the following code, we identify if a courier drove by a park by using the Intersect function. Because we have to cross two datasets, a C# function was created to help the evaluation of multiple events. // Reference the assemblies we require in our script. REFERENCE SYSTEM ASSEMBLY [System.Xml]; REFERENCE ASSEMBLY [SQLServerExtensions].[SqlSpatial]; REFERENCE ASSEMBLY [USQL.Core]; // Once the appropriate assemblies are registered, we can alias them using the USING keyword. USING Geometry = Microsoft.SqlServer.Types.SqlGeometry; USING Geography = Microsoft.SqlServer.Types.SqlGeography; USING SqlChars = System.Data.SqlTypes.SqlChars; USING [USQL].[Core].[Utilities]; // Extract the list of parks @parks = EXTRACT [ID] string, [PARKNAME] string, [STREET] string, [POSTCODE] string, [SHAPE] string FROM "RAW/Parks.csv" USING Extractors.Text(delimiter : ',', silent : false, quoting : true, skipFirstNRows : 1); //Extract data from the file containing the courier trajectory @trajectories = EXTRACT GPSDateTimeUTC DateTime, ReceivedDatetimeUTC DateTime, VehicleKey string, Altitude int, Longitude double, Latitude double, Distance decimal, VehicleSpeedMph decimal FROM "CURATED/Trajectory/Trajectory.TXT" USING Extractors.Text(delimiter : '|', silent : false, quoting : true, skipFirstNRows : 1); //Get the list of vehicles that drove by the park. @vehicleIntersection = SELECT DISTINCT a. *, "1" AS VehicleIntersected FROM @trajectories AS a CROSS JOIN @parks AS b WHERE Utilities.Intersect(b.[SHAPE], a.[Longitude], a.[Latitude]).ToString() == "True"; //Get the list of vehicles that didn't drive by the park. @vehicleWithoutIntersection = SELECT a. *, "0" AS VehicleIntersected FROM @trajectories AS a LEFT JOIN @vehicleIntersection AS b ON b.VehicleKey == a.VehicleKey AND b.GPSDateTimeUTC == a.GPSDateTimeUTC WHERE b.VehicleKey IS NULL; //Union both datasets to get the complete set of data @finalData = SELECT * FROM @vehicleIntersection UNION ALL SELECT * FROM @vehicleWithoutIntersection; //Export the results to a csv file OUTPUT @finalData TO "LABORATORY/GeoSpatialIntersection.csv" USING Outputters.Text(outputHeader : true, delimiter : ',', quoting : true);And here is the C# function. It accepts three parameters and calculate the intersection of a point with a shape.public static string Intersect(string shape, double longitude, double latitude) { //Because we had a csv file, the coordinates in the polygon were separated by | //It is important to use the .MakeValid() method to validate any invalid shape //In case the dataset had multypoligon shapes, without the MakeValid(), the function would throw an error var g = Geography.STGeomFromText( new SqlChars( shape.Replace('|',',')), 4326).MakeValid(); var h = Geography.Point(longitude, latitude, 4326); return g.STIntersects(h).ToString(); }As always, if you have any questions or comments, do let me know.

Different ways of performing file operations on Azure Data Lake

Recently at a client, we needed to come up with a few different ways that we can perform File Management operations within their Data Lake – for example moving files once processed and in their case, renaming folders etc. We needed to come up with different solutions to what we currently used in order to keep within their desired architecture. So we started looking at using the REST API and calling that using C# within an SSIS package. The other option I looked at was using Python. I will explain more about both the methods below, but first there is some set up we need to do. Pre-Requisites Aside from having an Azure Subscription and a Data Lake Store account, you will need an Azure Active Directory Application. For our case we needed a Web Application as we will be doing Service to Service authentication. This is where the application provides it’s own credentials to perform the operations whereas with End-User authentication, a user must log into your application using Azure AD. Service-to-service authentication setup Data Lake store uses Azure Active Directory (AAD) for authentication, and this results in our application being provided with an OAuth 2.0 token, which gets attached to each request made to the Azure Data Lake Store. To read more about how it works and how to create the app, get the relevant credentials and how to give it access to the Data Lake store, follow through the Microsoft tutorial here. Make a note of the below as we will be needing them when we develop the solutions: Tenant ID (also known as Directory ID) Client Id (also known as the ApplicationID)   Within App registrations, if you look for your App under ‘All Apps’ and click on it you will be able to retrieve the Application Id.   Client Secret (also known as Authentication Key) Within the App area used above, click on setting and then ‘Keys’. If you haven’t previously created one, you can create it there and you must remember to save it when it appears as you will not get another chance!   Data Lake Name   Using REST API Now we have everything set up and all the credentials we need, we can make a start constructing the requests. In order to test them out I used Postman which can be downloaded here. Authentication Firstly, before we can begin any folder management operations, we need to authenticate against the Data Lake. To do this we need to perform a POST request in order to obtain an access token. This token will be used and passed to the other REST API calls we will make (e.g. deleting files) as we will always need to authenticate against the Data Lake. To retrieve the access token, we need to pass through the TENANT ID, CLIENT ID and CLIENT SECRET and the request looks as follows: curl -X POST https://login.microsoftonline.com/<TENANT-ID>/oauth2/token  \  -F grant_type=client_credentials \  -F resource=https://management.core.windows.net/ \  -F client_id=<CLIENT-ID> \  -F client_secret=<AUTH-KEY> Within Postman, it looks like this: 1. Make sure the request type is set to POST 2. Make sure you have added your tenant id to the request 3. Fill out the body with your Client ID and Client Secret. (grant_type and resource are set as constant values as shown above). 4. Make a note of the Bearer access token as we will need it to perform any File Management operation on the Data Lake.   Deleting a File Now we have our access token, we can perform a deletion of a file with the following: curl -i -X DELETE -H "Authorization: Bearer <REDACTED>" 'https://<yourstorename>.azuredatalakestore.net/webhdfs/v1/mytempdir/myinputfile1.txt?op=DELETE'   Within Postman, it looks like the following:   1. This is a DELETE request and have therefore changed the dropdown to reflect that. 2. Remember to add your data lake store name into the request; in my example it is called emma1 3. You can point to a particular file, or you can point it to a folder and add &recursive=true to the request and it will delete all the files within the folder including the folder itself. I haven’t managed to find a way to just delete the contents of the folder and leaving the folder as is. 4. The access token is sent as a header called ‘Authorization’. Make sure to include ‘Bearer ‘ before you access token as highlighted above.   Once you have sent the request, you will receive some JSON in the output to show if the action has been successful (true). You can perform many more File Management operations using the Rest API and the code can be found here: Common Errors 1. This following error is caused by running the delete file request when passing through an Access Token that has expired. To fix this issue, re-generate the Access token and pass that through instead.   2. The error below is caused by the Application that we are using to access the Data Lake Store not having sufficient permissions to perform the operations. Make sure it has access to the relevant folder(s). Check Step 3 here to find out how to set the access.     Summary Now we have managed to get what them working manually within Postman, we need to consider how to call them in a production environment. The solution we implemented was an SSIS package (in-keeping with their current architecture) with script tasks calling C# which in turn calls the API. Before the File System Operation is called, we will run the authentication API call to obtain the latest Access Token, and place the value in a variable to be used later on within the package to ensure it is the latest.   Using Python From having a play around with Python do do similar File Management operations, it seems rather limited in comparison and you can’t do as much.Nevertheless, I am sure more functionality will be added and it is useful to know how it works. Firstly, if you don’t already have Python, you can download the latest version from here. As an IDE, I have been using Visual Studio 2017 which now comes with Python Support, see here for further information. In order for us to be able to perform operations on the Data Lake, we need to install three Azure modules. To install the modules, open up the command prompt and run the following: pip install azure-mgmt-resourcepip install azure-mgmt-datalake-storepip install azure-datalake-store Now we need to create the Python app (I used Visual Studio) to do the folder management tasks. In order to reference the modules we have just installed, we need to import the relevant modules so we can use them within our app. Each time we create an app related to Data Lake folder manipulations, we need to add them in each time. The code below shows how to do this. Save the application, but don’t run it yet! ## Use this for Azure AD authenticationfrom msrestazure.azure_active_directory import AADTokenCredentials ## Required for Azure Data Lake Store account managementfrom azure.mgmt.datalake.store import DataLakeStoreAccountManagementClientfrom azure.mgmt.datalake.store.models import DataLakeStoreAccount ## Required for Azure Data Lake Store filesystem managementfrom azure.datalake.store import core, lib, multithread ## Common Azure importsimport adalfrom azure.mgmt.resource.resources import ResourceManagementClientfrom azure.mgmt.resource.resources.models import ResourceGroup ## Use these as needed for your applicationimport logging, getpass, pprint, uuid, time   Firstly, we  need to authenticate with Azure AD. Again, as described above there are two ways; End-User and Service-to-Service. We will be using Service-to-Service again in this example. To set this up, we run the following: adlCreds = lib.auth(tenant_id = 'FILL-IN-HERE', client_secret = 'FILL-IN-HERE', client_id = 'FILL-IN-HERE', resource = 'https://datalake.azure.net/') And fill in the TENANT ID, CLIENT SECRET and CLIENT ID that we captured earlier on. Now we can authenticate against the Data Lake, we can now attempt to delete a file. We need to import some more modules, so add the script below to your application:   ## Use this only for Azure AD service-to-service authenticationfrom azure.common.credentials import ServicePrincipalCredentials ## Required for Azure Data Lake Store filesystem managementfrom azure.datalake.store import core, lib, multithread   We now need to create a filesystem client: ## Declare variablessubscriptionId = 'FILL-IN-HERE'adlsAccountName = 'FILL-IN-HERE' ## Create a filesystem client objectadlsFileSystemClient = core.AzureDLFileSystem(adlCreds, store_name=adlsAccountName)   We are now ready to perform some file management operations such as deleting a file:   ## Delete a directoryadlsFileSystemClient.rm('/mysampledirectory', recursive=True)   Please see the script below for the full piece of code. You can find information on the other operation you can complete (e.g. creating directories) here  ## Use this for Azure AD authenticationfrom msrestazure.azure_active_directory import AADTokenCredentials ## Required for Azure Data Lake Store filesystem managementfrom azure.datalake.store import core, lib, multithread # Common Azure importsimport adalfrom azure.mgmt.resource.resources import ResourceManagementClientfrom azure.mgmt.resource.resources.models import ResourceGroup ## Use these as needed for your applicationimport logging, getpass, pprint, uuid, time ## Service to service authentication with client secret for file system operations   adlCreds = lib.auth(tenant_id = XXX', client_secret = ‘XXX', client_id = ‘XXX', resource = 'https://datalake.azure.net/') ## Create filesystem client ## Declare variablessubscriptionId = ‘XXX’adlsAccountName = 'emma1' ## Create a filesystem client objectadlsFileSystemClient = core.AzureDLFileSystem(adlCreds, store_name=adlsAccountName) ## Create a directory#adlsFileSystemClient.mkdir('/mysampledirectory') ## Delete a directoryadlsFileSystemClient.rm('/mysampledirectory', recursive=True) Summary In summary, there are a few different ways in which you can handle your file management operations within Data Lake and the principles behind the methods are very similar. So, if one way doesn’t fit into your architecture, there is always an alternative.

Deploying Scheduled Triggers In Data Factory v2 With PowerShell

So in ADFv2, scheduled triggers have been overhauled to become their own deployable components, able to kick off multiple parameterised pipelines together in each trigger.This new version of the trigger seems far more modular and flexible than the functionality in v1, however there are a couple of catches and easy-to-miss steps during implementation if you choose to use PowerShell rather than the brand new visual authoring tool, so I’ll break down the process below.First, you need to build your trigger, as a separate JSON object to the pipeline. No additional code is needed in the pipeline itself. There is a more complete example in the documentation Microsoft has provided, but on top of that here is a working example I’ve written below:{ "name": "ST_OrchestratorTrigger", "properties": { "type": "ScheduleTrigger", "typeProperties": { "recurrence": { "frequency": "Day", "interval": 1, "startTime": "2017-12-20T04:00:00-05:00", "endTime": "2099-11-02T22:00:00-08:00", "schedule": { "hours": [ 4 ], "minutes": [ 0 ] } } }, "pipelines": [ { "pipelineReference": { "type": "PipelineReference", "referenceName": "PL_MasterOrchestration" }, "parameters": {} } ] } } As you might guess, this trigger runs the PL_MasterOrchestrator pipeline (with no input parameters) at 4am each day.A caveat on the recurrence – it seems that when running, the trigger doesn’t strictly adhere to the “startTime” after the first time it runs, so despite starting at 4am, the subsequent trigger could be minutes or possibly hours off on the next day. To ensure it always triggers at exactly 4am, you need to add the “schedule” recurrence attribute as shown above.Deployment of the trigger (as of writing this blog) is still only done through PowerShell scripts, using the cmdlet below:Set-AzureRmDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName ` -DataFactoryName $DataFactoryName ` -DefinitionFile $TriggerFile ` -Name $TriggerName ` -Force; Each of these parameters are exactly as you would expect from the v1 parameters for setting other objects. -Force shouldn’t strictly be necessary, but sometimes is worthwhile putting in depending on the exact circumstances, -DefinitionFile is asking for the .JSON file itself, etc.So, at this point you’ve created the scheduler, and deployed it to your data factory. Scheduled triggers in v2 need to be started, once they are deployed. You can do this with the following:Start-AzureRmDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -Name $TriggerName Just another quick note here: you will need to stop the trigger again if you want to redeploy it. Trying to overwrite a trigger while it’s running will just result in an error message.As well as finding the necessary cmdlets 2/3rds of the way down the Microsoft document linked above, there is a list of all relevant cmdlets found here.Hope this helps!The documentation again:https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-schedule-triggerhttps://docs.microsoft.com/en-us/powershell/module/azurerm.datafactoryv2/start-azurermdatafactoryv2trigger?view=azurermps-4.4.1

The Road to Microsoft MVP (Most Valuable Professional)

It is with great pleasure that I can share with you all that I am now a Microsoft MVP (Most Valuable Professional). The MVP programme (https://mvp.microsoft.com/) is designed to recognise those professionals who give their time to support technical communities and Microsoft. This could be through blogging, speaking at conferences, providing feedback, helping other and contributing code to projects. There are many award categories ranging from PowerPoint to Xbox each with its contribution areas. I was awarded MVP for the Data Platform, which covers SQL Server, Data Lake, HDInsight and many of the topics I am particularly interested in. I exceptionally pleased to be amongst the 19 Data Platform MVPs in the UK and 1 of the 418 Data Platform MVPs worldwide. In a recent blog by Kevin Kline, Kevin discussed what it takes to be an MVP, in his opinion it all boils down to being passionate about what you do and sharing that passion with others (https://blogs.sentryone.com/kevinkline/how-can-i-become-a-microsoft-mvp/). I could not agree more! I talk at events because I want to help people. I still get so much out of attending sessions and I want to make sure that I give back everything I took from the community. Kevin's blog gives a fantastic overview of what it takes to become an MVP, well worth a read. When I first started attending user group and conferences I began to become aware of the term MVP. At that point there was only a handful of MVPs in the UK (At that point you were a SQL Server MVP). Their sessions at conferences were always held in high regard and rightly so, these men and women were at the top of their game. I looked up to these professionals and would always learn something from reading their blogs, books and seeing them talk. When I started talking at user groups and SQL Saturday's I always wanted to become an MVP, but it was never my reason for presenting, it is now a very happy by-product of giving back to a technical community. MVP is awarded based on your contributions in the last 12 months, so what have I done in the last 12 months: Blogged (http://blogs.adatis.co.uk/terrymccann/) Delivered 25 technical sessions at user groups and conferences Spoke at the PASS Summit in Seattle Organised a data science meetup It has been a real blast over the last 12 months, I could not have spoken at so many events without the support of Adatis - so thanks. I want to also thank everyone who took the time to nominate me and for the help and support of Microsoft. (Some of the MVPs who nominated me - At a recent MVP event) What will I be doing over the next 12 months? A lot of the same. Although I would like to get a few more blogs written this year. If you want to see me talk. I will be at SQLBits in London in February delivering a pre-conference training day and a general session: A Data Engineer’s Guide to Azure SQL Data Warehouse Enhancing relational models with graph in SQL Server 2017 Beyond that, keep an eye out at your local user group or SQL Saturday. See you soon.

Archiving the Data Lake

In a blog introducing the Data Lake Framework, keen readers will be aware that in the diagram there’s a box titled “ARCHIVE” but it has not been brought up since. The reason why the Archive layer in the data lake has not been discussed is because we’ve been waiting for the Archive Tier in Blob Storage.To remind readers of the framework and where the archive layer sits in it, here it is again with the archive layer highlighted.The Archive BlobThe Archive access tier in blob storage was made generally available today (13th December 2017) and with it comes the final piece in the puzzle to archiving data from the data lake. Where Hot and Cool access tiers can be applied at a storage account level, the Archive access tier can only be applied to a blob storage container. To understand why the Archive access tier can only be applied to a container, you need to understand the features of the Archive access tier. It is intended for data that has no or low SLAs for availability within an organisation and the data is stored offline (Hot and Cool access tiers are online). Therefore, it can take up to 15 hours for data to be made online and available. Brining Archive data online is a process called rehydration (fitting for the data lake). If you have lots of blob containers in a storage account, you can archive them and rehydrate them as required, rather than having to rehydrate the entire storage account. Archive PatternAn intended use for the Archive access tier is to store raw data that must be preserved, even after it has been fully processed, and does not need to be accessed within 180 days. Data gets loaded into the RAW area of the data lake, is fully processed through to CURATED, and a copy of the raw data is archived off to a blob container with a Cool access tier applied to it. When the archive cycle comes about, a new Cool access tiered blob container is created and the now old container has its access tier changed to Archive. For example, our Archive cycle is monthly and we have a Cool access tiered blob container in our storage account called “December 2017”. When data has finished being processed in the Azure Data Lake, the Raw data is archived to this blob container. January comes around, we create a new blob container called “January 2018” with Cool access tier settings and change the access tier of “December 2017” from Cool to Archive. This data has now been formally achieved and is only available for disaster recovery, auditing or compliance purposes. 

The thing that stumps everyone new to Azure Data Factory

Recently I was playing around with Azure Data Factory and was making my way through an online tutorial when I came across a frustrating error message that was not very helpful and my data wasn’t moving anywhere! The aim of this exercise was to move some text files in Azure Data Lake Store over to a table in an Azure SQL Database. Sounds easy enough! The data within the Data Lake store was organised into a Year and Month hierarchy for the folders, and each days transactions were stored in a file which was named after the day within the relevant month folder. The task then was to create a pipeline which copies the dataset in the Data Lake Store over to the dbo.Orders table in Azure SQL DB every day within the scheduled period (Q1 2016). After creating all the json scripts and deploying them (with no errors), I clicked on the ‘Monitor and Manage’ tile to monitor the activities, check everything was working as it should be and monitor the progress. After waiting for at least 10 minutes, I started to get frustrated. As you can see, all the Pipeline Activities for each slice have a Status of: ‘Waiting: Dataset Dependecies’ and if you look at the Activity Window Explorer (see below), the activity window for each day shows a Status of Waiting (a solid orange square).   The files existed in the correct folders within my Data Lake Store and there were no errors during deployment so what was it waiting for?! Well, it turns out we need to set an additional property on the input dataset to let Data Factory know that the data is being generated externally and not internally within Data Factory from another pipeline .     That property is "external": true at the bottom of the script. Below is an example of a data factory and the arrows point to the data sets that need to have this external property set to true. Using the diagram of the pipeline once deployed, it is easy to identify which datasets need to be external as they are not linked to anything upstream.

LETS Process Data–Modern ETL for the Data Lake

At the PASS Summit this year, I attended a session by Michael Rys. In this session he introduced the concept of LETS as an approach to process data in the data lake. If you are familiar with data lake, then you will be familiar of having to apply a schema to the data held within. The LETS approach is purpose design for schematization. Where ETL stands for Extract, Transform, Load or ELT stands for Extract, Load, Transform – LETS stands for Load, Extract, Transform, Store. Data are Loaded into the data lakeData are Extracted and schematizedData are Transformed in rowsetsData are Stored in a location, such as the Catalog in Azure Data Lake Analytics, Azure Data Warehouse, Azure Analysis Services, for analysis purposes. I really like this approach as it makes sense for how data are handled in the data lake. It’s something that I will be advocating and using, and I hope you do too!

Migrating to Native Scoring with SQL Server 2017 PREDICT

Microsoft introduced native predictive model scoring with the release of SQL Server 2017. The PREDICT function (Documentation) is now a native T-SQL function that eliminates having to score using R or Python through the sp_execute_external_script procedure. It's an alternative to sp_rxPredict. In both cases you do not need to install R but with PREDICT you do not need to enable SQLCLR either - it's truly native. PREDICT should make predictions much faster as the process avoids having to marshal the data between SQL Server and Machine Learning Services (Previously R Services). Migrating from the original sp_execute_external_script approach to the new native approach tripped me up so I thought I'd share a quick summary of what I have learned. Stumble One: Error occurred during execution of the builtin function 'PREDICT' with HRESULT 0x80004001. Model type is unsupported. Reason: Not all models are supported. At the time of writing, only the following models are supported: rxLinMod rxLogit rxBTrees rxDtree rxdForest sp_rxPredict supports additional models including those available in the MicrosoftML package for R (I was using attempting to use rxFastTrees). I presume this limitation will reduce over time. The list of supported models is referenced in the PREDICT function (Documentation). Stumble Two: Error occurred during execution of the builtin function 'PREDICT' with HRESULT 0x80070057. Model is corrupt or invalid. Reason: The serialisation of the model needs to be modified for use by PREDICT. Typically you might serialise your model in R like this: model <- data.frame(model=as.raw(serialize(model, NULL))) Instead you need to use the rxSerializeModel method: model <- data.frame(rxSerializeModel(model, realtimeScoringOnly = TRUE)) There's a corresponding rxUnserializeModel method, so it's worth updating the serialisation across the board so models can be used interchangeably in the event all models are eventually supported.  I have been a bit legacy. That's it.  Oh, apart from the fact PREDICT is supported in Azure SQL DB, despite the documentation saying the contrary.

Automating The Deployment of Azure Data Factory Custom Activities

Custom Activities in Azure Data Factory (ADF) are a great way to extend the capabilities of ADF by utilising C# functionality. Custom Activities are useful if you need to move data to/from a data store that ADF does not support, or to transform/process data in a way that isn't supported by Data Factory, as it can be used within an ADF pipeline.Deploying Custom Activities to ADF is a manual process, which requires many steps. Microsoft’s documentation lists them as:Compile the project. Click Build from the menu and click Build Solution.Launch Windows Explorer, and navigate to bin\debug or bin\release folder depending on the type of build.Create a zip file MyDotNetActivity.zip that contains all the binaries in the \bin\Debug folder. Include the MyDotNetActivity.pdb file so that you get additional details such as line number in the source code that caused the issue if there was a failure.Create a blob container named customactivitycontainer if it does not already existUpload MyDotNetActivity.zip as a blob to the customactivitycontainer in a general purpose Azure blob storage that is referred to by AzureStorageLinkedService.The number of steps means that it can take some time to deploy Custom Activities and, because it is a manual process, can contain errors such as missing files or uploading to the wrong storage account. To avoid that errors and delays caused by a manual deployment, we want to automate as much as possible. Thanks to PowerShell, it’s possible to automate the entire deployment steps. The script to do this is as follows:Login-AzureRmAccount# Parameters $SourceCodePath = "C:\PathToCustomActivitiesProject\"$ProjectFile ="CustomActivities.csproj"$Configuration = "Debug" #Azure parameters$StorageAccountName = "storageaccountname"$ResourceGroupName = "resourcegroupname"$ContainerName = "blobcontainername"# Local Variables$MsBuild = "C:\Program Files (x86)\MSBuild\14.0\Bin\MSBuild.exe";            $SlnFilePath = $SourceCodePath + $ProjectFile;                                         # Prepare the Args for the actual build            $BuildArgs = @{                 FilePath = $MsBuild                 ArgumentList = $SlnFilePath, "/t:rebuild", ("/p:Configuration=" + $Configuration), "/v:minimal"                 Wait = $true                 }          # Start the build            Start-Process @BuildArgs # initiate a sleep to avoid zipping up a half built projectSleep 5# create zip file $zipfilename = ($ProjectFile -replace ".csproj", "") + ".zip"$source = $SourceCodePath + "bin\" + $Configuration$destination = $SourceCodePath + $zipfilenameif(Test-path $destination) {Remove-item $destination}Add-Type -assembly "system.io.compression.filesystem"[io.compression.zipfile]::CreateFromDirectory($Source, $destination) #create storage account if not exists$storageAccount = Get-AzureRmStorageAccount -ErrorAction Stop | where-object {$_.StorageAccountName -eq $StorageAccountName}       if  ( !$storageAccount ) {     $StorageLocation = (Get-AzureRmResourceGroup -ResourceGroupName $ResourceGroupName).Location     $StorageType = "Standard_LRS"     New-AzureRmStorageAccount -ResourceGroupName $ResourceGroupName  -Name $StorageAccountName -Location $StorageLocation -Type $StorageType} #create container if not exists$ContainerObject = Get-AzureStorageContainer -ErrorAction Stop | where-object {$_.Name -eq $ContainerName}if (!$ContainerObject){$storagekey = Get-AzureRmStorageAccountKey -ResourceGroupName $ResourceGroupName -Name $StorageAccountName$context = New-AzureStorageContext -StorageAccountName $StorageAccountName -StorageAccountKey $storagekey.Key1 -Protocol HttpNew-AzureStorageContainer -Name $ContainerName -Permission Blob -Context $context} # upload to blob#set default contextSet-AzureRmCurrentStorageAccount -StorageAccountName $StorageAccountName -ResourceGroupName  $ResourceGroupNameGet-AzureRmStorageAccount -ResourceGroupName $ResourceGroupName -Name $StorageAccountName # Upload fileSet-AzureStorageBlobContent –Container $ContainerName -File $destination By removing the manual steps in building, zipping and deploying ADF Custom Activities, you remove the risk of something going wrong and you add the reassurance that you have a consistent method of deployment which will hopefully speed up your overall development and deployments.As always, if you have any questions or comments, do let me know.

How to prepare for 70-766 - Perform Big Data Engineering on Microsoft Cloud Services

There is a new exam currently in beta titled "Perform Big Data Engineering on Microsoft Cloud Services (beta)". With all new exams there is little content on how to revise for the exam beyond the exams summary. This exam however, is what Adatis specialises in! Microsoft may call this "Big Data Engineering" we call it "Modern Data Analytics" and we have a few blogs on the subject. You can sign up to the exam here: https://www.microsoft.com/en-us/learning/exam-70-776.aspx Below you will find links to blog posts by Adatis consultants on topics related to all the key objectives of this exam. I will endeavour to keep this up-to-date with new content added by the team. Good luck with the exam. Design and Implement Complex Event Processing By Using Azure Stream Analytics (15-20%)Streaming data is vital to achieving real-time analytics. The following blogs posts focus on this and offer an introduction and walkthrough for getting started with Stream Analytics. When talking about a wider Lambda approach to Big Data, streaming enables rapid processing via a “Speed” layer.    http://blogs.adatis.co.uk/simonwhiteley/post/Adatis-Hackathon-Jan-2015-Streaming-Analytics-First-Thoughtshttp://blogs.adatis.co.uk/Jose%20Mendes/post/IoT-Hub-Device-Explorer-Stream-Analytics-Visual-Studio-2015-and-Power-BIhttp://blogs.adatis.co.uk/sachatomey/post/2017/01/19/Power-BI-Streaming-Datasets-An-Alternative-PowerShell-Push-Scripthttp://blogs.adatis.co.uk/Jose%20Mendes/post/Data-Data-Revolution Design and Implement Analytics by Using Azure Data Lake (25-30%)Azure Data Lake Store and Analytics are a vital component of the “Modern Data Analytics”. Data which is too large for traditional single server processing needs distributed parallel computation. Rather than pulling data and processing ADLA pushes the processing to the data. Understanding how to process large volumes of data is one part of the “Batch” layer in Lambda http://blogs.adatis.co.uk/ustoldfield/post/data-lakeshttp://blogs.adatis.co.uk/ustoldfield/post/Data-Flow-Job-Execution-in-the-Azure-Data-Lakehttp://blogs.adatis.co.uk/ustoldfield/post/Data-Flow-Pt-2-Vertexes-In-Azure-Data-Lake Design and Implement Azure SQL Data Warehouse Solutions (15-20%)Either as an alternative or in accompaniment to Data Lake is Azure SQL Data Warehouse. If Data Lake is batch across many files, Azure SQLDW is parallel batch over many databases. The key to both services is processing at the storage and not at the compute. The following is an on-going blog series covering the basics all the way to a  deep-dive.   http://blogs.adatis.co.uk/simonwhiteley/post/A-Guide-to-Azure-SQL-DataWarehousehttp://blogs.adatis.co.uk/simonwhiteley/post/Azure-SQLDW-What-is-ithttp://blogs.adatis.co.uk/simonwhiteley/post/Azure-SQLDW-How-Does-Scaling-Workhttp://blogs.adatis.co.uk/simonwhiteley/post/Azure-SQLDW-Distributionhttp://blogs.adatis.co.uk/simonwhiteley/post/Azure-SQLDW-Polybasehttp://blogs.adatis.co.uk/simonwhiteley/post/Azure-SQLDW-Polybase-Design-Patternshttp://blogs.adatis.co.uk/simonwhiteley/post/Azure-SQLDW-Polybase-Limitationshttp://blogs.adatis.co.uk/simonwhiteley/post/Azure-SQLDW-CTAS-Statements Design and Implement Cloud-Based Integration by using Azure Data Factory (15-20%)If you’re looking for a paas solution to move data in Azure, there is only really one option. Azure Data Factory. The following blogs will get you up-to-speed with ADF. http://blogs.adatis.co.uk/terrymccann/post/Getting-started-with-Azure-Data-Factoryhttp://blogs.adatis.co.uk/terrymccann/post/Setting-up-your-first-Azure-Data-Factoryhttp://blogs.adatis.co.uk/terrymccann/post/Azure-Data-Factory-using-the-Copy-Data-task-to-migrate-data-from-on-premise-SQL-Server-to-Blob-storage Manage and Maintain Azure SQL Data Warehouse, Azure Data Lake, Azure Data Factory, and Azure Stream Analytics (20-25%)Know each of the parts is only half the battle, you need to know how, when and why to use each part. What are the best practices? http://blogs.adatis.co.uk/ustoldfield/post/Deploying-a-Hybrid-Cloudhttp://blogs.adatis.co.uk/terrymccann/post/Azure-Data-Factory-Suggested-naming-conventions-and-best-practiceshttp://blogs.adatis.co.uk/ustoldfield/post/Azure-Data-Lake-Store-Storage-and-Best-Practiceshttp://blogs.adatis.co.uk/ustoldfield/post/Shaping-The-Lake-Data-Lake-Framework

Instant Bot: deploying a Bot in minutes with Azure Bot Service

I had been playing around with the Bot Framework for a while but hadn’t really got anywhere, largely due to having enough time to create something worthwhile, when I came across the Azure Bot Service whilst I was trawling through the documentation of the Bot Framework. The Azure Bot Service is currently in preview and allowed me to quickly author and deploy a basic bot for the purpose of this post and walkthrough. Creating the BotLike most services in Azure, creating the bot is easy and requires the following inputConfiguring the BotSetting up the Bot is a bit more involved. Once the bot has been created, you’ll be presented with the following screenThe App ID and Password are auto-generated by Microsoft, but you will need to make note of the password and store it securely as it is only displayed once in the app registration process. Next you want to select the language in which the bot is developed and deployed. You have the choice of C# or NodeJS. I opted for C# as it’s a language I am most familiar with. Choose your template, accept the T’s & C’s and your bot is ready to be deployed!Deploying the BotThe Bot has been created and configured, displaying its source code which can be further tweaked in the browser or Visual Studio. You can also embed your Bot in a number of existing apps, websites and services.Chatting with the BotThe basic bot isn’t the most stimulating of conversational partners but it is satisfying to see your creation talk back, even if it repeats what you have just told it. The Bot Framework opens up many possibilities to make the services you offer engaging in a conversational way. The Azure Bot Service makes the Bot Framework that much more accessible to quickly deploy bots and have them out there, engaging with users.

Geographic Spatial Analysis with Azure Data Lake Analytics (ADLA)

Whilst working on an Azure Data Lake project, a requirement hit the backlog that could be easily solved with a Geographical Information System (GIS) or even SQL Server - Spatial data type support was introduced into SQL Server 2008. However, Azure Data Lake Analytics (ADLA) does not natively support spatial data analytics so we'll have to extract the data into another service right? Wrong ? :) Due to the extensibility of Azure Data Lake Analytics, we can enhance it to do practically anything. In fact, we can lean on existing components and enhance the service without having to develop the enhancement itself. This blog is a quick run through demonstrating how to enhance ADLA such that it will support Spatial analytics and meet our project requirement. Problem For simplicity I've trivialised the problem. Here's the requirement: Indicate which Bus Stops are within 1.5 km of Southwark Tube Station. To support this requirement, we have two datasets: A list of all the Bus Stops in London, including their Geo location (circa 20k records) The Geo location record of Southwark Tube Station (a single record !) In fact, the location of the tube station is pretty accurate and is geo located to the entrance pavement outside the tube station: This would be an easy problem for a GIS to solve. You would specify the central point i.e. our Southwark Tube station marker and draw a circle, or buffer, with a radius 1.5 km around it and select all bus stops that fall within or intersect with that circle. This spatial analysis is easy for these systems as that's essentially what they are built to do. SQL Server 2008 introduced the Spatial Data Type, this allowed spatial style analysis to be performed on geo data using T-SQL in conjunction with the supplied Geometry and Geography data types. More info on those can be found here So, how can we solve our problem in ADLA, without a GIS and without having to export the data to SQL Server?? Solution You can register existing assemblies with ADLA. It so happens that the SQL Server Data Types and Spatial assemblies are nicely packaged up and can be used directly within ADLA itself - think about that, it's pretty awesome ! Caveat: At the time of writing we have no idea of the licence implications. It will be up to you to ensure you are not in breach :) Those assemblies can be downloaded from here.  You only need to download and install the following file: ENU\x64\SQLSysClrTypes.msi This installs two key assemblies, which you'll need to grab and upload to your Data Lake Store: C:\Program Files (x86)\Microsoft SQL Server\130\SDK\Assemblies\Microsoft.SqlServer.Types.dll C:\Windows\System32\SqlServerSpatial130.dll Once they have been uploaded to your Data Lake Store, you need to register those assemblies with ADLA. DECLARE @ASSEMBLY_PATH string = "/5.UTILITY/USQL-Extend/SQL-Server/"; DECLARE @TYPES_ASM string = @ASSEMBLY_PATH+"Microsoft.SqlServer.Types.dll"; DECLARE @SPATIAL_ASM string = @ASSEMBLY_PATH+"SqlServerSpatial130.dll"; CREATE DATABASE IF NOT EXISTS SQLServerExtensions; USE DATABASE SQLServerExtensions; DROP ASSEMBLY IF EXISTS SqlSpatial; CREATE ASSEMBLY SqlSpatial FROM @TYPES_ASM WITH ADDITIONAL_FILES = ( @SPATIAL_ASM ); Following registration of the assemblies, we can see the registration loaded in the ADLA Catalog database we created: We are now ready to use this U-SQL enhancement in our U-SQL Query - let's go right ahead and solve our problem in one U-SQL Script. // Reference the assemblies we require in our script. // System.Xml we get for free as a System Assembly so we didn't need to register that and our SQLServerExtensions.SqlSpatial assembly REFERENCE SYSTEM ASSEMBLY [System.Xml]; REFERENCE ASSEMBLY SQLServerExtensions.SqlSpatial; // Once the appropriate assemblies are registered, we can alias them using the USING keyword. USING Geometry = Microsoft.SqlServer.Types.SqlGeometry; USING Geography = Microsoft.SqlServer.Types.SqlGeography; USING SqlChars = System.Data.SqlTypes.SqlChars; // First create the centralised point. // In this case it's the pavement outside the entrance of Southwark Tube Station, London. // Format is Longitude, Latitude and then SRID. // NB: It's Longitude then Latitude, that's the opposite way to what you might expect.. DECLARE @southwarkTube Geography = Geography.Point(-0.104777,51.503829,4326); // Next we extract our entire London bus stop data set from the file. // There's about 20k of them. @busStopInput = EXTRACT [StopCode] string, [StopName] string, [Latitude] double?, [Longitude] double? FROM @"/1.RAW/OpenData/Transport/bus-stops-narrow-full-london.csv" USING Extractors.Csv(skipFirstNRows:1,silent:true); // This is effectively the transform step and where the magic happens // Very similar syntax to what you would do in T-SQL. // We are returning all the bus stops that fall within 1500m of Southwark Tube // Essentially we return all stops that intersect with a 1500m buffer around the central tube point @closeBusStops= SELECT * FROM @busStopInput WHERE @southwarkTube.STBuffer(1500).STIntersects(Geography.Point((double)@busStopInput.Longitude,(double)@busStopInput.Latitude,4326)).ToString()=="True"; // The results are written out to a csv file. OUTPUT @closeBusStops TO "/4.LABORATORY/Desks/Sach/spatial-closebusstops.csv" USING Outputters.Csv(outputHeader: true); The query outputs a list of bus stops that are within the specified Spatial distance from Southwark Tube Station. If we have a look at all the bus stops (in red) and overlay all the 'close' bus stops (in green), we can see the results: Pretty neat. Azure Data Lake Analytics does not natively support spatial data analytics but by simply utilising the assemblies that ship with SQL Server, we can extend the capability of U-SQL to provide that functionality or practically any functionality we desire.

Data Flow Pt 2: Vertexes In Azure Data Lake

Following on from my previous post on Job Execution in the Azure Data Lake, this post will explore the concept of Extents and how they are utilised by Vertexes in the actual Job processing in Azure. The U-SQL script that has been authored, compiled and deployed is the logical plan of how the author intends to transform input data into output data. This creates a total amount of work – essentially the amount of data it has to process – which is decomposed into a set of vertexes. Each vertex will process a subset of data, or extents (see Azure Data Lake Storage for more information) and represent a fraction of the total. Vertexes are displayed, or grouped, in a super vertex, also known as a stage. Vertexes in each stage are doing the same operation on a different part of the same data. The number of vertexes in a stage indicates the maximum theoretical parallelisation of that stage. The containers requested for the job will be allocated to complete each vertex. Say there is a 10GB file. This file will be split into 40 Extents and allocated to at least 10 Vertexes. If one wants to process all of the file in parallel then requesting 10 containers will allow for concurrent parallelism. All this is visualised as the job graph.If you have multiple transformations going on in your USQL script this will create multiple stages, and the output of one vertex becomes the input of another vertex in a dependent stage. As a result, dependent stages can begin processing even if preceding stages haven’t completed processing. If dependency does exist, it will create a critical path – which is the dependency chain of vertexes which keep the job running to the very end because the vertex on the bottom depends on the output of the vertex on the top. This can be seen in the Vertex Execution View in the Visual Studio Azure Data Lake Tools view. It’s useful for optimising job, by re-positioning or re-writing elements of your script, by checking which vertex takes the longest. It may not be possible to use all the reserved parallelism during a stage if there are fewer vertexes than Azure Data Lake Analytics Units (ADLAUs) available. For example, if I have 10 ADLAUs – it’s great for an early stage as all ADLAUs will be allocated a vertex. But with later stages, more and more ADLAUs will be idle. So by the last stage only 2 of the 14 are utilised. Unfortunately it is not currently possible to dynamically de-allocate ADLAUs during a job. In conclusion, understanding how vertexes interact with extents and can influence parallelism, you should be able to provision adequate resources for your jobs as well as know where to start the investigation for any long running queries you might have with the Vertex Execution View. In the next post of the series I will be looking into resource costs and optimising spend and time taken to process data.

Azure SQLDW-CTAS Statements

So here we are - you've designed your target distribution, you're scaling your warehouse appropriately and you've set up Polybase to access raw data outside of the system. You are now ready to bring data into the warehouse itself so we can start working on it. Firstly, you should be aware that we're building using ELT not ETL, ie: we design to allow for the warehouse itself to do the work as it is the powerful processing option available to us. When first introducing Polybase, I described how it is the only method of loading data into SQLDW that does not get bottlenecked through the control node. Well the same limitations will be present if we try and transform our data using traditional ETL tools tool. If we used an SSIS data flow, for example, this would extract data out, pull it across to the SSIS server, perform the transformation and then attempt to reinsert the data. Not only are we introducing network latency, but we are also ignoring the power of the MPP engine AND encountering the throughput limits of the control node. In order to utilise SQLDW effectively, we write SQL for our transformations, rather than relying on external tools such as SSIS. This ensures the work is being done by our compute nodes and, therefore, can be scaled up to increase performance. General best practice, therefore, would be write stored procedures for each of the data movements we want to occur. This allows us to add in auditing, logging etc. But what we're interested in here is the core data movement itself. Writing a traditional INSERT statement isn’t the fastest way to get data into a table. There is a special syntax which creates a new table and inserts into it, that is automatically configured for optimal bulk loading, this is the CTAS, or "Create Table as Select" statement.CREATE TABLE [dbo].[MyTable] WITH ( HEAP, DISTRIBUTION = ROUND_ROBIN ) AS SELECT * FROM [dbo].[MySourceTable]   As it is the only minimally-logged, bulk insert method within SQLDW, it’s the fastest way to get data into a table. If you have the table already exists and you want to wipe it and fill it with data, the fastest way to do so is to delete the table and recreate it using a CTAS. Performing a truncate/insert combination would be slower, as each row insert causes a write to the transaction log. SQLDW performs all standard actions as a transaction by default, so if you have a huge insert statement and something goes wrong, be prepared for a lengthy rollback process. The table creates by a CTAS statement infers the schema from the query, so you need to be explicit with ISNULLs, CASTs and anything needed to point the query in the right direction. This is where the bulk of the data processing happens - the select part of the CTAS both defines the destination table and carries out the data transformations, all as a single step. If you’ve ever used “SELECT field, field2 INTO [dbo].[NewTable] FROM [dbo].[SourceTable]” then you’ll have a pretty good idea of how it’ll work. But if you ran the two side by side, CTAS would perform faster as, again, it is specifically treated differently by the SQLDW engine. What does this mean to our traditional ETL pattern? If you're following the standard approach of having intermediary tables between data processing steps (ie: Staged data, cleaned data, transformed data and so on), these transient tables can actually be deleted between loads. They won't exist in your database project. They won't exist until your load process has been run. A CTAS to 'Clean' our fact table would be:BEGIN TRY --Primary CTAS to insert data from external table into strongly typed Clean table CREATE TABLE Clean.MyTable WITH( HEAP, DISTRIBUTION = ROUND_ROBIN) AS SELECT ISNULL(CAST(Column_0 as INT),-1) TableId, CAST(Column_1 as VARCHAR(50)) SomeTextField, CAST(Column_2 as DECIMAL(12,5)) SomeNumber FROM SRC.MySourceTable OPTION (LABEL = 'Clean.LoadMyTable.CTAS'); END TRY BEGIN CATCH --CTAS Failed, mark process as failed and throw error SET @ErrorMsg = 'Error loading table "MyTable" during Cleaning CTAS: ' + ERROR_MESSAGE() RAISERROR (@ErrorMsg, 16, 1) END CATCH By including the CAST and ISNULL statements, I’m controlling the schema of the created tables. The resulting table would have the following definition:CREATE TABLE [Clean].[MyTable] WITH ( HEAP, DISTRIBUTION = ROUND_ROBIN ) ( [TableId] INT NOT NULL, [SomeTextField] VARCHAR(50) NULL, [SomeNumber] NUMERIC(12,5) NULL ); By using CTAS wherever possible, you will maintain the speed of your data loads and avoid and issues around long-running transactions. An efficient SQLDW loading process contains ONLY CTAS functions. For example – there is no such thing as a MERGE statement within SQLDW. In order to merge the results of one table into another, you can write a CTAS that selects a union of the two tables and inserts them into a new table. You then drop the old table and rename your newly created table in it’s stead. There are many similar patterns that will help you around any data movement problem – we’ll discuss them in later posts in this series. For more information about the CTAS statement, see the Microsoft Docs.

Connecting Power BI to Hive

On a recent project I was tasked with importing data into Power BI from a Hive table. For those of you who are new to Azure or Big Data, Hive is a data warehousing infrastructure for Hadoop which sits in the HDInsight stack on Azure. The primary purpose of Hive is to provide data summarisation, query and analysis for big data sets. In this blog I’m going to take you through the steps and note any Gotchas so that you can connect to Hive using Power BI. Connecting to HiveAs Hive is part of the Azure HDInsight stack it would be tempting to select the HDInsight or Hadoop connector when you’re getting data. However, note HDFS in brackets beside the Azure HDInsight and Hadoop File options as this means that you’ll be connecting to the underlying data store, which can be Azure Data Lake Store or Azure Blob Storage – both of which use HDFS architectures. But this doesn’t help when you want to access a Hive table. In order to access a Hive table you will first of all need to install the Hive ODBC driver from Microsoft. Once you’ve downloaded and installed the driver you’ll be able to make your connection to Hive using the ODBC connector in PowerBI.You will need to input a connection string to connect even though it says optional. The format of the connection string is as follows:Driver={Microsoft Hive ODBC Driver};Host=hdinsightclustername.azurehdinsight.net;Port=443;Schema=default; RowsFetchedPerBlock=10000; HiveServerType=2; AuthMech=6; DefaultStringColumnLength=200;One the next screen you’ll be asked to enter a username and password. The credentials used here are not what you use to access Azure but the credentials you created when you set up the HDInsight cluster and use to login to the cluster. Click connect and you’ll be able to pull through the tables you need into Power BI. Or, if you want to be selective in what is returned, you can write a HiveQL query in the ODBC dialog. It’s also worth noting that at the moment it’s only possible to do an import of Hive Data in Power BI and not perform Direct Query, so if your data set is huge you’ll want to summarise the data or be really selective in what is returned first.

Azure Data Lake Store : Authorization Failed - The user does not have the permission to perform this operation

Just a quick one as it took slightly longer to solve than it possibly should have done so this might help others with the same problem. I was working fine with an Azure Data Lake Store for a few weeks until yesterday when I couldn’t access the files anymore.  The Azure Portal reported the following:Authorization Failed - The user does not have the permission to perform this operation. It turns out I was accessing it from a new location and the ADLS Firewall needed updating to whitelist my IP Address…. After adding my IP, the Data Lake Store is now accessible again.

Data Flow: Job Execution in the Azure Data Lake

In this blog we will deep dive into the job execution of the Azure Data Lake Analytics (ADLA). If you’ve used ADLA you will have come across the following image: This is the graphical representation of the job execution. But what is is doing at each stage? Preparing The job execution begins with the authoring of the U-SQL script. The U-SQL script itself is the logical plan of how you intend to transform input data into output data. The script gets compiled, which will translate your U-SQL script primarily into C#, as well as a few other items such as XML– which will contain metadata information and the job graph. Once compiled it will create an initial plan – which will then be optimised thus producing an optimised plan. If it fails at this stage it will mostly be down to your U-SQL script failing to compile successfully. Fortunately, the error codes at this stage are fairly helpful and will indicate the line in question. Queued Once compiled and optimised the job progresses to the queued stage. There are a few things which can cause the a job to queue and these are: Running Jobs The default setting per subscription is 3. So if you or someone else in your subscription is already running a series of jobs, your job will be queued until at least one of them completes Higher priority jobs As the queue is ordered by job priority, with loser numbers having a higher priority, other jobs can jump to the top of the queue thus forcing your job to queue for longer Lack of resources Even if there are no running jobs and there are no jobs ahead of yours in the queue, your job will continue to queue if there are not enough Azure Data Lake Analytic Units (ADLAUs) to start the job.   Running Once the job has finished queuing it will run. The Job Manager allocates vertexes, which are collections of data to be processed, to the ADLAUs and uses YARN to orchestrate it. The vertexes will then execute. At this stage, if there is a vertex failure you can download the vertex locally and debug in visual studio. When the vertexes have successfully completed you will be able to consume the end product – either locally, if you have run the job locally, or in the Azure Data Lake Store. Conclusion The above process can be visualised in this graph. Knowing how ADLA executes your job is the first of many steps to being able to write performant U-SQL and debug effectively. Be on the look out for more blogs on the technical deep dive into ADLA!

Objects detection with Data Lakes Analytics

In this blog I’m going to show one of the advantages of linking Data Lakes Analytics with Machine Learning. We’ll be uploading a series of images to the Data Lake, we will then run a USQL script that will detect objects in the images and create relative tags in a text file. First of all you need an instance of  Data Lake Store and one of Data Lake Analytics, once these are up and running we need to enable Python/R/Cognitive in your Data Lake Analytics instance (here is a blog to help you out on this). First things first, we need to put an image in our Data Lake Store, following Azure Data Lake best practices I put the images in my laboratory subfolder. Once our images are in place we need to create a script, in your Data Lake analytics instance click on New Job This will open a new blade with an empty script, let’s give our new Job a name “ImageTagging”. In order to use Image tagging we need to import the relevant ASSEMBLIES:REFERENCE ASSEMBLY ImageCommon;REFERENCE ASSEMBLY ImageTagging; Next we need to extract information (location, filename etc.) on the image file(s) we want to analyse, in this case we’ll process all images in the specified folder.@images= EXTRACT FileName string, ImgData byte[] FROM @"/Laboratory/Desks/CSbrescia/ImageTagging/{FileName:*}.jpg" USING new Cognition.Vision.ImageExtractor(); The following step is where the magic happens, the script analyses all the images located in the folder indicated before, it detects all objects present in each image and create tags; here is the structure of this “variable”: Image name Number of tagged objects detected A string with all the tags @TaggedObjects= PROCESS @images     PRODUCE FileName,            NumObjects int,            Tags string    READONLY FileName    USING new Cognition.Vision.ImageTagger(); Now we can write our variable with all the tags to an output fileOUTPUT @TaggedObjectsTO "/Laboratory/Desks/CSbrescia/ImageTagging/ImageTags.tsv"   USING Outputters.Tsv(); Here are the images I used in this example And here is the list of objects detected   In conclusion, we have created a pretty handy tool for automatic image tagging using Data Lake with very little knowledge required on the background processes involved. To be noted that there seems to be an image size limit, i had to resize all images to about 500 kb.