Adatis BI Blogs

How to support your organisation with Azure Cosmos DB Graph (Gremlin)?

Let me start this blog with two questions. 1) How can Joanna, from the HR department, query the mentor/mentee structure from the organisation? 2) How can Robin, from the resourcing department, find the most suitable and available consultant to start a new project next week? I’m sure at this point you are thinking that to solve both problems they could simply query the HR and the Resource Planning systems, and you are right, but, what if they could get the answer for both questions from a single system? In this blog I’ll demonstrate how to achieve such requirement using Azure Cosmos DB Graph (Gremlin).Graph TheoryIn graph theory, a graph is an ordered pair comprising a set of vertices and edges. A vertex is the fundamental unit of which graphs are formed and are usually represented by a circle with a label. An edge is represented by a line or arrow extending from one vertex to another.Graphs can be used to model many types of relations and processes in many different areas. For example, we can represent the link structure of a website as a graph, the web pages as vertices and the links from one page to another as edges.Azure Cosmos DBAzure Cosmos DB is a globally distributed multi-model database with the capacity to store various types of data, such as document, relational, graph and key values. It provides all the necessary tools to elastically scale throughput and storage across any number of Azure's geographic regions (further details here). Azure Cosmos DB supports a variety of popular API’s to access and query the data, such as SQL, MongoDB, Cassandra, Graph (Gremlin) and Table API. In this instance, I will focus on the Graph (Gremlin) API.GremlinGremlin is the graph traversal language of Apache TinkerPop, an open source Graph Computing Framework. Gremlin allows the users to write complex queries to traverse their graphs by using a composed sequence of steps, with each step performing an operation on the data stream (further details here). There are 4 fundamental steps:· transform: transform the objects in the stream· filter: remove objects from the stream· sideEffect: pass the object, but yield some side effect· branch: decide which step to takeScenarioThe image in the left is an example of the mentor/mentee structure. If we convert it to a graph (image in the right), we have the people represented as vertices and the relationship mentor as edges.Now, let’s create the database and the graph. When creating the Azure Cosmos DB, we need to ensure we select the Gremlin (graph) API. To populate and query our graph, we have three options, the Azure Portal, the Gremlin Console or the Guided Gremlin Tour. The last two tools can be downloaded from the Quick Start section once we create the sample container or directly downloaded here and here.In the Azure Portal, CRUD operations can be performed via the Data Explorer UI and Gremlin queries. The results can be visualised in a Graph or in GraphJSON format, the Gremlin standard format to represent vertices, edges and properties using JSON.In the proposed scenario, we will create 7 vertices and 6 edges. The vertices will be labelled as person and will have 3 properties, First Name, Position and In Bench (indicates if a consultant is allocated to a project or not). The edges, labelled as mentor, will define the relation between consultants. Create a vertexg.addV('person').property('firstName', 'Tim’).property('position', 'Director') .property('inBench', '1')Create an edgeg.V().hasLabel('person').has('firstName', 'Tim').addE('mentor').to(g.V().hasLabel('person').has('firstName', 'Neil'))Below is how the graph looks like when all vertices and edges are created.Joanna has now all that is needed to obtain valuable information. For example, she can traverse the graph and obtain the list of Principal and Senior Consultants that are mentored by Tim.g.V().hasLabel('person').has('firstName', 'Tim').outE('mentor').inV().hasLabel('person').has('position', within('Principal Consultant','Senior Consultant')).group().by('firstName').by(values('position'))[{"Neil": "Principal Consultant","Scott": "Senior Consultant"}]To support Robin, we will add two new instances of vertices, Blog and Project. The Blog vertex will be related with the Person vertex and will indicate who wrote blogs and which technology was covered. The Project vertex will be related with the Person vertex and will indicate who worked in a certain project. The Project vertex will have the client, the name of the project, the duration and which technology was used as properties.If Robin needs to find a consultant with experience in Azure Cosmos DB, he can query the graph and verify who either wrote a blog or worked in a previous project with that technology. On top of that, he can filter the results by indicating he is only interested in consultants that are currently in the bench.g.V().hasLabel('person').where(outE('worked').or().outE('wrote').has('tech', 'Azure CosmosDB')).has('inBench', '1')ConclusionsHopefully, using a very simple scenario, I managed to demonstrate the potential of Azure Cosmos DB Graph to build complex queries, implement powerful graph traversal logic and help the business to quickly obtain insights from very complex models..As always, if you have any questions or comments do let me know.

Azure Active Directory Authentication and Azure Data Catalog

In a previous post I introduced Azure Data Catalog. Because it’s great for data discovery and for data asset management, it makes sense to automate, as much as possible, the process of registering new data assets, and allowing users to discover data in a more natural, perhaps conversational, way. In order to automate the registration of data assets or to allow discovery through other tools, it’s necessary to look at how Azure Data Catalog authenticates users using Azure Active Directory (AAD). This post is going to explore some of the options the Azure Data Catalog uses for authentication and a walkthrough of a code example to make authentication work without user input.Azure Active Directory AuthenticationIf you have interacted with Azure Data Catalog before, you will find that there are two ways of doing so. First, there’s the web application that allows you to conduct data discovery and data asset management. Then there’s the native application that sits on your local machine that can be used for registering data assets. These use different methods of authenticating using Azure Active Directory. The first one uses Web Browser to Web Application authentication. The second uses Native Application to Web API authentication. Web Browser to Web Application What is involved with Web Browser to Web Application authentication? Simply put, the web application directs the user’s browser to get them to sign-in AAD. AAD then returns a token which authenticates the user to use the web application. In practice, it’s a bit more complex, so here’s a diagram to help explain it. In a bit more detail, the process it follows is:1) A user visits the application and needs to sign in, they are redirected via a sign-in request to the authentication endpoint in AAD. 2) The user signs in on the sign-in page. 3) If authentication is successful, AAD creates an authentication token and returns a sign-in response to the application’s Reply URL that was configured in the Azure Portal. The returned token includes claims about the user and AAD that are required by the application to validate the token. 4) The application validates the token by using a public signing key and issuer information available at the federation metadata document for Azure AD. After the application validates the token, Azure AD starts a new session with the user. This session allows the user to access the application until it expires.This method of authentication is used by Azure Data Catalog when discovering data through the browser. Native Application to Web APIWhat’s the process of Native Application to Web API authentication? Simply put, the application will ask you to sign-in to AAD, so that it can acquire a token in order to access resources from the Web API. In practice, it’s a bit more complex, so here’s a diagram to help explain it. In a bit more detail, the process it follows is:1) The native application makes a request to the authorisation endpoint in AAD, but using a browser pop-up. This request includes the Application ID and redirect URI of the native application (see the following article for native applications and registering them in Azure) and the Application ID URI of the Web API. The user is then requested to sign-in.2) AAD authenticates the user. AAD then issues an authorisation code response back to the application’s redirect URI. 3) The Application then stops the browser activity and extracts the authorisation code from the response. Using the authorisation code, the Application then requests an access token from AAD. It also uses details about the native application and the desired resource (Web API). 4) The authorisation code and details are checked by AAD, which then returns an access token and a refresh token. 5) The Application then uses the access token to add to the authorisation header in its request to the Web API. Which returns the requested resource, based on successful authentication. 6) When the access token expires, the refresh token is used to acquire a new access token without requiring the user to sign-in again. This method of authentication is used by Azure Data Catalog when registering data assets via the desktop application. Automated Interaction with Azure Data CatalogIn both of the examples above, they require the user to interact in order to provide sign-in credentials. This is not ideal if we want to automate the registration of data assets or conduct data discovery outside of the browser. Therefore we’ll need to use a different method of authentication. This is the Server Application to Web API authentication method. Simply, it assumes that the server has already required a user to login and therefore has the user’s credentials. It then uses those credentials to request the access and refresh tokens from AAD.In a bit more detail, the process it follows is:1) The Server Application makes a request to AAD’s Token Endpoint, bypassing the Authentication Endpoint, providing the credential, Application ID and Application URI. 2) AAD authenticates the application and returns an access token that can be used to call the Web API.3) The Application uses the access token to add to the authorisation header in its request to the Web API. Which returns the requested resource, based on successful authentication.This method is what we’re going to use to automate our interaction with Azure Data Catalog.From an authentication aspect, the code for Server Application to Web API is simple and this example will take us to the point of returning that token, from which we can then use to request resources from the Azure Data Catalog API. The full code can be found my GitHub repo.We are going to use the Client Id and Secret from an application we’ve registered in AAD (full process can be found in this Microsoft article on Integrating Applications with AAD).private static string clientId = "ApplicationId";private static string secret = "ApplicationKey";Then, we’re going to make sure we’re connecting to the correct AAD instanceprivate static string authorityUri = string.Format("{0}", tenantId);So we can create an authorisation contextAuthenticationContext authContext = new AuthenticationContext(authorityUri);In order to acquire a token authResult = await authContext.AcquireTokenAsync(resourceUri, new ClientCredential(clientId, secret));Which can then be used in an authorisation header in requests to the Azure Data Catalog API. In the next related post, we’ll explore how to make a call to the API using this authentication method.

Can the Custom Vision Service support the calculation of KPIs?

Let’s assume we have a company that distributes alcoholic drinks across the country. To determine their performance, they define a set of KPIs that will evaluate, between others, how many establishments (eg. pubs, bars, …) have their products exposed in the shelf. To achieve this goal, they have a set of sales reps that visit each establishment and take note of which products are exposed. One possible way to track the data is by accessing a mobile application, manually fill the form and upload the data, but, what if we could automate the process of identifying the products in a shelf by simply uploading a picture? To do that, we would need to apply a machine learning algorithm to classify the image and identify each product. To prove if the above scenario is achievable, I’ll demonstrate how to create a project using a tool called Custom Vision, a service that allow us to easily build a predictive model with just a few clicks and without the need of deep machine learning knowledge. What is the Custom Vision Service?Azure Custom Vision Service is a Microsoft Cognitive Services tool for tagging images using a custom computer vision model. Although very similar to Microsoft’s Computer Vision API, it has the advantage of fine-tuning a predictive model to a specific dataset, however, there are still a couple of minor disadvantages. For example, the service can only identify if an object is in an image and not where it stands within the image.Build the modelTo build a predictive model with this service we can either use the web interface or the REST API with support for C# and Python. The first step was to create a project by providing a name, a domain and a resource group. When selecting a domain, we can either choose a general domain optimized for a range of images or select a specific domain optimized for a certain scenario. In my case, I selected the Food domain, given I wanted to identify different kind of bottles.Below a description of each domain detailed by Microsoft. The next step was to upload and tag images. Here are a couple of considerations:- To start the prototype, we need at least two different tags and a couple of images, usually, a minimum of 30 per class- It is best practice to use a variety of good quality images (different angles, lights, background, size, …) to ensure a better differentiation and accurate results from the classifier. In my case, Google was the source of the images, which was a bit limited in some cases (surprisingly!!!)- It is best practice to include images that represent what the classifier will find in the real world, rather than images with neutral backgrounds- It is advised to avoid images with multiple entities. If we upload images with a bottle and a glass, because the classifier will learn the characteristics that the photos have in common, when comparing two images, the classifier might be comparing a bottle + cup with a single bottleI started by uploading 60 images for 2 classes, Baileys and Others. Once the images were uploaded, I trained my model and obtained the following results.The Precision and Recall indicators demonstrates how good the classifier is performing. Above, we can see that the analysis is done for the entire model and for each tag. It is important to refer that 100% precision is usually not achievable, however, having a model with 75%+ in Precision and Recall is an indication of an effective model.Precision – Indicates how likely the classifier is correctly classifying an image. Knowing that we had 60 images, having a precision of 84.3% means that roughly 51 of the images were correctly taggedRecall – From out of all the images that should have been classified correctly, how many did the classifier identified accurately. Having a precision of 34.5% means that only 20 images were correctly classifiedProbability Threshold – The slider, set by default at 90%, indicates what is the value used to calculate Precision and Recall. Let’s consider the following example. The probability that image A has a Baileys bottle is 94%. If the probability threshold is 90%, then Image A will be taken into consideration as a “correct prediction”.In the example below, we can see that I obtained better results when I changed the probability threshold to 75%. Based on this information I had two options, either correctly tag the wrong images or replace them with better ones.Fortunately, Custom Vision Service has a very handy functionality that, for each iteration, highlights which images confused the model according to the probability threshold value. In the image below, there are two images with a red square. If we hover the mouse on the image we can see the prediction for each tag. In this case, the prediction is below 90% for both tags, meaning the image was not considered in the Precision and Recall calculation.After a couple of iterations we finally obtained a model we could use to run a quick test. When having multiple iterations, we select the one with best results by selecting the option Make Default.To test the model, I selected two images from the internet showing individual brands. As we can see, the model correctly classified each image.Since I was happy with the results, I decided to increase the complexity of the model by creating a new tag and uploading a new set of images. After training the model, I noticed the results were not as good as before, since the new images were creating some confusion to the model. It took a couple of iterations until I got an acceptable model. Following the result of a quick test.I now had a model that could correctly classify 3 different entities, so I decided to increase the challenge and added an image with multiple entities. The result I obtained helped me understand the flaw in my model.The model identified the image should be classified as Others, however, although we have a Baileys and Hendricks bottle in the image, the probability for those two classes was too low. Here is why:- When uploading images to the project I only used 1 tag per image. Based on that, the model will always try to classify an image with a single tag. As soon as I added more tags per image, my predictions improved- All the examples used were showing a single entity, ie, only one type of bottle per image, except for the Others category. Example: I uploaded 30 images of Baileys bottles, 30 images of Hendricks bottles, and, for the category Others, 3 different types of bottles- My model was trained to identify bottles where the image didn’t have multiple entities. As referred above, the model should always be trained with images that represent what the classifier will predict- The number of images per class didn’t have enough variety. As any machine learning model, if we improve the number of examples, the model will perform betterFinal ConsiderationsWith just a couple of clicks and no deep machine learning knowledge I was able to create a predictive model that could accurately classify a set of images. With further iterations I could potentially have a model that could achieve the requirements of the scenario proposed at the beginning of this blog, however, would Custom Vision Service be the right tool? In my opinion the use of the Computer Vision API would be more adequate, but, this is definitely a service with massive potential.As always, if you have any queries or considerations do let me know.

Azure Event Grid in a Modern Data Warehouse Architecture

In this blog I’ll give a light introduction to Azure Event Grid and demonstrate how it is possible to integrate the service in an modern data warehouse architecture.Azure Event Grid is a fully managed event routing service that went into general availability on the 30th January 2018. With this service, we can subscribe to any event happening across our Azure resources and take advantage of serverless platforms like Azure Functions and Logic Apps to easily create serverless workflows. It has a built-in publish support for events with services like Blob Storage and Resource Groups and supports custom web hooks that can publish events to Azure and third -party services.Following is a list of key terms from this service.Events – In this context, an Event is a message that contains data describing what happened in the service. Eg. a new file was uploaded to a container in a Blob Storage. The event will contain information about the file, such as the name of the file.Event Publishers – It is the source of the events published to the Event Grid. Following is a list of current and planned event publishers.AvailablePlanned- Azure Subscriptions (management operations)- Custom Topics- Event Hubs- IoT Hub- Resource Groups (management operations)- Storage Blob- Storage General-purpose v2 (GPv2)- Azure Automation- Azure Active Directory- API Management- Logic Apps- IoT Hub- Service Bus- Azure Data Lake Store- Cosmos DBTopics - The endpoint where publishers send eventsEvent Subscriptions – Receives specific events from the Topic and sends them to the Event HandlersEvent Handlers – It is the receiver of the events subscribed by the event subscriptions. Following is a list of current and planned event handlers.AvailablePlanned- Azure Automation- Azure Functions- Event Hubs- Logic Apps- Microsoft Flow- WebHooks- Fabric Controller- Service Bus- Event Hubs- Azure Data Factory- Storage QueuesAzure Event Grid can be used as any other message queue service, however, the service stands when integrated in an event-based architecture. Let’s consider that we have an application that uploads csv files to a Blob Storage several times a day. As soon as a set of files are available, we want to move them to an Azure Data Lake Store (ADLS) to apply data transformations using Azure Data Lake Analytics (ADLA). Azure Data Factory (ADF) will be used to orchestrate the data movement. Finally, we need to extract complete data sets from the ADLS using the PolyBase features in Azure SQL DW and present them as tables to Azure Analysis Services (AAS). A Tabular model hosted in AAS will populate a set of Power BI reports. In the diagram below we can see that we can subscribe the Blob Storage events using an Event Grid subscription and trigger a loading process using Logic App as soon as a file is uploaded to the Blob Storage. Following I’ll detail how we can implement the sections surrounded by the red square.Blob StorageBlob storage events are only available in the Blob Storage and StorageV2 (general purpose v2) accounts. In this example, I created a Blob Storage account. Once the storage account was deployed I created the container to where the application was uploading the files.Logic App In Logic Apps we subscribed to the events from the Blob Storage and implemented a logic to validate if we had all the required files to start a new data load. If true, we called an Azure Function that triggered an ADF pipeline. An Azure Function was required because as of the time of writing, there wasn’t a Logic App connector to ADF. The ADF pipeline then executed a couple of U-SQL stored procedures that applied data transformations to the ingested data and created our dimension and fact files in the ADLS.The following screens demonstrate how to create the logic app, the Azure Event Grid trigger and an overview of the workflow with all the requested steps.When adding the Azure Even Grid trigger we are prompted to sign in. We should use the account with our subscription. Once we hit save, we can go back to the blob storage account and find a new event grid subscription was created.Final ConsiderationsAzure Event Grid uses a pay-per-event pricing model, meaning we only pay what we use. The first 100,000 operations per month are free and beyond that, £0.448 per million operations.As always, if you have any queries or considerations do let me know.

Tabular Automation and NuGet

In a recent blog post, I wrote about processing an Azure Analysis Services tabular model using Azure Functions. In it, there’s a lengthy process of downloading some DLLs and uploading them to the Azure Function. Handily, the Analysis Services team at Microsoft have released the Analysis Services NuGet package, which means that the necessary DLLs can be automatically installed to an Azure Function without much hassle. This blog is going to go through the steps of adding the NuGet package to your Azure Function. Add a new file to your function called project.jsonInput the following code in the newly created file{   "frameworks": {     "net46":{       "dependencies": {         "Microsoft.AnalysisServices.retail.amd64": "15.0.2"       }     }    } }Then save the Azure Function to proceed with the NuGet restore and compile your function. You should see the following logs in your log window.That is the entire process. Much easier than documented previously!

Introduction to Azure Data Catalog

With the rise of self-service business intelligence tools, like Power BI, and an increased engagement with data in the workplace, people’s expectations of where they can find expert information about data has changed. Where previously there would an expert that people would have to book time with in order to understand data, now people expect to get quick and detailed information about the data assets that an enterprise holds and maintains without going through a single contact. With Azure Data Catalog, data consumers can quickly discover data assets and gain knowledge about the data from documentation, tags and glossary terms from the subject matter experts. This post aims to give a brief introduction to Azure Data Catalog and what it can broadly be used for. What is Azure Data Catalog?Azure Data Catalog is a fully managed Azure service which is an enterprise-wide metadata catalogue that enables data discovery. With Azure Data Catalog, you register; discover; annotate; and, for some sources, connect to data assets. Azure Data Catalog is designed to manage disparate information about data; to make it easy to find data assets, understand them, and connect to them. Any user (analyst, data scientist, or developer) can discover, understand, and consume data sources. Azure Data Catalog is a one-stop central shop for all users to contribute their knowledge and build a community and culture of data.What can Azure Data Catalog be used for?As mentioned in the earlier headings, Azure Data Catalog can be used for data asset management; data governance; and data discovery. For data asset management, this means knowing what data is available and where; for data governance teams, this means answering questions like: where is my customer data? or what does this data model look like?; for data discovery, this means knowing which data is suitable for particular reports and who you can go to if you have any questions. There are some common scenarios for using Azure Data Catalog that Microsoft has put together, and it’s well worth reading to get a fuller understanding of what Azure Data Catalog can be used for.

Process an Azure Analysis Services Tabular Model from an Azure Function

A couple of weeks ago I wrote a blog post detailing how to process a tabular model from a stored procedure. The challenge there was to have a way of processing a tabular model on demand from a web application. The challenge still exists, but how do you achieve the same in Azure using Platform as a Service (PaaS) objects which do not have the benefit of a full server and items, such as msdb and SQL Agent, to aid in the process?In this post, I’m going to show you how to process a tabular model only using Azure PaaS offerings. Not only am I going to show you how to do process a tabular model on-demand, but also how to process a tabular model on a schedule. This post has taken inspiration and part of the code base from the a Microsoft blog: Automating Azure Analysis Services processing with Azure Functions.Azure FunctionsBefore we begin properly, it’s worth spending some time introducing Azure Functions. According to Microsoft, Azure Functions are:…a solution for easily running small pieces of code, or "functions," in the cloud. You can write just the code you need for the problem at hand, without worrying about a whole application or the infrastructure to run it. Functions can make development even more productive, and you can use your development language of choice, such as C#, F#, Node.js, Java, or PHP. Pay only for the time your code runs and trust Azure to scale as needed. Azure Functions lets you develop server less applications on Microsoft Azure.They are super useful for extending the capabilities of any solution and not just limited to what we’re going to cover here.On-Demand RefreshThe use of Azure Functions creates a trigger for the on-demand refresh of a tabular model from the web application or web hook, this is to make sure that selected elements of data in a tabular model, for example hot partitions, are always up to date. The following describes the process that Azure Functions will be involved in this scenario:The steps that are needed to create an Azure Function for On-Demand Refresh are as follow:1) Create an Azure Function AppNavigate to the Azure Portal and create a Function App (the name changes quite a bit, so don’t be concerned if it’s not exactly displayed as it is below)2) Create a new FunctionAfter you’ve created the Function App, we need to add a new Webhook + API function, which we’ll use as the basis for our on-demand refresh. Click on the + button next to Functions, select Webhook + API, choose C# as your language and click Create this function.3) Configure the FunctionDownload the latest client libraries for Analysis Services. This needs to be done to your local machine so you can then copy these files to your Azure Function App. After you’ve downloaded the client libraries, the DLLs can be found in C:\Program Files (x86)\Microsoft SQL Server\140\SDK\Assemblies. The two files you need are:C:\Program Files (x86)\Microsoft SQL Server\140\SDK\Assemblies\Microsoft.AnalysisServices.Core.DLLC:\Program Files (x86)\Microsoft SQL Server\140\SDK\Assemblies\Microsoft.AnalysisServices.Tabular.DLLThe documentation references the 130 assemblies, which is not correct and will not work. You need the assemblies in 140.In order to add these assemblies to your function, you’ll need to add a folder called “bin”. To do this, select your Function App, click Platform features, and under Development Tools, click Advanced tools (Kudu).In Kudu, click Debug console and select Cmd. Navigate to the site\wwwroot\OnDemandHttpTrigger folder and add the “bin” folder here by clicking the + button.Once you’ve added the “bin” folder, go back over to the Azure portal and select your function OnDemandHttpTrigger.On the right under View files, navigate to the bin folder. Then click the Upload button to add the two previously mentioned DLLs to the bin folder.You should see the two DLLs in your bin folder now.4) Add Azure Analysis Services Connection String to the Function AppThis step is going to add the connection string to the Azure Analysis Services (AAS) service to the entire Function App, not just individual functions. Click the name of your Function App, then select Platform features. Select Application settings under General Settings.Now we need to add our connection string under the Connection strings section. You’ll need your AAS server name and a user ID and password that has access to the AAS database.You can find your AAS server name by navigating to your AAS database in the Azure portal and copying the value constructed after clicking Show server connection strings:Your connection string should look like this:Provider=MSOLAP;Data Source=<your aas server>; Initial Catalog=<aas database name>;User ID=<your username>;Password=<your password>Back in the screen for the Function App, fill in the Name textbox with a name for your connection string and paste your connection string in the Value text box:Click Save at the top to save these settings for your Functions.5) Time for CodeOur Function App has been configured, now we need to add code to the function. The function comes with a working function, for which to test out the functionality, but we don’t need everything that is on offer.We’re going to programmatically process the tabular model. In doing so, we��ll leverage Analysis Services Management Objects (AMO). If you’re new to AMO, start here.Paste in the following code (all code referenced can also be download from my GitHub Repo):#r "Microsoft.AnalysisServices.Tabular.DLL"#r "Microsoft.AnalysisServices.Core.DLL"#r "System.Configuration"using System;using System.Configuration;using Microsoft.AnalysisServices.Tabular;using System.Net;public static async Task<HttpResponseMessage> Run(HttpRequestMessage req, TraceWriter log){log.Info("C# HTTP trigger function processed a request.");// parse query parameterstring status = req.GetQueryNameValuePairs().FirstOrDefault(q => string.Compare(q.Key, "status", true) == 0).Value;if (status == null){// Get request bodydynamic data = await req.Content.ReadAsAsync<object>();status = data?.status;}if (status == "execute"){log.Info($"C# trigger function started at: {DateTime.Now}");  try            {Microsoft.AnalysisServices.Tabular.Server asSrv = new Microsoft.AnalysisServices.Tabular.Server();var connStr = ConfigurationManager.ConnectionStrings["AASTabular"].ConnectionString;asSrv.Connect(connStr);Database db = asSrv.Databases["azureadventureworks"];Model m = db.Model;//db.Model.RequestRefresh(RefreshType.Full);     // Mark the model for refreshm.RequestRefresh(RefreshType.Full);     // Mark the model for refresh//m.Tables["Date"].RequestRefresh(RefreshType.Full);     // Mark only one table for refreshdb.Model.SaveChanges();     //commit  which will execute the refreshasSrv.Disconnect();            }catch (Exception e)            {log.Info($"C# trigger function exception: {e.ToString()}");            }log.Info($"C# trigger function finished at: {DateTime.Now}"); }return status == "execute"?req.CreateResponse(HttpStatusCode.OK, "Successfully Processed Tabular Model ") :req.CreateResponse(HttpStatusCode.BadRequest, "Please pass a status on the query string or in the request body");}Click the Save button at the top.6) Test, Test, TestClick the Run button at the top to test the function The function can also be tested in a web browser, and be called by a Web App using the POST HTTP method.Now we have a fully functioning method of refreshing an Azure Analysis Services tabular model on-demand.Scheduled RefreshThe use of Azure Functions creates a trigger for the scheduled refresh of a tabular model, this is to make sure that the entire tabular model has the latest data and is always up to date. The following describes the process that Azure Functions will be involved in this scenario:The steps that are needed to create an Azure Function for Scheduled Refresh are as follow:1) Create a FunctionWe’ve created our Function App, and now we need to add a new Timer Trigger function, which we’ll use as the basis for our scheduled refresh. Click on the + button next to Functions, select Timer, choose C# as your language and click Create this function.2) Configure the TimerWhat use is a timer without a schedule? To give the timer a schedule, click Integrate, set the schedule and click Save.The Schedule text box expects a CRON expression to define the days and times that the function should execute. Click the little Documentation button on the screen above to read about CRON expressions. The schedule I’ve set is to run once everyday at 09:30AM.3) Configure the FunctionSee step 3 of the On-Demand Function for detailed steps. You’ll need to create the bin folder and upload the DLLs to the bin folder.4) Time for CodeWe’ve configured our function, so now it’s time to add the code. The code base is much simpler than the On-Demand code base, mainly because it’s doing fewer tasks. But the AMO section is exactly the same. Paste in the following code:#r "Microsoft.AnalysisServices.Tabular.DLL"#r "Microsoft.AnalysisServices.Core.DLL"#r "System.Configuration"using System;using System.Configuration;using Microsoft.AnalysisServices.Tabular;public static void Run(TimerInfo myTimer, TraceWriter log){log.Info($"C# Timer trigger function started at: {DateTime.Now}");  try            {Microsoft.AnalysisServices.Tabular.Server asSrv = new Microsoft.AnalysisServices.Tabular.Server();var connStr = ConfigurationManager.ConnectionStrings["AASTabular"].ConnectionString;asSrv.Connect(connStr);Database db = asSrv.Databases["azureadventureworks"];Model m = db.Model;//db.Model.RequestRefresh(RefreshType.Full);     // Mark the model for refreshm.RequestRefresh(RefreshType.Full);     // Mark the model for refresh//m.Tables["Date"].RequestRefresh(RefreshType.Full);     // Mark only one table for refreshdb.Model.SaveChanges();     //commit  which will execute the refreshasSrv.Disconnect();            }catch (Exception e)            {log.Info($"C# Timer trigger function exception: {e.ToString()}");            }log.Info($"C# Timer trigger function finished at: {DateTime.Now}"); }Click the save button at the top.5) Test, Test, TestClick the Run button at the top to test the function Now we have a fully functioning method of refreshing an Azure Analysis Services tabular model on-demand.ConclusionI have shown you how simple it is to invoke two methods of refreshing a tabular model using Azure Functions: an On-Demand refresh and a refresh by Schedule. I hope that you take inspiration and use these methods in your use of both Azure Analysis Services and Azure Functions.

Extraction and Analysis of GeoSpatial data with Azure Data Lake Analytics

I recently had to implement a solution to prove it was possible to integrate a shape file (.SHP) in Azure Data Lake Store (ADLS) for post geographic spatial analysis using Azure Data Lake Analytics (ADLA). A shape file is a data set used by a geographic analysis application that stores a collection of geographic features, such as streets or zip code boundaries, in the form of points, lines or area features.As you already figured, storing a shape file in ADLS is not a difficult goal to achieve, however, how can you possibly use ADLA to obtain the geographic data from the file? In this blog I’ll explain how we can extract the data to a supported format, such as CSV, and use it to run geographic spatial analysis in ADLA, with the support of the spatial data types introduced in the SQL Server 2008 (details here).As always, whenever we face a limitation of ADLA, C# is our best friend. In order to read the content of a shape file, we need to start by adding a geospatial assembly to our solution, which, in my case, was the “Catfood” ESRI Shapefile Reader (details here).The shape file used in this example contains a list of parks in London. The following code demonstrates how to extract the metadata and the geographic shapes to a CSV file. The only shapes extracted are polygons, although it is possible to add more if needed. public static void CreateWorkForThreads() { //Create a new dataset and store the data in a table DataSet ds = CreateNewDataSet(); DataTable dt = ds.Tables[0]; int i; int count = 0; // Parse the shapefile and select the columns we are interested in using (Shapefile shapefile = new Shapefile(@"path\file.shp")) { foreach (Shape shape in shapefile) { string[] metadataNames = shape.GetMetadataNames(); string geometry = ""; int countParts = 0; int countShape = 0; DataRow dr = dt.NewRow(); //Extract the metadata. The first iteraction will extract the name of the columns if (metadataNames != null) { foreach (string metadataName in metadataNames) { if (count == 0) dr[metadataName] = metadataName; else dr[metadataName] = shape.GetMetadata(metadataName); } } //Shape is not part of the metadata, so manually defining the name of the column if (count == 0) { dr["shape"] = "shape"; } else { // cast shape based on the type switch (shape.Type) { case ShapeType.Point: // a point is just a single x/y point ShapePoint shapePoint = shape as ShapePoint; MessageBox.Show("Point (" + shapePoint.Point.X.ToString() + ", " + shapePoint.Point.Y.ToString() + ")"); break; case ShapeType.Polygon: // a polygon contains one or more parts - each part is a list of points which // are clockwise for boundaries and anti-clockwise for holes // see ShapePolygon shapePolygon = shape as ShapePolygon; foreach (PointD[] part in shapePolygon.Parts) { countShape = 0; if (countParts == 0) geometry = "("; else geometry = geometry + " | ("; foreach (PointD point in part) { if (part.Length - 1 != countShape) geometry = geometry + point.X + " " + point.Y + " |"; else geometry = geometry + point.X + " " + point.Y + " )"; countShape++; } countParts++; } break; default: break; } //Build our Polygon. //Eg. POLYGON((-122.358 47.653, -122.348 47.649| -122.348 47.658, -122.358 47.658, -122.358 47.653)) dr["shape"] = "POLYGON(" + geometry + ")"; } dt.Rows.Add(dr); count++; } } //Extract the data to a csv file using (System.IO.StreamWriter sw = new System.IO.StreamWriter(@"path\filename.csv")) { foreach (DataRow row in dt.Rows) { object[] array = row.ItemArray; for (i = 0; i < array.Length - 1; i++) { sw.Write(array[i].ToString() + ","); } sw.WriteLine(array[i].ToString()); } } } public static DataSet CreateNewDataSet() { DataSet dsTemp = new DataSet(); DataTable dtTemp = new DataTable("londonparks"); dtTemp.Columns.Add("id", typeof(string)); dtTemp.Columns.Add("parkname", typeof(string)); dtTemp.Columns.Add("street", typeof(string)); dtTemp.Columns.Add("postcode", typeof(string)); dtTemp.Columns.Add("shape", typeof(string)); dsTemp.Tables.Add(dtTemp); return dsTemp; }Now that we have a valid file that can be processed by ADLA, we can upload it to ADLS and start performing geospatial analysis. To do so, I simply followed the logic described in Sacha’s blog (here).The following U-SQL has in consideration a dataset that contains details of the trajectory of a courier, tracked on a daily basis. With the following code, we identify if a courier drove by a park by using the Intersect function. Because we have to cross two datasets, a C# function was created to help the evaluation of multiple events. // Reference the assemblies we require in our script. REFERENCE SYSTEM ASSEMBLY [System.Xml]; REFERENCE ASSEMBLY [SQLServerExtensions].[SqlSpatial]; REFERENCE ASSEMBLY [USQL.Core]; // Once the appropriate assemblies are registered, we can alias them using the USING keyword. USING Geometry = Microsoft.SqlServer.Types.SqlGeometry; USING Geography = Microsoft.SqlServer.Types.SqlGeography; USING SqlChars = System.Data.SqlTypes.SqlChars; USING [USQL].[Core].[Utilities]; // Extract the list of parks @parks = EXTRACT [ID] string, [PARKNAME] string, [STREET] string, [POSTCODE] string, [SHAPE] string FROM "RAW/Parks.csv" USING Extractors.Text(delimiter : ',', silent : false, quoting : true, skipFirstNRows : 1); //Extract data from the file containing the courier trajectory @trajectories = EXTRACT GPSDateTimeUTC DateTime, ReceivedDatetimeUTC DateTime, VehicleKey string, Altitude int, Longitude double, Latitude double, Distance decimal, VehicleSpeedMph decimal FROM "CURATED/Trajectory/Trajectory.TXT" USING Extractors.Text(delimiter : '|', silent : false, quoting : true, skipFirstNRows : 1); //Get the list of vehicles that drove by the park. @vehicleIntersection = SELECT DISTINCT a. *, "1" AS VehicleIntersected FROM @trajectories AS a CROSS JOIN @parks AS b WHERE Utilities.Intersect(b.[SHAPE], a.[Longitude], a.[Latitude]).ToString() == "True"; //Get the list of vehicles that didn't drive by the park. @vehicleWithoutIntersection = SELECT a. *, "0" AS VehicleIntersected FROM @trajectories AS a LEFT JOIN @vehicleIntersection AS b ON b.VehicleKey == a.VehicleKey AND b.GPSDateTimeUTC == a.GPSDateTimeUTC WHERE b.VehicleKey IS NULL; //Union both datasets to get the complete set of data @finalData = SELECT * FROM @vehicleIntersection UNION ALL SELECT * FROM @vehicleWithoutIntersection; //Export the results to a csv file OUTPUT @finalData TO "LABORATORY/GeoSpatialIntersection.csv" USING Outputters.Text(outputHeader : true, delimiter : ',', quoting : true);And here is the C# function. It accepts three parameters and calculate the intersection of a point with a shape.public static string Intersect(string shape, double longitude, double latitude) { //Because we had a csv file, the coordinates in the polygon were separated by | //It is important to use the .MakeValid() method to validate any invalid shape //In case the dataset had multypoligon shapes, without the MakeValid(), the function would throw an error var g = Geography.STGeomFromText( new SqlChars( shape.Replace('|',',')), 4326).MakeValid(); var h = Geography.Point(longitude, latitude, 4326); return g.STIntersects(h).ToString(); }As always, if you have any questions or comments, do let me know.

Different ways of performing file operations on Azure Data Lake

Recently at a client, we needed to come up with a few different ways that we can perform File Management operations within their Data Lake – for example moving files once processed and in their case, renaming folders etc. We needed to come up with different solutions to what we currently used in order to keep within their desired architecture. So we started looking at using the REST API and calling that using C# within an SSIS package. The other option I looked at was using Python. I will explain more about both the methods below, but first there is some set up we need to do. Pre-Requisites Aside from having an Azure Subscription and a Data Lake Store account, you will need an Azure Active Directory Application. For our case we needed a Web Application as we will be doing Service to Service authentication. This is where the application provides it’s own credentials to perform the operations whereas with End-User authentication, a user must log into your application using Azure AD. Service-to-service authentication setup Data Lake store uses Azure Active Directory (AAD) for authentication, and this results in our application being provided with an OAuth 2.0 token, which gets attached to each request made to the Azure Data Lake Store. To read more about how it works and how to create the app, get the relevant credentials and how to give it access to the Data Lake store, follow through the Microsoft tutorial here. Make a note of the below as we will be needing them when we develop the solutions: Tenant ID (also known as Directory ID) Client Id (also known as the ApplicationID)   Within App registrations, if you look for your App under ‘All Apps’ and click on it you will be able to retrieve the Application Id.   Client Secret (also known as Authentication Key) Within the App area used above, click on setting and then ‘Keys’. If you haven’t previously created one, you can create it there and you must remember to save it when it appears as you will not get another chance!   Data Lake Name   Using REST API Now we have everything set up and all the credentials we need, we can make a start constructing the requests. In order to test them out I used Postman which can be downloaded here. Authentication Firstly, before we can begin any folder management operations, we need to authenticate against the Data Lake. To do this we need to perform a POST request in order to obtain an access token. This token will be used and passed to the other REST API calls we will make (e.g. deleting files) as we will always need to authenticate against the Data Lake. To retrieve the access token, we need to pass through the TENANT ID, CLIENT ID and CLIENT SECRET and the request looks as follows: curl -X POST<TENANT-ID>/oauth2/token  \  -F grant_type=client_credentials \  -F resource= \  -F client_id=<CLIENT-ID> \  -F client_secret=<AUTH-KEY> Within Postman, it looks like this: 1. Make sure the request type is set to POST 2. Make sure you have added your tenant id to the request 3. Fill out the body with your Client ID and Client Secret. (grant_type and resource are set as constant values as shown above). 4. Make a note of the Bearer access token as we will need it to perform any File Management operation on the Data Lake.   Deleting a File Now we have our access token, we can perform a deletion of a file with the following: curl -i -X DELETE -H "Authorization: Bearer <REDACTED>" 'https://<yourstorename>'   Within Postman, it looks like the following:   1. This is a DELETE request and have therefore changed the dropdown to reflect that. 2. Remember to add your data lake store name into the request; in my example it is called emma1 3. You can point to a particular file, or you can point it to a folder and add &recursive=true to the request and it will delete all the files within the folder including the folder itself. I haven’t managed to find a way to just delete the contents of the folder and leaving the folder as is. 4. The access token is sent as a header called ‘Authorization’. Make sure to include ‘Bearer ‘ before you access token as highlighted above.   Once you have sent the request, you will receive some JSON in the output to show if the action has been successful (true). You can perform many more File Management operations using the Rest API and the code can be found here: Common Errors 1. This following error is caused by running the delete file request when passing through an Access Token that has expired. To fix this issue, re-generate the Access token and pass that through instead.   2. The error below is caused by the Application that we are using to access the Data Lake Store not having sufficient permissions to perform the operations. Make sure it has access to the relevant folder(s). Check Step 3 here to find out how to set the access.     Summary Now we have managed to get what them working manually within Postman, we need to consider how to call them in a production environment. The solution we implemented was an SSIS package (in-keeping with their current architecture) with script tasks calling C# which in turn calls the API. Before the File System Operation is called, we will run the authentication API call to obtain the latest Access Token, and place the value in a variable to be used later on within the package to ensure it is the latest.   Using Python From having a play around with Python do do similar File Management operations, it seems rather limited in comparison and you can’t do as much.Nevertheless, I am sure more functionality will be added and it is useful to know how it works. Firstly, if you don’t already have Python, you can download the latest version from here. As an IDE, I have been using Visual Studio 2017 which now comes with Python Support, see here for further information. In order for us to be able to perform operations on the Data Lake, we need to install three Azure modules. To install the modules, open up the command prompt and run the following: pip install azure-mgmt-resourcepip install azure-mgmt-datalake-storepip install azure-datalake-store Now we need to create the Python app (I used Visual Studio) to do the folder management tasks. In order to reference the modules we have just installed, we need to import the relevant modules so we can use them within our app. Each time we create an app related to Data Lake folder manipulations, we need to add them in each time. The code below shows how to do this. Save the application, but don’t run it yet! ## Use this for Azure AD authenticationfrom msrestazure.azure_active_directory import AADTokenCredentials ## Required for Azure Data Lake Store account managementfrom import DataLakeStoreAccountManagementClientfrom import DataLakeStoreAccount ## Required for Azure Data Lake Store filesystem managementfrom import core, lib, multithread ## Common Azure importsimport adalfrom azure.mgmt.resource.resources import ResourceManagementClientfrom azure.mgmt.resource.resources.models import ResourceGroup ## Use these as needed for your applicationimport logging, getpass, pprint, uuid, time   Firstly, we  need to authenticate with Azure AD. Again, as described above there are two ways; End-User and Service-to-Service. We will be using Service-to-Service again in this example. To set this up, we run the following: adlCreds = lib.auth(tenant_id = 'FILL-IN-HERE', client_secret = 'FILL-IN-HERE', client_id = 'FILL-IN-HERE', resource = '') And fill in the TENANT ID, CLIENT SECRET and CLIENT ID that we captured earlier on. Now we can authenticate against the Data Lake, we can now attempt to delete a file. We need to import some more modules, so add the script below to your application:   ## Use this only for Azure AD service-to-service authenticationfrom azure.common.credentials import ServicePrincipalCredentials ## Required for Azure Data Lake Store filesystem managementfrom import core, lib, multithread   We now need to create a filesystem client: ## Declare variablessubscriptionId = 'FILL-IN-HERE'adlsAccountName = 'FILL-IN-HERE' ## Create a filesystem client objectadlsFileSystemClient = core.AzureDLFileSystem(adlCreds, store_name=adlsAccountName)   We are now ready to perform some file management operations such as deleting a file:   ## Delete a directoryadlsFileSystemClient.rm('/mysampledirectory', recursive=True)   Please see the script below for the full piece of code. You can find information on the other operation you can complete (e.g. creating directories) here  ## Use this for Azure AD authenticationfrom msrestazure.azure_active_directory import AADTokenCredentials ## Required for Azure Data Lake Store filesystem managementfrom import core, lib, multithread # Common Azure importsimport adalfrom azure.mgmt.resource.resources import ResourceManagementClientfrom azure.mgmt.resource.resources.models import ResourceGroup ## Use these as needed for your applicationimport logging, getpass, pprint, uuid, time ## Service to service authentication with client secret for file system operations   adlCreds = lib.auth(tenant_id = XXX', client_secret = ‘XXX', client_id = ‘XXX', resource = '') ## Create filesystem client ## Declare variablessubscriptionId = ‘XXX’adlsAccountName = 'emma1' ## Create a filesystem client objectadlsFileSystemClient = core.AzureDLFileSystem(adlCreds, store_name=adlsAccountName) ## Create a directory#adlsFileSystemClient.mkdir('/mysampledirectory') ## Delete a directoryadlsFileSystemClient.rm('/mysampledirectory', recursive=True) Summary In summary, there are a few different ways in which you can handle your file management operations within Data Lake and the principles behind the methods are very similar. So, if one way doesn’t fit into your architecture, there is always an alternative.

Deploying Scheduled Triggers In Data Factory v2 With PowerShell

So in ADFv2, scheduled triggers have been overhauled to become their own deployable components, able to kick off multiple parameterised pipelines together in each trigger.This new version of the trigger seems far more modular and flexible than the functionality in v1, however there are a couple of catches and easy-to-miss steps during implementation if you choose to use PowerShell rather than the brand new visual authoring tool, so I’ll break down the process below.First, you need to build your trigger, as a separate JSON object to the pipeline. No additional code is needed in the pipeline itself. There is a more complete example in the documentation Microsoft has provided, but on top of that here is a working example I’ve written below:{ "name": "ST_OrchestratorTrigger", "properties": { "type": "ScheduleTrigger", "typeProperties": { "recurrence": { "frequency": "Day", "interval": 1, "startTime": "2017-12-20T04:00:00-05:00", "endTime": "2099-11-02T22:00:00-08:00", "schedule": { "hours": [ 4 ], "minutes": [ 0 ] } } }, "pipelines": [ { "pipelineReference": { "type": "PipelineReference", "referenceName": "PL_MasterOrchestration" }, "parameters": {} } ] } } As you might guess, this trigger runs the PL_MasterOrchestrator pipeline (with no input parameters) at 4am each day.A caveat on the recurrence – it seems that when running, the trigger doesn’t strictly adhere to the “startTime” after the first time it runs, so despite starting at 4am, the subsequent trigger could be minutes or possibly hours off on the next day. To ensure it always triggers at exactly 4am, you need to add the “schedule” recurrence attribute as shown above.Deployment of the trigger (as of writing this blog) is still only done through PowerShell scripts, using the cmdlet below:Set-AzureRmDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName ` -DataFactoryName $DataFactoryName ` -DefinitionFile $TriggerFile ` -Name $TriggerName ` -Force; Each of these parameters are exactly as you would expect from the v1 parameters for setting other objects. -Force shouldn’t strictly be necessary, but sometimes is worthwhile putting in depending on the exact circumstances, -DefinitionFile is asking for the .JSON file itself, etc.So, at this point you’ve created the scheduler, and deployed it to your data factory. Scheduled triggers in v2 need to be started, once they are deployed. You can do this with the following:Start-AzureRmDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -Name $TriggerName Just another quick note here: you will need to stop the trigger again if you want to redeploy it. Trying to overwrite a trigger while it’s running will just result in an error message.As well as finding the necessary cmdlets 2/3rds of the way down the Microsoft document linked above, there is a list of all relevant cmdlets found here.Hope this helps!The documentation again:

The Road to Microsoft MVP (Most Valuable Professional)

It is with great pleasure that I can share with you all that I am now a Microsoft MVP (Most Valuable Professional). The MVP programme ( is designed to recognise those professionals who give their time to support technical communities and Microsoft. This could be through blogging, speaking at conferences, providing feedback, helping other and contributing code to projects. There are many award categories ranging from PowerPoint to Xbox each with its contribution areas. I was awarded MVP for the Data Platform, which covers SQL Server, Data Lake, HDInsight and many of the topics I am particularly interested in. I exceptionally pleased to be amongst the 19 Data Platform MVPs in the UK and 1 of the 418 Data Platform MVPs worldwide. In a recent blog by Kevin Kline, Kevin discussed what it takes to be an MVP, in his opinion it all boils down to being passionate about what you do and sharing that passion with others ( I could not agree more! I talk at events because I want to help people. I still get so much out of attending sessions and I want to make sure that I give back everything I took from the community. Kevin's blog gives a fantastic overview of what it takes to become an MVP, well worth a read. When I first started attending user group and conferences I began to become aware of the term MVP. At that point there was only a handful of MVPs in the UK (At that point you were a SQL Server MVP). Their sessions at conferences were always held in high regard and rightly so, these men and women were at the top of their game. I looked up to these professionals and would always learn something from reading their blogs, books and seeing them talk. When I started talking at user groups and SQL Saturday's I always wanted to become an MVP, but it was never my reason for presenting, it is now a very happy by-product of giving back to a technical community. MVP is awarded based on your contributions in the last 12 months, so what have I done in the last 12 months: Blogged ( Delivered 25 technical sessions at user groups and conferences Spoke at the PASS Summit in Seattle Organised a data science meetup It has been a real blast over the last 12 months, I could not have spoken at so many events without the support of Adatis - so thanks. I want to also thank everyone who took the time to nominate me and for the help and support of Microsoft. (Some of the MVPs who nominated me - At a recent MVP event) What will I be doing over the next 12 months? A lot of the same. Although I would like to get a few more blogs written this year. If you want to see me talk. I will be at SQLBits in London in February delivering a pre-conference training day and a general session: A Data Engineer’s Guide to Azure SQL Data Warehouse Enhancing relational models with graph in SQL Server 2017 Beyond that, keep an eye out at your local user group or SQL Saturday. See you soon.

Archiving the Data Lake

In a blog introducing the Data Lake Framework, keen readers will be aware that in the diagram there’s a box titled “ARCHIVE” but it has not been brought up since. The reason why the Archive layer in the data lake has not been discussed is because we’ve been waiting for the Archive Tier in Blob Storage.To remind readers of the framework and where the archive layer sits in it, here it is again with the archive layer highlighted.The Archive BlobThe Archive access tier in blob storage was made generally available today (13th December 2017) and with it comes the final piece in the puzzle to archiving data from the data lake. Where Hot and Cool access tiers can be applied at a storage account level, the Archive access tier can only be applied to a blob storage container. To understand why the Archive access tier can only be applied to a container, you need to understand the features of the Archive access tier. It is intended for data that has no or low SLAs for availability within an organisation and the data is stored offline (Hot and Cool access tiers are online). Therefore, it can take up to 15 hours for data to be made online and available. Brining Archive data online is a process called rehydration (fitting for the data lake). If you have lots of blob containers in a storage account, you can archive them and rehydrate them as required, rather than having to rehydrate the entire storage account. Archive PatternAn intended use for the Archive access tier is to store raw data that must be preserved, even after it has been fully processed, and does not need to be accessed within 180 days. Data gets loaded into the RAW area of the data lake, is fully processed through to CURATED, and a copy of the raw data is archived off to a blob container with a Cool access tier applied to it. When the archive cycle comes about, a new Cool access tiered blob container is created and the now old container has its access tier changed to Archive. For example, our Archive cycle is monthly and we have a Cool access tiered blob container in our storage account called “December 2017”. When data has finished being processed in the Azure Data Lake, the Raw data is archived to this blob container. January comes around, we create a new blob container called “January 2018” with Cool access tier settings and change the access tier of “December 2017” from Cool to Archive. This data has now been formally achieved and is only available for disaster recovery, auditing or compliance purposes. 

The thing that stumps everyone new to Azure Data Factory

Recently I was playing around with Azure Data Factory and was making my way through an online tutorial when I came across a frustrating error message that was not very helpful and my data wasn’t moving anywhere! The aim of this exercise was to move some text files in Azure Data Lake Store over to a table in an Azure SQL Database. Sounds easy enough! The data within the Data Lake store was organised into a Year and Month hierarchy for the folders, and each days transactions were stored in a file which was named after the day within the relevant month folder. The task then was to create a pipeline which copies the dataset in the Data Lake Store over to the dbo.Orders table in Azure SQL DB every day within the scheduled period (Q1 2016). After creating all the json scripts and deploying them (with no errors), I clicked on the ‘Monitor and Manage’ tile to monitor the activities, check everything was working as it should be and monitor the progress. After waiting for at least 10 minutes, I started to get frustrated. As you can see, all the Pipeline Activities for each slice have a Status of: ‘Waiting: Dataset Dependecies’ and if you look at the Activity Window Explorer (see below), the activity window for each day shows a Status of Waiting (a solid orange square).   The files existed in the correct folders within my Data Lake Store and there were no errors during deployment so what was it waiting for?! Well, it turns out we need to set an additional property on the input dataset to let Data Factory know that the data is being generated externally and not internally within Data Factory from another pipeline .     That property is "external": true at the bottom of the script. Below is an example of a data factory and the arrows point to the data sets that need to have this external property set to true. Using the diagram of the pipeline once deployed, it is easy to identify which datasets need to be external as they are not linked to anything upstream.

LETS Process Data–Modern ETL for the Data Lake

At the PASS Summit this year, I attended a session by Michael Rys. In this session he introduced the concept of LETS as an approach to process data in the data lake. If you are familiar with data lake, then you will be familiar of having to apply a schema to the data held within. The LETS approach is purpose design for schematization. Where ETL stands for Extract, Transform, Load or ELT stands for Extract, Load, Transform – LETS stands for Load, Extract, Transform, Store. Data are Loaded into the data lakeData are Extracted and schematizedData are Transformed in rowsetsData are Stored in a location, such as the Catalog in Azure Data Lake Analytics, Azure Data Warehouse, Azure Analysis Services, for analysis purposes. I really like this approach as it makes sense for how data are handled in the data lake. It’s something that I will be advocating and using, and I hope you do too!

Migrating to Native Scoring with SQL Server 2017 PREDICT

Microsoft introduced native predictive model scoring with the release of SQL Server 2017. The PREDICT function (Documentation) is now a native T-SQL function that eliminates having to score using R or Python through the sp_execute_external_script procedure. It's an alternative to sp_rxPredict. In both cases you do not need to install R but with PREDICT you do not need to enable SQLCLR either - it's truly native. PREDICT should make predictions much faster as the process avoids having to marshal the data between SQL Server and Machine Learning Services (Previously R Services). Migrating from the original sp_execute_external_script approach to the new native approach tripped me up so I thought I'd share a quick summary of what I have learned. Stumble One: Error occurred during execution of the builtin function 'PREDICT' with HRESULT 0x80004001. Model type is unsupported. Reason: Not all models are supported. At the time of writing, only the following models are supported: rxLinMod rxLogit rxBTrees rxDtree rxdForest sp_rxPredict supports additional models including those available in the MicrosoftML package for R (I was using attempting to use rxFastTrees). I presume this limitation will reduce over time. The list of supported models is referenced in the PREDICT function (Documentation). Stumble Two: Error occurred during execution of the builtin function 'PREDICT' with HRESULT 0x80070057. Model is corrupt or invalid. Reason: The serialisation of the model needs to be modified for use by PREDICT. Typically you might serialise your model in R like this: model <- data.frame(model=as.raw(serialize(model, NULL))) Instead you need to use the rxSerializeModel method: model <- data.frame(rxSerializeModel(model, realtimeScoringOnly = TRUE)) There's a corresponding rxUnserializeModel method, so it's worth updating the serialisation across the board so models can be used interchangeably in the event all models are eventually supported.  I have been a bit legacy. That's it.  Oh, apart from the fact PREDICT is supported in Azure SQL DB, despite the documentation saying the contrary.

Automating The Deployment of Azure Data Factory Custom Activities

Custom Activities in Azure Data Factory (ADF) are a great way to extend the capabilities of ADF by utilising C# functionality. Custom Activities are useful if you need to move data to/from a data store that ADF does not support, or to transform/process data in a way that isn't supported by Data Factory, as it can be used within an ADF pipeline.Deploying Custom Activities to ADF is a manual process, which requires many steps. Microsoft’s documentation lists them as:Compile the project. Click Build from the menu and click Build Solution.Launch Windows Explorer, and navigate to bin\debug or bin\release folder depending on the type of build.Create a zip file that contains all the binaries in the \bin\Debug folder. Include the MyDotNetActivity.pdb file so that you get additional details such as line number in the source code that caused the issue if there was a failure.Create a blob container named customactivitycontainer if it does not already existUpload as a blob to the customactivitycontainer in a general purpose Azure blob storage that is referred to by AzureStorageLinkedService.The number of steps means that it can take some time to deploy Custom Activities and, because it is a manual process, can contain errors such as missing files or uploading to the wrong storage account. To avoid that errors and delays caused by a manual deployment, we want to automate as much as possible. Thanks to PowerShell, it’s possible to automate the entire deployment steps. The script to do this is as follows:Login-AzureRmAccount# Parameters $SourceCodePath = "C:\PathToCustomActivitiesProject\"$ProjectFile ="CustomActivities.csproj"$Configuration = "Debug" #Azure parameters$StorageAccountName = "storageaccountname"$ResourceGroupName = "resourcegroupname"$ContainerName = "blobcontainername"# Local Variables$MsBuild = "C:\Program Files (x86)\MSBuild\14.0\Bin\MSBuild.exe";            $SlnFilePath = $SourceCodePath + $ProjectFile;                                         # Prepare the Args for the actual build            $BuildArgs = @{                 FilePath = $MsBuild                 ArgumentList = $SlnFilePath, "/t:rebuild", ("/p:Configuration=" + $Configuration), "/v:minimal"                 Wait = $true                 }          # Start the build            Start-Process @BuildArgs # initiate a sleep to avoid zipping up a half built projectSleep 5# create zip file $zipfilename = ($ProjectFile -replace ".csproj", "") + ".zip"$source = $SourceCodePath + "bin\" + $Configuration$destination = $SourceCodePath + $zipfilenameif(Test-path $destination) {Remove-item $destination}Add-Type -assembly ""[io.compression.zipfile]::CreateFromDirectory($Source, $destination) #create storage account if not exists$storageAccount = Get-AzureRmStorageAccount -ErrorAction Stop | where-object {$_.StorageAccountName -eq $StorageAccountName}       if  ( !$storageAccount ) {     $StorageLocation = (Get-AzureRmResourceGroup -ResourceGroupName $ResourceGroupName).Location     $StorageType = "Standard_LRS"     New-AzureRmStorageAccount -ResourceGroupName $ResourceGroupName  -Name $StorageAccountName -Location $StorageLocation -Type $StorageType} #create container if not exists$ContainerObject = Get-AzureStorageContainer -ErrorAction Stop | where-object {$_.Name -eq $ContainerName}if (!$ContainerObject){$storagekey = Get-AzureRmStorageAccountKey -ResourceGroupName $ResourceGroupName -Name $StorageAccountName$context = New-AzureStorageContext -StorageAccountName $StorageAccountName -StorageAccountKey $storagekey.Key1 -Protocol HttpNew-AzureStorageContainer -Name $ContainerName -Permission Blob -Context $context} # upload to blob#set default contextSet-AzureRmCurrentStorageAccount -StorageAccountName $StorageAccountName -ResourceGroupName  $ResourceGroupNameGet-AzureRmStorageAccount -ResourceGroupName $ResourceGroupName -Name $StorageAccountName # Upload fileSet-AzureStorageBlobContent –Container $ContainerName -File $destination By removing the manual steps in building, zipping and deploying ADF Custom Activities, you remove the risk of something going wrong and you add the reassurance that you have a consistent method of deployment which will hopefully speed up your overall development and deployments.As always, if you have any questions or comments, do let me know.

How to prepare for 70-766 - Perform Big Data Engineering on Microsoft Cloud Services

There is a new exam currently in beta titled "Perform Big Data Engineering on Microsoft Cloud Services (beta)". With all new exams there is little content on how to revise for the exam beyond the exams summary. This exam however, is what Adatis specialises in! Microsoft may call this "Big Data Engineering" we call it "Modern Data Analytics" and we have a few blogs on the subject. You can sign up to the exam here: Below you will find links to blog posts by Adatis consultants on topics related to all the key objectives of this exam. I will endeavour to keep this up-to-date with new content added by the team. Good luck with the exam. Design and Implement Complex Event Processing By Using Azure Stream Analytics (15-20%)Streaming data is vital to achieving real-time analytics. The following blogs posts focus on this and offer an introduction and walkthrough for getting started with Stream Analytics. When talking about a wider Lambda approach to Big Data, streaming enables rapid processing via a “Speed” layer. Design and Implement Analytics by Using Azure Data Lake (25-30%)Azure Data Lake Store and Analytics are a vital component of the “Modern Data Analytics”. Data which is too large for traditional single server processing needs distributed parallel computation. Rather than pulling data and processing ADLA pushes the processing to the data. Understanding how to process large volumes of data is one part of the “Batch” layer in Lambda Design and Implement Azure SQL Data Warehouse Solutions (15-20%)Either as an alternative or in accompaniment to Data Lake is Azure SQL Data Warehouse. If Data Lake is batch across many files, Azure SQLDW is parallel batch over many databases. The key to both services is processing at the storage and not at the compute. The following is an on-going blog series covering the basics all the way to a  deep-dive. Design and Implement Cloud-Based Integration by using Azure Data Factory (15-20%)If you’re looking for a paas solution to move data in Azure, there is only really one option. Azure Data Factory. The following blogs will get you up-to-speed with ADF. Manage and Maintain Azure SQL Data Warehouse, Azure Data Lake, Azure Data Factory, and Azure Stream Analytics (20-25%)Know each of the parts is only half the battle, you need to know how, when and why to use each part. What are the best practices?

Instant Bot: deploying a Bot in minutes with Azure Bot Service

I had been playing around with the Bot Framework for a while but hadn’t really got anywhere, largely due to having enough time to create something worthwhile, when I came across the Azure Bot Service whilst I was trawling through the documentation of the Bot Framework. The Azure Bot Service is currently in preview and allowed me to quickly author and deploy a basic bot for the purpose of this post and walkthrough. Creating the BotLike most services in Azure, creating the bot is easy and requires the following inputConfiguring the BotSetting up the Bot is a bit more involved. Once the bot has been created, you’ll be presented with the following screenThe App ID and Password are auto-generated by Microsoft, but you will need to make note of the password and store it securely as it is only displayed once in the app registration process. Next you want to select the language in which the bot is developed and deployed. You have the choice of C# or NodeJS. I opted for C# as it’s a language I am most familiar with. Choose your template, accept the T’s & C’s and your bot is ready to be deployed!Deploying the BotThe Bot has been created and configured, displaying its source code which can be further tweaked in the browser or Visual Studio. You can also embed your Bot in a number of existing apps, websites and services.Chatting with the BotThe basic bot isn’t the most stimulating of conversational partners but it is satisfying to see your creation talk back, even if it repeats what you have just told it. The Bot Framework opens up many possibilities to make the services you offer engaging in a conversational way. The Azure Bot Service makes the Bot Framework that much more accessible to quickly deploy bots and have them out there, engaging with users.

Geographic Spatial Analysis with Azure Data Lake Analytics (ADLA)

Whilst working on an Azure Data Lake project, a requirement hit the backlog that could be easily solved with a Geographical Information System (GIS) or even SQL Server - Spatial data type support was introduced into SQL Server 2008. However, Azure Data Lake Analytics (ADLA) does not natively support spatial data analytics so we'll have to extract the data into another service right? Wrong ? :) Due to the extensibility of Azure Data Lake Analytics, we can enhance it to do practically anything. In fact, we can lean on existing components and enhance the service without having to develop the enhancement itself. This blog is a quick run through demonstrating how to enhance ADLA such that it will support Spatial analytics and meet our project requirement. Problem For simplicity I've trivialised the problem. Here's the requirement: Indicate which Bus Stops are within 1.5 km of Southwark Tube Station. To support this requirement, we have two datasets: A list of all the Bus Stops in London, including their Geo location (circa 20k records) The Geo location record of Southwark Tube Station (a single record !) In fact, the location of the tube station is pretty accurate and is geo located to the entrance pavement outside the tube station: This would be an easy problem for a GIS to solve. You would specify the central point i.e. our Southwark Tube station marker and draw a circle, or buffer, with a radius 1.5 km around it and select all bus stops that fall within or intersect with that circle. This spatial analysis is easy for these systems as that's essentially what they are built to do. SQL Server 2008 introduced the Spatial Data Type, this allowed spatial style analysis to be performed on geo data using T-SQL in conjunction with the supplied Geometry and Geography data types. More info on those can be found here So, how can we solve our problem in ADLA, without a GIS and without having to export the data to SQL Server?? Solution You can register existing assemblies with ADLA. It so happens that the SQL Server Data Types and Spatial assemblies are nicely packaged up and can be used directly within ADLA itself - think about that, it's pretty awesome ! Caveat: At the time of writing we have no idea of the licence implications. It will be up to you to ensure you are not in breach :) Those assemblies can be downloaded from here.  You only need to download and install the following file: ENU\x64\SQLSysClrTypes.msi This installs two key assemblies, which you'll need to grab and upload to your Data Lake Store: C:\Program Files (x86)\Microsoft SQL Server\130\SDK\Assemblies\Microsoft.SqlServer.Types.dll C:\Windows\System32\SqlServerSpatial130.dll Once they have been uploaded to your Data Lake Store, you need to register those assemblies with ADLA. DECLARE @ASSEMBLY_PATH string = "/5.UTILITY/USQL-Extend/SQL-Server/"; DECLARE @TYPES_ASM string = @ASSEMBLY_PATH+"Microsoft.SqlServer.Types.dll"; DECLARE @SPATIAL_ASM string = @ASSEMBLY_PATH+"SqlServerSpatial130.dll"; CREATE DATABASE IF NOT EXISTS SQLServerExtensions; USE DATABASE SQLServerExtensions; DROP ASSEMBLY IF EXISTS SqlSpatial; CREATE ASSEMBLY SqlSpatial FROM @TYPES_ASM WITH ADDITIONAL_FILES = ( @SPATIAL_ASM ); Following registration of the assemblies, we can see the registration loaded in the ADLA Catalog database we created: We are now ready to use this U-SQL enhancement in our U-SQL Query - let's go right ahead and solve our problem in one U-SQL Script. // Reference the assemblies we require in our script. // System.Xml we get for free as a System Assembly so we didn't need to register that and our SQLServerExtensions.SqlSpatial assembly REFERENCE SYSTEM ASSEMBLY [System.Xml]; REFERENCE ASSEMBLY SQLServerExtensions.SqlSpatial; // Once the appropriate assemblies are registered, we can alias them using the USING keyword. USING Geometry = Microsoft.SqlServer.Types.SqlGeometry; USING Geography = Microsoft.SqlServer.Types.SqlGeography; USING SqlChars = System.Data.SqlTypes.SqlChars; // First create the centralised point. // In this case it's the pavement outside the entrance of Southwark Tube Station, London. // Format is Longitude, Latitude and then SRID. // NB: It's Longitude then Latitude, that's the opposite way to what you might expect.. DECLARE @southwarkTube Geography = Geography.Point(-0.104777,51.503829,4326); // Next we extract our entire London bus stop data set from the file. // There's about 20k of them. @busStopInput = EXTRACT [StopCode] string, [StopName] string, [Latitude] double?, [Longitude] double? FROM @"/1.RAW/OpenData/Transport/bus-stops-narrow-full-london.csv" USING Extractors.Csv(skipFirstNRows:1,silent:true); // This is effectively the transform step and where the magic happens // Very similar syntax to what you would do in T-SQL. // We are returning all the bus stops that fall within 1500m of Southwark Tube // Essentially we return all stops that intersect with a 1500m buffer around the central tube point @closeBusStops= SELECT * FROM @busStopInput WHERE @southwarkTube.STBuffer(1500).STIntersects(Geography.Point((double)@busStopInput.Longitude,(double)@busStopInput.Latitude,4326)).ToString()=="True"; // The results are written out to a csv file. OUTPUT @closeBusStops TO "/4.LABORATORY/Desks/Sach/spatial-closebusstops.csv" USING Outputters.Csv(outputHeader: true); The query outputs a list of bus stops that are within the specified Spatial distance from Southwark Tube Station. If we have a look at all the bus stops (in red) and overlay all the 'close' bus stops (in green), we can see the results: Pretty neat. Azure Data Lake Analytics does not natively support spatial data analytics but by simply utilising the assemblies that ship with SQL Server, we can extend the capability of U-SQL to provide that functionality or practically any functionality we desire.

Data Flow Pt 2: Vertexes In Azure Data Lake

Following on from my previous post on Job Execution in the Azure Data Lake, this post will explore the concept of Extents and how they are utilised by Vertexes in the actual Job processing in Azure. The U-SQL script that has been authored, compiled and deployed is the logical plan of how the author intends to transform input data into output data. This creates a total amount of work – essentially the amount of data it has to process – which is decomposed into a set of vertexes. Each vertex will process a subset of data, or extents (see Azure Data Lake Storage for more information) and represent a fraction of the total. Vertexes are displayed, or grouped, in a super vertex, also known as a stage. Vertexes in each stage are doing the same operation on a different part of the same data. The number of vertexes in a stage indicates the maximum theoretical parallelisation of that stage. The containers requested for the job will be allocated to complete each vertex. Say there is a 10GB file. This file will be split into 40 Extents and allocated to at least 10 Vertexes. If one wants to process all of the file in parallel then requesting 10 containers will allow for concurrent parallelism. All this is visualised as the job graph.If you have multiple transformations going on in your USQL script this will create multiple stages, and the output of one vertex becomes the input of another vertex in a dependent stage. As a result, dependent stages can begin processing even if preceding stages haven’t completed processing. If dependency does exist, it will create a critical path – which is the dependency chain of vertexes which keep the job running to the very end because the vertex on the bottom depends on the output of the vertex on the top. This can be seen in the Vertex Execution View in the Visual Studio Azure Data Lake Tools view. It’s useful for optimising job, by re-positioning or re-writing elements of your script, by checking which vertex takes the longest. It may not be possible to use all the reserved parallelism during a stage if there are fewer vertexes than Azure Data Lake Analytics Units (ADLAUs) available. For example, if I have 10 ADLAUs – it’s great for an early stage as all ADLAUs will be allocated a vertex. But with later stages, more and more ADLAUs will be idle. So by the last stage only 2 of the 14 are utilised. Unfortunately it is not currently possible to dynamically de-allocate ADLAUs during a job. In conclusion, understanding how vertexes interact with extents and can influence parallelism, you should be able to provision adequate resources for your jobs as well as know where to start the investigation for any long running queries you might have with the Vertex Execution View. In the next post of the series I will be looking into resource costs and optimising spend and time taken to process data.