Adatis

Adatis BI Blogs

How to support your organisation with Azure Cosmos DB Graph (Gremlin)?

Let me start this blog with two questions. 1) How can Joanna, from the HR department, query the mentor/mentee structure from the organisation? 2) How can Robin, from the resourcing department, find the most suitable and available consultant to start a new project next week? I’m sure at this point you are thinking that to solve both problems they could simply query the HR and the Resource Planning systems, and you are right, but, what if they could get the answer for both questions from a single system? In this blog I’ll demonstrate how to achieve such requirement using Azure Cosmos DB Graph (Gremlin).Graph TheoryIn graph theory, a graph is an ordered pair comprising a set of vertices and edges. A vertex is the fundamental unit of which graphs are formed and are usually represented by a circle with a label. An edge is represented by a line or arrow extending from one vertex to another.Graphs can be used to model many types of relations and processes in many different areas. For example, we can represent the link structure of a website as a graph, the web pages as vertices and the links from one page to another as edges.Azure Cosmos DBAzure Cosmos DB is a globally distributed multi-model database with the capacity to store various types of data, such as document, relational, graph and key values. It provides all the necessary tools to elastically scale throughput and storage across any number of Azure's geographic regions (further details here). Azure Cosmos DB supports a variety of popular API’s to access and query the data, such as SQL, MongoDB, Cassandra, Graph (Gremlin) and Table API. In this instance, I will focus on the Graph (Gremlin) API.GremlinGremlin is the graph traversal language of Apache TinkerPop, an open source Graph Computing Framework. Gremlin allows the users to write complex queries to traverse their graphs by using a composed sequence of steps, with each step performing an operation on the data stream (further details here). There are 4 fundamental steps:· transform: transform the objects in the stream· filter: remove objects from the stream· sideEffect: pass the object, but yield some side effect· branch: decide which step to takeScenarioThe image in the left is an example of the mentor/mentee structure. If we convert it to a graph (image in the right), we have the people represented as vertices and the relationship mentor as edges.Now, let’s create the database and the graph. When creating the Azure Cosmos DB, we need to ensure we select the Gremlin (graph) API. To populate and query our graph, we have three options, the Azure Portal, the Gremlin Console or the Guided Gremlin Tour. The last two tools can be downloaded from the Quick Start section once we create the sample container or directly downloaded here and here.In the Azure Portal, CRUD operations can be performed via the Data Explorer UI and Gremlin queries. The results can be visualised in a Graph or in GraphJSON format, the Gremlin standard format to represent vertices, edges and properties using JSON.In the proposed scenario, we will create 7 vertices and 6 edges. The vertices will be labelled as person and will have 3 properties, First Name, Position and In Bench (indicates if a consultant is allocated to a project or not). The edges, labelled as mentor, will define the relation between consultants. Create a vertexg.addV('person').property('firstName', 'Tim’).property('position', 'Director') .property('inBench', '1')Create an edgeg.V().hasLabel('person').has('firstName', 'Tim').addE('mentor').to(g.V().hasLabel('person').has('firstName', 'Neil'))Below is how the graph looks like when all vertices and edges are created.Joanna has now all that is needed to obtain valuable information. For example, she can traverse the graph and obtain the list of Principal and Senior Consultants that are mentored by Tim.g.V().hasLabel('person').has('firstName', 'Tim').outE('mentor').inV().hasLabel('person').has('position', within('Principal Consultant','Senior Consultant')).group().by('firstName').by(values('position'))[{"Neil": "Principal Consultant","Scott": "Senior Consultant"}]To support Robin, we will add two new instances of vertices, Blog and Project. The Blog vertex will be related with the Person vertex and will indicate who wrote blogs and which technology was covered. The Project vertex will be related with the Person vertex and will indicate who worked in a certain project. The Project vertex will have the client, the name of the project, the duration and which technology was used as properties.If Robin needs to find a consultant with experience in Azure Cosmos DB, he can query the graph and verify who either wrote a blog or worked in a previous project with that technology. On top of that, he can filter the results by indicating he is only interested in consultants that are currently in the bench.g.V().hasLabel('person').where(outE('worked').or().outE('wrote').has('tech', 'Azure CosmosDB')).has('inBench', '1')ConclusionsHopefully, using a very simple scenario, I managed to demonstrate the potential of Azure Cosmos DB Graph to build complex queries, implement powerful graph traversal logic and help the business to quickly obtain insights from very complex models..As always, if you have any questions or comments do let me know.

Analysis of Spatial Data Using Cosmos DB

IntroductionRecently, while researching Cosmos DB, I came across the in-built capabilities for managing spatial data.Cosmos DB is Microsoft’s globally distributed, multi-model database. It has the capability to store various types of data, such as document, graph and key-value and can elastically scale to cope with varying needs. The piece of Cosmos DB that this post will be discussing is the spatial capabilities of the document model.The problem I have chosen to solve using the spatial functionality is working out which airfields are within the range of an aircraft when given its true airspeed and fuel endurance in hours; with the range being calculated by multiplying the airspeed by the endurance.The DataThe first step was to find a data set containing a list of all of the world’s airfields, this was found on GitHub at https://github.com/mwgg/Airports. The data set contains the details we need, which are:ICAO code – this is the unique identifier for an airportAirport NameLatitude and Longitude of the AirportThe next step was to create a Cosmos DB account in the Azure Portal and write a C# app to do some transformations on the data and load the documents into our Cosmos DB collection.I first created a C# object that matched the structure of the JSON data:using Newtonsoft.Json; namespace LoadAirportData { public class Airport { [JsonProperty("id")] public string Id => this.Icao; [JsonProperty("icao")] public string Icao { get; set; } [JsonProperty("iata")] public string Iata { get; set; } [JsonProperty("name")] public string Name { get; set; } [JsonProperty("city")] public string City { get; set; } [JsonProperty("state")] public string State { get; set; } [JsonProperty("country")] public string Country { get; set; } [JsonProperty("elevation")] public int Elevation { get; set; } [JsonProperty("lat")] public double Latitude { get; set; } [JsonProperty("lon")] public double Longitude { get; set; } [JsonProperty("tz")] public string TimeZone { get; set; } } } I then created a C# object that matched the structure of the document I wanted to insert into Cosmos DB. The “Point” data type is used to serialize the latitude and longitude into the GeoJSON structure that Cosmos DB supports:using Microsoft.Azure.Documents.Spatial; using Newtonsoft.Json; namespace LoadAirportData { public class AirportDocument { public AirportDocument(Airport airport) { Id = airport.Icao; Iata = airport.Iata; Name = airport.Name; Location = new Point(airport.Longitude, airport.Latitude); } [JsonProperty("id")] public string Id { get; set; } [JsonProperty("iata")] public string Iata { get; set; } [JsonProperty("name")] public string Name { get; set; } [JsonProperty("location")] public Point Location { get; set; } } } Finally I created a method that dropped the Cosmos DB database, recreated the database and the document collection then loaded the documents into the collection:using Microsoft.Azure.Documents.Client; using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; using Microsoft.Azure.Documents; using Newtonsoft.Json; using System.Net; using System.Collections.Concurrent; namespace LoadAirportData { public class CosmosDbImporter { private const string ENDPOINT_URL = "<YOUR ENDPOINT>"; private const string PRIMARY_KEY = "<YOUR KEY>"; private const string DATABASE_NAME = "airports"; private const string COLLECTION_NAME = "airportData"; private const string IMPORT_PATH = @"C:\Temp\Airports.json"; public async Task ImportData() { var documentClient = new DocumentClient(new Uri(ENDPOINT_URL), PRIMARY_KEY); // Delete the database if it exists try { await documentClient.DeleteDatabaseAsync(UriFactory.CreateDatabaseUri(DATABASE_NAME)); } catch (DocumentClientException ex) { if (ex.StatusCode != HttpStatusCode.NotFound) throw; } // Create the Database await documentClient.CreateDatabaseIfNotExistsAsync(new Database() { Id = DATABASE_NAME }); // Create the collection and switch on spatial indexing DocumentCollection collection = new DocumentCollection() { Id = COLLECTION_NAME }; collection.IndexingPolicy = new IndexingPolicy(new SpatialIndex(DataType.Point)); await documentClient.CreateDocumentCollectionIfNotExistsAsync(UriFactory.CreateDatabaseUri(DATABASE_NAME), collection); // Read the file and deserialize to our Airport object var data = System.IO.File.ReadAllText(IMPORT_PATH); var airports = JsonConvert.DeserializeObject<Dictionary<string, Airport>>(data); // Upload documents to CosmosDB await Task.WhenAll( from partition in Partitioner.Create(airports.Values).GetPartitions(50) select Task.Run(async delegate { using (partition) { while (partition.MoveNext()) { Console.WriteLine($"Processing {partition.Current.Icao}"); var airportDocument = new AirportDocument(partition.Current); await documentClient.CreateDocumentAsync(UriFactory.CreateDocumentCollectionUri(DATABASE_NAME, COLLECTION_NAME), airportDocument); } } })); } } } One thing to note is the above code enables spatial indexing when creating the collection, without this enabled performance is extremely poor when performing spatial queries.The beauty of Cosmos DB is that it is able to elastically scale to the performance level specified by the user through the number of RUs (request units) that are allocated to the collection. While loading the data into Cosmos DB I wanted to scale up my database to take advantage of the multithreading in my C# app and speed up my processing so I just went in to the Azure Portal and adjusted the number of RUs allocated to the collection, the change was almost instant and my process instantly sped up. Once I had finished importing the data I was able to scale my database back down so I’m no longer paying for any unused capacity.QueryingNow the data is in Cosmos DB we can begin to play around with some spatial queries.To query airfields within a certain distance of a specified point I can run the following query which returns all of the airfields within 25km of Blackbushe airport. As you can see, the SQL syntax for querying Cosmos DB is very similar to T-SQL meaning it’s very easy to re-use your SQL Server skills:SELECT airports.id AS ICAO, airports.name AS Name, ST_DISTANCE(airports.location, {"type": "Point", "coordinates": [-0.8475000262,51.3238983154]}) AS Distance FROM airports WHERE ST_DISTANCE(airports.location, {"type": "Point", "coordinates": [-0.8475000262,51.3238983154]}) < 25000The above query returns the following results, which are the 7 airfields that are within 25km of Blackbushe:[ { "ICAO": "EGHL", "Name": "Lasham Airport", "Distance": 19964.7890768588 }, { "ICAO": "EGVO", "Name": "RAF Odiham", "Distance": 11985.957064869535 }, { "ICAO": "EGTF", "Name": "Fairoaks Airport", "Distance": 20229.321025944442 }, { "ICAO": "EGLF", "Name": "Farnborough Airport", "Distance": 7286.035340157135 }, { "ICAO": "EGLK", "Name": "Blackbushe Airport", "Distance": 0 }, { "ICAO": "EGLM", "Name": "White Waltham Airfield", "Distance": 20312.693531316185 }, { "ICAO": "EGLP", "Name": "Brimpton Airfield", "Distance": 23311.94703537874 } ]The AppThe next step is to create an application that uses the functionality to find the airfields within the range of an aircraft. To do this I created a basic ASP.NET MVC application that has a form with the following fields:When the form is submitted the following C# code is executed:[HttpPost] public async Task Index(AirportFinderModel model) { var documentClient = new DocumentClient(new Uri(ENDPOINT_URL), PRIMARY_KEY); var baseAirfield = await documentClient.ReadDocumentAsync(UriFactory.CreateDocumentUri(DATABASE_NAME, COLLECTION_NAME, model.BaseAirfield)); var availableDistanceMeters = (model.EnduranceHours * model.TrueAirspeed) * 1852; var result = documentClient .CreateDocumentQuery(UriFactory.CreateDocumentCollectionUri(DATABASE_NAME, COLLECTION_NAME)) .Where(a => a.Location.Distance(baseAirfield.Document.Location) <= availableDistanceMeters) .ToList(); return View("Result", new AirportFinderResultModel() { MapPoints = JsonConvert.SerializeObject(result.Select(a => new MapPoint() { Title = a.Id, Description = a.Name, Longitude = a.Location.Position.Longitude, Latitude = a.Location.Position.Latitude })) }); } The above code connects to Cosmos DB and retrieves the details for the base airfield that was specified, it then calculates the range of the aircraft in meters by multiplying the endurance (in hours) by the true airspeed in knots (nautical miles per hour) and then multiplying that my 1852 (number of meters in a nautical mile). A Linq query is then run against Cosmos DB using the built-in spatial functions to find airfields within the specified distance. The result is then converted into a JSON array that can be understood by the Google Maps API that is being used on the client side.The client side uses the Google Maps API to plot the airfields on a map, giving us a view like the one below when given a base airfield of Blackbushe (EGLK), a true airspeed of 100kts and an endurance of 4.5 hours:The current functionality of the app is extremely basic but could easily be expanded to make the range calculation more accurate by looking at wind and other factors that can affect range. This could be done by creating a polygon representing our range and then using the ST_WITHIN function to find all airfields within that polygon. The functionality could also be enhanced to take into account other attributes of the airfield by deciding if an airfield is suitable based on runway length and other facilities.ConclusionAs you can see, using Cosmos DB it is extremely easy to analyse spatial data and the elastic scaling capability allows you to easily increase capacity of your application to cope with increased usage and large amounts of data. The spatial capabilities of Cosmos DB can make it a viable alternative to other databases such as SQL Server that aren’t always as easy to scale on demand and don’t have the flexibility that a document model can give.As always, if you have any questions or comments then let me know.