Adatis BI Blogs

Extraction and Analysis of GeoSpatial data with Azure Data Lake Analytics

I recently had to implement a solution to prove it was possible to integrate a shape file (.SHP) in Azure Data Lake Store (ADLS) for post geographic spatial analysis using Azure Data Lake Analytics (ADLA). A shape file is a data set used by a geographic analysis application that stores a collection of geographic features, such as streets or zip code boundaries, in the form of points, lines or area features.As you already figured, storing a shape file in ADLS is not a difficult goal to achieve, however, how can you possibly use ADLA to obtain the geographic data from the file? In this blog I’ll explain how we can extract the data to a supported format, such as CSV, and use it to run geographic spatial analysis in ADLA, with the support of the spatial data types introduced in the SQL Server 2008 (details here).As always, whenever we face a limitation of ADLA, C# is our best friend. In order to read the content of a shape file, we need to start by adding a geospatial assembly to our solution, which, in my case, was the “Catfood” ESRI Shapefile Reader (details here).The shape file used in this example contains a list of parks in London. The following code demonstrates how to extract the metadata and the geographic shapes to a CSV file. The only shapes extracted are polygons, although it is possible to add more if needed. public static void CreateWorkForThreads() { //Create a new dataset and store the data in a table DataSet ds = CreateNewDataSet(); DataTable dt = ds.Tables[0]; int i; int count = 0; // Parse the shapefile and select the columns we are interested in using (Shapefile shapefile = new Shapefile(@"path\file.shp")) { foreach (Shape shape in shapefile) { string[] metadataNames = shape.GetMetadataNames(); string geometry = ""; int countParts = 0; int countShape = 0; DataRow dr = dt.NewRow(); //Extract the metadata. The first iteraction will extract the name of the columns if (metadataNames != null) { foreach (string metadataName in metadataNames) { if (count == 0) dr[metadataName] = metadataName; else dr[metadataName] = shape.GetMetadata(metadataName); } } //Shape is not part of the metadata, so manually defining the name of the column if (count == 0) { dr["shape"] = "shape"; } else { // cast shape based on the type switch (shape.Type) { case ShapeType.Point: // a point is just a single x/y point ShapePoint shapePoint = shape as ShapePoint; MessageBox.Show("Point (" + shapePoint.Point.X.ToString() + ", " + shapePoint.Point.Y.ToString() + ")"); break; case ShapeType.Polygon: // a polygon contains one or more parts - each part is a list of points which // are clockwise for boundaries and anti-clockwise for holes // see ShapePolygon shapePolygon = shape as ShapePolygon; foreach (PointD[] part in shapePolygon.Parts) { countShape = 0; if (countParts == 0) geometry = "("; else geometry = geometry + " | ("; foreach (PointD point in part) { if (part.Length - 1 != countShape) geometry = geometry + point.X + " " + point.Y + " |"; else geometry = geometry + point.X + " " + point.Y + " )"; countShape++; } countParts++; } break; default: break; } //Build our Polygon. //Eg. POLYGON((-122.358 47.653, -122.348 47.649| -122.348 47.658, -122.358 47.658, -122.358 47.653)) dr["shape"] = "POLYGON(" + geometry + ")"; } dt.Rows.Add(dr); count++; } } //Extract the data to a csv file using (System.IO.StreamWriter sw = new System.IO.StreamWriter(@"path\filename.csv")) { foreach (DataRow row in dt.Rows) { object[] array = row.ItemArray; for (i = 0; i < array.Length - 1; i++) { sw.Write(array[i].ToString() + ","); } sw.WriteLine(array[i].ToString()); } } } public static DataSet CreateNewDataSet() { DataSet dsTemp = new DataSet(); DataTable dtTemp = new DataTable("londonparks"); dtTemp.Columns.Add("id", typeof(string)); dtTemp.Columns.Add("parkname", typeof(string)); dtTemp.Columns.Add("street", typeof(string)); dtTemp.Columns.Add("postcode", typeof(string)); dtTemp.Columns.Add("shape", typeof(string)); dsTemp.Tables.Add(dtTemp); return dsTemp; }Now that we have a valid file that can be processed by ADLA, we can upload it to ADLS and start performing geospatial analysis. To do so, I simply followed the logic described in Sacha’s blog (here).The following U-SQL has in consideration a dataset that contains details of the trajectory of a courier, tracked on a daily basis. With the following code, we identify if a courier drove by a park by using the Intersect function. Because we have to cross two datasets, a C# function was created to help the evaluation of multiple events. // Reference the assemblies we require in our script. REFERENCE SYSTEM ASSEMBLY [System.Xml]; REFERENCE ASSEMBLY [SQLServerExtensions].[SqlSpatial]; REFERENCE ASSEMBLY [USQL.Core]; // Once the appropriate assemblies are registered, we can alias them using the USING keyword. USING Geometry = Microsoft.SqlServer.Types.SqlGeometry; USING Geography = Microsoft.SqlServer.Types.SqlGeography; USING SqlChars = System.Data.SqlTypes.SqlChars; USING [USQL].[Core].[Utilities]; // Extract the list of parks @parks = EXTRACT [ID] string, [PARKNAME] string, [STREET] string, [POSTCODE] string, [SHAPE] string FROM "RAW/Parks.csv" USING Extractors.Text(delimiter : ',', silent : false, quoting : true, skipFirstNRows : 1); //Extract data from the file containing the courier trajectory @trajectories = EXTRACT GPSDateTimeUTC DateTime, ReceivedDatetimeUTC DateTime, VehicleKey string, Altitude int, Longitude double, Latitude double, Distance decimal, VehicleSpeedMph decimal FROM "CURATED/Trajectory/Trajectory.TXT" USING Extractors.Text(delimiter : '|', silent : false, quoting : true, skipFirstNRows : 1); //Get the list of vehicles that drove by the park. @vehicleIntersection = SELECT DISTINCT a. *, "1" AS VehicleIntersected FROM @trajectories AS a CROSS JOIN @parks AS b WHERE Utilities.Intersect(b.[SHAPE], a.[Longitude], a.[Latitude]).ToString() == "True"; //Get the list of vehicles that didn't drive by the park. @vehicleWithoutIntersection = SELECT a. *, "0" AS VehicleIntersected FROM @trajectories AS a LEFT JOIN @vehicleIntersection AS b ON b.VehicleKey == a.VehicleKey AND b.GPSDateTimeUTC == a.GPSDateTimeUTC WHERE b.VehicleKey IS NULL; //Union both datasets to get the complete set of data @finalData = SELECT * FROM @vehicleIntersection UNION ALL SELECT * FROM @vehicleWithoutIntersection; //Export the results to a csv file OUTPUT @finalData TO "LABORATORY/GeoSpatialIntersection.csv" USING Outputters.Text(outputHeader : true, delimiter : ',', quoting : true);And here is the C# function. It accepts three parameters and calculate the intersection of a point with a shape.public static string Intersect(string shape, double longitude, double latitude) { //Because we had a csv file, the coordinates in the polygon were separated by | //It is important to use the .MakeValid() method to validate any invalid shape //In case the dataset had multypoligon shapes, without the MakeValid(), the function would throw an error var g = Geography.STGeomFromText( new SqlChars( shape.Replace('|',',')), 4326).MakeValid(); var h = Geography.Point(longitude, latitude, 4326); return g.STIntersects(h).ToString(); }As always, if you have any questions or comments, do let me know.

Geographic Spatial Analysis with Azure Data Lake Analytics (ADLA)

Whilst working on an Azure Data Lake project, a requirement hit the backlog that could be easily solved with a Geographical Information System (GIS) or even SQL Server - Spatial data type support was introduced into SQL Server 2008. However, Azure Data Lake Analytics (ADLA) does not natively support spatial data analytics so we'll have to extract the data into another service right? Wrong ? :) Due to the extensibility of Azure Data Lake Analytics, we can enhance it to do practically anything. In fact, we can lean on existing components and enhance the service without having to develop the enhancement itself. This blog is a quick run through demonstrating how to enhance ADLA such that it will support Spatial analytics and meet our project requirement. Problem For simplicity I've trivialised the problem. Here's the requirement: Indicate which Bus Stops are within 1.5 km of Southwark Tube Station. To support this requirement, we have two datasets: A list of all the Bus Stops in London, including their Geo location (circa 20k records) The Geo location record of Southwark Tube Station (a single record !) In fact, the location of the tube station is pretty accurate and is geo located to the entrance pavement outside the tube station: This would be an easy problem for a GIS to solve. You would specify the central point i.e. our Southwark Tube station marker and draw a circle, or buffer, with a radius 1.5 km around it and select all bus stops that fall within or intersect with that circle. This spatial analysis is easy for these systems as that's essentially what they are built to do. SQL Server 2008 introduced the Spatial Data Type, this allowed spatial style analysis to be performed on geo data using T-SQL in conjunction with the supplied Geometry and Geography data types. More info on those can be found here So, how can we solve our problem in ADLA, without a GIS and without having to export the data to SQL Server?? Solution You can register existing assemblies with ADLA. It so happens that the SQL Server Data Types and Spatial assemblies are nicely packaged up and can be used directly within ADLA itself - think about that, it's pretty awesome ! Caveat: At the time of writing we have no idea of the licence implications. It will be up to you to ensure you are not in breach :) Those assemblies can be downloaded from here.  You only need to download and install the following file: ENU\x64\SQLSysClrTypes.msi This installs two key assemblies, which you'll need to grab and upload to your Data Lake Store: C:\Program Files (x86)\Microsoft SQL Server\130\SDK\Assemblies\Microsoft.SqlServer.Types.dll C:\Windows\System32\SqlServerSpatial130.dll Once they have been uploaded to your Data Lake Store, you need to register those assemblies with ADLA. DECLARE @ASSEMBLY_PATH string = "/5.UTILITY/USQL-Extend/SQL-Server/"; DECLARE @TYPES_ASM string = @ASSEMBLY_PATH+"Microsoft.SqlServer.Types.dll"; DECLARE @SPATIAL_ASM string = @ASSEMBLY_PATH+"SqlServerSpatial130.dll"; CREATE DATABASE IF NOT EXISTS SQLServerExtensions; USE DATABASE SQLServerExtensions; DROP ASSEMBLY IF EXISTS SqlSpatial; CREATE ASSEMBLY SqlSpatial FROM @TYPES_ASM WITH ADDITIONAL_FILES = ( @SPATIAL_ASM ); Following registration of the assemblies, we can see the registration loaded in the ADLA Catalog database we created: We are now ready to use this U-SQL enhancement in our U-SQL Query - let's go right ahead and solve our problem in one U-SQL Script. // Reference the assemblies we require in our script. // System.Xml we get for free as a System Assembly so we didn't need to register that and our SQLServerExtensions.SqlSpatial assembly REFERENCE SYSTEM ASSEMBLY [System.Xml]; REFERENCE ASSEMBLY SQLServerExtensions.SqlSpatial; // Once the appropriate assemblies are registered, we can alias them using the USING keyword. USING Geometry = Microsoft.SqlServer.Types.SqlGeometry; USING Geography = Microsoft.SqlServer.Types.SqlGeography; USING SqlChars = System.Data.SqlTypes.SqlChars; // First create the centralised point. // In this case it's the pavement outside the entrance of Southwark Tube Station, London. // Format is Longitude, Latitude and then SRID. // NB: It's Longitude then Latitude, that's the opposite way to what you might expect.. DECLARE @southwarkTube Geography = Geography.Point(-0.104777,51.503829,4326); // Next we extract our entire London bus stop data set from the file. // There's about 20k of them. @busStopInput = EXTRACT [StopCode] string, [StopName] string, [Latitude] double?, [Longitude] double? FROM @"/1.RAW/OpenData/Transport/bus-stops-narrow-full-london.csv" USING Extractors.Csv(skipFirstNRows:1,silent:true); // This is effectively the transform step and where the magic happens // Very similar syntax to what you would do in T-SQL. // We are returning all the bus stops that fall within 1500m of Southwark Tube // Essentially we return all stops that intersect with a 1500m buffer around the central tube point @closeBusStops= SELECT * FROM @busStopInput WHERE @southwarkTube.STBuffer(1500).STIntersects(Geography.Point((double)@busStopInput.Longitude,(double)@busStopInput.Latitude,4326)).ToString()=="True"; // The results are written out to a csv file. OUTPUT @closeBusStops TO "/4.LABORATORY/Desks/Sach/spatial-closebusstops.csv" USING Outputters.Csv(outputHeader: true); The query outputs a list of bus stops that are within the specified Spatial distance from Southwark Tube Station. If we have a look at all the bus stops (in red) and overlay all the 'close' bus stops (in green), we can see the results: Pretty neat. Azure Data Lake Analytics does not natively support spatial data analytics but by simply utilising the assemblies that ship with SQL Server, we can extend the capability of U-SQL to provide that functionality or practically any functionality we desire.

Azure Data Lake Store : Authorization Failed - The user does not have the permission to perform this operation

Just a quick one as it took slightly longer to solve than it possibly should have done so this might help others with the same problem. I was working fine with an Azure Data Lake Store for a few weeks until yesterday when I couldn’t access the files anymore.  The Azure Portal reported the following:Authorization Failed - The user does not have the permission to perform this operation. It turns out I was accessing it from a new location and the ADLS Firewall needed updating to whitelist my IP Address…. After adding my IP, the Data Lake Store is now accessible again.