Adatis

Adatis BI Blogs

A quick look at U-SQL Database Projects

U-SQL Database Projects were released today (see here for the announcement from Microsoft) and it promises easier development and deployment. I’ve blogged in the past about using PowerShell to deploy U-SQL scripts, and I’ve worked with the older “U-SQL Project” project quite a bit, so I wanted to try the new project out as quickly as I could to see what it has to offer.Once you’ve updated to the latest version of ADL Tools, you’ll be able to select the new project type:DevelopingThe project contains a number of database object templates, allowing you to get off the mark quickly with development, and the project allows all of the usual visual studio organization with folders, and intellisense in the code editor.Object templates available:Here’s the template of a Procedure Script item:Source ControlThe projects integrates much more nicely with TFS than the older “U-SQL Project” does. It actually gives you the icons (padlock, check mark, etc..) in the solution explorer, so it actually looks like it’s under source control!Something that I’d really hoped had been fixed, but hasn’t, is when copying and renaming an existing item, it doesn’t recognize the rename. You have to undo the checkout of the non-existent object (the copy, before being renamed):Something that has been fixed is that with the older project style, you wouldn’t be able to edit the “Procedure2.usql” item from my previous example until you’d checked it in – now you can!DeploymentWhere you’re deploying to is controlled by the Cloud Explorer in Visual Studio and which Data Lakes you have access to. By right clicking and selecting deploy on the project, you are presented with a simple dialogue box where you select which Data Lake Analytics account, making it nice and easy to deploy to development vs UAT, etc…The Database Name specified in this textbox is the name of the database your project will be deployed to. Unfortunately, it is just a free form textbox, so be careful of any typos!

PII Anonymisation and Self-Joins in U-SQL

It’s been a while since I wrote my last blog, so I decided to share one of the latest challenges I faced in a project. Requirements: Anonymise any Personably Identifiable Information (PII) data stored in an Azure Data Lake Store (ADLS); Anonymise any PII data from customers identified in a configuration file.Details: All PII data should be anonymised as soon as the files land in the DEV or UAT ADLS;The PII data landing in the PROD ADLS should only be anonymised if identified in the configuration file.Goal:Create a single U-SQL pattern to achieve the requirements.Step 1Reference assemblies and declare variables. Pay special attention to the variable Environment. This will be dynamically populated by the Azure Data Factory (ADF) pipeline activity and will identify in which environment the U-SQL is executed.REFERENCE ASSEMBLY [USQL.Core];USING [USQL].[Core].[Utilities];USING [USQL].[Core].[Anonymisation];//Set variablesDECLARE @month string = DateTime.UtcNow.ToString("MM");DECLARE @day string = DateTime.UtcNow.ToString("dd");DECLARE @year string = DateTime.UtcNow.ToString("yyyy");DECLARE @schemaVersion int = 1;DECLARE @Environment string = "DEV";DECLARE @inputLocation = "RAW/Sensitive/" + @schemaVersion + "/" + @year + "/" + @month + "/" + @day + "/{*}.csv";DECLARE @outputLocation = "RAW/Anonymized/" + @schemaVersion + "/" + @year + "/" + @month + "/" + @day + "/Customers.csv";DECLARE @configLocation = "RAW/Config/Configuration.csv";Step 2Extract the data from the source and configuration file. The configuration file only includes an ID that identifies a customer.//Extract data from source file@ExtractSourceData = EXTRACT [CustomerId] string, [FirstName] string, [LastName] string, [EmailAddress] string, [HomeTel] string, [MobileNumber] string, [Address] string, [PostalCode] string FROM @inputLocation USING Extractors.Text(delimiter : '|', silent : false, quoting : true, skipFirstNRows : 1);//Extract data from the configuration file @ExtractConfigurationData = EXTRACT [Id] string FROM @configLocation USING Extractors.Text(silent : true, quoting : true, skipFirstNRows : 1);Step 3Create two rowsets, one to include the distinct list of CustomerId from the source file and the other to include the distinct list of Id from the configuration file. //Obtain a list of distinct CustomerId from source file@SelectSourceData = SELECT DISTINCT [CustomerId] FROM @ExtractSourceData;//Obtain a list of distinct Ids from configuration file@SelectConfigurationData = SELECT DISTINCT [Id] FROM @ExtractConfigurationData;Step 4This is one of the most important details in this script. U-SQL does not support self-joins, which is needed to ensure we anonymise all data if we are in a non-production environment. To overcome this limitation, we create a new rowset to union the IDs from the source and configuration file.//Create a new rowset to use on self-join@UnionIds = SELECT [CustomerId], "" AS [Id] FROM @SelectSourceData UNION SELECT "" AS [CustomerId], [Id] FROM @SelectConfigurationData;Step 5In this step, we identify which records should and shouldn’t be anonymised. If you remember from the requirements, if the data is in a non-production environment, we have to anonymise all PII data, however, if we are in production, we should only anonymise the records identified in the configuration file. This could easily be achieved with a self-join, however, because it isn’t supported by U-SQL, we use the rowset from the previous step. //Identify records to be anonymised@FlagAnonymiseRecords = SELECT DISTINCT A.[CustomerId], [FirstName], [LastName], [EmailAddress], [HomeTel], [MobileNumber], [Address], [PostalCode] FROM @ExtractSourceData AS A JOIN @UnionIds AS B ON A.[CustomerId] == (@Environment == "PROD" ? B.[Id] : B.[CustomerId]);//Identify records that shouldn't be anonymised.//ANTISEMIJOIN works as a SQL NOT IN@FlagDoNotAnonymiseRecords = SELECT DISTINCT [CustomerId], [FirstName], [LastName], [EmailAddress], [HomeTel], [MobileNumber], [Address], [PostalCode] FROM @ExtractSourceData AS A ANTISEMIJOIN ( SELECT DISTINCT [CustomerId] FROM @FlagAnonymiseRecords ) AS B ON A.[CustomerId] == B.[CustomerId];Step 6Now that we identified the records that should be anonymised, we can start applying the correct mask. This is achieved by using different classes created in an assembly that is registered in the Azure Data Lake Analytics (ADLA).//Anonymise data@AnonymizsData =SELECT [CustomerId],Utilities.ReturnLenght([FirstName]) == "0" ? [FirstName] : Anonymisation.AnonymiseForename([CustomerId], [FirstName]) AS [FirstName],Utilities.ReturnLenght([LastName]) == "0" ? [LastName] : Anonymisation.AnonymiseSurname([CustomerId], [LastName]) AS [LastName],Utilities.ReturnLenght([EmailAddress]) == "0" ? [EmailAddress] : Anonymisation.AnonymiseEmail([EmailAddress]) AS [HomeTel],Utilities.ReturnLenght([HomeTel]) == "0" ? [HomeTel] : Anonymisation.AnonymiseNumbers([HomeTel]) AS [HomeTel],Utilities.ReturnLenght([MobileNumber]) == "0" ? [MobileNumber] : Anonymisation.AnonymiseNumbers([MobileNumber]) AS [CellNumber],Utilities.ReturnLenght([PostalCode]) == "0" ? [PostalCode] : Anonymisation.AnonymisePostalCode([PostalCode]) AS [PostalCode]FROM @FlagAnonymiseRecords;Step 7The last step in this process is to union the anonymised and non-anonymised rowsets and output the file to the ADLS.//Union anonymised and non-anonymised data@FullData = SELECT [CustomerId], [FirstName], [LastName], [EmailAddress], [HomeTel], [MobileNumber], [Address], [PostalCode] FROM @AnonymiseData UNION SELECT [CustomerId], [FirstName], [LastName], [EmailAddress], [HomeTel], [MobileNumber], [Address], [PostalCode] FROM @FlagDoNotAnonymiseRecords;//Select data for output@Output = SELECT [CustomerId], [FirstName], [LastName], [EmailAddress], [HomeTel], [MobileNumber], [Address], [PostalCode] FROM @FullData;//Output data to destinationOUTPUT @OutputTO @outputLocationUSING Outputters.Text(outputHeader : true, delimiter : '|', quoting : true); As always, if you have any questions or comments do let me know.

Deploying Multiple U-SQL Procedures with PowerShell

If you’d like to deploy multiple U-SQL procedures without having to open each one in Visual Studio and submit the job to Data Lake Analytics manually, here’s a PowerShell script which you can point at a folder location containing your .USQL files to loop through them and submit them for you.This method relies on the Login-AzureRmAccount command with a service principal, which you can learn more about here.The Script $azureAccountName = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" $azurePassword = ConvertTo-SecureString "xxxxxxxxxxxxxxxxxxxx/xxxxxxxxxxxxxxx/xxxxxx=" -AsPlainText -Force $psCred = New-Object System.Management.Automation.PSCredential($azureAccountName, $azurePassword) $psTenantID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" $adla = "zachstagersadla" $fileLocation = "C:\SourceControl\USQL.Project\*.usql" Login-AzureRmAccount -ServicePrincipal -TenantID $psTenantID -Credential $psCred | Out-Null ForEach ($file in Get-ChildItem -Path $fileLocation) { $scriptContents = [IO.File]::ReadAllText($file.FullName) Submit-AzureRmDataLakeAnalyticsJob ` -AccountName $adla ` -Name $file.Name ` -Script $scriptContents | Out-Null; Write-Host "`n" $file.Name "submitted." } Parameter ConfigurationWhere these various GUID’s can be found within the Azure portal is liable to change, but at time of writing I have provided a path to follow to find each of them.$azureAccountName – This is the Application Id of your Enterprise Application, which can be found by navigating to Azure Active Directory > Enterprise Applications > All Applications > Selecting your application > Properties > Application Id.$azurePassword – This is the secret key of your Enterprise Application, and would have been generated during application registration. If you’ve created your application, but have not generated a secret key, you can do so by navigating to: Enterprise Applications > New Application > Application You’re Developing > OK, take me to App Registration > Change the drop down from ‘My Apps’ to ‘All Apps’ (may not be required) > Select your application > Settings > Keys > Fill in the details and click save. Note the important message about the key only being available to copy until before navigating away from the page!$psTenantID – From within the Azure Portal, go to Azure Active Directory, open the Properties blade, and copy the ‘Directory ID’.$adla – The name of the data lake analytics resource you’re deploying to.$fileLocation – The file location on your local machine which contains the USQL scripts you wish to deploy. Note the “\*.usql” on the end in the example, this is a wildcard search for all files ending in .usql.PermissionsThe service principal needs to be given the following permissions to successfully execute against the lake:· Owner of the Data Lake Analytics resource you’re deploying to. This is configured via the Access Control blade.· Owner of the Data Lake Store resource associated to the Analytics resource you’re deploying to. This is configured via the Access Control blade.· Read, Write, and Execute permissions against the sub-folders within the Data Lake Store. Ensure you select ‘This folder and all children’ and ‘An access permission entry and a default permission entry’. This is configured by entering the Data Lake Store, selecting Data Explorer, then Access.

Extraction and Analysis of GeoSpatial data with Azure Data Lake Analytics

I recently had to implement a solution to prove it was possible to integrate a shape file (.SHP) in Azure Data Lake Store (ADLS) for post geographic spatial analysis using Azure Data Lake Analytics (ADLA). A shape file is a data set used by a geographic analysis application that stores a collection of geographic features, such as streets or zip code boundaries, in the form of points, lines or area features.As you already figured, storing a shape file in ADLS is not a difficult goal to achieve, however, how can you possibly use ADLA to obtain the geographic data from the file? In this blog I’ll explain how we can extract the data to a supported format, such as CSV, and use it to run geographic spatial analysis in ADLA, with the support of the spatial data types introduced in the SQL Server 2008 (details here).As always, whenever we face a limitation of ADLA, C# is our best friend. In order to read the content of a shape file, we need to start by adding a geospatial assembly to our solution, which, in my case, was the “Catfood” ESRI Shapefile Reader (details here).The shape file used in this example contains a list of parks in London. The following code demonstrates how to extract the metadata and the geographic shapes to a CSV file. The only shapes extracted are polygons, although it is possible to add more if needed. public static void CreateWorkForThreads() { //Create a new dataset and store the data in a table DataSet ds = CreateNewDataSet(); DataTable dt = ds.Tables[0]; int i; int count = 0; // Parse the shapefile and select the columns we are interested in using (Shapefile shapefile = new Shapefile(@"path\file.shp")) { foreach (Shape shape in shapefile) { string[] metadataNames = shape.GetMetadataNames(); string geometry = ""; int countParts = 0; int countShape = 0; DataRow dr = dt.NewRow(); //Extract the metadata. The first iteraction will extract the name of the columns if (metadataNames != null) { foreach (string metadataName in metadataNames) { if (count == 0) dr[metadataName] = metadataName; else dr[metadataName] = shape.GetMetadata(metadataName); } } //Shape is not part of the metadata, so manually defining the name of the column if (count == 0) { dr["shape"] = "shape"; } else { // cast shape based on the type switch (shape.Type) { case ShapeType.Point: // a point is just a single x/y point ShapePoint shapePoint = shape as ShapePoint; MessageBox.Show("Point (" + shapePoint.Point.X.ToString() + ", " + shapePoint.Point.Y.ToString() + ")"); break; case ShapeType.Polygon: // a polygon contains one or more parts - each part is a list of points which // are clockwise for boundaries and anti-clockwise for holes // see http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf ShapePolygon shapePolygon = shape as ShapePolygon; foreach (PointD[] part in shapePolygon.Parts) { countShape = 0; if (countParts == 0) geometry = "("; else geometry = geometry + " | ("; foreach (PointD point in part) { if (part.Length - 1 != countShape) geometry = geometry + point.X + " " + point.Y + " |"; else geometry = geometry + point.X + " " + point.Y + " )"; countShape++; } countParts++; } break; default: break; } //Build our Polygon. //Eg. POLYGON((-122.358 47.653, -122.348 47.649| -122.348 47.658, -122.358 47.658, -122.358 47.653)) dr["shape"] = "POLYGON(" + geometry + ")"; } dt.Rows.Add(dr); count++; } } //Extract the data to a csv file using (System.IO.StreamWriter sw = new System.IO.StreamWriter(@"path\filename.csv")) { foreach (DataRow row in dt.Rows) { object[] array = row.ItemArray; for (i = 0; i < array.Length - 1; i++) { sw.Write(array[i].ToString() + ","); } sw.WriteLine(array[i].ToString()); } } } public static DataSet CreateNewDataSet() { DataSet dsTemp = new DataSet(); DataTable dtTemp = new DataTable("londonparks"); dtTemp.Columns.Add("id", typeof(string)); dtTemp.Columns.Add("parkname", typeof(string)); dtTemp.Columns.Add("street", typeof(string)); dtTemp.Columns.Add("postcode", typeof(string)); dtTemp.Columns.Add("shape", typeof(string)); dsTemp.Tables.Add(dtTemp); return dsTemp; }Now that we have a valid file that can be processed by ADLA, we can upload it to ADLS and start performing geospatial analysis. To do so, I simply followed the logic described in Sacha’s blog (here).The following U-SQL has in consideration a dataset that contains details of the trajectory of a courier, tracked on a daily basis. With the following code, we identify if a courier drove by a park by using the Intersect function. Because we have to cross two datasets, a C# function was created to help the evaluation of multiple events. // Reference the assemblies we require in our script. REFERENCE SYSTEM ASSEMBLY [System.Xml]; REFERENCE ASSEMBLY [SQLServerExtensions].[SqlSpatial]; REFERENCE ASSEMBLY [USQL.Core]; // Once the appropriate assemblies are registered, we can alias them using the USING keyword. USING Geometry = Microsoft.SqlServer.Types.SqlGeometry; USING Geography = Microsoft.SqlServer.Types.SqlGeography; USING SqlChars = System.Data.SqlTypes.SqlChars; USING [USQL].[Core].[Utilities]; // Extract the list of parks @parks = EXTRACT [ID] string, [PARKNAME] string, [STREET] string, [POSTCODE] string, [SHAPE] string FROM "RAW/Parks.csv" USING Extractors.Text(delimiter : ',', silent : false, quoting : true, skipFirstNRows : 1); //Extract data from the file containing the courier trajectory @trajectories = EXTRACT GPSDateTimeUTC DateTime, ReceivedDatetimeUTC DateTime, VehicleKey string, Altitude int, Longitude double, Latitude double, Distance decimal, VehicleSpeedMph decimal FROM "CURATED/Trajectory/Trajectory.TXT" USING Extractors.Text(delimiter : '|', silent : false, quoting : true, skipFirstNRows : 1); //Get the list of vehicles that drove by the park. @vehicleIntersection = SELECT DISTINCT a. *, "1" AS VehicleIntersected FROM @trajectories AS a CROSS JOIN @parks AS b WHERE Utilities.Intersect(b.[SHAPE], a.[Longitude], a.[Latitude]).ToString() == "True"; //Get the list of vehicles that didn't drive by the park. @vehicleWithoutIntersection = SELECT a. *, "0" AS VehicleIntersected FROM @trajectories AS a LEFT JOIN @vehicleIntersection AS b ON b.VehicleKey == a.VehicleKey AND b.GPSDateTimeUTC == a.GPSDateTimeUTC WHERE b.VehicleKey IS NULL; //Union both datasets to get the complete set of data @finalData = SELECT * FROM @vehicleIntersection UNION ALL SELECT * FROM @vehicleWithoutIntersection; //Export the results to a csv file OUTPUT @finalData TO "LABORATORY/GeoSpatialIntersection.csv" USING Outputters.Text(outputHeader : true, delimiter : ',', quoting : true);And here is the C# function. It accepts three parameters and calculate the intersection of a point with a shape.public static string Intersect(string shape, double longitude, double latitude) { //Because we had a csv file, the coordinates in the polygon were separated by | //It is important to use the .MakeValid() method to validate any invalid shape //In case the dataset had multypoligon shapes, without the MakeValid(), the function would throw an error var g = Geography.STGeomFromText( new SqlChars( shape.Replace('|',',')), 4326).MakeValid(); var h = Geography.Point(longitude, latitude, 4326); return g.STIntersects(h).ToString(); }As always, if you have any questions or comments, do let me know.

Geographic Spatial Analysis with Azure Data Lake Analytics (ADLA)

Whilst working on an Azure Data Lake project, a requirement hit the backlog that could be easily solved with a Geographical Information System (GIS) or even SQL Server - Spatial data type support was introduced into SQL Server 2008. However, Azure Data Lake Analytics (ADLA) does not natively support spatial data analytics so we'll have to extract the data into another service right? Wrong ? :) Due to the extensibility of Azure Data Lake Analytics, we can enhance it to do practically anything. In fact, we can lean on existing components and enhance the service without having to develop the enhancement itself. This blog is a quick run through demonstrating how to enhance ADLA such that it will support Spatial analytics and meet our project requirement. Problem For simplicity I've trivialised the problem. Here's the requirement: Indicate which Bus Stops are within 1.5 km of Southwark Tube Station. To support this requirement, we have two datasets: A list of all the Bus Stops in London, including their Geo location (circa 20k records) The Geo location record of Southwark Tube Station (a single record !) In fact, the location of the tube station is pretty accurate and is geo located to the entrance pavement outside the tube station: This would be an easy problem for a GIS to solve. You would specify the central point i.e. our Southwark Tube station marker and draw a circle, or buffer, with a radius 1.5 km around it and select all bus stops that fall within or intersect with that circle. This spatial analysis is easy for these systems as that's essentially what they are built to do. SQL Server 2008 introduced the Spatial Data Type, this allowed spatial style analysis to be performed on geo data using T-SQL in conjunction with the supplied Geometry and Geography data types. More info on those can be found here So, how can we solve our problem in ADLA, without a GIS and without having to export the data to SQL Server?? Solution You can register existing assemblies with ADLA. It so happens that the SQL Server Data Types and Spatial assemblies are nicely packaged up and can be used directly within ADLA itself - think about that, it's pretty awesome ! Caveat: At the time of writing we have no idea of the licence implications. It will be up to you to ensure you are not in breach :) Those assemblies can be downloaded from here.  You only need to download and install the following file: ENU\x64\SQLSysClrTypes.msi This installs two key assemblies, which you'll need to grab and upload to your Data Lake Store: C:\Program Files (x86)\Microsoft SQL Server\130\SDK\Assemblies\Microsoft.SqlServer.Types.dll C:\Windows\System32\SqlServerSpatial130.dll Once they have been uploaded to your Data Lake Store, you need to register those assemblies with ADLA. DECLARE @ASSEMBLY_PATH string = "/5.UTILITY/USQL-Extend/SQL-Server/"; DECLARE @TYPES_ASM string = @ASSEMBLY_PATH+"Microsoft.SqlServer.Types.dll"; DECLARE @SPATIAL_ASM string = @ASSEMBLY_PATH+"SqlServerSpatial130.dll"; CREATE DATABASE IF NOT EXISTS SQLServerExtensions; USE DATABASE SQLServerExtensions; DROP ASSEMBLY IF EXISTS SqlSpatial; CREATE ASSEMBLY SqlSpatial FROM @TYPES_ASM WITH ADDITIONAL_FILES = ( @SPATIAL_ASM ); Following registration of the assemblies, we can see the registration loaded in the ADLA Catalog database we created: We are now ready to use this U-SQL enhancement in our U-SQL Query - let's go right ahead and solve our problem in one U-SQL Script. // Reference the assemblies we require in our script. // System.Xml we get for free as a System Assembly so we didn't need to register that and our SQLServerExtensions.SqlSpatial assembly REFERENCE SYSTEM ASSEMBLY [System.Xml]; REFERENCE ASSEMBLY SQLServerExtensions.SqlSpatial; // Once the appropriate assemblies are registered, we can alias them using the USING keyword. USING Geometry = Microsoft.SqlServer.Types.SqlGeometry; USING Geography = Microsoft.SqlServer.Types.SqlGeography; USING SqlChars = System.Data.SqlTypes.SqlChars; // First create the centralised point. // In this case it's the pavement outside the entrance of Southwark Tube Station, London. // Format is Longitude, Latitude and then SRID. // NB: It's Longitude then Latitude, that's the opposite way to what you might expect.. DECLARE @southwarkTube Geography = Geography.Point(-0.104777,51.503829,4326); // Next we extract our entire London bus stop data set from the file. // There's about 20k of them. @busStopInput = EXTRACT [StopCode] string, [StopName] string, [Latitude] double?, [Longitude] double? FROM @"/1.RAW/OpenData/Transport/bus-stops-narrow-full-london.csv" USING Extractors.Csv(skipFirstNRows:1,silent:true); // This is effectively the transform step and where the magic happens // Very similar syntax to what you would do in T-SQL. // We are returning all the bus stops that fall within 1500m of Southwark Tube // Essentially we return all stops that intersect with a 1500m buffer around the central tube point @closeBusStops= SELECT * FROM @busStopInput WHERE @southwarkTube.STBuffer(1500).STIntersects(Geography.Point((double)@busStopInput.Longitude,(double)@busStopInput.Latitude,4326)).ToString()=="True"; // The results are written out to a csv file. OUTPUT @closeBusStops TO "/4.LABORATORY/Desks/Sach/spatial-closebusstops.csv" USING Outputters.Csv(outputHeader: true); The query outputs a list of bus stops that are within the specified Spatial distance from Southwark Tube Station. If we have a look at all the bus stops (in red) and overlay all the 'close' bus stops (in green), we can see the results: Pretty neat. Azure Data Lake Analytics does not natively support spatial data analytics but by simply utilising the assemblies that ship with SQL Server, we can extend the capability of U-SQL to provide that functionality or practically any functionality we desire.

Azure Data Lake Store : Authorization Failed - The user does not have the permission to perform this operation

Just a quick one as it took slightly longer to solve than it possibly should have done so this might help others with the same problem. I was working fine with an Azure Data Lake Store for a few weeks until yesterday when I couldn’t access the files anymore.  The Azure Portal reported the following:Authorization Failed - The user does not have the permission to perform this operation. It turns out I was accessing it from a new location and the ADLS Firewall needed updating to whitelist my IP Address…. After adding my IP, the Data Lake Store is now accessible again.