Adatis

Adatis BI Blogs

Unit Testing Overview

In a previous post, I touched on the point of testing and briefly talked about unit testing. In this post, I will be going into more detail about what unit testing is and why it’s important to do it. In the previous post, I said that Unit Tests are: low level tests, meaning that they are close to the source of the product. They should be written with the aim of testing individual methods and functions for a given code base, using a unit test framework to support the authoring and execution of a test. As a developer, you would typically author the unit tests in a development tool like Visual Studio; you’d run them locally to ensure that the tests pass; and then they would be executed on a regular basis as a task in a Build Pipeline within Azure DevOps. Unit Tests are cheap to automate and should be quick to run.To expand on this, unit tests are written by a developer to apply to a unit of code. But what do we mean by “unit of code”? A unit of code is the smallest testable part of a solution – verifying that the individual part or component of a solution works as intended, independently from other parts. A unit could be a C# method; a PowerShell function; a T-SQL Stored Proc, and many others. Like most forms of testing, unit tests follow a pattern of:Initialise system under testCall method under testAssert expected outcome against result of methodA best practice would be to write the unit test before the writing any code, but if you’ve not got to that level of maturity with your test approach - writing tests after code is still good practice. How do you write a good unit test?Keep it simpleA unit test shouldn’t replicate the code it is intended to test. You’ll be writing lots of them, so make them quick and easy to write.ReadableBy keeping it simple, the test should also be readable. Making it easy to know what method is being tested and the expected behaviour of the method.By making it readable, you can easily address any failures that may surface.Reliable and RepeatableUnit tests should only fail if there are bugs in the system, not because there are bugs in the tests. Keeping it simple and readable will avoid that issue. Unit tests need to be run many times, sometimes multiple times throughout the course of a day, so they need to be executed quickly in a repeatable manner. Keeping it simple helps achieve this aim.How do you write a unit test?We’ve got an understanding of what a unit test is, but how do we write one? For this example, we’ll be writing our code and tests using C#.Our application is a very simple calculator, which adds two numbers together. Simply, to add a new Unit Test, we can right-click on the method and select Create Unit Tests. Because we’ve not built any unit tests before, we can use it to create a new unit test project using a framework of choice. If we already had a unit test project, we could add the new test to the existing project. Using this method, it creates a skeleton of a unit test from which we can amend for our needs.As you can see, this doesn’t contain what we need, so we amend the test so that it reflects our requirements, as in the below.To run a Unit Test, you can either right-click on the test method and click on Run Test(s) or open up the Test Explorer window, navigate to the desired test and click on Run Selected Tests.Unit Tests in Azure DevOpsWe’ve written our unit tests and have run them locally, but how do we make it repeatable? We utilise the power of Azure DevOps to have repeatable tests run against a changing code base as part of the Build or Continuous Integration process. The process is:Install NuGet on the Build AgentRestore any packages from NuGet that your application requiresBuild solutionRun testsPublish testsCopy successfully built and tested artifacts to a staging directoryPublish those artifactsUsing Azure DevOps, or another CI tool, we can rely upon our tests in a repeatable manner. Additional ReadingThere’s a good post by Sergey Kolodiy on the importance of writing good code and how unit testing encourages good behaviour. My colleague Jon has also written a post on the subject: Setup Unit Testing with NUnit and NBi.

Lambda vs Azure Databricks Delta Architecture

Historically, when implementing big data processing architectures, Lambda has been the desired approach, however, as technology evolves, new paradigms arise and with that, more efficient approaches become available, such as the Databricks Delta architecture. In this blog, I’ll describe both architectures and demonstrate how to build a data pipeline in Azure Databricks following the Databricks Delta architecture.The Lambda architecture, originally defined by Nathan Marz, is a big data processing architecture that combines both batch and real time processing methods. This approach attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data.From a high-level perspective, the Lambda architecture is as followed.A valid approach of using Lambda in Azure could be as demonstrated below (please be aware that there are different options for each layer that I won’t be detailing in this blog).For the speed layer we could use Azure Streaming Analytics, a serverless scalable event processing engine that enables the development and run of real-time analytics on multiple streams of data from sources such as devices, sensors, web sites, social media, and other applications. For the batch layer, we could use Azure Data Lake Storage (ADLS) and Azure Databricks. ADLS is an enterprise-wide hyper-scale repository for big data analytic workloads that enable us to capture data of any size, type, and ingestion speed in one single place for operational and exploratory analytics. Currently there are two versions of ADLS, Gen 1 and Gen 2, with the latest still being in private preview.Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform that allow us to create streamlined workflows and interactive workspaces that enables collaboration between data scientists, data engineers, and business analysts.For the serving layer, we could use Azure Data Warehouse, a cloud-based Enterprise Data Warehouse (EDW) that leverages Massively Parallel Processing (MPP) to quickly run complex queries across petabytes of data.With this architecture, the events are consumed by the Azure Streaming Analytics and landed in ADLS in flat files, that can be partitioned by hour. Once the processing of the file is completed, we can create a batch process via Azure Databricks and store the data in the Azure SQL Data Warehouse. To obtain the data that was not captured by the batch process, we can use Polybase to query the file being updated and then create a view to union both tables. Every time that view is queried, the polybase table will get the latest streamed data, meaning we have a real time query with the capability to obtain the most recent data.The major problem of the Lambda architecture is that we have to build two separate pipelines, which can be very complex, and, ultimately, difficult to combine the processing of batch and real-time data, however, it is now possible to overcome such limitation if we have the possibility to change our approach. Databricks Delta delivers a powerful transactional storage layer by harnessing the power of Apache Spark and Databricks File System (DBFS). It is a single data management tool that combines the scale of a data lake, the reliability and performance of a data warehouse, and the low latency of streaming in a single system. The core abstraction of Databricks Delta is an optimized Spark table that stores data as parquet files in DBFS and maintains a transaction log that tracks changes to the table.From a high-level perspective, the Databricks Delta architecture can be described as followed.An Azure Databricks Delta Raw table stores the data that is either produced by streaming sources or is stored in data lakes. Query tables contains the normalized data from the Raw tables. Summary tables, often used as the source for the presentation layer, contains the aggregated key business metrics that are frequently queried. This unified approach means that there are less complexity due to the removal of storage systems and data management steps, and, more importantly, output queries can be performed on streaming and historical data at the same time.In the next steps, I’ll demonstrate how to implement the Databricks Delta architecture using a python notebook.#If Databricks delta is not enabled in the cluster, run this cell spark.sql("set spark.databricks.delta.preview.enabled=true")#Define variables basePath = "/kafka" taxiRidesRawPath = basePath + "/taxiRidesRaw.delta" taxiRidesQueryPath = basePath + "/taxiRidesQuery.delta" taxiFaresQueryPath = basePath + "/taxiFaresQuery.delta" taxiSummaryPath = basePath + "/taxiSummary.delta" checkpointPath = basePath + "/checkpoints"#Load the Kafka stream data to a DataFrame kafkaDF = (spark   .readStream   .option("kafka.bootstrap.servers", "192.168.1.4:9092")   .option("subscribe", "taxirides")   .option("startingOffsets", "earliest")   .option("checkpointLocation", "/taxinyc/kafka.checkpoint")   .format("kafka")   .load() )#Kafka transmits information using a key, value, and metadata such as topic and partition. The information we're interested in is the value column. Since this is a binary value, we must first cast it to a StringType and then split the columns. #Stream into the Raw Databricks Delta directory. By using a checkpoint location, the metadata on which data has already been processed will be maintained so the cluster can be shut down without a loss of information. from pyspark.sql.types import StructType, StructField,LongType,TimestampType,StringType,FloatType,IntegerType from pyspark.sql.functions import col, split (kafkaDF  .select(split(col("value").cast(StringType()),",").alias("message"))  .writeStream  .format("delta")  .option("checkpointLocation", checkpointPath + "/taxiRidesRaw")  .outputMode("append")  .start(taxiRidesRawPath) )#Create and populate the raw delta table. Data is stored in a single column as an array Eg. ["6","START","2013-01-01 00:00:00","1970-01-01 00:00:00","-73.866135","40.771091","-73.961334","40.764912","6","2013000006","2013000006"] spark.sql("DROP TABLE IF EXISTS TaxiRidesRaw")           spark.sql("""   CREATE TABLE TaxiRidesRaw   USING Delta   LOCATION '{}' """.format(taxiRidesRawPath))#Stream into the Query Databricks delta directory. (spark.readStream  .format("delta")  .load(str(taxiRidesRawPath))  .select(col("message")[0].cast(IntegerType()).alias("rideId"),    col("message")[1].cast(StringType()).alias("rideStatus"),    col("message")[2].cast(TimestampType()).alias("rideEndTime"),    col("message")[3].cast(TimestampType()).alias("rideStartTime"),    col("message")[4].cast(FloatType()).alias("startLong"),    col("message")[5].cast(FloatType()).alias("startLat"),    col("message")[6].cast(FloatType()).alias("endLong"),    col("message")[7].cast(FloatType()).alias("endLat"),    col("message")[8].cast(IntegerType()).alias("passengerCount"),    col("message")[9].cast(IntegerType()).alias("taxiId"),    col("message")[10].cast(IntegerType()).alias("driverId"))  .filter("rideStartTime <> '1970-01-01T00:00:00.000+0000'")  .writeStream  .format("delta")  .outputMode("append")  .option("checkpointLocation", checkpointPath + "/taxiRidesQuery")  .start(taxiRidesQueryPath) )#Create and populate the quer delta table. Data is no longer in a single column spark.sql("DROP TABLE IF EXISTS TaxiRidesQuery")           spark.sql("""   CREATE TABLE TaxiRidesQuery   USING Delta   LOCATION '{}' """.format(taxiRidesQueryPath))#Load the data to a DataFrame. The parquet files are stored in a blob storage taxiFaresDF = (spark.read                .parquet("/mnt/geospatial/kafka/NYC")                .write                .format("delta")                .mode("append")                .save(taxiFaresQueryPath)               )#Create and populate the query delta table spark.sql("DROP TABLE IF EXISTS TaxiFaresQuery")           spark.sql("""   CREATE TABLE TaxiFaresQuery   USING Delta   LOCATION '{}' """.format(taxiFaresQueryPath))#Load the data to a DataFrame taxiRidesDF = (spark                .readStream                .format("delta")                .load(str(taxiRidesQueryPath))               )#Load the data to a DataFrame taxiFaresDF = (spark                .read                .format("delta")                .load(str(taxiFaresQueryPath))               )#Join the steaming data and the batch data. Group by Date and Taxi Driver to obtain the number of rides per day from pyspark.sql.functions import date_format, col, sum RidesDf = (taxiRidesDF.join(taxiFaresDF, (taxiRidesDF.taxiId == taxiFaresDF.taxiId) & (taxiRidesDF.driverId == taxiFaresDF.driverId))            .withColumn("date", date_format(taxiRidesDF.rideStartTime, "yyyyMMdd"))            .groupBy(col("date"),taxiRidesDF.driverId)            .count()            .withColumnRenamed("count","RidesPerDay")            .writeStream            .format("delta")            .outputMode("complete")            .option("checkpointLocation", checkpointPath + "taxiSummary")            .start(taxiSummaryPath) )#Create and populate the summary delta table spark.sql("DROP TABLE IF EXISTS TaxiSummary")           spark.sql("""   CREATE TABLE TaxiSummary   USING Delta   LOCATION '{}' """.format(taxiSummaryPath))As always, if you have any questions or comments, do let me know.

Geospatial analysis in Azure Databricks – Part II

After my last post on running geospatial analysis in Azure Databricks with Magellan (here) I decided to investigate which other libraries were available and discover if they performed better or worse. The first library I investigated was GeoMesa. an Apache licensed open source suite of tools that enables large-scale geospatial analytics on cloud and distributed computing systems, letting you manage and analyze the huge spatio-temporal datasets. GeoMesa does this by providing spatio-temporal data persistence on top of the Accumulo, HBase, and Cassandra distributed column-oriented databases for massive storage of point, line, and polygon data. It allows rapid access to this data via queries that take full advantage of geographical properties to specify distance and area. GeoMesa also provides support for near real time stream processing of spatio-temporal data by layering spatial semantics on top of the Apache Kafka messaging system (further details here).Although their website is rich in documentation, I immediately stumbled in the most basic operation, read a GeoJSON file with the geomesa format. The reason behind this is because, in their tutorials, they assume Apache Accumulo, a distributed key/value store, is used as the backing data store. Because I wanted to make sure I could ingest data from either Azure Blob Storage or Azure Data Lake Storage, I decided to not use their recommendation. As such, after many hours of failed attempts, I decided to abandon the idea of using GeoMesa.My next option was GeoSpark, a cluster computing system for processing large-scale spatial data. GeoSpark extends Apache Spark / SparkSQL with a set of out-of-the-box Spatial Resilient Distributed Datasets (SRDDs)/ SpatialSQL that efficiently load, process, and analyze large-scale spatial data across machines (further details here ). GeoSpark immediately impresses with the possibility of either creating Spatial RDDs and run spatial queries using GeoSpark-core or create Spatial SQL/DataFrame to manage spatial data using GeoSparkSQL. Their website contains tutorials that are easy to follow and offers the possibility to chat with the community on gitter.In the spirit of trying to keep my approach as simple as possible, I decided to compare Magellan with GeoSparkSQL, since SparkSQL is easier to use and working with RDDs can be a complex task, however, it is important to highlight that their recommendation is to use GeoSpark core rather than GeoSparkSQL. The reason for this is because SparkSQL has some limitations, such as not supporting clustered indices, making it difficult to get it exposed to all GeoSpark core features.The data used in the following test cases was based on the NYC Taxicab datasets to create the geometry points and the Magellan NYC Neighbourhoods GeoJSON to extract the polygons. Both datasets were stored in a blob storage and added to Azure Databricks as a mount point.The table below details the version of the libraries and clusters configuration. There are a couple of points to notice: Magellan does not support Apache Spark 2.3+.The Magellan library 1.0.6 is about to be released this month and should cover some of the limitations identified below.The GeoSpark library 1.2.0 is currently available in SNAPSHOT and will hopefully fix the load of multiline GeoJSON files. Library Version Runtime Version Cluster Specification Magellan 1.0.5 3.5 LTS (includes Apache Spark 2.2.1, Scala 2.11) Standard_DS3_v2 driver type with 14GB Memory, 4 Cores and auto scaling enabled GeoSpark/ GeoSparkSQL 1.1.3 / 2.3-1.1.3 4.3 (includes Apache Spark 2.3.1, Scala 2.11) Standard_DS3_v2 driver type with 14GB Memory, 4 Cores and auto scaling enabledTo test the performance of both libraries, I implemented a set of queries and ran them 3 times, registering how long it took on each run. The best results are highlighted in green.DS1 - NYC Neighbourhoods dataset containing the polygonsDS2 – NYC Taxicab dataset containing the geometry points for the month of January 2015DS3 – NYC Taxicab dataset containing the geometry points for the year of 2015Test NumberDescriptionNumber of ResultsMagellan (avg in sec)GeoSparkSQL (avg in sec)1Select all rows from DS13100.860.692Select all rows from DS212.748.98619.8215.643Select all rows from DS1 where borough is Manhattan372.220.694Select all rows from DS2 where total amount is bigger than 202.111.70718.7117.235Select 100 rows from DS1 ordered by the distance between one point and all polygons100N/A*0.86Select all rows from DS1 where a single point is within all polygons11.630.687Select all rows from DS1 where one point with buffer 0.1 intersects all polygons73N/A*0.808Join DS1 and DS2 and select all rows where polygons contains points12.492.67829.171573.8 (~26min)9Join DS1 and DS2 and select all rows where points are within polygons12.492.67829.311518 (~25min)10Select all rows from DS3146.113.001187.8155.411Select all rows from DS3 where total amount is bigger than 2029.333.13094.8119.412**Join DS1 and DS3 and select all rows where points are within polygons143.664.028168N/A** Although the following link mentions Magellan can perform Distance and Buffer operations, I couldn’t find documentation demonstrating how to perform them, or, in the cases I tried, Azure Databricks threw an error indicating the class was not available.** Considering the time it took to run queries 8/9 using DS2 (~1.8GB), I decided to not test the performance against DS3 (~21.3GB), since I already knew the results were not going to be positive.From the tests above, we can see that GeoSparkSQL is generally better when not performing joins with spatial ranges, where the performance drastically decreases when compared with Magellan. On the other hand, Magellan is still an ongoing project and seems to be lacking some of the basic operations that might be of big importance for some analysis, however, it clearly excels when we need to run spatial analysis in joined datasets.Based on my experience using the libraries and the tests conducted in this blog, my recommendation would be to use Magellan, since even when GeoSparkSQL was better, the performance gains were not that significant, however, as already referred, Magellan might not be an option if the requirements involve operations that are not yet available, such as distances or buffers Following is the implementation of the tests using GeoSparkSQL.//Import Libraries and config session import org.datasyslab.geosparksql.utils.GeoSparkSQLRegistratorimport org.datasyslab.geosparksql.utils.Adapterimport org.datasyslab.geosparksql.UDF.UdfRegistratorimport org.datasyslab.geosparksql.UDT.UdtRegistrator import org.apache.spark.serializer.KryoSerializer; import org.apache.spark.sql.SparkSession;import org.apache.spark.sql.geosparksql.strategy.join.JoinQueryDetectorimport org.apache.spark.sql.Rowimport org.apache.spark.sql.DataFrameimport org.apache.spark.sql.functions._import org.apache.spark.sql.types._//Initiate Spark Session var sparkSession = SparkSession.builder()                     .appName("NYCTaxis")                     // Enable GeoSpark custom Kryo serializer                     .config("spark.serializer", classOf[KryoSerializer].getName)                     .config("spark.kryo.registrator", classOf[GeoSparkKryoRegistrator].getName)                     .getOrCreate() //Register GeoSparkSQL GeoSparkSQLRegistrator.registerAll(sparkSession)//Define schema for the NYC taxi data val schema = StructType(Array(     StructField("vendorId", StringType, false),     StructField("pickup_datetime", StringType, false),     StructField("dropoff_datetime", StringType, false),     StructField("passenger_count", IntegerType, false),     StructField("trip_distance", DoubleType, false),     StructField("pickup_longitude", DoubleType, false),     StructField("pickup_latitude", DoubleType, false),     StructField("rateCodeId", StringType, false),     StructField("store_fwd", StringType, false),     StructField("dropoff_longitude", DoubleType, false),     StructField("dropoff_latitude", DoubleType, false),     StructField("payment_type", StringType, false),     StructField("fare_amount", StringType, false),     StructField("extra", StringType, false),     StructField("mta_tax", StringType, false),     StructField("tip_amount", StringType, false),     StructField("tolls_amount", StringType, false),     StructField("improvement_surcharge", StringType, false),     StructField("total_amount", DoubleType, false)))//Read data from the NYC Taxicab dataset. var trips = sparkSession.read             .format("com.databricks.spark.csv")             .option("header", "true")             .schema(schema)             .load("/mnt/geospatial/nyctaxis/*") trips.createOrReplaceTempView("tripstable")//Read GeoJSON file var polygonJsonDF = spark.read                     .option("multiline", "true")                     .json("/mnt/geospatial/neighborhoods/neighborhoods.geojson") //GeoSparkSQL can't read multiline GeoJSON files. This workaround will only work if the file only contains one geometry type (eg. polygons)  val polygons = polygonJsonDF                 .select(explode(col("features")).as("feature"))                 .withColumn("polygon", callUDF("ST_GeomFromGeoJson", to_json(col("feature"))))                 .select($"polygon", $"feature.properties.borough", $"feature.properties.boroughCode", $"feature.properties.neighborhood") polygons.createOrReplaceTempView("polygontable")//Test 1 var polygonAll = sparkSession.sql(         """           | SELECT *           | FROM polygontable         """) polygonAll.count()//Test 2 var tripsAll = sparkSession.sql(         """           | SELECT *           | FROM tripstable         """) tripsAll.count()//Test 3 var polygonWhere = sparkSession.sql(         """           | SELECT *           | FROM polygontable           | WHERE borough = 'Manhattan'         """) polygonWhere.count()//Test 4 var tripsWhere = sparkSession.sql(         """           | SELECT *           | FROM tripstable           | WHERE total_amount > 20         """) tripsWhere.count()//Test 5 var polygonGeomDistance = sparkSession.sql(         """           | SELECT *           | FROM polygontable           | ORDER BY ST_Distance(polygon, ST_PointFromText('-74.00672149658203, 40.73177719116211', ','))           | LIMIT 100         """) polygonGeomDistance.count()//Test 6 var polygonGeomWithin = sparkSession.sql(         """           | SELECT *           | FROM polygontable           | WHERE ST_Within(ST_PointFromText('-74.00672149658203, 40.73177719116211', ','), polygon)         """) polygonGeomWithin.show() //Test 7 var polygonGeomInterset = sparkSession.sql(         """           | SELECT *           | FROM polygontable           | WHERE ST_Intersects(ST_Circle(ST_PointFromText('-74.00672149658203, 40.73177719116211', ','),0.1), polygon)         """) polygonGeomInterset.count() //Test 8 var polygonContainsJoin = sparkSession.sql(         """           | SELECT *           | FROM polygontable, tripstable           | WHERE ST_Contains(polygontable.polygon, ST_Point(CAST(tripstable.pickup_longitude AS Decimal(24,20)), CAST(tripstable.pickup_latitude AS Decimal(24,20))))         """) polygonContainsJoin.count() //Test 9 var polygonWithinJoin = sparkSession.sql(         """           | SELECT *           | FROM polygontable, tripstable           | WHERE ST_Within(ST_Point(CAST(tripstable.pickup_longitude AS Decimal(24,20)), CAST(tripstable.pickup_latitude AS Decimal(24,20))), polygontable.polygon)         """) polygonWithinJoin.count() Following is the implementation of the tests using Magellan.//Import Libraries import magellan._ import org.apache.spark.sql.magellan.dsl.expressions._ import org.apache.spark.sql.Row import org.apache.spark.sql.Column import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._//Define schema for the NYC taxi data val schema = StructType(Array(     StructField("vendorId", StringType, false),     StructField("pickup_datetime", StringType, false),     StructField("dropoff_datetime", StringType, false),     StructField("passenger_count", IntegerType, false),     StructField("trip_distance", DoubleType, false),     StructField("pickup_longitude", DoubleType, false),     StructField("pickup_latitude", DoubleType, false),     StructField("rateCodeId", StringType, false),     StructField("store_fwd", StringType, false),     StructField("dropoff_longitude", DoubleType, false),     StructField("dropoff_latitude", DoubleType, false),     StructField("payment_type", StringType, false),     StructField("fare_amount", StringType, false),     StructField("extra", StringType, false),     StructField("mta_tax", StringType, false),     StructField("tip_amount", StringType, false),     StructField("tolls_amount", StringType, false),     StructField("improvement_surcharge", StringType, false),     StructField("total_amount", DoubleType, false)))//Read data from the NYC Taxicab dataset and create a Magellan point val trips = sqlContext.read       .format("com.databricks.spark.csv")       .option("mode", "DROPMALFORMED")       .schema(schema)       .load("/mnt/geospatial/nyctaxis/*")       .withColumn("point", point($"pickup_longitude",$"pickup_latitude"))//Read GeoJSON file and define index precision val neighborhoods = sqlContext.read       .format("magellan")       .option("type", "geojson")       .load("/mnt/geospatial/neighborhoods/neighborhoods.geojson")       .select($"polygon",               $"metadata"("borough").as("borough"),              $"metadata"("boroughCode").as("boroughCode"),              $"metadata"("neighborhood").as("neighborhood"))       .index(30)//Test 1 magellan.Utils.injectRules(spark) neighborhoods.count()//Test 2 trips.count()//Test 3 neighborhoods.filter("borough == 'Manhattan'").count()//Test 4 trips.filter("total_amount > 20").count()//Test 6 val points = sc.parallelize(Seq((-74.00672149658203, 40.73177719116211))).toDF("x", "y").select(point($"x", $"y").as("point")) val polygons =neighborhoods.join(points) polygons.filter($"point" within $"polygon").count()//Test 8 trips.join(neighborhoods)         .where($"polygon" >? $"point")         .count()//Test 9 trips.join(neighborhoods)         .where($"point" within $"polygon")         .count()

New Features for Workload Management in Azure SQL Data Warehouse

I was lucky enough to attend the PASS Summit in Seattle last week and was therefore able to see some early demos of some new features coming to various Azure products. One of the key new releases that caught my eye was the implementation of Importance for workload management in Azure SQL Data Warehouse. The concept of workload management is a key factor for Azure SQL DW as there is only limited concurrency slots available and depending on the resource class, these slots can fill up pretty quickly. Once the concurrency slots are full, queries are queued until a sufficiently sized slot is opened up. Let’s recap what Resource Classes are and how they affect workload management. A Resource Class is a pre-configured database role that determines how much resource is allocated to queries coming from users that belong to that role. For example, an ETL service account may use a “large” resource class and be allocated a generous amount of the server, however an analyst may use a “small” resource class and therefore only use up a small amount of the server with their queries. There are actually 2 types of resource class, Dynamic and Static. The Dynamic resource classes will grant a set percentage of memory to a query and actual value of this percentage will vary as the Warehouse scales up and down. The key factor is that an xLargeRc (extra-large resource class) will always take up 70% of the Server and will not allow any other queries to be run concurrently. No matter how much you scale up the Warehouse, queries run with an xLargeRc will run one at a time. Conversely, queries run with a smallrc will only be allocated 4% of the Server and therefore as a Warehouse scales up, this 4% becomes a larger amount of resource and can therefore process data quicker. The static resource class offers a fixed amount of memory per query regardless of scale. This means that as the Warehouse gets scaled up, the concurrency increases with it. More detail on Resource Classes in SQL Data Warehouse is available here (https://docs.microsoft.com/en-us/azure/sql-data-warehouse/resource-classes-for-workload-management)The new Importance feature, that will be released into preview on 1st December, allows administrators to specify the importance of a user, thereby affecting the order in which jobs are processed off the queue by the server. To illustrate Importance, we can use the example of a CEO and an Analyst, both of which are trying to run queries against the SQL DW Instance. Assuming the CEO has been tagged with a high importance then their queries will be pulled off the queue before the analyst, as illustrated in the below diagram:In the above diagram, Job 2 finishes and therefore Job 7 is pulled off the queue because it originated from the CEO with High Importance. The analyst job with lower importance is left in the queue until another slot opens up. Importance can also affect how the SQL DW handles locking. The reason for this is that if a job is awaiting a lock on a frequently used table, this lock can often be unavailable for a large amount of time, therefore increasing the time it takes to complete the operation. Importance on top of locking will ensure the locks are reserved for high importance queries so that they can complete much quicker. See the below diagram. In this image, both Job 1, 2 and Job 3 are running, however in the top queue Job 1 currently has the lock on a table that both Job 2 and Job 3 will need. As Job 2 – the Analyst job, was submitted before Job 3 – the CEO job, usually this would secure the lock before the CEO and therefore mean that the CEO would wait longer for their query. However, with Importance enabled, SQL DW knows to hold the lock and grant it to the higher importance so that it can complete first. Finally, Importance affects how the optimiser chooses which queries to run when. Generally speaking the optimiser accelerates throughput, so as soon as an appropriate sized slot becomes available then the next job is pulled off the queue and executed. However, if a large query is waiting to be run then there is the possibility that it will wait a long time as small jobs continually jump into the smaller slots. See diagram below:In the above diagram, there are 4 small jobs running, a small job queued for an analyst and a large job queued for the CEO. With Importance turned on, the optimiser knows that when Job 1 finishes it needs to hold the resource so that I can make room for the highly important job 6 and not automatically start job 5. When Job 2 finishes, the optimiser will then push Job 6 to be processed as there is now enough resource to run the process. Hopefully the above diagrams have illustrated how Importance for Workload Management can ensure that the essential jobs are run before the less essential ones. Please look out for the preview release of this feature on 1st December.

Azure Data Factory v2 : ForEach Activity : Handling Null Items

The ForEach activity is a great addition to Azure Data Factory v2 (ADF v2) – however, you can encounter issues in some situations where you pass a null in it’s ‘Items’ setting for it to iterate.  When you pass a Null, you receive the error: Error { "errorCode": "InvalidTemplate", "message": "The function 'length' expects its parameter to be an array or a string. The provided value IS of type 'Null'.", "failureType": "UserError", "target": "ForEach" } This happens because the initialisation of the ForEach iterator checks the length of the string or array that is passed in on the ‘Items’ setting.  When a null is supplied (i.e. no items to create an array from), the length function fails.  I would like to see the ADF ForEach Activity check for null first and only check the length and continue with the iterator when it’s not null but it doesn’t, although I’m sure that will tighten up in future versions. Arguably the correct way to handle this is to implement an IF condition operator activity within your pipeline that tests for Null and only execute the ForEach iterator activities when you can confirm the object you want to iterate is not null.  However, for me that slightly overcomplicates the pipeline as you end up with nested activities / or additional pipelines that make it difficult to maintain and really understand what’s happening in the pipeline. There’s an alternative and I’d be interested in understanding if there are any better alternatives to achieve the same result. Please comment below. Rather than passing the Null object, we can run an inline test within the Items attribute and pass either a valid object for iterating (when the object is not null) or pass it an empty array when the object is null.  It took me a while to work out how to create an empty array.  @array(‘’) returned an array with a length of 1 so the ForEach loop fired but then subsequently failed as there was nothing to grab from the array.  An empty string had the same effect too.  We need to generate an array (or a string) with a length of zero.  The solution was to ‘take’ an item from array(‘’) – which item?  The item with index of 0.  This returned me a non null array with a length of 0 – the ForEach loop now didn’t fail, it now succeeded and better still, didn’t trigger the sub activities – the result I wanted. Wrapping this into a coalesce provides a self contained defensive null handling pattern that should arguably always be used on the Items setting of the ForEach activity. @coalesce(<##Your string or array to iterate that might be null##>,take(array(''),0)) I would love to hear better/alternative approaches that don’t rely on the IF conditional activity.

Geospatial analysis with Azure Databricks

A few months ago, I wrote a blog demonstrating how to extract and analyse geospatial data in Azure Data Lake Analytics (ADLA) (here). The article aimed to prove that it was possible to run spatial analysis using U-SQL, even though it does not natively support spatial data analytics. The outcome of that experience was positive, however, with serious limitations in terms of performance. ADLA dynamically provisions resources and can perform analytics on terabytes to petabytes of data, however, because it must use the SQL Server Data Types and Spatial assemblies to perform spatial analysis, all the parallelism capabilities are suddenly limited. For example, if you are running an aggregation, ADLA will split the processing between multiple vertices, making it faster, however, when running intersections between points and polygons, because it is a SQL threaded operation, it will only use one vertex, and consequently, the job might take hours to complete. Since I last wrote my blog, the data analytics landscape has changed, and with that, new options became available, namely Azure Databricks. In this blog, I’ll demonstrate how to run spatial analysis and export the results to a mounted point using the Magellan library and Azure Databricks.Magellan is a distributed execution engine for geospatial analytics on big data. It is implemented on top of Apache Spark and deeply leverages modern database techniques like efficient data layout, code generation and query optimization in order to optimize geospatial queries (further details here).Although people mentioned in their GitHub page that the 1.0.5 Magellan library is available for Apache Spark 2.3+ clusters, I learned through a very difficult process that the only way to make it work in Azure Databricks is if you have an Apache Spark 2.2.1 cluster with Scala 2.11. The cluster I used for this experience consisted of a Standard_DS3_v2 driver type with 14GB Memory, 4 Cores and auto scaling enabled. In terms of datasets, I used the NYC Taxicab dataset to create the geometry points and the Magellan NYC Neighbourhoods GeoJSON dataset to extract the polygons. Both datasets were stored in a blob storage and added to Azure Databricks as a mount point.As always, first we need to import the libraries.//Import Libraries import magellan._ import org.apache.spark.sql.magellan.dsl.expressions._ import org.apache.spark.sql.Row import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._Next, we define the schema of the NYC Taxicab dataset and load the data to a DataFrame. While loading the data, we convert the pickup longitude and latitude into a Magellan Point. If there was a need, in the same operation we could also add another Magellan Point from the drop off longitude and latitude.//Define schema for the NYC taxi data val schema = StructType(Array(     StructField("vendorId", StringType, false),     StructField("pickup_datetime", StringType, false),     StructField("dropoff_datetime", StringType, false),     StructField("passenger_count", IntegerType, false),     StructField("trip_distance", DoubleType, false),     StructField("pickup_longitude", DoubleType, false),     StructField("pickup_latitude", DoubleType, false),     StructField("rateCodeId", StringType, false),     StructField("store_fwd", StringType, false),     StructField("dropoff_longitude", DoubleType, false),     StructField("dropoff_latitude", DoubleType, false),     StructField("payment_type", StringType, false),     StructField("fare_amount", StringType, false),     StructField("extra", StringType, false),     StructField("mta_tax", StringType, false),     StructField("tip_amount", StringType, false),     StructField("tolls_amount", StringType, false),     StructField("improvement_surcharge", StringType, false),     StructField("total_amount", DoubleType, false)))//Read data from the NYC Taxicab dataset and create a Magellan point val trips = sqlContext.read       .format("com.databricks.spark.csv")       .option("mode", "DROPMALFORMED")       .schema(schema)       .load("/mnt/geospatial/nyctaxis/*")       .withColumn("point", point($"pickup_longitude",$"pickup_latitude"))The next step is to load the neighbourhood data. As mentioned in their documentation, Magellan supports the reading of ESRI, GeoJSON, OSM-XML and WKT formats. From the GeoJSON dataset, Magellan will extract a collection of polygons and read the metadata into a Map. There are three things to notice in the code below. First, the extraction of the polygon, second, the selection of the key corresponding to the neighbourhood name and finally, the provision of a hint that defines what the index precision should be. This operation, alongside with the injection of a spatial join rule into Catalyst, massively increases the performance of the queries. To have a better understanding of this operation, read this excellent blog.//Read GeoJSON file and define index precision val neighborhoods = sqlContext.read       .format("magellan")       .option("type", "geojson")       .load("/mnt/geospatial/neighborhoods/neighborhoods.geojson")       .select($"polygon",         $"metadata"("neighborhood").as("neighborhood"))       .index(30)Now that we have our two datasets loaded, we can run our geospatial query, to identify in which neighbourhood the pickup points fall under. To achieve our goal, we need to join the two DataFrames and apply a within predicate. As a curiosity, if we consider that m represents the number of points (12.748.987), n the number of polygons (310) p the average # of edges per polygon (104) and O(mnp), then, we will roughly perform 4 trillion calculations on a single node to determine where each point falls. //Inject rules and join DataFrames with within predicate magellan.Utils.injectRules(spark) val intersected = trips.join(neighborhoods)         .where($"point" within $"polygon")The above code, does not take longer than 1 second to execute. It is only when we want to obtain details about our DataFrame that the computing time is visible. For example, if we want to know which state has the most pickups, we can write the following code which takes in average 40 seconds.//Neighbourhoods that received the most pickups display(intersected        .groupBy('neighborhood)       .count()       .orderBy($"count".desc))If we want to save the data and identify which pickups fall inside the NYC neighbourhoods, then we have to rewrite our intersected DataFrame to select all columns except the Magellan Points and Polygons, add a new column to the DataFrame and export the data back to the blob, as shown below.//select pickup points that don't fall inside a neighbourhood val nonIntersected = trips                       .select($"vendorId",$"pickup_datetime", $"dropoff_datetime", $"passenger_count", $"trip_distance", $"pickup_longitude", $"pickup_latitude", $"rateCodeId", $"store_fwd",$"dropoff_longitude", $"dropoff_latitude",$"payment_type",$"fare_amount", $"extra", $"mta_tax", $"tip_amount", $"tolls_amount", $"improvement_surcharge", $"total_amount")                       .except(intersected)//add new column intersected_flagval intersectedFlag = "1"val nonIntersectedFlag = "0"val tripsIntersected = intersected.withColumn("intersected_flag",expr(intersectedFlag)) val tripsNotIntersected = nonIntersected.withColumn("intersected_flag",expr(nonIntersectedFlag))//Union DataFramesval allTrips = tripsNotIntersected.union(tripsIntersected)//Save data to the blobintersected.write   .format("com.databricks.spark.csv")   .option("header", "true")   .save("/mnt/geospatial/trips/trips.csv")In summary, reading a dataset with 1.8GB, apply geospatial analysis and export it back to the blob storage only took in average 1 min, which is miles better when compared with my previous attempt with U-SQL.As always, if you have any comments or questions, do let me know.

Parsing nested JSON lists in Databricks using Python

Parsing complex JSON structures is usually not a trivial task. When your destination is a database, what you expect naturally is a flattened result set. Things get more complicated when your JSON source is a web service and the result consists of multiple nested objects including lists in lists and so on. Things get even more complicated if the JSON schema changes over time, which is often a real-life scenario. We have these wonderful Azure Logic Apps, which help us consistently get the JSON results from various sources. However, Logic Apps are not so good at parsing more complex nested structures. And they definitely don’t like even subtle source schema changes. Enter Databricks! With Databricks you get: An easy way to infer the JSON schema and avoid creating it manually Subtle changes in the JSON schema won’t break things The ability to explode nested lists into rows in a very easy way (see the Notebook below) Speed! Following is an example Databricks Notebook (Python) demonstrating the above claims. The JSON sample consists of an imaginary JSON result set, which contains a list of car models within a list of car vendors within a list of people. We want to flatten this result into a dataframe. Here you go: Blog - Nested JSON Arrays in Databricks - Databricks window.settings = {"enableUsageDeliveryConfiguration":false,"enableNotebookNotifications":true,"enableSshKeyUI":false,"defaultInteractivePricePerDBU":0.55,"enableDynamicAutoCompleteResourceLoading":false,"enableClusterMetricsUI":false,"allowWhitelistedIframeDomains":true,"enableOnDemandClusterType":true,"enableAutoCompleteAsYouType":[],"devTierName":"Community Edition","enableJobsPrefetching":true,"workspaceFeaturedLinks":[{"linkURI":"https://docs.azuredatabricks.net/index.html","displayName":"Documentation","icon":"question"},{"linkURI":"https://docs.azuredatabricks.net/release-notes/product/index.html","displayName":"Release Notes","icon":"code"},{"linkURI":"https://docs.azuredatabricks.net/spark/latest/training/index.html","displayName":"Training & Tutorials","icon":"graduation-cap"}],"enableReservoirTableUI":true,"enableClearStateFeature":true,"dbcForumURL":"http://forums.databricks.com/","enableProtoClusterInfoDeltaPublisher":true,"enableAttachExistingCluster":true,"sandboxForSandboxFrame":"allow-scripts allow-popups allow-popups-to-escape-sandbox allow-forms","resetJobListOnConnect":true,"serverlessDefaultSparkVersion":"latest-stable-scala2.11","maxCustomTags":8,"serverlessDefaultMaxWorkers":20,"enableInstanceProfilesUIInJobs":true,"nodeInfo":{"node_types":[{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":7284,"instance_type_id":"Standard_DS3_v2","node_type_id":"Standard_DS3_v2","description":"Standard_DS3_v2","support_cluster_tags":true,"container_memory_mb":9105,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_DS3_v2","provider":"Azure","local_disk_size_gb":28,"supports_accelerated_networking":true,"compute_units":4.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":14336,"num_cores":4,"cpu_quota_type":"Standard DSv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":14336,"is_hidden":false,"category":"General Purpose","num_cores":4.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":18409,"instance_type_id":"Standard_DS4_v2","node_type_id":"Standard_DS4_v2","description":"Standard_DS4_v2","support_cluster_tags":true,"container_memory_mb":23011,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_DS4_v2","provider":"Azure","local_disk_size_gb":56,"supports_accelerated_networking":true,"compute_units":8.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":28672,"num_cores":8,"cpu_quota_type":"Standard DSv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":28672,"is_hidden":false,"category":"General Purpose","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":40658,"instance_type_id":"Standard_DS5_v2","node_type_id":"Standard_DS5_v2","description":"Standard_DS5_v2","support_cluster_tags":true,"container_memory_mb":50823,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_DS5_v2","provider":"Azure","local_disk_size_gb":112,"supports_accelerated_networking":true,"compute_units":16.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":57344,"num_cores":16,"cpu_quota_type":"Standard DSv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":57344,"is_hidden":false,"category":"General Purpose","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":21587,"instance_type_id":"Standard_D8s_v3","node_type_id":"Standard_D8s_v3","description":"Standard_D8s_v3","support_cluster_tags":true,"container_memory_mb":26984,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":348},"node_instance_type":{"instance_type_id":"Standard_D8s_v3","provider":"Azure","local_disk_size_gb":64,"supports_accelerated_networking":true,"compute_units":8.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":32768,"num_cores":8,"cpu_quota_type":"Standard DSv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":32768,"is_hidden":false,"category":"General Purpose","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":47015,"instance_type_id":"Standard_D16s_v3","node_type_id":"Standard_D16s_v3","description":"Standard_D16s_v3","support_cluster_tags":true,"container_memory_mb":58769,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":348},"node_instance_type":{"instance_type_id":"Standard_D16s_v3","provider":"Azure","local_disk_size_gb":128,"supports_accelerated_networking":true,"compute_units":16.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":65536,"num_cores":16,"cpu_quota_type":"Standard DSv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":65536,"is_hidden":false,"category":"General Purpose","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":97871,"instance_type_id":"Standard_D32s_v3","node_type_id":"Standard_D32s_v3","description":"Standard_D32s_v3","support_cluster_tags":true,"container_memory_mb":122339,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":348},"node_instance_type":{"instance_type_id":"Standard_D32s_v3","provider":"Azure","local_disk_size_gb":256,"supports_accelerated_networking":true,"compute_units":32.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":131072,"num_cores":32,"cpu_quota_type":"Standard DSv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":131072,"is_hidden":false,"category":"General Purpose","num_cores":32.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":199583,"instance_type_id":"Standard_D64s_v3","node_type_id":"Standard_D64s_v3","description":"Standard_D64s_v3","support_cluster_tags":true,"container_memory_mb":249479,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":348},"node_instance_type":{"instance_type_id":"Standard_D64s_v3","provider":"Azure","local_disk_size_gb":512,"supports_accelerated_networking":true,"compute_units":64.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":262144,"num_cores":64,"cpu_quota_type":"Standard DSv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":262144,"is_hidden":false,"category":"General Purpose","num_cores":64.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":7284,"instance_type_id":"Standard_D3_v2","node_type_id":"Standard_D3_v2","description":"Standard_D3_v2","support_cluster_tags":true,"container_memory_mb":9105,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D3_v2","provider":"Azure","local_disk_size_gb":200,"supports_accelerated_networking":true,"compute_units":4.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":14336,"num_cores":4,"cpu_quota_type":"Standard Dv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":14336,"is_hidden":false,"category":"General Purpose (HDD)","num_cores":4.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":21587,"instance_type_id":"Standard_D8_v3","node_type_id":"Standard_D8_v3","description":"Standard_D8_v3","support_cluster_tags":true,"container_memory_mb":26984,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D8_v3","provider":"Azure","local_disk_size_gb":200,"supports_accelerated_networking":true,"compute_units":8.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":32768,"num_cores":8,"cpu_quota_type":"Standard Dv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":32768,"is_hidden":false,"category":"General Purpose (HDD)","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":47015,"instance_type_id":"Standard_D16_v3","node_type_id":"Standard_D16_v3","description":"Standard_D16_v3","support_cluster_tags":true,"container_memory_mb":58769,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D16_v3","provider":"Azure","local_disk_size_gb":400,"supports_accelerated_networking":true,"compute_units":16.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":65536,"num_cores":16,"cpu_quota_type":"Standard Dv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":65536,"is_hidden":false,"category":"General Purpose (HDD)","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":97871,"instance_type_id":"Standard_D32_v3","node_type_id":"Standard_D32_v3","description":"Standard_D32_v3","support_cluster_tags":true,"container_memory_mb":122339,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D32_v3","provider":"Azure","local_disk_size_gb":800,"supports_accelerated_networking":true,"compute_units":32.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":131072,"num_cores":32,"cpu_quota_type":"Standard Dv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":131072,"is_hidden":false,"category":"General Purpose (HDD)","num_cores":32.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":199583,"instance_type_id":"Standard_D64_v3","node_type_id":"Standard_D64_v3","description":"Standard_D64_v3","support_cluster_tags":true,"container_memory_mb":249479,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D64_v3","provider":"Azure","local_disk_size_gb":1600,"supports_accelerated_networking":true,"compute_units":64.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":262144,"num_cores":64,"cpu_quota_type":"Standard Dv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":262144,"is_hidden":false,"category":"General Purpose (HDD)","num_cores":64.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":18409,"instance_type_id":"Standard_D12_v2","node_type_id":"Standard_D12_v2","description":"Standard_D12_v2","support_cluster_tags":true,"container_memory_mb":23011,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D12_v2","provider":"Azure","local_disk_size_gb":200,"supports_accelerated_networking":true,"compute_units":4.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":28672,"num_cores":4,"cpu_quota_type":"Standard Dv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":28672,"is_hidden":false,"category":"Memory Optimized (Remote HDD)","num_cores":4.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":40658,"instance_type_id":"Standard_D13_v2","node_type_id":"Standard_D13_v2","description":"Standard_D13_v2","support_cluster_tags":true,"container_memory_mb":50823,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D13_v2","provider":"Azure","local_disk_size_gb":400,"supports_accelerated_networking":true,"compute_units":8.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":57344,"num_cores":8,"cpu_quota_type":"Standard Dv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":57344,"is_hidden":false,"category":"Memory Optimized (Remote HDD)","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":85157,"instance_type_id":"Standard_D14_v2","node_type_id":"Standard_D14_v2","description":"Standard_D14_v2","support_cluster_tags":true,"container_memory_mb":106447,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D14_v2","provider":"Azure","local_disk_size_gb":800,"supports_accelerated_networking":true,"compute_units":16.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":114688,"num_cores":16,"cpu_quota_type":"Standard Dv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":114688,"is_hidden":false,"category":"Memory Optimized (Remote HDD)","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":107407,"instance_type_id":"Standard_D15_v2","node_type_id":"Standard_D15_v2","description":"Standard_D15_v2","support_cluster_tags":true,"container_memory_mb":134259,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D15_v2","provider":"Azure","local_disk_size_gb":1000,"supports_accelerated_networking":true,"compute_units":20.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":143360,"num_cores":20,"cpu_quota_type":"Standard Dv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":143360,"is_hidden":false,"category":"Memory Optimized (Remote HDD)","num_cores":20.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":18409,"instance_type_id":"Standard_DS12_v2","node_type_id":"Standard_DS12_v2","description":"Standard_DS12_v2","support_cluster_tags":true,"container_memory_mb":23011,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_DS12_v2","provider":"Azure","local_disk_size_gb":56,"supports_accelerated_networking":true,"compute_units":4.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":28672,"num_cores":4,"cpu_quota_type":"Standard DSv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":28672,"is_hidden":false,"category":"Memory Optimized","num_cores":4.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":40658,"instance_type_id":"Standard_DS13_v2","node_type_id":"Standard_DS13_v2","description":"Standard_DS13_v2","support_cluster_tags":true,"container_memory_mb":50823,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_DS13_v2","provider":"Azure","local_disk_size_gb":112,"supports_accelerated_networking":true,"compute_units":8.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":57344,"num_cores":8,"cpu_quota_type":"Standard DSv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":57344,"is_hidden":false,"category":"Memory Optimized","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":85157,"instance_type_id":"Standard_DS14_v2","node_type_id":"Standard_DS14_v2","description":"Standard_DS14_v2","support_cluster_tags":true,"container_memory_mb":106447,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_DS14_v2","provider":"Azure","local_disk_size_gb":224,"supports_accelerated_networking":true,"compute_units":16.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":114688,"num_cores":16,"cpu_quota_type":"Standard DSv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":114688,"is_hidden":false,"category":"Memory Optimized","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":107407,"instance_type_id":"Standard_DS15_v2","node_type_id":"Standard_DS15_v2","description":"Standard_DS15_v2","support_cluster_tags":true,"container_memory_mb":134259,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_DS15_v2","provider":"Azure","local_disk_size_gb":280,"supports_accelerated_networking":true,"compute_units":20.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":143360,"num_cores":20,"cpu_quota_type":"Standard DSv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":143360,"is_hidden":false,"category":"Memory Optimized","num_cores":20.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":47015,"instance_type_id":"Standard_E8s_v3","node_type_id":"Standard_E8s_v3","description":"Standard_E8s_v3","support_cluster_tags":true,"container_memory_mb":58769,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_E8s_v3","provider":"Azure","local_disk_size_gb":128,"supports_accelerated_networking":true,"compute_units":8.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":65536,"num_cores":8,"cpu_quota_type":"Standard ESv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":65536,"is_hidden":false,"category":"Memory Optimized","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":97871,"instance_type_id":"Standard_E16s_v3","node_type_id":"Standard_E16s_v3","description":"Standard_E16s_v3","support_cluster_tags":true,"container_memory_mb":122339,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_E16s_v3","provider":"Azure","local_disk_size_gb":256,"supports_accelerated_networking":true,"compute_units":16.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":131072,"num_cores":16,"cpu_quota_type":"Standard ESv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":131072,"is_hidden":false,"category":"Memory Optimized","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":199583,"instance_type_id":"Standard_E32s_v3","node_type_id":"Standard_E32s_v3","description":"Standard_E32s_v3","support_cluster_tags":true,"container_memory_mb":249479,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_E32s_v3","provider":"Azure","local_disk_size_gb":512,"supports_accelerated_networking":true,"compute_units":32.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":262144,"num_cores":32,"cpu_quota_type":"Standard ESv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":262144,"is_hidden":false,"category":"Memory Optimized","num_cores":32.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":21587,"instance_type_id":"Standard_L4s","node_type_id":"Standard_L4s","description":"Standard_L4s","support_cluster_tags":true,"container_memory_mb":26984,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_L4s","provider":"Azure","local_disk_size_gb":678,"supports_accelerated_networking":false,"compute_units":4.0,"number_of_ips":2,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":32768,"num_cores":4,"cpu_quota_type":"Standard LS Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":32768,"is_hidden":false,"category":"Storage Optimized","num_cores":4.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":47015,"instance_type_id":"Standard_L8s","node_type_id":"Standard_L8s","description":"Standard_L8s","support_cluster_tags":true,"container_memory_mb":58769,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_L8s","provider":"Azure","local_disk_size_gb":1388,"supports_accelerated_networking":false,"compute_units":8.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":65536,"num_cores":8,"cpu_quota_type":"Standard LS Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":65536,"is_hidden":false,"category":"Storage Optimized","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":97871,"instance_type_id":"Standard_L16s","node_type_id":"Standard_L16s","description":"Standard_L16s","support_cluster_tags":true,"container_memory_mb":122339,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_L16s","provider":"Azure","local_disk_size_gb":2807,"supports_accelerated_networking":false,"compute_units":16.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":131072,"num_cores":16,"cpu_quota_type":"Standard LS Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":131072,"is_hidden":false,"category":"Storage Optimized","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":199583,"instance_type_id":"Standard_L32s","node_type_id":"Standard_L32s","description":"Standard_L32s","support_cluster_tags":true,"container_memory_mb":249479,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_L32s","provider":"Azure","local_disk_size_gb":5630,"supports_accelerated_networking":false,"compute_units":32.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":262144,"num_cores":32,"cpu_quota_type":"Standard LS Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":262144,"is_hidden":false,"category":"Storage Optimized","num_cores":32.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":2516,"instance_type_id":"Standard_F4s","node_type_id":"Standard_F4s","description":"Standard_F4s","support_cluster_tags":true,"container_memory_mb":3146,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_F4s","provider":"Azure","local_disk_size_gb":16,"supports_accelerated_networking":true,"compute_units":4.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":8192,"num_cores":4,"cpu_quota_type":"Standard FS Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":8192,"is_hidden":false,"category":"Compute Optimized","num_cores":4.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":8873,"instance_type_id":"Standard_F8s","node_type_id":"Standard_F8s","description":"Standard_F8s","support_cluster_tags":true,"container_memory_mb":11092,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_F8s","provider":"Azure","local_disk_size_gb":32,"supports_accelerated_networking":true,"compute_units":8.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":16384,"num_cores":8,"cpu_quota_type":"Standard FS Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":16384,"is_hidden":false,"category":"Compute Optimized","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":21587,"instance_type_id":"Standard_F16s","node_type_id":"Standard_F16s","description":"Standard_F16s","support_cluster_tags":true,"container_memory_mb":26984,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_F16s","provider":"Azure","local_disk_size_gb":64,"supports_accelerated_networking":true,"compute_units":16.0,"number_of_ips":16,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":32768,"num_cores":16,"cpu_quota_type":"Standard FS Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":32768,"is_hidden":false,"category":"Compute Optimized","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":85157,"instance_type_id":"Standard_H16","node_type_id":"Standard_H16","description":"Standard_H16","support_cluster_tags":true,"container_memory_mb":106447,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":8},"node_instance_type":{"instance_type_id":"Standard_H16","provider":"Azure","local_disk_size_gb":2000,"supports_accelerated_networking":false,"compute_units":16.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":114688,"num_cores":16,"cpu_quota_type":"Standard H Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":114688,"is_hidden":false,"category":"Compute Optimized","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":2,"spark_heap_memory":85157,"instance_type_id":"Standard_NC12","node_type_id":"Standard_NC12","description":"Standard_NC12 (beta)","support_cluster_tags":true,"container_memory_mb":106447,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":48},"node_instance_type":{"instance_type_id":"Standard_NC12","provider":"Azure","local_disk_size_gb":680,"supports_accelerated_networking":false,"compute_units":12.0,"number_of_ips":2,"local_disks":1,"reserved_compute_units":1.0,"gpus":2,"memory_mb":114688,"num_cores":12,"cpu_quota_type":"Standard NC Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":48,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":114688,"is_hidden":false,"category":"GPU Accelerated","num_cores":12.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":4,"spark_heap_memory":174155,"instance_type_id":"Standard_NC24","node_type_id":"Standard_NC24","description":"Standard_NC24 (beta)","support_cluster_tags":true,"container_memory_mb":217694,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":48},"node_instance_type":{"instance_type_id":"Standard_NC24","provider":"Azure","local_disk_size_gb":1440,"supports_accelerated_networking":false,"compute_units":24.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":4,"memory_mb":229376,"num_cores":24,"cpu_quota_type":"Standard NC Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":229376,"is_hidden":false,"category":"GPU Accelerated","num_cores":24.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":1,"spark_heap_memory":85157,"instance_type_id":"Standard_NC6s_v3","node_type_id":"Standard_NC6s_v3","description":"Standard_NC6s_v3 (beta)","support_cluster_tags":true,"container_memory_mb":106447,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":0},"node_instance_type":{"instance_type_id":"Standard_NC6s_v3","provider":"Azure","local_disk_size_gb":736,"supports_accelerated_networking":false,"compute_units":6.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":1,"memory_mb":114688,"num_cores":6,"cpu_quota_type":"Standard NCSv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":12,"supported_disk_types":[{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":114688,"is_hidden":false,"category":"GPU Accelerated","num_cores":6.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":2,"spark_heap_memory":174155,"instance_type_id":"Standard_NC12s_v3","node_type_id":"Standard_NC12s_v3","description":"Standard_NC12s_v3 (beta)","support_cluster_tags":true,"container_memory_mb":217694,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":0},"node_instance_type":{"instance_type_id":"Standard_NC12s_v3","provider":"Azure","local_disk_size_gb":1474,"supports_accelerated_networking":false,"compute_units":12.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":2,"memory_mb":229376,"num_cores":12,"cpu_quota_type":"Standard NCSv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":24,"supported_disk_types":[{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":229376,"is_hidden":false,"category":"GPU Accelerated","num_cores":12.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":4,"spark_heap_memory":352151,"instance_type_id":"Standard_NC24s_v3","node_type_id":"Standard_NC24s_v3","description":"Standard_NC24s_v3 (beta)","support_cluster_tags":true,"container_memory_mb":440189,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":0},"node_instance_type":{"instance_type_id":"Standard_NC24s_v3","provider":"Azure","local_disk_size_gb":2948,"supports_accelerated_networking":false,"compute_units":24.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":4,"memory_mb":458752,"num_cores":24,"cpu_quota_type":"Standard NCSv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":458752,"is_hidden":false,"category":"GPU Accelerated","num_cores":24.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false}],"default_node_type_id":"Standard_DS3_v2"},"enableDatabaseSupportClusterChoice":true,"enableClusterAcls":true,"notebookRevisionVisibilityHorizon":0,"serverlessClusterProductName":"Serverless Pool","showS3TableImportOption":false,"redirectBrowserOnWorkspaceSelection":true,"maxEbsVolumesPerInstance":10,"enableRStudioUI":false,"isAdmin":true,"deltaProcessingBatchSize":1000,"timerUpdateQueueLength":100,"sqlAclsEnabledMap":{"spark.databricks.acl.enabled":"true","spark.databricks.acl.sqlOnly":"true"},"enableLargeResultDownload":true,"maxElasticDiskCapacityGB":5000,"serverlessDefaultMinWorkers":2,"zoneInfos":[],"enableCustomSpotPricingUIByTier":true,"serverlessClustersEnabled":true,"enableWorkspaceBrowserSorting":true,"enableSentryLogging":false,"enableFindAndReplace":true,"disallowUrlImportExceptFromDocs":false,"defaultStandardClusterModel":{"cluster_name":"","node_type_id":"Standard_DS3_v2","spark_version":"4.0.x-scala2.11","num_workers":null,"autoscale":{"min_workers":2,"max_workers":8},"autotermination_minutes":120,"default_tags":{"Vendor":"Databricks","Creator":"ivv@adatis.co.uk","ClusterName":null,"ClusterId":""}},"enableEBSVolumesUIForJobs":true,"enablePublishNotebooks":false,"enableBitbucketCloud":true,"shouldShowCommandStatus":true,"createTableInNotebookS3Link":{"url":"https://docs.azuredatabricks.net/_static/notebooks/data-import/s3.html","displayName":"S3","workspaceFileName":"S3 Example"},"sanitizeHtmlResult":true,"enableClusterPinningUI":true,"enableJobAclsConfig":false,"enableFullTextSearch":false,"enableElasticSparkUI":true,"enableNewClustersCreate":true,"clusters":true,"allowRunOnPendingClusters":true,"useAutoscalingByDefault":true,"enableAzureToolbar":true,"fileStoreBase":"FileStore","enableEmailInAzure":true,"enableRLibraries":true,"enableTableAclsConfig":false,"enableSshKeyUIInJobs":true,"enableDetachAndAttachSubMenu":true,"configurableSparkOptionsSpec":[{"keyPattern":"spark\\.kryo(\\.[^\\.]+)+","valuePattern":".*","keyPatternDisplay":"spark.kryo.*","valuePatternDisplay":"*","description":"Configuration options for Kryo serialization"},{"keyPattern":"spark\\.io\\.compression\\.codec","valuePattern":"(lzf|snappy|org\\.apache\\.spark\\.io\\.LZFCompressionCodec|org\\.apache\\.spark\\.io\\.SnappyCompressionCodec)","keyPatternDisplay":"spark.io.compression.codec","valuePatternDisplay":"snappy|lzf","description":"The codec used to compress internal data such as RDD partitions, broadcast variables and shuffle outputs."},{"keyPattern":"spark\\.serializer","valuePattern":"(org\\.apache\\.spark\\.serializer\\.JavaSerializer|org\\.apache\\.spark\\.serializer\\.KryoSerializer)","keyPatternDisplay":"spark.serializer","valuePatternDisplay":"org.apache.spark.serializer.JavaSerializer|org.apache.spark.serializer.KryoSerializer","description":"Class to use for serializing objects that will be sent over the network or need to be cached in serialized form."},{"keyPattern":"spark\\.rdd\\.compress","valuePattern":"(true|false)","keyPatternDisplay":"spark.rdd.compress","valuePatternDisplay":"true|false","description":"Whether to compress serialized RDD partitions (e.g. for StorageLevel.MEMORY_ONLY_SER). Can save substantial space at the cost of some extra CPU time."},{"keyPattern":"spark\\.speculation","valuePattern":"(true|false)","keyPatternDisplay":"spark.speculation","valuePatternDisplay":"true|false","description":"Whether to use speculation (recommended off for streaming)"},{"keyPattern":"spark\\.es(\\.[^\\.]+)+","valuePattern":".*","keyPatternDisplay":"spark.es.*","valuePatternDisplay":"*","description":"Configuration options for ElasticSearch"},{"keyPattern":"es(\\.([^\\.]+))+","valuePattern":".*","keyPatternDisplay":"es.*","valuePatternDisplay":"*","description":"Configuration options for ElasticSearch"},{"keyPattern":"spark\\.(storage|shuffle)\\.memoryFraction","valuePattern":"0?\\.0*([1-9])([0-9])*","keyPatternDisplay":"spark.(storage|shuffle).memoryFraction","valuePatternDisplay":"(0.0,1.0)","description":"Fraction of Java heap to use for Spark's shuffle or storage"},{"keyPattern":"spark\\.streaming\\.backpressure\\.enabled","valuePattern":"(true|false)","keyPatternDisplay":"spark.streaming.backpressure.enabled","valuePatternDisplay":"true|false","description":"Enables or disables Spark Streaming's internal backpressure mechanism (since 1.5). This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process. Internally, this dynamically sets the maximum receiving rate of receivers. This rate is upper bounded by the values `spark.streaming.receiver.maxRate` and `spark.streaming.kafka.maxRatePerPartition` if they are set."},{"keyPattern":"spark\\.streaming\\.receiver\\.maxRate","valuePattern":"^([0-9]{1,})$","keyPatternDisplay":"spark.streaming.receiver.maxRate","valuePatternDisplay":"numeric","description":"Maximum rate (number of records per second) at which each receiver will receive data. Effectively, each stream will consume at most this number of records per second. Setting this configuration to 0 or a negative number will put no limit on the rate. See the deployment guide in the Spark Streaming programing guide for mode details."},{"keyPattern":"spark\\.streaming\\.kafka\\.maxRatePerPartition","valuePattern":"^([0-9]{1,})$","keyPatternDisplay":"spark.streaming.kafka.maxRatePerPartition","valuePatternDisplay":"numeric","description":"Maximum rate (number of records per second) at which data will be read from each Kafka partition when using the Kafka direct stream API introduced in Spark 1.3. See the Kafka Integration guide for more details."},{"keyPattern":"spark\\.streaming\\.kafka\\.maxRetries","valuePattern":"^([0-9]{1,})$","keyPatternDisplay":"spark.streaming.kafka.maxRetries","valuePatternDisplay":"numeric","description":"Maximum number of consecutive retries the driver will make in order to find the latest offsets on the leader of each partition (a default value of 1 means that the driver will make a maximum of 2 attempts). Only applies to the Kafka direct stream API introduced in Spark 1.3."},{"keyPattern":"spark\\.streaming\\.ui\\.retainedBatches","valuePattern":"^([0-9]{1,})$","keyPatternDisplay":"spark.streaming.ui.retainedBatches","valuePatternDisplay":"numeric","description":"How many batches the Spark Streaming UI and status APIs remember before garbage collecting."}],"enableReactNotebookComments":true,"enableAdminPasswordReset":false,"checkBeforeAddingAadUser":true,"enableResetPassword":true,"maxClusterTagValueLength":256,"enableJobsSparkUpgrade":true,"createTableInNotebookDBFSLink":{"url":"https://docs.azuredatabricks.net/_static/notebooks/data-import/dbfs.html","displayName":"DBFS","workspaceFileName":"DBFS Example"},"perClusterAutoterminationEnabled":true,"enableNotebookCommandNumbers":true,"measureRoundTripTimes":true,"allowStyleInSanitizedHtml":false,"sparkVersions":[{"key":"3.3.x-scala2.10","displayName":"3.3 (includes Apache Spark 2.2.0, Scala 2.10)","packageLabel":"spark-image-86a9b375074f5afad339e70230ec0ec265c4cefbd280844785fab3bcde5869f9","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":[]},{"key":"4.1.x-scala2.11","displayName":"4.1 (includes Apache Spark 2.3.0, Scala 2.11)","packageLabel":"spark-image-e69ab61f9eb55b59a5df3b9d9ca1605268022efa7b699e11b408049038d8ea8c","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS","SUPPORTS_RSTUDIO"]},{"key":"4.1.x-ml-gpu-scala2.11","displayName":"4.1 ML Beta (includes Apache Spark 2.3.0, GPU, Scala 2.11)","packageLabel":"spark-image-5907529b625e97ac8feb0a069002b4fdb861a16740752b5df568fe4efb1c004e","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":[]},{"key":"4.0.x-scala2.11","displayName":"4.0 (includes Apache Spark 2.3.0, Scala 2.11)","packageLabel":"spark-image-958dfd1fcde8070c85e13f869b8d816b71d63cac31357210d4858c3ff3be83ce","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS"]},{"key":"3.4.x-scala2.11","displayName":"3.4 (includes Apache Spark 2.2.0, Scala 2.11)","packageLabel":"spark-image-35a5008cd4a7aac70818911317758515b85d0fcab6ead08fa3f66157119fa6ce","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION"]},{"key":"3.2.x-scala2.10","displayName":"3.2 (includes Apache Spark 2.2.0, Scala 2.10)","packageLabel":"spark-image-557788bea0eea16bbf7a8ba13ace07e64dd7fc86270bd5cea086097fe886431f","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":[]},{"key":"4.1.x-ml-scala2.11","displayName":"4.1 ML Beta (includes Apache Spark 2.3.0, Scala 2.11)","packageLabel":"spark-image-ad599fbbca53898d7531a4b94c73f3e68b3c2e49e3502c09f6bf01468d801882","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":[]},{"key":"latest-experimental-scala2.10","displayName":"[DO NOT USE] Latest experimental (3.5 snapshot, Scala 2.10)","packageLabel":"spark-image-5e4f1f2feb631875a6036dffb069ec14b436939b5efe0ecb3ff8220c835298d6","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS"]},{"key":"latest-rc-scala2.11","displayName":"Latest RC (4.2 snapshot, Scala 2.11)","packageLabel":"spark-image-eb6b629259e901623758e884730de4a93e6babcb8995b191e759e09a596490fd","upgradable":true,"deprecated":false,"customerVisible":false,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS","SUPPORTS_RSTUDIO"]},{"key":"latest-stable-scala2.11","displayName":"Latest stable (Scala 2.11)","packageLabel":"spark-image-e69ab61f9eb55b59a5df3b9d9ca1605268022efa7b699e11b408049038d8ea8c","upgradable":true,"deprecated":false,"customerVisible":false,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS","SUPPORTS_RSTUDIO"]},{"key":"4.1.x-gpu-scala2.11","displayName":"4.1 (includes Apache Spark 2.3.0, GPU, Scala 2.11)","packageLabel":"spark-image-79462cd30eaa7c6e877243f75bd986e68290424e3856fad0fbd47a3b6353032e","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_RSTUDIO"]},{"key":"3.5.x-scala2.10","displayName":"3.5 LTS (includes Apache Spark 2.2.1, Scala 2.10)","packageLabel":"spark-image-2c23eb0f5a3d83904705cf416a815421bdca898db8835c0dcf9084c01509594e","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS"]},{"key":"latest-rc-scala2.10","displayName":"[DO NOT USE] Latest RC (3.5 snapshot, Scala 2.10)","packageLabel":"spark-image-5e4f1f2feb631875a6036dffb069ec14b436939b5efe0ecb3ff8220c835298d6","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS"]},{"key":"latest-stable-scala2.10","displayName":"[DEPRECATED] Latest stable (Scala 2.10)","packageLabel":"spark-image-5e4f1f2feb631875a6036dffb069ec14b436939b5efe0ecb3ff8220c835298d6","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS"]},{"key":"3.1.x-scala2.11","displayName":"3.1 (includes Apache Spark 2.2.0, Scala 2.11)","packageLabel":"spark-image-241fa8b78ee6343242b1756b18076270894385ff40a81172a6fb5eadf66155d3","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":[]},{"key":"3.1.x-scala2.10","displayName":"3.1 (includes Apache Spark 2.2.0, Scala 2.10)","packageLabel":"spark-image-7efac6b9a8f2da59cb4f6d0caac46cfcb3f1ebf64c8073498c42d0360f846714","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":[]},{"key":"3.3.x-scala2.11","displayName":"3.3 (includes Apache Spark 2.2.0, Scala 2.11)","packageLabel":"spark-image-46cc39a9afa43fbd7bfa9f4f5ed8d23f658cd0b0d74208627243222ae0d22f8d","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":[]},{"key":"3.5.x-scala2.11","displayName":"3.5 LTS (includes Apache Spark 2.2.1, Scala 2.11)","packageLabel":"spark-image-ddab4ca82a96df57f2dd2d5acfecd2373ac2db360d9559cd59e632dee270f05b","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS"]},{"key":"latest-experimental-scala2.11","displayName":"Latest experimental (4.2 snapshot, Scala 2.11)","packageLabel":"spark-image-eb6b629259e901623758e884730de4a93e6babcb8995b191e759e09a596490fd","upgradable":true,"deprecated":false,"customerVisible":false,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS","SUPPORTS_RSTUDIO"]},{"key":"3.2.x-scala2.11","displayName":"3.2 (includes Apache Spark 2.2.0, Scala 2.11)","packageLabel":"spark-image-5537926238bc55cb6cd76ee0f0789511349abead3781c4780721a845f34b5d4e","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":[]},{"key":"latest-rc-gpu-scala2.11","displayName":"Latest RC (4.2 snapshot, GPU, Scala 2.11)","packageLabel":"spark-image-8fe44f4a94defa1f3e13ffcbe83d5fe86c26e651f20c9250f5f09930295a9e66","upgradable":true,"deprecated":false,"customerVisible":false,"capabilities":["SUPPORTS_RSTUDIO"]},{"key":"3.4.x-scala2.10","displayName":"3.4 (includes Apache Spark 2.2.0, Scala 2.10)","packageLabel":"spark-image-3e68b33974ebcf196fd048476d71c8747b8e3596456ec9e6621d73388e5484f9","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION"]}],"enablePresentationMode":false,"enableClearStateAndRunAll":true,"enableTableAclsByTier":true,"enableRestrictedClusterCreation":false,"enableFeedback":false,"enableClusterAutoScaling":true,"enableUserVisibleDefaultTags":true,"defaultNumWorkers":8,"serverContinuationTimeoutMillis":10000,"jobsUnreachableThresholdMillis":60000,"driverStderrFilePrefix":"stderr","roundTripReportTimeoutMs":5000,"enableNotebookRefresh":true,"createTableInNotebookImportedFileLink":{"url":"https://docs.azuredatabricks.net/_static/notebooks/data-import/imported-file.html","displayName":"Imported File","workspaceFileName":"Imported File Example"},"accountsOwnerUrl":"https://portal.azure.com/?feature.customportal=false&microsoft_azure_marketplace_ItemHideKey=DatabricksExtensionHidden&Microsoft_Azure_Databricks=true#resource/subscriptions/76dd74d5-e8e7-493d-91dc-d8113ee1f20c/resourceGroups/RGABI/providers/Microsoft.Databricks/workspaces/abiweuadlsdev","driverStdoutFilePrefix":"stdout","showDbuPricing":true,"databricksDocsBaseHostname":"docs.azuredatabricks.net","defaultNodeTypeToPricingUnitsMap":{"Standard_E64s_v3":16,"r3.2xlarge":2,"i3.4xlarge":4,"Standard_NC12s_v2":6.75,"class-node":1,"m4.2xlarge":1.5,"Standard_D11_v2":0.5,"r4.xlarge":1,"m4.4xlarge":3,"p3.2xlarge":4.15,"Standard_DS5_v2":3,"Standard_D2s_v3":0.5,"Standard_DS4_v2_Promo":1.5,"Standard_DS14":4,"Standard_DS11_v2_Promo":0.5,"r4.16xlarge":16,"Standard_NC6":1.5,"Standard_DS11":0.5,"Standard_D2_v3":0.5,"Standard_DS14_v2_Promo":4,"Standard_D64s_v3":12,"p2.8xlarge":9.76,"m4.10xlarge":8,"Standard_D8s_v3":1.5,"Standard_E32s_v3":8,"Standard_DS3":0.75,"Standard_DS2_v2":0.5,"r3.8xlarge":8,"r4.4xlarge":4,"dev-tier-node":1,"Standard_L8s":2,"Standard_D13_v2":2,"p3.16xlarge":33.2,"Standard_NC24rs_v3":20,"Standard_DS13_v2_Promo":2,"Standard_E4s_v3":1,"Standard_D3_v2":0.75,"Standard_NC24":6,"Standard_NC24r":6,"Standard_DS15_v2":5,"Standard_D16s_v3":3,"Standard_D5_v2":3,"Standard_E8s_v3":2,"Standard_DS2_v2_Promo":0.5,"c3.8xlarge":4,"Standard_D4_v3":0.75,"Standard_E2s_v3":0.5,"Standard_D32_v3":6,"Standard_DS3_v2":0.75,"Standard_NC6s_v3":5,"r3.4xlarge":4,"Standard_DS4":1.5,"i2.4xlarge":6,"Standard_DS3_v2_Promo":0.75,"m4.xlarge":0.75,"r4.8xlarge":8,"Standard_D14_v2":4,"Standard_H16":4,"Standard_NC12":3,"Standard_DS14_v2":4,"r4.large":0.5,"Standard_D15_v2":5,"Standard_DS12":1,"development-node":1,"i2.2xlarge":3,"Standard_NC6s_v2":3.38,"g2.8xlarge":6,"Standard_D12_v2":1,"i3.large":0.75,"Standard_NC12s_v3":10,"memory-optimized":1,"m4.large":0.4,"Standard_D16_v3":3,"Standard_F4s":0.5,"p2.16xlarge":19.52,"Standard_NC24rs_v2":13.5,"i3.8xlarge":8,"Standard_D32s_v3":6,"i3.16xlarge":16,"Standard_DS12_v2":1,"Standard_L32s":8,"Standard_D4s_v3":0.75,"Standard_DS13":2,"Standard_DS11_v2":0.5,"Standard_DS12_v2_Promo":1,"Standard_DS13_v2":2,"c3.2xlarge":1,"Standard_L4s":1,"Standard_F16s":2,"c4.2xlarge":1,"Standard_L16s":4,"i2.xlarge":1.5,"Standard_DS2":0.5,"compute-optimized":1,"c4.4xlarge":2,"Standard_DS5_v2_Promo":3,"Standard_D64_v3":12,"Standard_D2_v2":0.5,"Standard_D8_v3":1.5,"i3.2xlarge":2,"Standard_E16s_v3":4,"Standard_F8s":1,"c3.4xlarge":2,"Standard_NC24s_v2":13.5,"Standard_NC24s_v3":20,"Standard_D4_v2":1.5,"g2.2xlarge":1.5,"p3.8xlarge":16.6,"p2.xlarge":1.22,"m4.16xlarge":12,"Standard_DS4_v2":1.5,"c4.8xlarge":4,"i3.xlarge":1,"r3.xlarge":1,"r4.2xlarge":2,"i2.8xlarge":12},"tableFilesBaseFolder":"/tables","enableSparkDocsSearch":true,"sparkHistoryServerEnabled":true,"enableClusterAppsUIOnServerless":false,"enableEBSVolumesUI":true,"homePageWelcomeMessage":"","metastoreServiceRowLimit":1000000,"enableIPythonImportExport":true,"enableClusterTagsUIForJobs":true,"enableClusterTagsUI":true,"enableNotebookHistoryDiffing":true,"branch":"2.72.251","accountsLimit":-1,"enableSparkEnvironmentVariables":true,"enableX509Authentication":false,"useAADLogin":true,"enableStructuredStreamingNbOptimizations":true,"enableNotebookGitBranching":true,"terminatedClustersWindow":2592000000,"local":false,"enableNotebookLazyRenderWrapper":false,"enableClusterAutoScalingForJobs":true,"enableStrongPassword":false,"showReleaseNote":false,"displayDefaultContainerMemoryGB":30,"broadenedEditPermission":false,"enableWorkspacePurgeDryRun":false,"disableS3TableImport":true,"enableArrayParamsEdit":true,"deploymentMode":"production","useSpotForWorkers":true,"removePasswordInAccountSettings":true,"preferStartTerminatedCluster":false,"enableUserInviteWorkflow":true,"createTableConnectorOptionLinks":[{"url":"https://docs.databricks.com/_static/notebooks/data-import/azure-blob-store.html","displayName":"Azure Blob Storage","workspaceFileName":"Azure Blob Storage Import Example Notebook"},{"url":"https://docs.azuredatabricks.net/_static/notebooks/data-import/jdbc.html","displayName":"JDBC","workspaceFileName":"JDBC Example"},{"url":"https://docs.azuredatabricks.net/_static/notebooks/cassandra.html","displayName":"Cassandra","workspaceFileName":"Cassandra Example"},{"url":"https://docs.azuredatabricks.net/_static/notebooks/structured-streaming-etl-kafka.html","displayName":"Kafka","workspaceFileName":"Kafka Example"},{"url":"https://docs.azuredatabricks.net/_static/notebooks/redis.html","displayName":"Redis","workspaceFileName":"Redis Example"},{"url":"https://docs.azuredatabricks.net/_static/notebooks/elasticsearch.html","displayName":"Elasticsearch","workspaceFileName":"Elasticsearch Example"}],"enableStaticNotebooks":true,"enableNewLineChart":true,"shouldReportUnhandledPromiseRejectionsToSentry":false,"sandboxForUrlSandboxFrame":"allow-scripts allow-popups allow-popups-to-escape-sandbox allow-forms","enableCssTransitions":true,"serverlessEnableElasticDisk":true,"minClusterTagKeyLength":1,"showHomepageFeaturedLinks":true,"pricingURL":"https://databricks.com/product/pricing","enableClusterEdit":true,"enableClusterAclsConfig":false,"useTempS3UrlForTableUpload":false,"notifyLastLogin":false,"enableFilePurge":true,"enableSshKeyUIByTier":true,"enableCreateClusterOnAttach":false,"defaultAutomatedPricePerDBU":0.35,"enableNotebookGitVersioning":true,"defaultMinWorkers":2,"commandStatusDebounceMaxWait":1000,"files":"files/","feedbackEmail":"feedback@databricks.com","enableDriverLogsUI":true,"enableExperimentalCharts":false,"defaultMaxWorkers":8,"enableWorkspaceAclsConfig":false,"serverlessRunPythonAsLowPrivilegeUser":false,"dropzoneMaxFileSize":2047,"enableNewClustersList":true,"enableNewDashboardViews":true,"enableJobListPermissionFilter":true,"terminatedInteractiveClustersMax":70,"driverLog4jFilePrefix":"log4j","enableSingleSignOn":false,"enableMavenLibraries":true,"updateTreeTableToV2Schema":false,"displayRowLimit":1000,"deltaProcessingAsyncEnabled":true,"enableSparkEnvironmentVariablesUI":false,"defaultSparkVersion":{"key":"4.0.x-scala2.11","displayName":"4.0 (includes Apache Spark 2.3.0, Scala 2.11)","packageLabel":"spark-image-958dfd1fcde8070c85e13f869b8d816b71d63cac31357210d4858c3ff3be83ce","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS"]},"enableNewLineChartParams":false,"deprecatedEnableStructuredDataAcls":false,"enableCustomSpotPricing":true,"enableRStudioFreeUI":false,"enableMountAclsConfig":false,"defaultAutoterminationMin":120,"useDevTierHomePage":false,"disableExportNotebook":false,"enableClusterClone":true,"enableNotebookLineNumbers":true,"enablePublishHub":false,"notebookHubUrl":"http://hub.dev.databricks.com/","commandStatusDebounceInterval":100,"showSqlEndpoints":true,"enableNotebookDatasetInfoView":true,"defaultTagKeys":{"CLUSTER_NAME":"ClusterName","VENDOR":"Vendor","CLUSTER_TYPE":"ResourceClass","CREATOR":"Creator","CLUSTER_ID":"ClusterId"},"enableClusterAclsByTier":true,"databricksDocsBaseUrl":"https://docs.azuredatabricks.net/","azurePortalLink":"https://portal.azure.com","cloud":"Azure","customSparkVersionPrefix":"custom:","disallowAddingAdmins":false,"enableSparkConfUI":true,"enableClusterEventsUI":true,"featureTier":"STANDARD_W_SEC_TIER","mavenCentralSearchEndpoint":"http://search.maven.org/solrsearch/select","defaultServerlessClusterModel":{"cluster_name":"","node_type_id":"Standard_DS13_v2","spark_version":"latest-stable-scala2.11","num_workers":null,"enable_jdbc_auto_start":true,"custom_tags":{"ResourceClass":"Serverless"},"autoscale":{"min_workers":2,"max_workers":20},"spark_conf":{"spark.databricks.cluster.profile":"serverless","spark.databricks.repl.allowedLanguages":"sql,python,r"},"autotermination_minutes":0,"enable_elastic_disk":true,"default_tags":{"Vendor":"Databricks","Creator":"ivv@adatis.co.uk","ClusterName":null,"ClusterId":""}},"enableClearRevisionHistoryForNotebook":true,"enableOrgSwitcherUI":true,"bitbucketCloudBaseApiV2Url":"https://api.bitbucket.org/2.0","clustersLimit":-1,"enableJdbcImport":true,"enableClusterAppsUIOnNormalClusters":false,"enableElasticDisk":true,"logfiles":"logfiles/","enableRelativeNotebookLinks":true,"enableMultiSelect":true,"homePageLogo":"login/DB_Azure_Lockup_2x.png","enableWebappSharding":true,"enableNotebookParamsEdit":true,"enableClusterDeltaUpdates":true,"enableSingleSignOnLogin":false,"separateTableForJobClusters":true,"ebsVolumeSizeLimitGB":{"GENERAL_PURPOSE_SSD":[100,4096],"THROUGHPUT_OPTIMIZED_HDD":[500,4096]},"enableClusterDeleteUI":true,"enableMountAcls":false,"requireEmailUserName":true,"enableRServerless":true,"frameRateReportIntervalMs":10000,"dbcFeedbackURL":"http://feedback.databricks.com/forums/263785-product-feedback","enableMountAclService":true,"showVersion":false,"serverlessClustersByDefault":false,"collectDetailedFrameRateStats":true,"enableWorkspaceAcls":true,"maxClusterTagKeyLength":512,"gitHash":"","clusterTagReservedPrefixes":["azure","microsoft","windows"],"tableAclsEnabledMap":{"spark.databricks.acl.dfAclsEnabled":"true","spark.databricks.repl.allowedLanguages":"python,sql"},"showWorkspaceFeaturedLinks":true,"signupUrl":"","databricksDocsNotebookPathPrefix":"^https://docs\\.azuredatabricks\\.net/_static/notebooks/.+$","serverlessAttachEbsVolumesByDefault":false,"enableTokensConfig":true,"allowFeedbackForumAccess":true,"frameDurationReportThresholdMs":1000,"enablePythonVersionUI":true,"enableImportFromUrl":true,"allowDisplayHtmlByUrl":true,"enableTokens":true,"enableMiniClusters":false,"enableNewJobList":true,"maxPinnedClustersPerOrg":20,"enableDebugUI":false,"enableStreamingMetricsDashboard":true,"allowNonAdminUsers":true,"enableSingleSignOnByTier":true,"enableJobsRetryOnTimeout":true,"loginLogo":"/login/DB_Azure_Lockup_2x.png","useStandardTierUpgradeTooltips":true,"staticNotebookResourceUrl":"https://databricks-prod-cloudfront.cloud.databricks.com/static/c0a57b890925d4a38b701f56755414e0d7e15ba065243871740ecb804faf39d5/","enableSpotClusterType":true,"enableSparkPackages":true,"checkAadUserInWorkspaceTenant":false,"dynamicSparkVersions":false,"useIframeForHtmlResult":false,"enableClusterTagsUIByTier":true,"enableUserPromptForPendingRpc":true,"enableNotebookHistoryUI":true,"addWhitespaceAfterLastNotebookCell":true,"enableClusterLoggingUI":true,"setDeletedAtForDeletedColumnsOnWebappStart":false,"enableDatabaseDropdownInTableUI":true,"showDebugCounters":false,"enableInstanceProfilesUI":true,"enableFolderHtmlExport":true,"homepageFeaturedLinks":[{"linkURI":"https://docs.azuredatabricks.net/_static/notebooks/azure/gentle-introduction-to-apache-spark-azure.html","displayName":"Introduction to Apache Spark on Databricks","icon":"img/home/Python_icon.svg"},{"linkURI":"https://docs.azuredatabricks.net/_static/notebooks/azure/databricks-for-data-scientists-azure.html","displayName":"Databricks for Data Scientists","icon":"img/home/Scala_icon.svg"},{"linkURI":"https://docs.azuredatabricks.net/_static/notebooks/structured-streaming-python.html","displayName":"Introduction to Structured Streaming","icon":"img/home/Python_icon.svg"}],"enableClusterStart":true,"maxImportFileVersion":5,"enableEBSVolumesUIByTier":true,"enableTableAclService":true,"removeSubCommandCodeWhenExport":true,"upgradeURL":"","maxAutoterminationMinutes":10000,"showResultsFromExternalSearchEngine":true,"autoterminateClustersByDefault":false,"notebookLoadingBackground":"#fff","sshContainerForwardedPort":2200,"enableStaticHtmlImport":true,"enableInstanceProfilesByTier":true,"showForgotPasswordLink":true,"defaultMemoryPerContainerMB":28000,"enablePresenceUI":true,"minAutoterminationMinutes":10,"accounts":true,"useOnDemandClustersByDefault":false,"enableAutoCreateUserUI":true,"defaultCoresPerContainer":4,"showTerminationReason":true,"enableNewClustersGet":true,"showPricePerDBU":true,"showSqlProxyUI":true,"enableNotebookErrorHighlighting":true}; var __DATABRICKS_NOTEBOOK_MODEL = '%7B%22version%22%3A%22NotebookV1%22%2C%22origId%22%3A1379691019282556%2C%22name%22%3A%22Blog%20-%20Nested%20JSON%20Arrays%20in%20Databricks%22%2C%22language%22%3A%22python%22%2C%22commands%22%3A%5B%7B%22version%22%3A%22CommandV1%22%2C%22origId%22%3A1379691019282561%2C%22guid%22%3A%22154497a0-3d19-49fc-8a45-caa1059662e7%22%2C%22subtype%22%3A%22command%22%2C%22commandType%22%3A%22auto%22%2C%22position%22%3A0.5%2C%22command%22%3A%22from%20pyspark.sql.functions%20import%20explode%2C%20col%22%2C%22commandVersion%22%3A0%2C%22state%22%3A%22finished%22%2C%22results%22%3A%7B%22type%22%3A%22html%22%2C%22data%22%3A%22%3Cdiv%20class%3D%5C%22ansiout%5C%22%3E%3C%2Fdiv%3E%22%2C%22arguments%22%3A%7B%7D%2C%22addedWidgets%22%3A%7B%7D%2C%22removedWidgets%22%3A%5B%5D%2C%22datasetInfos%22%3A%5B%5D%7D%2C%22errorSummary%22%3Anull%2C%22error%22%3Anull%2C%22workflows%22%3A%5B%5D%2C%22startTime%22%3A1528377834320%2C%22submitTime%22%3A1528377834203%2C%22finishTime%22%3A1528377834368%2C%22collapsed%22%3Afalse%2C%22bindings%22%3A%7B%7D%2C%22inputWidgets%22%3A%7B%7D%2C%22displayType%22%3A%22table%22%2C%22width%22%3A%22auto%22%2C%22height%22%3A%22auto%22%2C%22xColumns%22%3Anull%2C%22yColumns%22%3Anull%2C%22pivotColumns%22%3Anull%2C%22pivotAggregation%22%3Anull%2C%22customPlotOptions%22%3A%7B%7D%2C%22commentThread%22%3A%5B%5D%2C%22commentsVisible%22%3Afalse%2C%22parentHierarchy%22%3A%5B%5D%2C%22diffInserts%22%3A%5B%5D%2C%22diffDeletes%22%3A%5B%5D%2C%22globalVars%22%3A%7B%7D%2C%22latestUser%22%3A%22a%20user%22%2C%22latestUserId%22%3Anull%2C%22commandTitle%22%3A%22Import%20only%20the%20pyspark%20functions%20that%20we%20need%22%2C%22showCommandTitle%22%3Atrue%2C%22hideCommandCode%22%3Afalse%2C%22hideCommandResult%22%3Afalse%2C%22iPythonMetadata%22%3Anull%2C%22streamStates%22%3A%7B%7D%2C%22nuid%22%3A%223da1c634-7500-4872-9372-4d33707e2fc3%22%7D%2C%7B%22version%22%3A%22CommandV1%22%2C%22origId%22%3A1379691019282562%2C%22guid%22%3A%22456ffc2f-7673-4daf-911e-1dbf060d3eda%22%2C%22subtype%22%3A%22command%22%2C%22commandType%22%3A%22auto%22%2C%22position%22%3A0.75%2C%22command%22%3A%22source_json%20%3D%20%5C%22%5C%22%5C%22%5Cn%7B%5Cn%20%20%20%20%5C%22persons%5C%22%3A%20%5B%5Cn%20%20%20%20%20%20%20%20%7B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%5C%22name%5C%22%3A%20%5C%22John%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%5C%22age%5C%22%3A%2030%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%5C%22cars%5C%22%3A%20%5B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22name%5C%22%3A%20%5C%22Ford%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22models%5C%22%3A%20%5B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22Fiesta%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22Focus%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22Mustang%5C%22%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5D%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7D%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22name%5C%22%3A%20%5C%22BMW%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22models%5C%22%3A%20%5B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22320%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22X3%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22X5%5C%22%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5D%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7D%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%5D%5Cn%20%20%20%20%20%20%20%20%7D%2C%5Cn%20%20%20%20%20%20%20%20%7B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%5C%22name%5C%22%3A%20%5C%22Peter%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%5C%22age%5C%22%3A%2046%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%5C%22cars%5C%22%3A%20%5B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22name%5C%22%3A%20%5C%22Huyndai%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22models%5C%22%3A%20%5B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22i10%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22i30%5C%22%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5D%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7D%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22name%5C%22%3A%20%5C%22Mercedes%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22models%5C%22%3A%20%5B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22E320%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22E63%20AMG%5C%22%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5D%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7D%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%5D%5Cn%20%20%20%20%20%20%20%20%7D%5Cn%20%20%20%20%5D%5Cn%7D%5Cn%5C%22%5C%22%5C%22%22%2C%22commandVersion%22%3A0%2C%22state%22%3A%22finished%22%2C%22results%22%3A%7B%22type%22%3A%22html%22%2C%22data%22%3A%22%3Cdiv%20class%3D%5C%22ansiout%5C%22%3E%3C%2Fdiv%3E%22%2C%22arguments%22%3A%7B%7D%2C%22addedWidgets%22%3A%7B%7D%2C%22removedWidgets%22%3A%5B%5D%2C%22datasetInfos%22%3A%5B%5D%7D%2C%22errorSummary%22%3Anull%2C%22error%22%3Anull%2C%22workflows%22%3A%5B%5D%2C%22startTime%22%3A1528377834395%2C%22submitTime%22%3A1528377834204%2C%22finishTime%22%3A1528377834414%2C%22collapsed%22%3Afalse%2C%22bindings%22%3A%7B%7D%2C%22inputWidgets%22%3A%7B%7D%2C%22displayType%22%3A%22table%22%2C%22width%22%3A%22auto%22%2C%22height%22%3A%22auto%22%2C%22xColumns%22%3Anull%2C%22yColumns%22%3Anull%2C%22pivotColumns%22%3Anull%2C%22pivotAggregation%22%3Anull%2C%22customPlotOptions%22%3A%7B%7D%2C%22commentThread%22%3A%5B%5D%2C%22commentsVisible%22%3Afalse%2C%22parentHierarchy%22%3A%5B%5D%2C%22diffInserts%22%3A%5B%5D%2C%22diffDeletes%22%3A%5B%5D%2C%22globalVars%22%3A%7B%7D%2C%22latestUser%22%3A%22a%20user%22%2C%22latestUserId%22%3Anull%2C%22commandTitle%22%3A%22Create%20a%20sample%20JSON%22%2C%22showCommandTitle%22%3Atrue%2C%22hideCommandCode%22%3Afalse%2C%22hideCommandResult%22%3Afalse%2C%22iPythonMetadata%22%3Anull%2C%22streamStates%22%3A%7B%7D%2C%22nuid%22%3A%22fcf920a1-1762-4c1d-b3a8-021e31f4fa04%22%7D%2C%7B%22version%22%3A%22CommandV1%22%2C%22origId%22%3A1379691019282563%2C%22guid%22%3A%22ebed0478-d445-4085-9b4b-6aef41f4974f%22%2C%22subtype%22%3A%22command%22%2C%22commandType%22%3A%22auto%22%2C%22position%22%3A0.875%2C%22command%22%3A%22dbutils.fs.put(%5C%22%2Ftmp%2Fsource.json%5C%22%2C%20source_json%2C%20True)%22%2C%22commandVersion%22%3A0%2C%22state%22%3A%22finished%22%2C%22results%22%3A%7B%22type%22%3A%22html%22%2C%22data%22%3A%22%3Cdiv%20class%3D%5C%22ansiout%5C%22%3EWrote%201074%20bytes.%5Cn%3Cspan%20class%3D%5C%22ansired%5C%22%3EOut%5B%3C%2Fspan%3E%3Cspan%20class%3D%5C%22ansired%5C%22%3E3%3C%2Fspan%3E%3Cspan%20class%3D%5C%22ansired%5C%22%3E%5D%3A%20%3C%2Fspan%3ETrue%5Cn%3C%2Fdiv%3E%22%2C%22arguments%22%3A%7B%7D%2C%22addedWidgets%22%3A%7B%7D%2C%22removedWidgets%22%3A%5B%5D%2C%22datasetInfos%22%3A%5B%5D%7D%2C%22errorSummary%22%3Anull%2C%22error%22%3Anull%2C%22workflows%22%3A%5B%5D%2C%22startTime%22%3A1528377834421%2C%22submitTime%22%3A1528377834224%2C%22finishTime%22%3A1528377834678%2C%22collapsed%22%3Afalse%2C%22bindings%22%3A%7B%7D%2C%22inputWidgets%22%3A%7B%7D%2C%22displayType%22%3A%22table%22%2C%22width%22%3A%22auto%22%2C%22height%22%3A%22auto%22%2C%22xColumns%22%3Anull%2C%22yColumns%22%3Anull%2C%22pivotColumns%22%3Anull%2C%22pivotAggregation%22%3Anull%2C%22customPlotOptions%22%3A%7B%7D%2C%22commentThread%22%3A%5B%5D%2C%22commentsVisible%22%3Afalse%2C%22parentHierarchy%22%3A%5B%5D%2C%22diffInserts%22%3A%5B%5D%2C%22diffDeletes%22%3A%5B%5D%2C%22globalVars%22%3A%7B%7D%2C%22latestUser%22%3A%22a%20user%22%2C%22latestUserId%22%3Anull%2C%22commandTitle%22%3A%22Write%20the%20JSON%20sample%20into%20the%20file%20system%22%2C%22showCommandTitle%22%3Atrue%2C%22hideCommandCode%22%3Afalse%2C%22hideCommandResult%22%3Afalse%2C%22iPythonMetadata%22%3Anull%2C%22streamStates%22%3A%7B%7D%2C%22nuid%22%3A%226e3b4b30-8dc3-4dc7-a599-3022ea70d59a%22%7D%2C%7B%22version%22%3A%22CommandV1%22%2C%22origId%22%3A1379691019282564%2C%22guid%22%3A%22d207f10d-cca2-4f04-aefb-c37acf9e6864%22%2C%22subtype%22%3A%22command%22%2C%22commandType%22%3A%22auto%22%2C%22position%22%3A0.9375%2C%22command%22%3A%22source_df%20%3D%20spark.read.option(%5C%22multiline%5C%22%2C%20%5C%22true%5C%22).json(%5C%22%2Ftmp%2Fsource.json%5C%22)%22%2C%22commandVersion%22%3A0%2C%22state%22%3A%22finished%22%2C%22results%22%3A%7B%22type%22%3A%22html%22%2C%22data%22%3A%22%3Cdiv%20class%3D%5C%22ansiout%5C%22%3E%3C%2Fdiv%3E%22%2C%22arguments%22%3A%7B%7D%2C%22addedWidgets%22%3A%7B%7D%2C%22removedWidgets%22%3A%5B%5D%2C%22datasetInfos%22%3A%5B%7B%22name%22%3A%22source_df%22%2C%22typeStr%22%3A%22pyspark.sql.dataframe.DataFrame%22%2C%22schema%22%3A%7B%22fields%22%3A%5B%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22persons%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%7B%22containsNull%22%3Atrue%2C%22elementType%22%3A%7B%22fields%22%3A%5B%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22age%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22long%22%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22cars%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%7B%22containsNull%22%3Atrue%2C%22elementType%22%3A%7B%22fields%22%3A%5B%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22models%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%7B%22containsNull%22%3Atrue%2C%22elementType%22%3A%22string%22%2C%22type%22%3A%22array%22%7D%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22name%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22string%22%7D%5D%2C%22type%22%3A%22struct%22%7D%2C%22type%22%3A%22array%22%7D%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22name%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22string%22%7D%5D%2C%22type%22%3A%22struct%22%7D%2C%22type%22%3A%22array%22%7D%7D%5D%2C%22type%22%3A%22struct%22%7D%2C%22tableIdentifier%22%3Anull%7D%5D%7D%2C%22errorSummary%22%3Anull%2C%22error%22%3Anull%2C%22workflows%22%3A%5B%5D%2C%22startTime%22%3A1528377834691%2C%22submitTime%22%3A1528377834240%2C%22finishTime%22%3A1528377838859%2C%22collapsed%22%3Afalse%2C%22bindings%22%3A%7B%7D%2C%22inputWidgets%22%3A%7B%7D%2C%22displayType%22%3A%22table%22%2C%22width%22%3A%22auto%22%2C%22height%22%3A%22auto%22%2C%22xColumns%22%3Anull%2C%22yColumns%22%3Anull%2C%22pivotColumns%22%3Anull%2C%22pivotAggregation%22%3Anull%2C%22customPlotOptions%22%3A%7B%7D%2C%22commentThread%22%3A%5B%5D%2C%22commentsVisible%22%3Afalse%2C%22parentHierarchy%22%3A%5B%5D%2C%22diffInserts%22%3A%5B%5D%2C%22diffDeletes%22%3A%5B%5D%2C%22globalVars%22%3A%7B%7D%2C%22latestUser%22%3A%22a%20user%22%2C%22latestUserId%22%3Anull%2C%22commandTitle%22%3A%22Load%20the%20JSON%20file%20into%20a%20dataframe%22%2C%22showCommandTitle%22%3Atrue%2C%22hideCommandCode%22%3Afalse%2C%22hideCommandResult%22%3Afalse%2C%22iPythonMetadata%22%3Anull%2C%22streamStates%22%3A%7B%7D%2C%22nuid%22%3A%22e7492795-5330-47df-9eaf-eeb62e3aa6d4%22%7D%2C%7B%22version%22%3A%22CommandV1%22%2C%22origId%22%3A1379691019282557%2C%22guid%22%3A%22c212a1a1-1e7a-48d2-966d-9f7c6f011f5c%22%2C%22subtype%22%3A%22command%22%2C%22commandType%22%3A%22auto%22%2C%22position%22%3A1.0%2C%22command%22%3A%22%23%20Explode%20all%20persons%20into%20different%20rows%5Cnpersons%20%3D%20source_df.select(explode(%5C%22persons%5C%22).alias(%5C%22persons%5C%22))%5Cn%5Cn%23%20Explode%20all%20car%20brands%20into%20different%20rows%5Cnpersons_cars%20%3D%20persons.select(%5Cn%20%20%20col(%5C%22persons.name%5C%22).alias(%5C%22persons_name%5C%22)%5Cn%20%2C%20col(%5C%22persons.age%5C%22).alias(%5C%22persons_age%5C%22)%5Cn%20%2C%20explode(%5C%22persons.cars%5C%22).alias(%5C%22persons_cars_brands%5C%22)%5Cn%20%2C%20col(%5C%22persons_cars_brands.name%5C%22).alias(%5C%22persons_cars_brand%5C%22)%5Cn)%5Cn%5Cn%23%20Explode%20all%20car%20models%20into%20different%20rows%5Cnpersons_cars_models%20%3D%20persons_cars.select(%5Cn%20%20%20col(%5C%22persons_name%5C%22)%5Cn%20%2C%20col(%5C%22persons_age%5C%22)%5Cn%20%2C%20col(%5C%22persons_cars_brand%5C%22)%5Cn%20%2C%20explode(%5C%22persons_cars_brands.models%5C%22).alias(%5C%22persons_cars_model%5C%22)%5Cn)%22%2C%22commandVersion%22%3A0%2C%22state%22%3A%22finished%22%2C%22results%22%3A%7B%22type%22%3A%22html%22%2C%22data%22%3A%22%3Cdiv%20class%3D%5C%22ansiout%5C%22%3E%3C%2Fdiv%3E%22%2C%22arguments%22%3A%7B%7D%2C%22addedWidgets%22%3A%7B%7D%2C%22removedWidgets%22%3A%5B%5D%2C%22datasetInfos%22%3A%5B%7B%22name%22%3A%22persons%22%2C%22typeStr%22%3A%22pyspark.sql.dataframe.DataFrame%22%2C%22schema%22%3A%7B%22fields%22%3A%5B%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22persons%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%7B%22fields%22%3A%5B%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22age%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22long%22%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22cars%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%7B%22containsNull%22%3Atrue%2C%22elementType%22%3A%7B%22fields%22%3A%5B%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22models%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%7B%22containsNull%22%3Atrue%2C%22elementType%22%3A%22string%22%2C%22type%22%3A%22array%22%7D%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22name%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22string%22%7D%5D%2C%22type%22%3A%22struct%22%7D%2C%22type%22%3A%22array%22%7D%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22name%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22string%22%7D%5D%2C%22type%22%3A%22struct%22%7D%7D%5D%2C%22type%22%3A%22struct%22%7D%2C%22tableIdentifier%22%3Anull%7D%2C%7B%22name%22%3A%22persons_cars%22%2C%22typeStr%22%3A%22pyspark.sql.dataframe.DataFrame%22%2C%22schema%22%3A%7B%22fields%22%3A%5B%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22persons_name%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22string%22%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22persons_age%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22long%22%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22persons_cars_brands%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%7B%22fields%22%3A%5B%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22models%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%7B%22containsNull%22%3Atrue%2C%22elementType%22%3A%22string%22%2C%22type%22%3A%22array%22%7D%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22name%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22string%22%7D%5D%2C%22type%22%3A%22struct%22%7D%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22persons_cars_brand%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22string%22%7D%5D%2C%22type%22%3A%22struct%22%7D%2C%22tableIdentifier%22%3Anull%7D%2C%7B%22name%22%3A%22persons_cars_models%22%2C%22typeStr%22%3A%22pyspark.sql.dataframe.DataFrame%22%2C%22schema%22%3A%7B%22fields%22%3A%5B%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22persons_name%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22string%22%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22persons_age%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22long%22%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22persons_cars_brand%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22string%22%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22persons_cars_model%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22string%22%7D%5D%2C%22type%22%3A%22struct%22%7D%2C%22tableIdentifier%22%3Anull%7D%5D%7D%2C%22errorSummary%22%3Anull%2C%22error%22%3Anull%2C%22workflows%22%3A%5B%5D%2C%22startTime%22%3A1528377838885%2C%22submitTime%22%3A1528377834256%2C%22finishTime%22%3A1528377839087%2C%22collapsed%22%3Afalse%2C%22bindings%22%3A%7B%7D%2C%22inputWidgets%22%3A%7B%7D%2C%22displayType%22%3A%22table%22%2C%22width%22%3A%22auto%22%2C%22height%22%3A%22auto%22%2C%22xColumns%22%3Anull%2C%22yColumns%22%3Anull%2C%22pivotColumns%22%3Anull%2C%22pivotAggregation%22%3Anull%2C%22customPlotOptions%22%3A%7B%7D%2C%22commentThread%22%3A%5B%5D%2C%22commentsVisible%22%3Afalse%2C%22parentHierarchy%22%3A%5B%5D%2C%22diffInserts%22%3A%5B%5D%2C%22diffDeletes%22%3A%5B%5D%2C%22globalVars%22%3A%7B%7D%2C%22latestUser%22%3A%22a%20user%22%2C%22latestUserId%22%3Anull%2C%22commandTitle%22%3A%22Explode%20all%20nested%20lists%20into%20rows%22%2C%22showCommandTitle%22%3Atrue%2C%22hideCommandCode%22%3Afalse%2C%22hideCommandResult%22%3Afalse%2C%22iPythonMetadata%22%3Anull%2C%22streamStates%22%3A%7B%7D%2C%22nuid%22%3A%2211d78aa5-7c55-4119-9ba3-0e9901c26f3a%22%7D%2C%7B%22version%22%3A%22CommandV1%22%2C%22origId%22%3A1379691019282560%2C%22guid%22%3A%220101cb78-3132-4206-a358-b731a3537248%22%2C%22subtype%22%3A%22command%22%2C%22commandType%22%3A%22auto%22%2C%22position%22%3A3.0%2C%22command%22%3A%22display(persons_cars_models)%22%2C%22commandVersion%22%3A0%2C%22state%22%3A%22finished%22%2C%22results%22%3A%7B%22type%22%3A%22table%22%2C%22data%22%3A%5B%5B%22John%22%2C30%2C%22Ford%22%2C%22Fiesta%22%5D%2C%5B%22John%22%2C30%2C%22Ford%22%2C%22Focus%22%5D%2C%5B%22John%22%2C30%2C%22Ford%22%2C%22Mustang%22%5D%2C%5B%22John%22%2C30%2C%22BMW%22%2C%22320%22%5D%2C%5B%22John%22%2C30%2C%22BMW%22%2C%22X3%22%5D%2C%5B%22John%22%2C30%2C%22BMW%22%2C%22X5%22%5D%2C%5B%22Peter%22%2C46%2C%22Huyndai%22%2C%22i10%22%5D%2C%5B%22Peter%22%2C46%2C%22Huyndai%22%2C%22i30%22%5D%2C%5B%22Peter%22%2C46%2C%22Mercedes%22%2C%22E320%22%5D%2C%5B%22Peter%22%2C46%2C%22Mercedes%22%2C%22E63%20AMG%22%5D%5D%2C%22arguments%22%3A%7B%7D%2C%22addedWidgets%22%3A%7B%7D%2C%22removedWidgets%22%3A%5B%5D%2C%22schema%22%3A%5B%7B%22name%22%3A%22persons_name%22%2C%22type%22%3A%22%5C%22string%5C%22%22%2C%22metadata%22%3A%22%7B%7D%22%7D%2C%7B%22name%22%3A%22persons_age%22%2C%22type%22%3A%22%5C%22long%5C%22%22%2C%22metadata%22%3A%22%7B%7D%22%7D%2C%7B%22name%22%3A%22persons_cars_brand%22%2C%22type%22%3A%22%5C%22string%5C%22%22%2C%22metadata%22%3A%22%7B%7D%22%7D%2C%7B%22name%22%3A%22persons_cars_model%22%2C%22type%22%3A%22%5C%22string%5C%22%22%2C%22metadata%22%3A%22%7B%7D%22%7D%5D%2C%22overflow%22%3Afalse%2C%22aggData%22%3A%5B%5D%2C%22aggSchema%22%3A%5B%5D%2C%22aggOverflow%22%3Afalse%2C%22aggSeriesLimitReached%22%3Afalse%2C%22aggError%22%3A%22%22%2C%22aggType%22%3A%22%22%2C%22plotOptions%22%3Anull%2C%22isJsonSchema%22%3Atrue%2C%22dbfsResultPath%22%3Anull%2C%22datasetInfos%22%3A%5B%5D%2C%22columnCustomDisplayInfos%22%3A%7B%7D%7D%2C%22errorSummary%22%3Anull%2C%22error%22%3Anull%2C%22workflows%22%3A%5B%5D%2C%22startTime%22%3A1528377839096%2C%22submitTime%22%3A1528377834273%2C%22finishTime%22%3A1528377841654%2C%22collapsed%22%3Afalse%2C%22bindings%22%3A%7B%7D%2C%22inputWidgets%22%3A%7B%7D%2C%22displayType%22%3A%22table%22%2C%22width%22%3A%221773%22%2C%22height%22%3A%22344%22%2C%22xColumns%22%3Anull%2C%22yColumns%22%3Anull%2C%22pivotColumns%22%3Anull%2C%22pivotAggregation%22%3Anull%2C%22customPlotOptions%22%3A%7B%7D%2C%22commentThread%22%3A%5B%5D%2C%22commentsVisible%22%3Afalse%2C%22parentHierarchy%22%3A%5B%5D%2C%22diffInserts%22%3A%5B%5D%2C%22diffDeletes%22%3A%5B%5D%2C%22globalVars%22%3A%7B%7D%2C%22latestUser%22%3A%22a%20user%22%2C%22latestUserId%22%3Anull%2C%22commandTitle%22%3A%22Display%20the%20flattened%20data%22%2C%22showCommandTitle%22%3Atrue%2C%22hideCommandCode%22%3Afalse%2C%22hideCommandResult%22%3Afalse%2C%22iPythonMetadata%22%3Anull%2C%22streamStates%22%3A%7B%7D%2C%22nuid%22%3A%226bd284df-6c41-454d-b241-55a356b0bfdf%22%7D%5D%2C%22dashboards%22%3A%5B%5D%2C%22guid%22%3A%22c54c9e3f-f940-4b32-8bd6-41fbcd1dd131%22%2C%22globalVars%22%3A%7B%7D%2C%22iPythonMetadata%22%3Anull%2C%22inputWidgets%22%3A%7B%7D%7D'; if (window.mainJsLoadError) { var u = 'https://databricks-prod-cloudfront.cloud.databricks.com/static/c0a57b890925d4a38b701f56755414e0d7e15ba065243871740ecb804faf39d5/js/notebook-main.js'; var b = document.getElementsByTagName('body')[0]; var c = document.createElement('div'); c.innerHTML = ('Network Error' + 'Please check your network connection and try again.' + 'Could not load a required resource: ' + u + ''); c.style.margin = '30px'; c.style.padding = '20px 50px'; c.style.backgroundColor = '#f5f5f5'; c.style.borderRadius = '5px'; b.appendChild(c); } "> We've seen here how we can use Databricks to flatten JSON with just a few lines of code. Keep your eyes open for future Databricks related blogs, which will demonstrate more of the versatility of this great platform. More on some of the used functions (PySpark 2.3.0 documentation): DataFrameReader.json explode (function(root, factory) { // `root` does not resolve to the global window object in a Browserified // bundle, so a direct reference to that object is used instead. var _srcDoc = window.srcDoc; if (typeof define === "function" && define.amd) { define(['exports'], function(exports) { factory(exports, _srcDoc); root.srcDoc = exports; }); } else if (typeof exports === "object") { factory(exports, _srcDoc); } else { root.srcDoc = {}; factory(root.srcDoc, _srcDoc); } })(this, function(exports, _srcDoc) { var idx, iframes; var isCompliant = !!("srcdoc" in document.createElement("iframe")); var sandboxMsg = "Polyfill may not function in the presence of the " + "`sandbox` attribute. Consider using the `force` option."; var sandboxAllow = /\ballow-same-origin\b/; /** * Determine if the operation may be blocked by the `sandbox` attribute in * some environments, and optionally issue a warning or remove the * attribute. */ var validate = function( iframe, options ) { var sandbox = iframe.getAttribute("sandbox"); if (typeof sandbox === "string" && !sandboxAllow.test(sandbox)) { if (options && options.force) { iframe.removeAttribute("sandbox"); } else if (!options || options.force !== false) { logError(sandboxMsg); iframe.setAttribute("data-srcdoc-polyfill", sandboxMsg); } } }; var implementations = { compliant: function( iframe, content, options ) { if (content) { validate(iframe, options); iframe.setAttribute("srcdoc", content); } }, legacy: function( iframe, content, options ) { var jsUrl; if (!iframe || !iframe.getAttribute) { return; } if (!content) { content = iframe.getAttribute("srcdoc"); } else { iframe.setAttribute("srcdoc", content); } if (content) { validate(iframe, options); // The value returned by a script-targeted URL will be used as // the iFrame's content. Create such a URL which returns the // iFrame element's `srcdoc` attribute. jsUrl = "javascript: window.frameElement.getAttribute('srcdoc');"; // Explicitly set the iFrame's window.location for // compatability with IE9, which does not react to changes in // the `src` attribute when it is a `javascript:` URL, for // some reason if (iframe.contentWindow) { iframe.contentWindow.location = jsUrl; } iframe.setAttribute("src", jsUrl); } } }; var srcDoc = exports; var logError; if (window.console && window.console.error) { logError = function(msg) { window.console.error("[srcdoc-polyfill] " + msg); }; } else { logError = function() {}; } // Assume the best srcDoc.set = implementations.compliant; srcDoc.noConflict = function() { window.srcDoc = _srcDoc; return srcDoc; }; // If the browser supports srcdoc, no shimming is necessary if (isCompliant) { return; } srcDoc.set = implementations.legacy; // Automatically shim any iframes already present in the document iframes = document.getElementsByTagName("iframe"); idx = iframes.length; while (idx--) { srcDoc.set( iframes[idx] ); } });

How to support your organisation with Azure Cosmos DB Graph (Gremlin)?

Let me start this blog with two questions. 1) How can Joanna, from the HR department, query the mentor/mentee structure from the organisation? 2) How can Robin, from the resourcing department, find the most suitable and available consultant to start a new project next week? I’m sure at this point you are thinking that to solve both problems they could simply query the HR and the Resource Planning systems, and you are right, but, what if they could get the answer for both questions from a single system? In this blog I’ll demonstrate how to achieve such requirement using Azure Cosmos DB Graph (Gremlin).Graph TheoryIn graph theory, a graph is an ordered pair comprising a set of vertices and edges. A vertex is the fundamental unit of which graphs are formed and are usually represented by a circle with a label. An edge is represented by a line or arrow extending from one vertex to another.Graphs can be used to model many types of relations and processes in many different areas. For example, we can represent the link structure of a website as a graph, the web pages as vertices and the links from one page to another as edges.Azure Cosmos DBAzure Cosmos DB is a globally distributed multi-model database with the capacity to store various types of data, such as document, relational, graph and key values. It provides all the necessary tools to elastically scale throughput and storage across any number of Azure's geographic regions (further details here). Azure Cosmos DB supports a variety of popular API’s to access and query the data, such as SQL, MongoDB, Cassandra, Graph (Gremlin) and Table API. In this instance, I will focus on the Graph (Gremlin) API.GremlinGremlin is the graph traversal language of Apache TinkerPop, an open source Graph Computing Framework. Gremlin allows the users to write complex queries to traverse their graphs by using a composed sequence of steps, with each step performing an operation on the data stream (further details here). There are 4 fundamental steps:· transform: transform the objects in the stream· filter: remove objects from the stream· sideEffect: pass the object, but yield some side effect· branch: decide which step to takeScenarioThe image in the left is an example of the mentor/mentee structure. If we convert it to a graph (image in the right), we have the people represented as vertices and the relationship mentor as edges.Now, let’s create the database and the graph. When creating the Azure Cosmos DB, we need to ensure we select the Gremlin (graph) API. To populate and query our graph, we have three options, the Azure Portal, the Gremlin Console or the Guided Gremlin Tour. The last two tools can be downloaded from the Quick Start section once we create the sample container or directly downloaded here and here.In the Azure Portal, CRUD operations can be performed via the Data Explorer UI and Gremlin queries. The results can be visualised in a Graph or in GraphJSON format, the Gremlin standard format to represent vertices, edges and properties using JSON.In the proposed scenario, we will create 7 vertices and 6 edges. The vertices will be labelled as person and will have 3 properties, First Name, Position and In Bench (indicates if a consultant is allocated to a project or not). The edges, labelled as mentor, will define the relation between consultants. Create a vertexg.addV('person').property('firstName', 'Tim’).property('position', 'Director') .property('inBench', '1')Create an edgeg.V().hasLabel('person').has('firstName', 'Tim').addE('mentor').to(g.V().hasLabel('person').has('firstName', 'Neil'))Below is how the graph looks like when all vertices and edges are created.Joanna has now all that is needed to obtain valuable information. For example, she can traverse the graph and obtain the list of Principal and Senior Consultants that are mentored by Tim.g.V().hasLabel('person').has('firstName', 'Tim').outE('mentor').inV().hasLabel('person').has('position', within('Principal Consultant','Senior Consultant')).group().by('firstName').by(values('position'))[{"Neil": "Principal Consultant","Scott": "Senior Consultant"}]To support Robin, we will add two new instances of vertices, Blog and Project. The Blog vertex will be related with the Person vertex and will indicate who wrote blogs and which technology was covered. The Project vertex will be related with the Person vertex and will indicate who worked in a certain project. The Project vertex will have the client, the name of the project, the duration and which technology was used as properties.If Robin needs to find a consultant with experience in Azure Cosmos DB, he can query the graph and verify who either wrote a blog or worked in a previous project with that technology. On top of that, he can filter the results by indicating he is only interested in consultants that are currently in the bench.g.V().hasLabel('person').where(outE('worked').or().outE('wrote').has('tech', 'Azure CosmosDB')).has('inBench', '1')ConclusionsHopefully, using a very simple scenario, I managed to demonstrate the potential of Azure Cosmos DB Graph to build complex queries, implement powerful graph traversal logic and help the business to quickly obtain insights from very complex models..As always, if you have any questions or comments do let me know.

Azure Active Directory Authentication and Azure Data Catalog

In a previous post I introduced Azure Data Catalog. Because it’s great for data discovery and for data asset management, it makes sense to automate, as much as possible, the process of registering new data assets, and allowing users to discover data in a more natural, perhaps conversational, way. In order to automate the registration of data assets or to allow discovery through other tools, it’s necessary to look at how Azure Data Catalog authenticates users using Azure Active Directory (AAD). This post is going to explore some of the options the Azure Data Catalog uses for authentication and a walkthrough of a code example to make authentication work without user input.Azure Active Directory AuthenticationIf you have interacted with Azure Data Catalog before, you will find that there are two ways of doing so. First, there’s the web application that allows you to conduct data discovery and data asset management. Then there’s the native application that sits on your local machine that can be used for registering data assets. These use different methods of authenticating using Azure Active Directory. The first one uses Web Browser to Web Application authentication. The second uses Native Application to Web API authentication. Web Browser to Web Application What is involved with Web Browser to Web Application authentication? Simply put, the web application directs the user’s browser to get them to sign-in AAD. AAD then returns a token which authenticates the user to use the web application. In practice, it’s a bit more complex, so here’s a diagram to help explain it. In a bit more detail, the process it follows is:1) A user visits the application and needs to sign in, they are redirected via a sign-in request to the authentication endpoint in AAD. 2) The user signs in on the sign-in page. 3) If authentication is successful, AAD creates an authentication token and returns a sign-in response to the application’s Reply URL that was configured in the Azure Portal. The returned token includes claims about the user and AAD that are required by the application to validate the token. 4) The application validates the token by using a public signing key and issuer information available at the federation metadata document for Azure AD. After the application validates the token, Azure AD starts a new session with the user. This session allows the user to access the application until it expires.This method of authentication is used by Azure Data Catalog when discovering data through the browser. Native Application to Web APIWhat’s the process of Native Application to Web API authentication? Simply put, the application will ask you to sign-in to AAD, so that it can acquire a token in order to access resources from the Web API. In practice, it’s a bit more complex, so here’s a diagram to help explain it. In a bit more detail, the process it follows is:1) The native application makes a request to the authorisation endpoint in AAD, but using a browser pop-up. This request includes the Application ID and redirect URI of the native application (see the following article for native applications and registering them in Azure) and the Application ID URI of the Web API. The user is then requested to sign-in.2) AAD authenticates the user. AAD then issues an authorisation code response back to the application’s redirect URI. 3) The Application then stops the browser activity and extracts the authorisation code from the response. Using the authorisation code, the Application then requests an access token from AAD. It also uses details about the native application and the desired resource (Web API). 4) The authorisation code and details are checked by AAD, which then returns an access token and a refresh token. 5) The Application then uses the access token to add to the authorisation header in its request to the Web API. Which returns the requested resource, based on successful authentication. 6) When the access token expires, the refresh token is used to acquire a new access token without requiring the user to sign-in again. This method of authentication is used by Azure Data Catalog when registering data assets via the desktop application. Automated Interaction with Azure Data CatalogIn both of the examples above, they require the user to interact in order to provide sign-in credentials. This is not ideal if we want to automate the registration of data assets or conduct data discovery outside of the browser. Therefore we’ll need to use a different method of authentication. This is the Server Application to Web API authentication method. Simply, it assumes that the server has already required a user to login and therefore has the user’s credentials. It then uses those credentials to request the access and refresh tokens from AAD.In a bit more detail, the process it follows is:1) The Server Application makes a request to AAD’s Token Endpoint, bypassing the Authentication Endpoint, providing the credential, Application ID and Application URI. 2) AAD authenticates the application and returns an access token that can be used to call the Web API.3) The Application uses the access token to add to the authorisation header in its request to the Web API. Which returns the requested resource, based on successful authentication.This method is what we’re going to use to automate our interaction with Azure Data Catalog.From an authentication aspect, the code for Server Application to Web API is simple and this example will take us to the point of returning that token, from which we can then use to request resources from the Azure Data Catalog API. The full code can be found my GitHub repo.We are going to use the Client Id and Secret from an application we’ve registered in AAD (full process can be found in this Microsoft article on Integrating Applications with AAD).private static string clientId = "ApplicationId";private static string secret = "ApplicationKey";Then, we’re going to make sure we’re connecting to the correct AAD instanceprivate static string authorityUri = string.Format("https://login.windows.net/{0}", tenantId);So we can create an authorisation contextAuthenticationContext authContext = new AuthenticationContext(authorityUri);In order to acquire a token authResult = await authContext.AcquireTokenAsync(resourceUri, new ClientCredential(clientId, secret));Which can then be used in an authorisation header in requests to the Azure Data Catalog API. In the next related post, we’ll explore how to make a call to the API using this authentication method.

Can the Custom Vision Service support the calculation of KPIs?

Let’s assume we have a company that distributes alcoholic drinks across the country. To determine their performance, they define a set of KPIs that will evaluate, between others, how many establishments (eg. pubs, bars, …) have their products exposed in the shelf. To achieve this goal, they have a set of sales reps that visit each establishment and take note of which products are exposed. One possible way to track the data is by accessing a mobile application, manually fill the form and upload the data, but, what if we could automate the process of identifying the products in a shelf by simply uploading a picture? To do that, we would need to apply a machine learning algorithm to classify the image and identify each product. To prove if the above scenario is achievable, I’ll demonstrate how to create a project using a tool called Custom Vision, a service that allow us to easily build a predictive model with just a few clicks and without the need of deep machine learning knowledge. What is the Custom Vision Service?Azure Custom Vision Service is a Microsoft Cognitive Services tool for tagging images using a custom computer vision model. Although very similar to Microsoft’s Computer Vision API, it has the advantage of fine-tuning a predictive model to a specific dataset, however, there are still a couple of minor disadvantages. For example, the service can only identify if an object is in an image and not where it stands within the image.Build the modelTo build a predictive model with this service we can either use the web interface or the REST API with support for C# and Python. The first step was to create a project by providing a name, a domain and a resource group. When selecting a domain, we can either choose a general domain optimized for a range of images or select a specific domain optimized for a certain scenario. In my case, I selected the Food domain, given I wanted to identify different kind of bottles.Below a description of each domain detailed by Microsoft. The next step was to upload and tag images. Here are a couple of considerations:- To start the prototype, we need at least two different tags and a couple of images, usually, a minimum of 30 per class- It is best practice to use a variety of good quality images (different angles, lights, background, size, …) to ensure a better differentiation and accurate results from the classifier. In my case, Google was the source of the images, which was a bit limited in some cases (surprisingly!!!)- It is best practice to include images that represent what the classifier will find in the real world, rather than images with neutral backgrounds- It is advised to avoid images with multiple entities. If we upload images with a bottle and a glass, because the classifier will learn the characteristics that the photos have in common, when comparing two images, the classifier might be comparing a bottle + cup with a single bottleI started by uploading 60 images for 2 classes, Baileys and Others. Once the images were uploaded, I trained my model and obtained the following results.The Precision and Recall indicators demonstrates how good the classifier is performing. Above, we can see that the analysis is done for the entire model and for each tag. It is important to refer that 100% precision is usually not achievable, however, having a model with 75%+ in Precision and Recall is an indication of an effective model.Precision – Indicates how likely the classifier is correctly classifying an image. Knowing that we had 60 images, having a precision of 84.3% means that roughly 51 of the images were correctly taggedRecall – From out of all the images that should have been classified correctly, how many did the classifier identified accurately. Having a precision of 34.5% means that only 20 images were correctly classifiedProbability Threshold – The slider, set by default at 90%, indicates what is the value used to calculate Precision and Recall. Let’s consider the following example. The probability that image A has a Baileys bottle is 94%. If the probability threshold is 90%, then Image A will be taken into consideration as a “correct prediction”.In the example below, we can see that I obtained better results when I changed the probability threshold to 75%. Based on this information I had two options, either correctly tag the wrong images or replace them with better ones.Fortunately, Custom Vision Service has a very handy functionality that, for each iteration, highlights which images confused the model according to the probability threshold value. In the image below, there are two images with a red square. If we hover the mouse on the image we can see the prediction for each tag. In this case, the prediction is below 90% for both tags, meaning the image was not considered in the Precision and Recall calculation.After a couple of iterations we finally obtained a model we could use to run a quick test. When having multiple iterations, we select the one with best results by selecting the option Make Default.To test the model, I selected two images from the internet showing individual brands. As we can see, the model correctly classified each image.Since I was happy with the results, I decided to increase the complexity of the model by creating a new tag and uploading a new set of images. After training the model, I noticed the results were not as good as before, since the new images were creating some confusion to the model. It took a couple of iterations until I got an acceptable model. Following the result of a quick test.I now had a model that could correctly classify 3 different entities, so I decided to increase the challenge and added an image with multiple entities. The result I obtained helped me understand the flaw in my model.The model identified the image should be classified as Others, however, although we have a Baileys and Hendricks bottle in the image, the probability for those two classes was too low. Here is why:- When uploading images to the project I only used 1 tag per image. Based on that, the model will always try to classify an image with a single tag. As soon as I added more tags per image, my predictions improved- All the examples used were showing a single entity, ie, only one type of bottle per image, except for the Others category. Example: I uploaded 30 images of Baileys bottles, 30 images of Hendricks bottles, and, for the category Others, 3 different types of bottles- My model was trained to identify bottles where the image didn’t have multiple entities. As referred above, the model should always be trained with images that represent what the classifier will predict- The number of images per class didn’t have enough variety. As any machine learning model, if we improve the number of examples, the model will perform betterFinal ConsiderationsWith just a couple of clicks and no deep machine learning knowledge I was able to create a predictive model that could accurately classify a set of images. With further iterations I could potentially have a model that could achieve the requirements of the scenario proposed at the beginning of this blog, however, would Custom Vision Service be the right tool? In my opinion the use of the Computer Vision API would be more adequate, but, this is definitely a service with massive potential.As always, if you have any queries or considerations do let me know.

Azure Event Grid in a Modern Data Warehouse Architecture

In this blog I’ll give a light introduction to Azure Event Grid and demonstrate how it is possible to integrate the service in an modern data warehouse architecture.Azure Event Grid is a fully managed event routing service that went into general availability on the 30th January 2018. With this service, we can subscribe to any event happening across our Azure resources and take advantage of serverless platforms like Azure Functions and Logic Apps to easily create serverless workflows. It has a built-in publish support for events with services like Blob Storage and Resource Groups and supports custom web hooks that can publish events to Azure and third -party services.Following is a list of key terms from this service.Events – In this context, an Event is a message that contains data describing what happened in the service. Eg. a new file was uploaded to a container in a Blob Storage. The event will contain information about the file, such as the name of the file.Event Publishers – It is the source of the events published to the Event Grid. Following is a list of current and planned event publishers.AvailablePlanned- Azure Subscriptions (management operations)- Custom Topics- Event Hubs- IoT Hub- Resource Groups (management operations)- Storage Blob- Storage General-purpose v2 (GPv2)- Azure Automation- Azure Active Directory- API Management- Logic Apps- IoT Hub- Service Bus- Azure Data Lake Store- Cosmos DBTopics - The endpoint where publishers send eventsEvent Subscriptions – Receives specific events from the Topic and sends them to the Event HandlersEvent Handlers – It is the receiver of the events subscribed by the event subscriptions. Following is a list of current and planned event handlers.AvailablePlanned- Azure Automation- Azure Functions- Event Hubs- Logic Apps- Microsoft Flow- WebHooks- Fabric Controller- Service Bus- Event Hubs- Azure Data Factory- Storage QueuesAzure Event Grid can be used as any other message queue service, however, the service stands when integrated in an event-based architecture. Let’s consider that we have an application that uploads csv files to a Blob Storage several times a day. As soon as a set of files are available, we want to move them to an Azure Data Lake Store (ADLS) to apply data transformations using Azure Data Lake Analytics (ADLA). Azure Data Factory (ADF) will be used to orchestrate the data movement. Finally, we need to extract complete data sets from the ADLS using the PolyBase features in Azure SQL DW and present them as tables to Azure Analysis Services (AAS). A Tabular model hosted in AAS will populate a set of Power BI reports. In the diagram below we can see that we can subscribe the Blob Storage events using an Event Grid subscription and trigger a loading process using Logic App as soon as a file is uploaded to the Blob Storage. Following I’ll detail how we can implement the sections surrounded by the red square.Blob StorageBlob storage events are only available in the Blob Storage and StorageV2 (general purpose v2) accounts. In this example, I created a Blob Storage account. Once the storage account was deployed I created the container to where the application was uploading the files.Logic App In Logic Apps we subscribed to the events from the Blob Storage and implemented a logic to validate if we had all the required files to start a new data load. If true, we called an Azure Function that triggered an ADF pipeline. An Azure Function was required because as of the time of writing, there wasn’t a Logic App connector to ADF. The ADF pipeline then executed a couple of U-SQL stored procedures that applied data transformations to the ingested data and created our dimension and fact files in the ADLS.The following screens demonstrate how to create the logic app, the Azure Event Grid trigger and an overview of the workflow with all the requested steps.When adding the Azure Even Grid trigger we are prompted to sign in. We should use the account with our subscription. Once we hit save, we can go back to the blob storage account and find a new event grid subscription was created.Final ConsiderationsAzure Event Grid uses a pay-per-event pricing model, meaning we only pay what we use. The first 100,000 operations per month are free and beyond that, £0.448 per million operations.As always, if you have any queries or considerations do let me know.

Tabular Automation and NuGet

In a recent blog post, I wrote about processing an Azure Analysis Services tabular model using Azure Functions. In it, there’s a lengthy process of downloading some DLLs and uploading them to the Azure Function. Handily, the Analysis Services team at Microsoft have released the Analysis Services NuGet package, which means that the necessary DLLs can be automatically installed to an Azure Function without much hassle. This blog is going to go through the steps of adding the NuGet package to your Azure Function. Add a new file to your function called project.jsonInput the following code in the newly created file{   "frameworks": {     "net46":{       "dependencies": {         "Microsoft.AnalysisServices.retail.amd64": "15.0.2"       }     }    } }Then save the Azure Function to proceed with the NuGet restore and compile your function. You should see the following logs in your log window.That is the entire process. Much easier than documented previously!

Introduction to Azure Data Catalog

With the rise of self-service business intelligence tools, like Power BI, and an increased engagement with data in the workplace, people’s expectations of where they can find expert information about data has changed. Where previously there would an expert that people would have to book time with in order to understand data, now people expect to get quick and detailed information about the data assets that an enterprise holds and maintains without going through a single contact. With Azure Data Catalog, data consumers can quickly discover data assets and gain knowledge about the data from documentation, tags and glossary terms from the subject matter experts. This post aims to give a brief introduction to Azure Data Catalog and what it can broadly be used for. What is Azure Data Catalog?Azure Data Catalog is a fully managed Azure service which is an enterprise-wide metadata catalogue that enables data discovery. With Azure Data Catalog, you register; discover; annotate; and, for some sources, connect to data assets. Azure Data Catalog is designed to manage disparate information about data; to make it easy to find data assets, understand them, and connect to them. Any user (analyst, data scientist, or developer) can discover, understand, and consume data sources. Azure Data Catalog is a one-stop central shop for all users to contribute their knowledge and build a community and culture of data.What can Azure Data Catalog be used for?As mentioned in the earlier headings, Azure Data Catalog can be used for data asset management; data governance; and data discovery. For data asset management, this means knowing what data is available and where; for data governance teams, this means answering questions like: where is my customer data? or what does this data model look like?; for data discovery, this means knowing which data is suitable for particular reports and who you can go to if you have any questions. There are some common scenarios for using Azure Data Catalog that Microsoft has put together, and it’s well worth reading to get a fuller understanding of what Azure Data Catalog can be used for.

Process an Azure Analysis Services Tabular Model from an Azure Function

A couple of weeks ago I wrote a blog post detailing how to process a tabular model from a stored procedure. The challenge there was to have a way of processing a tabular model on demand from a web application. The challenge still exists, but how do you achieve the same in Azure using Platform as a Service (PaaS) objects which do not have the benefit of a full server and items, such as msdb and SQL Agent, to aid in the process?In this post, I’m going to show you how to process a tabular model only using Azure PaaS offerings. Not only am I going to show you how to do process a tabular model on-demand, but also how to process a tabular model on a schedule. This post has taken inspiration and part of the code base from the a Microsoft blog: Automating Azure Analysis Services processing with Azure Functions.Azure FunctionsBefore we begin properly, it’s worth spending some time introducing Azure Functions. According to Microsoft, Azure Functions are:…a solution for easily running small pieces of code, or "functions," in the cloud. You can write just the code you need for the problem at hand, without worrying about a whole application or the infrastructure to run it. Functions can make development even more productive, and you can use your development language of choice, such as C#, F#, Node.js, Java, or PHP. Pay only for the time your code runs and trust Azure to scale as needed. Azure Functions lets you develop server less applications on Microsoft Azure.They are super useful for extending the capabilities of any solution and not just limited to what we’re going to cover here.On-Demand RefreshThe use of Azure Functions creates a trigger for the on-demand refresh of a tabular model from the web application or web hook, this is to make sure that selected elements of data in a tabular model, for example hot partitions, are always up to date. The following describes the process that Azure Functions will be involved in this scenario:The steps that are needed to create an Azure Function for On-Demand Refresh are as follow:1) Create an Azure Function AppNavigate to the Azure Portal and create a Function App (the name changes quite a bit, so don’t be concerned if it’s not exactly displayed as it is below)2) Create a new FunctionAfter you’ve created the Function App, we need to add a new Webhook + API function, which we’ll use as the basis for our on-demand refresh. Click on the + button next to Functions, select Webhook + API, choose C# as your language and click Create this function.3) Configure the FunctionDownload the latest client libraries for Analysis Services. This needs to be done to your local machine so you can then copy these files to your Azure Function App. After you’ve downloaded the client libraries, the DLLs can be found in C:\Program Files (x86)\Microsoft SQL Server\140\SDK\Assemblies. The two files you need are:C:\Program Files (x86)\Microsoft SQL Server\140\SDK\Assemblies\Microsoft.AnalysisServices.Core.DLLC:\Program Files (x86)\Microsoft SQL Server\140\SDK\Assemblies\Microsoft.AnalysisServices.Tabular.DLLThe documentation references the 130 assemblies, which is not correct and will not work. You need the assemblies in 140.In order to add these assemblies to your function, you’ll need to add a folder called “bin”. To do this, select your Function App, click Platform features, and under Development Tools, click Advanced tools (Kudu).In Kudu, click Debug console and select Cmd. Navigate to the site\wwwroot\OnDemandHttpTrigger folder and add the “bin” folder here by clicking the + button.Once you’ve added the “bin” folder, go back over to the Azure portal and select your function OnDemandHttpTrigger.On the right under View files, navigate to the bin folder. Then click the Upload button to add the two previously mentioned DLLs to the bin folder.You should see the two DLLs in your bin folder now.4) Add Azure Analysis Services Connection String to the Function AppThis step is going to add the connection string to the Azure Analysis Services (AAS) service to the entire Function App, not just individual functions. Click the name of your Function App, then select Platform features. Select Application settings under General Settings.Now we need to add our connection string under the Connection strings section. You’ll need your AAS server name and a user ID and password that has access to the AAS database.You can find your AAS server name by navigating to your AAS database in the Azure portal and copying the value constructed after clicking Show server connection strings:Your connection string should look like this:Provider=MSOLAP;Data Source=<your aas server>; Initial Catalog=<aas database name>;User ID=<your username>;Password=<your password>Back in the screen for the Function App, fill in the Name textbox with a name for your connection string and paste your connection string in the Value text box:Click Save at the top to save these settings for your Functions.5) Time for CodeOur Function App has been configured, now we need to add code to the function. The function comes with a working function, for which to test out the functionality, but we don’t need everything that is on offer.We’re going to programmatically process the tabular model. In doing so, we’ll leverage Analysis Services Management Objects (AMO). If you’re new to AMO, start here.Paste in the following code (all code referenced can also be download from my GitHub Repo):#r "Microsoft.AnalysisServices.Tabular.DLL"#r "Microsoft.AnalysisServices.Core.DLL"#r "System.Configuration"using System;using System.Configuration;using Microsoft.AnalysisServices.Tabular;using System.Net;public static async Task<HttpResponseMessage> Run(HttpRequestMessage req, TraceWriter log){log.Info("C# HTTP trigger function processed a request.");// parse query parameterstring status = req.GetQueryNameValuePairs().FirstOrDefault(q => string.Compare(q.Key, "status", true) == 0).Value;if (status == null){// Get request bodydynamic data = await req.Content.ReadAsAsync<object>();status = data?.status;}if (status == "execute"){log.Info($"C# trigger function started at: {DateTime.Now}");  try            {Microsoft.AnalysisServices.Tabular.Server asSrv = new Microsoft.AnalysisServices.Tabular.Server();var connStr = ConfigurationManager.ConnectionStrings["AASTabular"].ConnectionString;asSrv.Connect(connStr);Database db = asSrv.Databases["azureadventureworks"];Model m = db.Model;//db.Model.RequestRefresh(RefreshType.Full);     // Mark the model for refreshm.RequestRefresh(RefreshType.Full);     // Mark the model for refresh//m.Tables["Date"].RequestRefresh(RefreshType.Full);     // Mark only one table for refreshdb.Model.SaveChanges();     //commit  which will execute the refreshasSrv.Disconnect();            }catch (Exception e)            {log.Info($"C# trigger function exception: {e.ToString()}");            }log.Info($"C# trigger function finished at: {DateTime.Now}"); }return status == "execute"?req.CreateResponse(HttpStatusCode.OK, "Successfully Processed Tabular Model ") :req.CreateResponse(HttpStatusCode.BadRequest, "Please pass a status on the query string or in the request body");}Click the Save button at the top.6) Test, Test, TestClick the Run button at the top to test the function The function can also be tested in a web browser, and be called by a Web App using the POST HTTP method.Now we have a fully functioning method of refreshing an Azure Analysis Services tabular model on-demand.Scheduled RefreshThe use of Azure Functions creates a trigger for the scheduled refresh of a tabular model, this is to make sure that the entire tabular model has the latest data and is always up to date. The following describes the process that Azure Functions will be involved in this scenario:The steps that are needed to create an Azure Function for Scheduled Refresh are as follow:1) Create a FunctionWe’ve created our Function App, and now we need to add a new Timer Trigger function, which we’ll use as the basis for our scheduled refresh. Click on the + button next to Functions, select Timer, choose C# as your language and click Create this function.2) Configure the TimerWhat use is a timer without a schedule? To give the timer a schedule, click Integrate, set the schedule and click Save.The Schedule text box expects a CRON expression to define the days and times that the function should execute. Click the little Documentation button on the screen above to read about CRON expressions. The schedule I’ve set is to run once everyday at 09:30AM.3) Configure the FunctionSee step 3 of the On-Demand Function for detailed steps. You’ll need to create the bin folder and upload the DLLs to the bin folder.4) Time for CodeWe’ve configured our function, so now it’s time to add the code. The code base is much simpler than the On-Demand code base, mainly because it’s doing fewer tasks. But the AMO section is exactly the same. Paste in the following code:#r "Microsoft.AnalysisServices.Tabular.DLL"#r "Microsoft.AnalysisServices.Core.DLL"#r "System.Configuration"using System;using System.Configuration;using Microsoft.AnalysisServices.Tabular;public static void Run(TimerInfo myTimer, TraceWriter log){log.Info($"C# Timer trigger function started at: {DateTime.Now}");  try            {Microsoft.AnalysisServices.Tabular.Server asSrv = new Microsoft.AnalysisServices.Tabular.Server();var connStr = ConfigurationManager.ConnectionStrings["AASTabular"].ConnectionString;asSrv.Connect(connStr);Database db = asSrv.Databases["azureadventureworks"];Model m = db.Model;//db.Model.RequestRefresh(RefreshType.Full);     // Mark the model for refreshm.RequestRefresh(RefreshType.Full);     // Mark the model for refresh//m.Tables["Date"].RequestRefresh(RefreshType.Full);     // Mark only one table for refreshdb.Model.SaveChanges();     //commit  which will execute the refreshasSrv.Disconnect();            }catch (Exception e)            {log.Info($"C# Timer trigger function exception: {e.ToString()}");            }log.Info($"C# Timer trigger function finished at: {DateTime.Now}"); }Click the save button at the top.5) Test, Test, TestClick the Run button at the top to test the function Now we have a fully functioning method of refreshing an Azure Analysis Services tabular model on-demand.ConclusionI have shown you how simple it is to invoke two methods of refreshing a tabular model using Azure Functions: an On-Demand refresh and a refresh by Schedule. I hope that you take inspiration and use these methods in your use of both Azure Analysis Services and Azure Functions.

Extraction and Analysis of GeoSpatial data with Azure Data Lake Analytics

I recently had to implement a solution to prove it was possible to integrate a shape file (.SHP) in Azure Data Lake Store (ADLS) for post geographic spatial analysis using Azure Data Lake Analytics (ADLA). A shape file is a data set used by a geographic analysis application that stores a collection of geographic features, such as streets or zip code boundaries, in the form of points, lines or area features.As you already figured, storing a shape file in ADLS is not a difficult goal to achieve, however, how can you possibly use ADLA to obtain the geographic data from the file? In this blog I’ll explain how we can extract the data to a supported format, such as CSV, and use it to run geographic spatial analysis in ADLA, with the support of the spatial data types introduced in the SQL Server 2008 (details here).As always, whenever we face a limitation of ADLA, C# is our best friend. In order to read the content of a shape file, we need to start by adding a geospatial assembly to our solution, which, in my case, was the “Catfood” ESRI Shapefile Reader (details here).The shape file used in this example contains a list of parks in London. The following code demonstrates how to extract the metadata and the geographic shapes to a CSV file. The only shapes extracted are polygons, although it is possible to add more if needed. public static void CreateWorkForThreads() { //Create a new dataset and store the data in a table DataSet ds = CreateNewDataSet(); DataTable dt = ds.Tables[0]; int i; int count = 0; // Parse the shapefile and select the columns we are interested in using (Shapefile shapefile = new Shapefile(@"path\file.shp")) { foreach (Shape shape in shapefile) { string[] metadataNames = shape.GetMetadataNames(); string geometry = ""; int countParts = 0; int countShape = 0; DataRow dr = dt.NewRow(); //Extract the metadata. The first iteraction will extract the name of the columns if (metadataNames != null) { foreach (string metadataName in metadataNames) { if (count == 0) dr[metadataName] = metadataName; else dr[metadataName] = shape.GetMetadata(metadataName); } } //Shape is not part of the metadata, so manually defining the name of the column if (count == 0) { dr["shape"] = "shape"; } else { // cast shape based on the type switch (shape.Type) { case ShapeType.Point: // a point is just a single x/y point ShapePoint shapePoint = shape as ShapePoint; MessageBox.Show("Point (" + shapePoint.Point.X.ToString() + ", " + shapePoint.Point.Y.ToString() + ")"); break; case ShapeType.Polygon: // a polygon contains one or more parts - each part is a list of points which // are clockwise for boundaries and anti-clockwise for holes // see http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf ShapePolygon shapePolygon = shape as ShapePolygon; foreach (PointD[] part in shapePolygon.Parts) { countShape = 0; if (countParts == 0) geometry = "("; else geometry = geometry + " | ("; foreach (PointD point in part) { if (part.Length - 1 != countShape) geometry = geometry + point.X + " " + point.Y + " |"; else geometry = geometry + point.X + " " + point.Y + " )"; countShape++; } countParts++; } break; default: break; } //Build our Polygon. //Eg. POLYGON((-122.358 47.653, -122.348 47.649| -122.348 47.658, -122.358 47.658, -122.358 47.653)) dr["shape"] = "POLYGON(" + geometry + ")"; } dt.Rows.Add(dr); count++; } } //Extract the data to a csv file using (System.IO.StreamWriter sw = new System.IO.StreamWriter(@"path\filename.csv")) { foreach (DataRow row in dt.Rows) { object[] array = row.ItemArray; for (i = 0; i < array.Length - 1; i++) { sw.Write(array[i].ToString() + ","); } sw.WriteLine(array[i].ToString()); } } } public static DataSet CreateNewDataSet() { DataSet dsTemp = new DataSet(); DataTable dtTemp = new DataTable("londonparks"); dtTemp.Columns.Add("id", typeof(string)); dtTemp.Columns.Add("parkname", typeof(string)); dtTemp.Columns.Add("street", typeof(string)); dtTemp.Columns.Add("postcode", typeof(string)); dtTemp.Columns.Add("shape", typeof(string)); dsTemp.Tables.Add(dtTemp); return dsTemp; }Now that we have a valid file that can be processed by ADLA, we can upload it to ADLS and start performing geospatial analysis. To do so, I simply followed the logic described in Sacha’s blog (here).The following U-SQL has in consideration a dataset that contains details of the trajectory of a courier, tracked on a daily basis. With the following code, we identify if a courier drove by a park by using the Intersect function. Because we have to cross two datasets, a C# function was created to help the evaluation of multiple events. // Reference the assemblies we require in our script. REFERENCE SYSTEM ASSEMBLY [System.Xml]; REFERENCE ASSEMBLY [SQLServerExtensions].[SqlSpatial]; REFERENCE ASSEMBLY [USQL.Core]; // Once the appropriate assemblies are registered, we can alias them using the USING keyword. USING Geometry = Microsoft.SqlServer.Types.SqlGeometry; USING Geography = Microsoft.SqlServer.Types.SqlGeography; USING SqlChars = System.Data.SqlTypes.SqlChars; USING [USQL].[Core].[Utilities]; // Extract the list of parks @parks = EXTRACT [ID] string, [PARKNAME] string, [STREET] string, [POSTCODE] string, [SHAPE] string FROM "RAW/Parks.csv" USING Extractors.Text(delimiter : ',', silent : false, quoting : true, skipFirstNRows : 1); //Extract data from the file containing the courier trajectory @trajectories = EXTRACT GPSDateTimeUTC DateTime, ReceivedDatetimeUTC DateTime, VehicleKey string, Altitude int, Longitude double, Latitude double, Distance decimal, VehicleSpeedMph decimal FROM "CURATED/Trajectory/Trajectory.TXT" USING Extractors.Text(delimiter : '|', silent : false, quoting : true, skipFirstNRows : 1); //Get the list of vehicles that drove by the park. @vehicleIntersection = SELECT DISTINCT a. *, "1" AS VehicleIntersected FROM @trajectories AS a CROSS JOIN @parks AS b WHERE Utilities.Intersect(b.[SHAPE], a.[Longitude], a.[Latitude]).ToString() == "True"; //Get the list of vehicles that didn't drive by the park. @vehicleWithoutIntersection = SELECT a. *, "0" AS VehicleIntersected FROM @trajectories AS a LEFT JOIN @vehicleIntersection AS b ON b.VehicleKey == a.VehicleKey AND b.GPSDateTimeUTC == a.GPSDateTimeUTC WHERE b.VehicleKey IS NULL; //Union both datasets to get the complete set of data @finalData = SELECT * FROM @vehicleIntersection UNION ALL SELECT * FROM @vehicleWithoutIntersection; //Export the results to a csv file OUTPUT @finalData TO "LABORATORY/GeoSpatialIntersection.csv" USING Outputters.Text(outputHeader : true, delimiter : ',', quoting : true);And here is the C# function. It accepts three parameters and calculate the intersection of a point with a shape.public static string Intersect(string shape, double longitude, double latitude) { //Because we had a csv file, the coordinates in the polygon were separated by | //It is important to use the .MakeValid() method to validate any invalid shape //In case the dataset had multypoligon shapes, without the MakeValid(), the function would throw an error var g = Geography.STGeomFromText( new SqlChars( shape.Replace('|',',')), 4326).MakeValid(); var h = Geography.Point(longitude, latitude, 4326); return g.STIntersects(h).ToString(); }As always, if you have any questions or comments, do let me know.

Different ways of performing file operations on Azure Data Lake

Recently at a client, we needed to come up with a few different ways that we can perform File Management operations within their Data Lake – for example moving files once processed and in their case, renaming folders etc. We needed to come up with different solutions to what we currently used in order to keep within their desired architecture. So we started looking at using the REST API and calling that using C# within an SSIS package. The other option I looked at was using Python. I will explain more about both the methods below, but first there is some set up we need to do. Pre-Requisites Aside from having an Azure Subscription and a Data Lake Store account, you will need an Azure Active Directory Application. For our case we needed a Web Application as we will be doing Service to Service authentication. This is where the application provides it’s own credentials to perform the operations whereas with End-User authentication, a user must log into your application using Azure AD. Service-to-service authentication setup Data Lake store uses Azure Active Directory (AAD) for authentication, and this results in our application being provided with an OAuth 2.0 token, which gets attached to each request made to the Azure Data Lake Store. To read more about how it works and how to create the app, get the relevant credentials and how to give it access to the Data Lake store, follow through the Microsoft tutorial here. Make a note of the below as we will be needing them when we develop the solutions: Tenant ID (also known as Directory ID) Client Id (also known as the ApplicationID)   Within App registrations, if you look for your App under ‘All Apps’ and click on it you will be able to retrieve the Application Id.   Client Secret (also known as Authentication Key) Within the App area used above, click on setting and then ‘Keys’. If you haven’t previously created one, you can create it there and you must remember to save it when it appears as you will not get another chance!   Data Lake Name   Using REST API Now we have everything set up and all the credentials we need, we can make a start constructing the requests. In order to test them out I used Postman which can be downloaded here. Authentication Firstly, before we can begin any folder management operations, we need to authenticate against the Data Lake. To do this we need to perform a POST request in order to obtain an access token. This token will be used and passed to the other REST API calls we will make (e.g. deleting files) as we will always need to authenticate against the Data Lake. To retrieve the access token, we need to pass through the TENANT ID, CLIENT ID and CLIENT SECRET and the request looks as follows: curl -X POST https://login.microsoftonline.com/<TENANT-ID>/oauth2/token  \  -F grant_type=client_credentials \  -F resource=https://management.core.windows.net/ \  -F client_id=<CLIENT-ID> \  -F client_secret=<AUTH-KEY> Within Postman, it looks like this: 1. Make sure the request type is set to POST 2. Make sure you have added your tenant id to the request 3. Fill out the body with your Client ID and Client Secret. (grant_type and resource are set as constant values as shown above). 4. Make a note of the Bearer access token as we will need it to perform any File Management operation on the Data Lake.   Deleting a File Now we have our access token, we can perform a deletion of a file with the following: curl -i -X DELETE -H "Authorization: Bearer <REDACTED>" 'https://<yourstorename>.azuredatalakestore.net/webhdfs/v1/mytempdir/myinputfile1.txt?op=DELETE'   Within Postman, it looks like the following:   1. This is a DELETE request and have therefore changed the dropdown to reflect that. 2. Remember to add your data lake store name into the request; in my example it is called emma1 3. You can point to a particular file, or you can point it to a folder and add &recursive=true to the request and it will delete all the files within the folder including the folder itself. I haven’t managed to find a way to just delete the contents of the folder and leaving the folder as is. 4. The access token is sent as a header called ‘Authorization’. Make sure to include ‘Bearer ‘ before you access token as highlighted above.   Once you have sent the request, you will receive some JSON in the output to show if the action has been successful (true). You can perform many more File Management operations using the Rest API and the code can be found here: Common Errors 1. This following error is caused by running the delete file request when passing through an Access Token that has expired. To fix this issue, re-generate the Access token and pass that through instead.   2. The error below is caused by the Application that we are using to access the Data Lake Store not having sufficient permissions to perform the operations. Make sure it has access to the relevant folder(s). Check Step 3 here to find out how to set the access.     Summary Now we have managed to get what them working manually within Postman, we need to consider how to call them in a production environment. The solution we implemented was an SSIS package (in-keeping with their current architecture) with script tasks calling C# which in turn calls the API. Before the File System Operation is called, we will run the authentication API call to obtain the latest Access Token, and place the value in a variable to be used later on within the package to ensure it is the latest.   Using Python From having a play around with Python do do similar File Management operations, it seems rather limited in comparison and you can’t do as much.Nevertheless, I am sure more functionality will be added and it is useful to know how it works. Firstly, if you don’t already have Python, you can download the latest version from here. As an IDE, I have been using Visual Studio 2017 which now comes with Python Support, see here for further information. In order for us to be able to perform operations on the Data Lake, we need to install three Azure modules. To install the modules, open up the command prompt and run the following: pip install azure-mgmt-resourcepip install azure-mgmt-datalake-storepip install azure-datalake-store Now we need to create the Python app (I used Visual Studio) to do the folder management tasks. In order to reference the modules we have just installed, we need to import the relevant modules so we can use them within our app. Each time we create an app related to Data Lake folder manipulations, we need to add them in each time. The code below shows how to do this. Save the application, but don’t run it yet! ## Use this for Azure AD authenticationfrom msrestazure.azure_active_directory import AADTokenCredentials ## Required for Azure Data Lake Store account managementfrom azure.mgmt.datalake.store import DataLakeStoreAccountManagementClientfrom azure.mgmt.datalake.store.models import DataLakeStoreAccount ## Required for Azure Data Lake Store filesystem managementfrom azure.datalake.store import core, lib, multithread ## Common Azure importsimport adalfrom azure.mgmt.resource.resources import ResourceManagementClientfrom azure.mgmt.resource.resources.models import ResourceGroup ## Use these as needed for your applicationimport logging, getpass, pprint, uuid, time   Firstly, we  need to authenticate with Azure AD. Again, as described above there are two ways; End-User and Service-to-Service. We will be using Service-to-Service again in this example. To set this up, we run the following: adlCreds = lib.auth(tenant_id = 'FILL-IN-HERE', client_secret = 'FILL-IN-HERE', client_id = 'FILL-IN-HERE', resource = 'https://datalake.azure.net/') And fill in the TENANT ID, CLIENT SECRET and CLIENT ID that we captured earlier on. Now we can authenticate against the Data Lake, we can now attempt to delete a file. We need to import some more modules, so add the script below to your application:   ## Use this only for Azure AD service-to-service authenticationfrom azure.common.credentials import ServicePrincipalCredentials ## Required for Azure Data Lake Store filesystem managementfrom azure.datalake.store import core, lib, multithread   We now need to create a filesystem client: ## Declare variablessubscriptionId = 'FILL-IN-HERE'adlsAccountName = 'FILL-IN-HERE' ## Create a filesystem client objectadlsFileSystemClient = core.AzureDLFileSystem(adlCreds, store_name=adlsAccountName)   We are now ready to perform some file management operations such as deleting a file:   ## Delete a directoryadlsFileSystemClient.rm('/mysampledirectory', recursive=True)   Please see the script below for the full piece of code. You can find information on the other operation you can complete (e.g. creating directories) here  ## Use this for Azure AD authenticationfrom msrestazure.azure_active_directory import AADTokenCredentials ## Required for Azure Data Lake Store filesystem managementfrom azure.datalake.store import core, lib, multithread # Common Azure importsimport adalfrom azure.mgmt.resource.resources import ResourceManagementClientfrom azure.mgmt.resource.resources.models import ResourceGroup ## Use these as needed for your applicationimport logging, getpass, pprint, uuid, time ## Service to service authentication with client secret for file system operations   adlCreds = lib.auth(tenant_id = XXX', client_secret = ‘XXX', client_id = ‘XXX', resource = 'https://datalake.azure.net/') ## Create filesystem client ## Declare variablessubscriptionId = ‘XXX’adlsAccountName = 'emma1' ## Create a filesystem client objectadlsFileSystemClient = core.AzureDLFileSystem(adlCreds, store_name=adlsAccountName) ## Create a directory#adlsFileSystemClient.mkdir('/mysampledirectory') ## Delete a directoryadlsFileSystemClient.rm('/mysampledirectory', recursive=True) Summary In summary, there are a few different ways in which you can handle your file management operations within Data Lake and the principles behind the methods are very similar. So, if one way doesn’t fit into your architecture, there is always an alternative.

Deploying Scheduled Triggers In Data Factory v2 With PowerShell

So in ADFv2, scheduled triggers have been overhauled to become their own deployable components, able to kick off multiple parameterised pipelines together in each trigger.This new version of the trigger seems far more modular and flexible than the functionality in v1, however there are a couple of catches and easy-to-miss steps during implementation if you choose to use PowerShell rather than the brand new visual authoring tool, so I’ll break down the process below.First, you need to build your trigger, as a separate JSON object to the pipeline. No additional code is needed in the pipeline itself. There is a more complete example in the documentation Microsoft has provided, but on top of that here is a working example I’ve written below:{ "name": "ST_OrchestratorTrigger", "properties": { "type": "ScheduleTrigger", "typeProperties": { "recurrence": { "frequency": "Day", "interval": 1, "startTime": "2017-12-20T04:00:00-05:00", "endTime": "2099-11-02T22:00:00-08:00", "schedule": { "hours": [ 4 ], "minutes": [ 0 ] } } }, "pipelines": [ { "pipelineReference": { "type": "PipelineReference", "referenceName": "PL_MasterOrchestration" }, "parameters": {} } ] } } As you might guess, this trigger runs the PL_MasterOrchestrator pipeline (with no input parameters) at 4am each day.A caveat on the recurrence – it seems that when running, the trigger doesn’t strictly adhere to the “startTime” after the first time it runs, so despite starting at 4am, the subsequent trigger could be minutes or possibly hours off on the next day. To ensure it always triggers at exactly 4am, you need to add the “schedule” recurrence attribute as shown above.Deployment of the trigger (as of writing this blog) is still only done through PowerShell scripts, using the cmdlet below:Set-AzureRmDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName ` -DataFactoryName $DataFactoryName ` -DefinitionFile $TriggerFile ` -Name $TriggerName ` -Force; Each of these parameters are exactly as you would expect from the v1 parameters for setting other objects. -Force shouldn’t strictly be necessary, but sometimes is worthwhile putting in depending on the exact circumstances, -DefinitionFile is asking for the .JSON file itself, etc.So, at this point you’ve created the scheduler, and deployed it to your data factory. Scheduled triggers in v2 need to be started, once they are deployed. You can do this with the following:Start-AzureRmDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName -Name $TriggerName Just another quick note here: you will need to stop the trigger again if you want to redeploy it. Trying to overwrite a trigger while it’s running will just result in an error message.As well as finding the necessary cmdlets 2/3rds of the way down the Microsoft document linked above, there is a list of all relevant cmdlets found here.Hope this helps!The documentation again:https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-schedule-triggerhttps://docs.microsoft.com/en-us/powershell/module/azurerm.datafactoryv2/start-azurermdatafactoryv2trigger?view=azurermps-4.4.1

The Road to Microsoft MVP (Most Valuable Professional)

It is with great pleasure that I can share with you all that I am now a Microsoft MVP (Most Valuable Professional). The MVP programme (https://mvp.microsoft.com/) is designed to recognise those professionals who give their time to support technical communities and Microsoft. This could be through blogging, speaking at conferences, providing feedback, helping other and contributing code to projects. There are many award categories ranging from PowerPoint to Xbox each with its contribution areas. I was awarded MVP for the Data Platform, which covers SQL Server, Data Lake, HDInsight and many of the topics I am particularly interested in. I exceptionally pleased to be amongst the 19 Data Platform MVPs in the UK and 1 of the 418 Data Platform MVPs worldwide. In a recent blog by Kevin Kline, Kevin discussed what it takes to be an MVP, in his opinion it all boils down to being passionate about what you do and sharing that passion with others (https://blogs.sentryone.com/kevinkline/how-can-i-become-a-microsoft-mvp/). I could not agree more! I talk at events because I want to help people. I still get so much out of attending sessions and I want to make sure that I give back everything I took from the community. Kevin's blog gives a fantastic overview of what it takes to become an MVP, well worth a read. When I first started attending user group and conferences I began to become aware of the term MVP. At that point there was only a handful of MVPs in the UK (At that point you were a SQL Server MVP). Their sessions at conferences were always held in high regard and rightly so, these men and women were at the top of their game. I looked up to these professionals and would always learn something from reading their blogs, books and seeing them talk. When I started talking at user groups and SQL Saturday's I always wanted to become an MVP, but it was never my reason for presenting, it is now a very happy by-product of giving back to a technical community. MVP is awarded based on your contributions in the last 12 months, so what have I done in the last 12 months: Blogged (http://blogs.adatis.co.uk/terrymccann/) Delivered 25 technical sessions at user groups and conferences Spoke at the PASS Summit in Seattle Organised a data science meetup It has been a real blast over the last 12 months, I could not have spoken at so many events without the support of Adatis - so thanks. I want to also thank everyone who took the time to nominate me and for the help and support of Microsoft. (Some of the MVPs who nominated me - At a recent MVP event) What will I be doing over the next 12 months? A lot of the same. Although I would like to get a few more blogs written this year. If you want to see me talk. I will be at SQLBits in London in February delivering a pre-conference training day and a general session: A Data Engineer’s Guide to Azure SQL Data Warehouse Enhancing relational models with graph in SQL Server 2017 Beyond that, keep an eye out at your local user group or SQL Saturday. See you soon.

Archiving the Data Lake

In a blog introducing the Data Lake Framework, keen readers will be aware that in the diagram there’s a box titled “ARCHIVE” but it has not been brought up since. The reason why the Archive layer in the data lake has not been discussed is because we’ve been waiting for the Archive Tier in Blob Storage.To remind readers of the framework and where the archive layer sits in it, here it is again with the archive layer highlighted.The Archive BlobThe Archive access tier in blob storage was made generally available today (13th December 2017) and with it comes the final piece in the puzzle to archiving data from the data lake. Where Hot and Cool access tiers can be applied at a storage account level, the Archive access tier can only be applied to a blob storage container. To understand why the Archive access tier can only be applied to a container, you need to understand the features of the Archive access tier. It is intended for data that has no or low SLAs for availability within an organisation and the data is stored offline (Hot and Cool access tiers are online). Therefore, it can take up to 15 hours for data to be made online and available. Brining Archive data online is a process called rehydration (fitting for the data lake). If you have lots of blob containers in a storage account, you can archive them and rehydrate them as required, rather than having to rehydrate the entire storage account. Archive PatternAn intended use for the Archive access tier is to store raw data that must be preserved, even after it has been fully processed, and does not need to be accessed within 180 days. Data gets loaded into the RAW area of the data lake, is fully processed through to CURATED, and a copy of the raw data is archived off to a blob container with a Cool access tier applied to it. When the archive cycle comes about, a new Cool access tiered blob container is created and the now old container has its access tier changed to Archive. For example, our Archive cycle is monthly and we have a Cool access tiered blob container in our storage account called “December 2017”. When data has finished being processed in the Azure Data Lake, the Raw data is archived to this blob container. January comes around, we create a new blob container called “January 2018” with Cool access tier settings and change the access tier of “December 2017” from Cool to Archive. This data has now been formally achieved and is only available for disaster recovery, auditing or compliance purposes. 

The thing that stumps everyone new to Azure Data Factory

Recently I was playing around with Azure Data Factory and was making my way through an online tutorial when I came across a frustrating error message that was not very helpful and my data wasn’t moving anywhere! The aim of this exercise was to move some text files in Azure Data Lake Store over to a table in an Azure SQL Database. Sounds easy enough! The data within the Data Lake store was organised into a Year and Month hierarchy for the folders, and each days transactions were stored in a file which was named after the day within the relevant month folder. The task then was to create a pipeline which copies the dataset in the Data Lake Store over to the dbo.Orders table in Azure SQL DB every day within the scheduled period (Q1 2016). After creating all the json scripts and deploying them (with no errors), I clicked on the ‘Monitor and Manage’ tile to monitor the activities, check everything was working as it should be and monitor the progress. After waiting for at least 10 minutes, I started to get frustrated. As you can see, all the Pipeline Activities for each slice have a Status of: ‘Waiting: Dataset Dependecies’ and if you look at the Activity Window Explorer (see below), the activity window for each day shows a Status of Waiting (a solid orange square).   The files existed in the correct folders within my Data Lake Store and there were no errors during deployment so what was it waiting for?! Well, it turns out we need to set an additional property on the input dataset to let Data Factory know that the data is being generated externally and not internally within Data Factory from another pipeline .     That property is "external": true at the bottom of the script. Below is an example of a data factory and the arrows point to the data sets that need to have this external property set to true. Using the diagram of the pipeline once deployed, it is easy to identify which datasets need to be external as they are not linked to anything upstream.