Adatis

Adatis BI Blogs

Geospatial analysis with Azure Databricks

A few months ago, I wrote a blog demonstrating how to extract and analyse geospatial data in Azure Data Lake Analytics (ADLA) (here). The article aimed to prove that it was possible to run spatial analysis using U-SQL, even though it does not natively support spatial data analytics. The outcome of that experience was positive, however, with serious limitations in terms of performance. ADLA dynamically provisions resources and can perform analytics on terabytes to petabytes of data, however, because it must use the SQL Server Data Types and Spatial assemblies to perform spatial analysis, all the parallelism capabilities are suddenly limited. For example, if you are running an aggregation, ADLA will split the processing between multiple vertices, making it faster, however, when running intersections between points and polygons, because it is a SQL threaded operation, it will only use one vertex, and consequently, the job might take hours to complete. Since I last wrote my blog, the data analytics landscape has changed, and with that, new options became available, namely Azure Databricks. In this blog, I’ll demonstrate how to run spatial analysis and export the results to a mounted point using the Magellan library and Azure Databricks.Magellan is a distributed execution engine for geospatial analytics on big data. It is implemented on top of Apache Spark and deeply leverages modern database techniques like efficient data layout, code generation and query optimization in order to optimize geospatial queries (further details here).Although people mentioned in their GitHub page that the 1.0.5 Magellan library is available for Apache Spark 2.3+ clusters, I learned through a very difficult process that the only way to make it work in Azure Databricks is if you have an Apache Spark 2.2.1 cluster with Scala 2.11. The cluster I used for this experience consisted of a Standard_DS3_v2 driver type with 14GB Memory, 4 Cores and auto scaling enabled. In terms of datasets, I used the NYC Taxicab dataset to create the geometry points and the Magellan NYC Neighbourhoods GeoJSON dataset to extract the polygons. Both datasets were stored in a blob storage and added to Azure Databricks as a mount point.As always, first we need to import the libraries.//Import Libraries import magellan._ import org.apache.spark.sql.magellan.dsl.expressions._ import org.apache.spark.sql.Row import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._Next, we define the schema of the NYC Taxicab dataset and load the data to a DataFrame. While loading the data, we convert the pickup longitude and latitude into a Magellan Point. If there was a need, in the same operation we could also add another Magellan Point from the drop off longitude and latitude.//Define schema for the NYC taxi data val schema = StructType(Array(     StructField("vendorId", StringType, false),     StructField("pickup_datetime", StringType, false),     StructField("dropoff_datetime", StringType, false),     StructField("passenger_count", IntegerType, false),     StructField("trip_distance", DoubleType, false),     StructField("pickup_longitude", DoubleType, false),     StructField("pickup_latitude", DoubleType, false),     StructField("rateCodeId", StringType, false),     StructField("store_fwd", StringType, false),     StructField("dropoff_longitude", DoubleType, false),     StructField("dropoff_latitude", DoubleType, false),     StructField("payment_type", StringType, false),     StructField("fare_amount", StringType, false),     StructField("extra", StringType, false),     StructField("mta_tax", StringType, false),     StructField("tip_amount", StringType, false),     StructField("tolls_amount", StringType, false),     StructField("improvement_surcharge", StringType, false),     StructField("total_amount", DoubleType, false)))//Read data from the NYC Taxicab dataset and create a Magellan point val trips = sqlContext.read       .format("com.databricks.spark.csv")       .option("mode", "DROPMALFORMED")       .schema(schema)       .load("/mnt/geospatial/nyctaxis/*")       .withColumn("point", point($"pickup_longitude",$"pickup_latitude"))The next step is to load the neighbourhood data. As mentioned in their documentation, Magellan supports the reading of ESRI, GeoJSON, OSM-XML and WKT formats. From the GeoJSON dataset, Magellan will extract a collection of polygons and read the metadata into a Map. There are three things to notice in the code below. First, the extraction of the polygon, second, the selection of the key corresponding to the neighbourhood name and finally, the provision of a hint that defines what the index precision should be. This operation, alongside with the injection of a spatial join rule into Catalyst, massively increases the performance of the queries. To have a better understanding of this operation, read this excellent blog.//Read GeoJSON file and define index precision val neighborhoods = sqlContext.read       .format("magellan")       .option("type", "geojson")       .load("/mnt/geospatial/neighborhoods/neighborhoods.geojson")       .select($"polygon",         $"metadata"("neighborhood").as("neighborhood"))       .index(30)Now that we have our two datasets loaded, we can run our geospatial query, to identify in which neighbourhood the pickup points fall under. To achieve our goal, we need to join the two DataFrames and apply a within predicate. As a curiosity, if we consider that m represents the number of points (12.748.987), n the number of polygons (310) p the average # of edges per polygon (104) and O(mnp), then, we will roughly perform 4 trillion calculations on a single node to determine where each point falls. //Inject rules and join DataFrames with within predicate magellan.Utils.injectRules(spark) val intersected = trips.join(neighborhoods)         .where($"point" within $"polygon")The above code, does not take longer than 1 second to execute. It is only when we want to obtain details about our DataFrame that the computing time is visible. For example, if we want to know which state has the most pickups, we can write the following code which takes in average 40 seconds.//Neighbourhoods that received the most pickups display(intersected        .groupBy('neighborhood)       .count()       .orderBy($"count".desc))If we want to save the data and identify which pickups fall inside the NYC neighbourhoods, then we have to rewrite our intersected DataFrame to select all columns except the Magellan Points and Polygons, add a new column to the DataFrame and export the data back to the blob, as shown below.//select pickup points that don't fall inside a neighbourhood val nonIntersected = trips                       .select($"vendorId",$"pickup_datetime", $"dropoff_datetime", $"passenger_count", $"trip_distance", $"pickup_longitude", $"pickup_latitude", $"rateCodeId", $"store_fwd",$"dropoff_longitude", $"dropoff_latitude",$"payment_type",$"fare_amount", $"extra", $"mta_tax", $"tip_amount", $"tolls_amount", $"improvement_surcharge", $"total_amount")                       .except(intersected)//add new column intersected_flagval intersectedFlag = "1"val nonIntersectedFlag = "0"val tripsIntersected = intersected.withColumn("intersected_flag",expr(intersectedFlag)) val tripsNotIntersected = nonIntersected.withColumn("intersected_flag",expr(nonIntersectedFlag))//Union DataFramesval allTrips = tripsNotIntersected.union(tripsIntersected)//Save data to the blobintersected.write   .format("com.databricks.spark.csv")   .option("header", "true")   .save("/mnt/geospatial/trips/trips.csv")In summary, reading a dataset with 1.8GB, apply geospatial analysis and export it back to the blob storage only took in average 1 min, which is miles better when compared with my previous attempt with U-SQL.As always, if you have any comments or questions, do let me know.

Spark Streaming in Azure Databricks

Real-time stream processing is becoming more prevalent on modern day data platforms, and with a myriad of processing technologies out there, where do you begin? Stream processing involves the consumption of messages from either queue/files, doing some processing in the middle (querying, filtering, aggregation) and then forwarding the result to a sink – all with a minimal latency. This is in direct contrast to batch processing which usually occurs on an hourly or daily basis. Often is this the case, both of these will need to be combined to create a new data set. In terms of options for real-time stream processing on Azure you have the following: Azure Stream Analytics Spark Streaming / Storm on HDInsight Spark Streaming on Databricks Azure Functions Stream Analytics is a simple PaaS offering. It connects easily into other Azure resources such as Event Hubs, IoT Hub, and Blob, and outputs to a range of resources that you’d expect. It has its own intuitive query language, with the added benefit of letting you create functions in JavaScript. Scaling can be achieved by partitions, and it has windowing and late arrival event support that you’d expect from a processing option. For most jobs, this service will be the quickest/easiest to implement as long as its relatively small amount of limitations fall outside the bounds of what you want to achieve. Its also worth noting that the service does not currently support Azure network security such as Virtual Networks or IP Filtering. I suspect this may only be time with the Preview of this in EventHubs. Both Spark Streaming on HDInsight and Databricks open up the options for configurability and are possibly more suited to an enterprise level data platform, allowing us to use languages such as Scala/Python or even Java for the processing in the middle. The use of these options also allows us to integrate Kafka (an open source alternative to EventHubs) as well as HDFS, and Data Lake as inputs. Scalability is determined by the cluster sizes and the support for other events mentioned above is also included. These options also give us the flexibility for the future, and allow us to adapt moving forward depending on evolving technologies. They also come with the benefit of Azure network security support so we can peer our clusters onto a virtual network. Lastly – I wouldn’t personally use this but we can also use Functions to achieve the same goal through C#/Node.js. This route however does not include support for those temporal/windowing/late arrival events since functions are serverless and act on a per execution basis. In the following blog, I’ll be looking at Spark Streaming on Databricks (which is fast becoming my favourite research topic). A good place to start this is to understand the structured streaming model which I’ve seen a documented a few times now. Essentially treating the stream as an unbounded table, with new records from the stream being appended as a new rows to the table. This allows us to treat both batch and streaming data as tables in a DataFrame, therefore allowing similar queries to be run across them.     At this point, it will be useful to include some code to help explain the process. Before beginning its worth mounting your data sink to your databricks instance so you can reference it as if it were inside the DBFS (Databricks File System) – this is merely a pointer. For more info on this, refer to the databricks documentation here. Only create a mount point if you want all users in the workspace to have access. If you wish to apply security, you will need to access the store directly (also documented in the same place) and then apply permissions to the notebook accordingly. As my input for my stream was from EventHubs, we can start by defining the reading stream. You’ll firstly need to add the maven coordinate com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.2 to add the EventHub library to the cluster to allow the connection. Further options can be added for the consumer group, starting positions (for partitioning), timeouts and events per trigger. Positions can also be used to define starting and ending points in time so that the stream is not running continuously. connectionString = "Endpoint=sb://{EVENTHUBNAMESPACE}.servicebus.windows.net/{EVENTHUBNAME};EntityPath={EVENTHUBNAME};SharedAccessKeyName={ACCESSKEYNAME};SharedAccessKey={ACCESSKEY}" startingEventPosition = { "offset": "-1", # start of stream "seqNo": -1, # not in use "enqueuedTime": None, # not in use "isInclusive": True } endingEventPosition = { "offset": None, # not in use "seqNo": -1, # not in use "enqueuedTime": dt.now().strftime("%Y-%m-%dT%H:%M:%S.%fZ"), # point in time "isInclusive": True } ehConf = {} ehConf['eventhubs.connectionString'] = connectionString ehConf['eventhubs.startingPosition'] = json.dumps(startingEventPosition) ehConf['eventhubs.endingPosition'] = json.dumps(endingEventPosition) df = spark \ .readStream \ .format("eventhubs") \ .options(**ehConf) \ .load() The streaming data that is then output then follows the following schema – the body followed by a series of metadata about the streaming message.     Its important to note that the body comes out as a binary stream (this contains our message). We will need to cast the body to a String to deserialize the column to the JSON that we are expecting. This can be done by using some Spark SQL to turn the binary into a string as JSON and then parsing the column into a StructType with specified schema. If multiple records are coming through in the same message, you will need to explode out the result into separate records. Flattening out the nested columns is also useful as long as the data frame is still manageable. Spark SQL provides some great functions here to make our life easy. rawData = df. \ selectExpr("cast(body as string) as json"). \ select(from_json("json", Schema).alias("data")). \ select("data.*") While its entirely possible to construct your schema manually, its also worth noting that you can take a sample JSON, read it into a data frame using spark.read.json(path) and then calling printSchema() on top of it to return the inferred schema. This can then used be used to create the StructType. # Inferred schema: # root # |-- LineTotal: string (nullable = true) # |-- OrderQty: string (nullable = true) # |-- ProductID: string (nullable = true) # |-- SalesOrderDetailID: string (nullable = true) # |-- SalesOrderID: string (nullable = true) # |-- UnitPrice: string (nullable = true) # |-- UnitPriceDiscount: string (nullable = true) Schema = StructType([ StructField('SalesOrderID', StringType(), False), StructField('SalesOrderDetailID', StringType(), False), StructField('OrderQty', StringType(), False), StructField('ProductID', StringType(), False), StructField('UnitPrice', StringType(), False), StructField('UnitPriceDiscount', StringType(), False), StructField('LineTotal', StringType(), False) ]) At this point, you have the data streaming into your data frame. To output to the console you can use display(rawData) to see the data visually. However this is only useful for debugging since the data is not actually going anywhere! To write the stream into somewhere such as data lake you would then use the following code. The checkpoint location can be used to recover from failures when the stream is interrupted, and this is important if this code were to make it to a production environment. Should a cluster fail, the query be restarted on a new cluster from a specific point and consistently recover, thus enabling exactly-once guarantees. This also means we can change the query as long as the input source and output schema are the same, and not directly interrupt the stream. Lastly, the trigger will check for new rows in to stream every 10 seconds. rawData.writeStream \ .format("json") \ .outputMode("append") \ .option("path", PATH) \ .trigger(processingTime = "10 seconds") \ .option("checkpointLocation", PATH) \ .start() Checking our data lake, you can now see the data has made its way over, broken up by the time intervals specified.     Hopefully this is useful for anyone getting going in the topic area. I’d advise to stick to Python given the extra capacity of the PySpark language over Scala, even though a lot of the Databricks documentation / tutorials uses Scala. This was just something that felt more comfortable. If you intend to do much in this area I would definitely suggest you use the PySpark SQL documentation which can be found here. This is pretty much a bible for all commands and I’ve been referencing it quite a bit. If this is not enough there is also a cheat sheet available here. Again, very useful for reference when the language is still not engrained.

Databricks – Cluster Sizing

IntroductionSetting up Clusters in Databricks presents you with a wrath of different options. Which cluster mode should I use? What driver type should I select? How many worker nodes should I be using? In this blog I will try to answer those questions and to give a little insight into how to setup a cluster which exactly meets your needs to allow you to save money and produce low running times. To do this I will first of all describe and explain the different options available, then we shall go through some experiments, before finally drawing some conclusions to give you a deeper understanding of how to effectively setup your cluster.Cluster TypesDatabricks has two different types of clusters: Interactive and Job. You can see these when you navigate to the Clusters homepage, all clusters are grouped under either Interactive or Job. When to use each one depends on your specific scenario. Interactive clusters are used to analyse data with notebooks, thus give you much more visibility and control. This should be used in the development phase of a project. Job clusters are used to run automated workloads using the UI or API. Jobs can be used to schedule Notebooks, they are recommended to be used in Production for most projects and that a new cluster is created for each run of each job. For the experiments we will go through in this blog we will use existing predefined interactive clusters so that we can fairly assess the performance of each configuration as opposed to start-up time.Cluster ModesWhen creating a cluster, you will notice that there are two types of cluster modes. Standard is the default and can be used with Python, R, Scala and SQL. The other cluster mode option is high concurrency. High concurrency provides resource utilisation, isolation for each notebook by creating a new environment for each one, security and sharing by multiple concurrently active users. Sharing is accomplished by pre-empting tasks to enforce fair sharing between different users. Pre-emption can be altered in a variety of different ways. To enable, you must be running Spark 2.2 above and add the following coloured underline lines to Spark Config, displayed in the image below. It should be noted high concurrency does not support Scala.Enabled – Self-explanatory, required to enable pre-emption.Threshold – Fair share fraction guaranteed. 1.0 will aggressively attempt to guarantee perfect sharing. 0.0 disables pre-emption. 0.5 is the default, at worse the user will get half of their fair share. Timeout – The amount of time that a user is starved before pre-emption starts. A lower value will cause more interactive response times, at the expense of cluster efficiency. Recommended to be between 1-100 seconds.Interval – How often the scheduler will check for pre-emption. This should be less than the timeout above.Driver Node and Worker NodesCluster nodes have a single driver node and multiple worker nodes. The driver and worker nodes can have different instance types, but by default they are the same. A driver node runs the main function and executes various parallel operations on the worker nodes. The worker nodes read and write from and to the data sources.When creating a cluster, you can either specify an exact number of workers required for the cluster or specify a minimum and maximum range and allow the number of workers to automatically be scaled. When auto scaling is enabled the number of total workers will sit between the min and max. If a cluster has pending tasks it scales up, once there are no pending tasks it scales back down again. This all happens whilst a load is running.PricingIf you’re going to be playing around with clusters, then it’s important you understand how the pricing works. Databricks uses something called Databricks Unit (DBU), which is a unit of processing capability per hour. Based upon different tiers, more information can be found here.You will be charged for your driver node and each worker node per hour.You can find out much more about pricing Databricks clusters by going to my colleague’s blog, which can be found here.ExperimentFor the experiments I wanted to use a medium and big dataset to make it a fair test. I started with the People10M dataset, with the intention of this being the larger dataset. I created some basic ETL to put it through its paces, so we could effectively compare different configurations. The ETL does the following: read in the data, pivot on the decade of birth, convert the salary to GBP and calculate the average, grouped by the gender. The People10M dataset wasn’t large enough for my liking, the ETL still ran in under 15 seconds. Therefore, I created a for loop to union the dataset to itself 4 times. Taking us from 10 million rows to 160 million rows. The code used can be found below:# Import relevant functions.import datetimefrom pyspark.sql.functions import year, floor# Read in the People10m table.people = spark.sql("select * from clusters.people10m ORDER BY ssn")# Explode the dataset.for i in xrange(0,4):     people = people.union(people)# Get decade from birthDate and convert salary to GBP.people = people.withColumn('decade', floor(year("birthDate")/10)*10).withColumn('salaryGBP', floor(people.salary.cast("float") * 0.753321205))# Pivot the decade of birth and sum the salary whilst applying a currency conversion.people = people.groupBy("gender").pivot("decade").sum("salaryGBP").show()To push it through its paces further and to test parallelism I used threading to run the above ETL 5 times, this brought the running time to over 5 minutes, perfect! The following code was used to carry out orchestration:from multiprocessing.pool import ThreadPoolpool = ThreadPool(10)pool.map(     lambda path: dbutils.notebook.run(          "/Users/mdw@adatis.co.uk/Cluster Sizing/PeopleETL160M",          timeout_seconds = 1200),     ["","","","",""])To be able to test the different options available to us I created 5 different cluster configurations. For each of them the Databricks runtime version was 4.3 (includes Apache Spark 2.3.1, Scala 2.11) and Python v2.Default – This was the default cluster configuration at the time of writing, which is a worker type of Standard_DS3_v2 (14 GB memory, 4 cores), driver node the same as the workers and autoscaling enabled with a range of 2 to 8. Total available is 112 GB memory and 32 cores.Auto scale (large range) – This is identical to the default but with autoscaling range of 2 to 14. Therefore total available is 182 GB memory and 56 cores. I included this to try and understand just how effective the autoscaling is. Static (few powerful workers) – The worker type is Standard_DS5_v2 (56 GB memory, 16 cores), driver node the same as the workers and just 2 worker nodes. Total available is 112 GB memory and 32 cores.Static (many workers new) – The same as the default, except there are 8 workers. Total available is 112 GB memory and 32 cores, which is identical to the Static (few powerful workers) configuration above. Therefore, will allow us to understand if few powerful workers or many weaker workers is more effective.High Concurrency – A cluster mode of ‘High Concurrency’ is selected, unlike all the others which are ‘Standard’. This results in a worker type of Standard_DS13_v2 (56 GB memory, 8 cores), driver node is the same as the workers and autoscaling enabled with a range of 2 to 8. Total available is 448 GB memory and 64 cores. This cluster also has all of the Spark Config attributes specified earlier in the blog. Here we are trying to understand when to use High Concurrency instead of Standard cluster mode.The results can be seen below, measured in seconds, a new row for each different configuration described above and I did three different runs and calculated the average and standard deviation, the rank is based upon the average. Run 1 was always done in the morning, Run 2 in the afternoon and Run 3 in the evening, this was to try and make the tests fair and reduce the effects of other clusters running at the same time.Before we move onto the conclusions, I want to make one important point, different cluster configurations work better or worse depending on the dataset size, so don’t discredit the smaller dataset, when you are working with smaller datasets you can’t apply what you know about the larger datasets.Comparing the default to the auto scale (large range) shows that when using a large dataset allowing for more worker nodes really does make a positive difference. With just 1 million rows the difference is negligible, but with 160 million on average it is 65% quicker.Comparing the two static configurations: few powerful worker nodes versus many less powerful worker nodes yielded some interesting results. Remember, both have identical memory and cores. With the small data set, few powerful worker nodes resulted in quicker times, the quickest of all configurations in fact. When looking at the larger dataset the opposite is true, having more, less powerful workers is quicker. Whilst this is a fair observation to make, it should be noted that the static configurations do have an advantage with these relatively short loading times as the autoscaling does take time.The final observation I’d like to make is High Concurrency configuration, it is the only configuration to perform quicker for the larger dataset. By quite a significant difference it is the slowest with the smaller dataset. With the largest dataset it is the second quickest, only losing out, I suspect, to the autoscaling. High concurrency isolates each notebook, thus enforcing true parallelism. Why the large dataset performs quicker than the smaller dataset requires further investigation and experiments, but it certainly is useful to know that with large datasets where time of execution is important that High Concurrency can make a good positive impact.To conclude, I’d like to point out the default configuration is almost the slowest in both dataset sizes, hence it is worth spending time contemplating which cluster configurations could impact your solution, because choosing the correct ones will make runtimes significantly quicker.

Getting Started with Databricks Cluster Pricing

The use of databricks for data engineering or data analytics workloads is becoming more prevalent as the platform grows, and has made its way into most of our recent modern data architecture proposals – whether that be PaaS warehouses, or data science platforms. To run any type of workload on the platform, you will need to setup a cluster to do the processing for you. While the Azure-based platform has made this relatively simple for development purposes, i.e. give it a name, select a runtime, select the type of VMs you want and away you go – for production workloads, a bit more thought needs to go into the configuration/cost.  In the following blog I’ll start by looking at the pricing in a bit more detail which will aim to provide a cost element to the cluster configuration process. For arguments sake, the work that we tend to deliver with databricks is based on data engineering usage – spinning up resource for an allocated period to perform a task. Therefore this is generally the focus for the following topic.   To get started in this area, I think it would be useful to included some definitions. DBU – a databricks unit (unit of processing capability per hour billed on per second usage) Data Engineering Workload - a job that both starts and terminates the cluster which it runs on (via the job scheduler) Data Analytics Workload – a non automated workload, for example running a command manually within a databricks notebook. Multiple users can share the cluster to perform interactive analysis Cluster – made up of instances of processing (VMs) and constitute of a driver, and workers. Workers can either be provisioned upfront, or autoscaled between a min no. workers / max no. workers. Tier – either standard or premium. Premium includes role based access control, ODBC endpoint authentication, audit logs, Databricks Delta (unified data management system). The billing for the clusters primarily works depending on the type of workload you initiate and tier (or functionality) you require. As you might of guessed data engineering workloads on the standard tier offer the best price. I’ve taken the DS3 v2 instance (VM) pricing from the Azure Databricks pricing page.   The pricing can be broken down as follows: Each instance is charged at £0.262/hour. So for example, the cost of a very simple cluster - 1 driver and 2 workers is £0.262/hour x 3 = £0.786/hour. The VM cost does not depend on the workload type/tier. The DBU cost is then calculated at £0.196/hour. So for example, the cost of 3 nodes (as above) is £0.196/hour x 3 = £0.588/hour. This cost does change depending on workload type/tier. The total cost is then £0.786/hour (VM Cost) + £0.588/hour (DBU Cost) = £1.374/hour. Also known as the pay as you go price. Discounts are then added accordingly for reserved processing power. I thought this was worth simplifying since the pricing page doesn’t make this abundantly clear with the way the table is laid out and often this is overlooked. Due to the vast amount of options you can have to setup clusters, its worth understanding this element to balance against time. The DBU count is merely to be used as reference to compare the different VMs processing power and is not directly included in the calculations. Its also worth mentioning that by default databricks services are setup as premium and can be downgraded to standard only by contacting support. In some cases, this can add some massive cost savings depending upon the type of work you are doing on the platform so please take this into account before spinning up clusters and don’t just go with the default. With regards to configuration, clusters can either be setup under a High Concurrency mode (previously known as serverless) or as Standard. The high concurrency mode is optimised for concurrent workloads and therefore is more applicable to data analytics workloads and interactive notebooks which are used by multiple users simultaneously. This piece of configuration does not effect the pricing. By using the following cost model, we can then assume for a basic batch ETL run where we have a driver and 8 worker nodes on relatively small DS3 instances, would cost £123.60/month given a standard 1 hour daily ETL window. Hopefully this provides as a very simple introduction into the pricing model used by Databricks.

Databricks UDF Performance Comparisons

I’ve recently been spending quite a bit of time on the Azure Databricks platform, and while learning decided it was worth using it to experiment with some common data warehousing tasks in the form of data cleansing. As Databricks provides us with a platform to run a Spark environment on, it offers options to use cross-platform APIs that allow us to write code in Scala, Python, R, and SQL within the same notebook. As with most things in life, not everything is equal and there are potential differences in performance between them. In this blog, I will explain the tests I produced with the aim of outlining best practice for Databricks implementations for UDFs of this nature. Scala is the native language for Spark – and without going into too much detail here, it will compile down faster to the JVM for processing. Under the hood, Python on the other hand provides a wrapper around the code but in reality is a Scala program telling the cluster what to do, and being transformed by Scala code. Converting these objects into a form Python can read is called serialisation / deserialisation, and its expensive, especially over time and across a distributed dataset. This most expensive scenario occurs through UDFs (functions) – the runtime process for which can be seen below. The overhead here is in (4) and (5) to read the data and write into JVM memory. Using Scala to create the UDFs, the execution process can skip these steps and keep everything native. Scala UDFs operate within the JVM of the executor so we can skip serialisation and deserialisation.   Experiments As part of my data for this task I took a list of company names from a data set and then run them through a process to codify them, essentially stripping out characters which cause them to be unique and converting them to upper case, thus grouping a set of companies together under the same name. For instance Adatis, Adatis Ltd, and Adatis (Ltd) would become ADATIS. This was an example of a typical cleansing activity when working with data sets. The dataset in question was around 2.5GB and contained 10.5m rows. The cluster I used was Databricks runtime 4.2 (Spark 2.3.1 / Scala 2.11) with Standard_DS2_v2 VMs for the driver/worker nodes (14GB memory) with autoscaling disabled and limited to 2 workers. I disabled the autoscaling for this as I was seeing wildly inconsistent timings each run which impacted the tests. The goods news is that with it enabled and using up to 8 workers, the timings were about 20% faster albeit more erratic from a standard deviation point of view. The following approaches were tested: Scala program calls Scala UDF via Function Scala program calls Scala UDF via SQL Python program calls Scala UDF via SQL Python program calls Python UDF via Function Python program calls Python Vectorised UDF via Function Python program uses SQL While it was true in previous versions of Spark that there was a difference between these using Scala/Python, in the latest version of Spark (2.3) it is believed to be more of a level playing field by using Apache Arrow in the form of Vectorised Pandas UDFs within Python. As part of the tests I also wanted to use Python to call a Scala UDF via a function but unfortunately we cannot do this without creating a Jar file of the Scala code and importing it separately. This would be done via SBT (build tool) using the following guide here. I considered this too much of an overhead for the purposes of the experiment. The following code was then used as part of a Databricks notebook to define the tests. A custom function to time the write was required for Scala whereas Python allows us to use %timeit for a similar purpose.   Scala program calls Scala UDF via Function // Scala program calls Scala UDF via Function %scala def codifyScalaUdf = udf((string: String) => string.toUpperCase.replace(" ", "").replace("#","").replace(";","").replace("&","").replace(" AND ","").replace(" THE ","").replace("LTD","").replace("LIMITED","").replace("PLC","").replace(".","").replace(",","").replace("[","").replace("]","").replace("LLP","").replace("INC","").replace("CORP","")) spark.udf.register("ScalaUdf", codifyScalaUdf) val transformedScalaDf = table("DataTable").select(codifyScalaUdf($"CompanyName").alias("CompanyName")) val ssfTime = timeIt(transformedScalaDf.write.mode("overwrite").format("parquet").saveAsTable("SSF"))   Scala program calls Scala UDF via SQL // Scala program calls Scala UDF via SQL %scala val sss = spark.sql("SELECT ScalaUdf(CompanyName) as a from DataTable where CompanyName is not null") val sssTime = timeIt(sss.write.mode("overwrite").format("parquet").saveAsTable("SSS"))   Python program calls Scala UDF via SQL # Python program calls Scala UDF via SQL pss = spark.sql("SELECT ScalaUdf(CompanyName) as a from DataTable where CompanyName is not null") %timeit -r 1 pss.write.format("parquet").saveAsTable("PSS", mode='overwrite')   Python program calls Python UDF via Function # Python program calls Python UDF via Function from pyspark.sql.functions import * from pyspark.sql.types import StringType @udf(StringType()) def pythonCodifyUDF(string): return (string.upper().replace(" ", "").replace("#","").replace(";","").replace("&","").replace(" AND ","").replace(" THE ","").replace("LTD","").replace("LIMITED","").replace("PLC","").replace(".","").replace(",","").replace("[","").replace("]","").replace("LLP","").replace("INC","").replace("CORP","")) pyDF = df.select(pythonCodifyUDF(col("CompanyName")).alias("CompanyName")).filter(col("CompanyName").isNotNull()) %timeit -r 1 pyDF.write.format("parquet").saveAsTable("PPF", mode='overwrite')   Python program calls Python Vectorised UDF via Function # Python program calls Python Vectorised UDF via Function from pyspark.sql.types import StringType from pyspark.sql.functions import pandas_udf, col @pandas_udf(returnType=StringType()) def pythonCodifyVecUDF(string): return (string.replace(" ", "").replace("#","").replace(";","").replace("&","").replace(" AND ","").replace(" THE ","").replace("LTD","").replace("LIMITED","").replace("PLC","").replace(".","").replace(",","").replace("[","").replace("]","").replace("LLP","").replace("INC","").replace("CORP","")).str.upper() pyVecDF = df.select(pythonCodifyVecUDF(col("CompanyName")).alias("CompanyName")).filter(col("CompanyName").isNotNull()) %timeit -r 1 pyVecDF.write.format("parquet").saveAsTable("PVF", mode='overwrite')   Python Program uses SQL # Python Program uses SQL sql = spark.sql("SELECT upper(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(CompanyName,' ',''),'&',''),';',''),'#',''),' AND ',''),' THE ',''),'LTD',''),'LIMITED',''),'PLC',''),'.',''),',',''),'[',''),']',''),'LLP',''),'INC',''),'CORP','')) as a from DataTable where CompanyName is not null") %timeit -r 1 sql.write.format("parquet").saveAsTable("SQL", mode='overwrite')   Results and Observations It was interesting to note the following: The hypothesis above does indeed hold true and the 2 methods which were expected to be slowest were within the experiment, and by a considerable margin. The Scala UDF performs consistently regardless of the method used to call the UDF. The Python vectorised UDF now performs on par with the Scala UDFs and there is a clear difference between the vectorised and non-vectorised Python UDFs. The standard deviation for the vectorised UDF was surprisingly low and the method was performing consistently each run. The non-vectorised Python UDF was the opposite. To summarise, moving forward – as long as you adopt to writing your UDFs in Scala or use the vectorised version of the Python UDF, the performance will be similar for this type of activity. Its worth noting to definitely avoid writing the UDFs as standard Python functions due to the theory and results above. Over time, across a complete solution and with more data, this time would add up.

Embedding Databricks Notebooks into a BlogEngine.NET post

Databricks is a buzzword now. This means that each day more and more related content appears on the net. With Databricks Notebooks it’s so easy to share code. If by any chance you need to share a notebook directly in your blog post here are some short guidelines on how to do so. If your favourite blog engine appears to be BlogEngine.NET it’s not so straightforward task. Fear not – following are the steps you need to take: Export your notebook to HTML from the Databricks portal: Make the following text replacements in the exported HTML: replace & with &amp; and “ with &quot; Paste the following code in the blog’s source replacing both the [[HEIGHT]] placeholders with the notebook’s height in pixels (you may need a little trial and error to get to the exact value so that the vertical scrollers disappear) and [[NOTEBOOK_HTML]] placeholder with the resulting HTML from the above point: <iframe height="[[HEIGHT]]px" frameborder="0" scrolling="no" width="100%" style="width: 100%; height: [[HEIGHT]]px;" sandbox="allow-forms allow-pointer-lock allow-popups allow-presentation allow-same-origin allow-scripts allow-top-navigation" srcdoc="[[NOTEBOOK_HTML]]"></iframe> To make everything compatible with Internet Explorer and Edge, at the very end of your blog script place between <script></script> tags the wonderful srcdoc-polyfill script Here is an example notebook with the above steps applied: Blog2 - Databricks window.settings = {"enableUsageDeliveryConfiguration":false,"enableNotebookNotifications":true,"enableSshKeyUI":false,"defaultInteractivePricePerDBU":0.55,"enableDynamicAutoCompleteResourceLoading":false,"enableClusterMetricsUI":false,"allowWhitelistedIframeDomains":true,"enableOnDemandClusterType":true,"enableAutoCompleteAsYouType":[],"devTierName":"Community Edition","enableJobsPrefetching":true,"workspaceFeaturedLinks":[{"linkURI":"https://docs.azuredatabricks.net/index.html","displayName":"Documentation","icon":"question"},{"linkURI":"https://docs.azuredatabricks.net/release-notes/product/index.html","displayName":"Release Notes","icon":"code"},{"linkURI":"https://docs.azuredatabricks.net/spark/latest/training/index.html","displayName":"Training & Tutorials","icon":"graduation-cap"}],"enableReservoirTableUI":true,"enableClearStateFeature":true,"dbcForumURL":"http://forums.databricks.com/","enableProtoClusterInfoDeltaPublisher":true,"enableAttachExistingCluster":true,"sandboxForSandboxFrame":"allow-scripts allow-popups allow-popups-to-escape-sandbox allow-forms","resetJobListOnConnect":true,"serverlessDefaultSparkVersion":"latest-stable-scala2.11","maxCustomTags":8,"serverlessDefaultMaxWorkers":8,"enableInstanceProfilesUIInJobs":true,"nodeInfo":{"node_types":[{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":7284,"instance_type_id":"Standard_DS3_v2","node_type_id":"Standard_DS3_v2","description":"Standard_DS3_v2","support_cluster_tags":true,"container_memory_mb":9105,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":342},"node_instance_type":{"instance_type_id":"Standard_DS3_v2","provider":"Azure","local_disk_size_gb":28,"supports_accelerated_networking":true,"compute_units":4.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":14336,"num_cores":4,"cpu_quota_type":"Standard DSv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":14336,"is_hidden":false,"category":"General Purpose","num_cores":4.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":18409,"instance_type_id":"Standard_DS4_v2","node_type_id":"Standard_DS4_v2","description":"Standard_DS4_v2","support_cluster_tags":true,"container_memory_mb":23011,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":342},"node_instance_type":{"instance_type_id":"Standard_DS4_v2","provider":"Azure","local_disk_size_gb":56,"supports_accelerated_networking":true,"compute_units":8.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":28672,"num_cores":8,"cpu_quota_type":"Standard DSv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":28672,"is_hidden":false,"category":"General Purpose","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":40658,"instance_type_id":"Standard_DS5_v2","node_type_id":"Standard_DS5_v2","description":"Standard_DS5_v2","support_cluster_tags":true,"container_memory_mb":50823,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":342},"node_instance_type":{"instance_type_id":"Standard_DS5_v2","provider":"Azure","local_disk_size_gb":112,"supports_accelerated_networking":true,"compute_units":16.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":57344,"num_cores":16,"cpu_quota_type":"Standard DSv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":57344,"is_hidden":false,"category":"General Purpose","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":21587,"instance_type_id":"Standard_D8s_v3","node_type_id":"Standard_D8s_v3","description":"Standard_D8s_v3","support_cluster_tags":true,"container_memory_mb":26984,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":348},"node_instance_type":{"instance_type_id":"Standard_D8s_v3","provider":"Azure","local_disk_size_gb":64,"supports_accelerated_networking":true,"compute_units":8.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":32768,"num_cores":8,"cpu_quota_type":"Standard DSv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":32768,"is_hidden":false,"category":"General Purpose","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":47015,"instance_type_id":"Standard_D16s_v3","node_type_id":"Standard_D16s_v3","description":"Standard_D16s_v3","support_cluster_tags":true,"container_memory_mb":58769,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":348},"node_instance_type":{"instance_type_id":"Standard_D16s_v3","provider":"Azure","local_disk_size_gb":128,"supports_accelerated_networking":true,"compute_units":16.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":65536,"num_cores":16,"cpu_quota_type":"Standard DSv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":65536,"is_hidden":false,"category":"General Purpose","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":97871,"instance_type_id":"Standard_D32s_v3","node_type_id":"Standard_D32s_v3","description":"Standard_D32s_v3","support_cluster_tags":true,"container_memory_mb":122339,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":348},"node_instance_type":{"instance_type_id":"Standard_D32s_v3","provider":"Azure","local_disk_size_gb":256,"supports_accelerated_networking":true,"compute_units":32.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":131072,"num_cores":32,"cpu_quota_type":"Standard DSv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":131072,"is_hidden":false,"category":"General Purpose","num_cores":32.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":199583,"instance_type_id":"Standard_D64s_v3","node_type_id":"Standard_D64s_v3","description":"Standard_D64s_v3","support_cluster_tags":true,"container_memory_mb":249479,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":348},"node_instance_type":{"instance_type_id":"Standard_D64s_v3","provider":"Azure","local_disk_size_gb":512,"supports_accelerated_networking":true,"compute_units":64.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":262144,"num_cores":64,"cpu_quota_type":"Standard DSv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":262144,"is_hidden":false,"category":"General Purpose","num_cores":64.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":7284,"instance_type_id":"Standard_D3_v2","node_type_id":"Standard_D3_v2","description":"Standard_D3_v2","support_cluster_tags":true,"container_memory_mb":9105,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D3_v2","provider":"Azure","local_disk_size_gb":200,"supports_accelerated_networking":true,"compute_units":4.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":14336,"num_cores":4,"cpu_quota_type":"Standard Dv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":14336,"is_hidden":false,"category":"General Purpose (HDD)","num_cores":4.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":21587,"instance_type_id":"Standard_D8_v3","node_type_id":"Standard_D8_v3","description":"Standard_D8_v3","support_cluster_tags":true,"container_memory_mb":26984,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D8_v3","provider":"Azure","local_disk_size_gb":200,"supports_accelerated_networking":true,"compute_units":8.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":32768,"num_cores":8,"cpu_quota_type":"Standard Dv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":32768,"is_hidden":false,"category":"General Purpose (HDD)","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":47015,"instance_type_id":"Standard_D16_v3","node_type_id":"Standard_D16_v3","description":"Standard_D16_v3","support_cluster_tags":true,"container_memory_mb":58769,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D16_v3","provider":"Azure","local_disk_size_gb":400,"supports_accelerated_networking":true,"compute_units":16.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":65536,"num_cores":16,"cpu_quota_type":"Standard Dv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":65536,"is_hidden":false,"category":"General Purpose (HDD)","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":97871,"instance_type_id":"Standard_D32_v3","node_type_id":"Standard_D32_v3","description":"Standard_D32_v3","support_cluster_tags":true,"container_memory_mb":122339,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D32_v3","provider":"Azure","local_disk_size_gb":800,"supports_accelerated_networking":true,"compute_units":32.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":131072,"num_cores":32,"cpu_quota_type":"Standard Dv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":131072,"is_hidden":false,"category":"General Purpose (HDD)","num_cores":32.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":199583,"instance_type_id":"Standard_D64_v3","node_type_id":"Standard_D64_v3","description":"Standard_D64_v3","support_cluster_tags":true,"container_memory_mb":249479,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D64_v3","provider":"Azure","local_disk_size_gb":1600,"supports_accelerated_networking":true,"compute_units":64.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":262144,"num_cores":64,"cpu_quota_type":"Standard Dv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":262144,"is_hidden":false,"category":"General Purpose (HDD)","num_cores":64.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":18409,"instance_type_id":"Standard_D12_v2","node_type_id":"Standard_D12_v2","description":"Standard_D12_v2","support_cluster_tags":true,"container_memory_mb":23011,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D12_v2","provider":"Azure","local_disk_size_gb":200,"supports_accelerated_networking":true,"compute_units":4.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":28672,"num_cores":4,"cpu_quota_type":"Standard Dv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":28672,"is_hidden":false,"category":"Memory Optimized (Remote HDD)","num_cores":4.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":40658,"instance_type_id":"Standard_D13_v2","node_type_id":"Standard_D13_v2","description":"Standard_D13_v2","support_cluster_tags":true,"container_memory_mb":50823,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D13_v2","provider":"Azure","local_disk_size_gb":400,"supports_accelerated_networking":true,"compute_units":8.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":57344,"num_cores":8,"cpu_quota_type":"Standard Dv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":57344,"is_hidden":false,"category":"Memory Optimized (Remote HDD)","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":85157,"instance_type_id":"Standard_D14_v2","node_type_id":"Standard_D14_v2","description":"Standard_D14_v2","support_cluster_tags":true,"container_memory_mb":106447,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D14_v2","provider":"Azure","local_disk_size_gb":800,"supports_accelerated_networking":true,"compute_units":16.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":114688,"num_cores":16,"cpu_quota_type":"Standard Dv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":114688,"is_hidden":false,"category":"Memory Optimized (Remote HDD)","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":107407,"instance_type_id":"Standard_D15_v2","node_type_id":"Standard_D15_v2","description":"Standard_D15_v2","support_cluster_tags":true,"container_memory_mb":134259,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D15_v2","provider":"Azure","local_disk_size_gb":1000,"supports_accelerated_networking":true,"compute_units":20.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":143360,"num_cores":20,"cpu_quota_type":"Standard Dv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":143360,"is_hidden":false,"category":"Memory Optimized (Remote HDD)","num_cores":20.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":18409,"instance_type_id":"Standard_DS12_v2","node_type_id":"Standard_DS12_v2","description":"Standard_DS12_v2","support_cluster_tags":true,"container_memory_mb":23011,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":342},"node_instance_type":{"instance_type_id":"Standard_DS12_v2","provider":"Azure","local_disk_size_gb":56,"supports_accelerated_networking":true,"compute_units":4.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":28672,"num_cores":4,"cpu_quota_type":"Standard DSv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":28672,"is_hidden":false,"category":"Memory Optimized","num_cores":4.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":40658,"instance_type_id":"Standard_DS13_v2","node_type_id":"Standard_DS13_v2","description":"Standard_DS13_v2","support_cluster_tags":true,"container_memory_mb":50823,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":342},"node_instance_type":{"instance_type_id":"Standard_DS13_v2","provider":"Azure","local_disk_size_gb":112,"supports_accelerated_networking":true,"compute_units":8.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":57344,"num_cores":8,"cpu_quota_type":"Standard DSv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":57344,"is_hidden":false,"category":"Memory Optimized","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":85157,"instance_type_id":"Standard_DS14_v2","node_type_id":"Standard_DS14_v2","description":"Standard_DS14_v2","support_cluster_tags":true,"container_memory_mb":106447,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":342},"node_instance_type":{"instance_type_id":"Standard_DS14_v2","provider":"Azure","local_disk_size_gb":224,"supports_accelerated_networking":true,"compute_units":16.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":114688,"num_cores":16,"cpu_quota_type":"Standard DSv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":114688,"is_hidden":false,"category":"Memory Optimized","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":107407,"instance_type_id":"Standard_DS15_v2","node_type_id":"Standard_DS15_v2","description":"Standard_DS15_v2","support_cluster_tags":true,"container_memory_mb":134259,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":342},"node_instance_type":{"instance_type_id":"Standard_DS15_v2","provider":"Azure","local_disk_size_gb":280,"supports_accelerated_networking":true,"compute_units":20.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":143360,"num_cores":20,"cpu_quota_type":"Standard DSv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":143360,"is_hidden":false,"category":"Memory Optimized","num_cores":20.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":47015,"instance_type_id":"Standard_E8s_v3","node_type_id":"Standard_E8s_v3","description":"Standard_E8s_v3","support_cluster_tags":true,"container_memory_mb":58769,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_E8s_v3","provider":"Azure","local_disk_size_gb":128,"supports_accelerated_networking":true,"compute_units":8.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":65536,"num_cores":8,"cpu_quota_type":"Standard ESv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":65536,"is_hidden":false,"category":"Memory Optimized","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":97871,"instance_type_id":"Standard_E16s_v3","node_type_id":"Standard_E16s_v3","description":"Standard_E16s_v3","support_cluster_tags":true,"container_memory_mb":122339,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_E16s_v3","provider":"Azure","local_disk_size_gb":256,"supports_accelerated_networking":true,"compute_units":16.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":131072,"num_cores":16,"cpu_quota_type":"Standard ESv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":131072,"is_hidden":false,"category":"Memory Optimized","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":199583,"instance_type_id":"Standard_E32s_v3","node_type_id":"Standard_E32s_v3","description":"Standard_E32s_v3","support_cluster_tags":true,"container_memory_mb":249479,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_E32s_v3","provider":"Azure","local_disk_size_gb":512,"supports_accelerated_networking":true,"compute_units":32.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":262144,"num_cores":32,"cpu_quota_type":"Standard ESv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":262144,"is_hidden":false,"category":"Memory Optimized","num_cores":32.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":21587,"instance_type_id":"Standard_L4s","node_type_id":"Standard_L4s","description":"Standard_L4s","support_cluster_tags":true,"container_memory_mb":26984,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_L4s","provider":"Azure","local_disk_size_gb":678,"supports_accelerated_networking":false,"compute_units":4.0,"number_of_ips":2,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":32768,"num_cores":4,"cpu_quota_type":"Standard LS Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":32768,"is_hidden":false,"category":"Storage Optimized","num_cores":4.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":47015,"instance_type_id":"Standard_L8s","node_type_id":"Standard_L8s","description":"Standard_L8s","support_cluster_tags":true,"container_memory_mb":58769,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_L8s","provider":"Azure","local_disk_size_gb":1388,"supports_accelerated_networking":false,"compute_units":8.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":65536,"num_cores":8,"cpu_quota_type":"Standard LS Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":65536,"is_hidden":false,"category":"Storage Optimized","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":97871,"instance_type_id":"Standard_L16s","node_type_id":"Standard_L16s","description":"Standard_L16s","support_cluster_tags":true,"container_memory_mb":122339,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_L16s","provider":"Azure","local_disk_size_gb":2807,"supports_accelerated_networking":false,"compute_units":16.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":131072,"num_cores":16,"cpu_quota_type":"Standard LS Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":131072,"is_hidden":false,"category":"Storage Optimized","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":199583,"instance_type_id":"Standard_L32s","node_type_id":"Standard_L32s","description":"Standard_L32s","support_cluster_tags":true,"container_memory_mb":249479,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_L32s","provider":"Azure","local_disk_size_gb":5630,"supports_accelerated_networking":false,"compute_units":32.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":262144,"num_cores":32,"cpu_quota_type":"Standard LS Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":262144,"is_hidden":false,"category":"Storage Optimized","num_cores":32.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":2516,"instance_type_id":"Standard_F4s","node_type_id":"Standard_F4s","description":"Standard_F4s","support_cluster_tags":true,"container_memory_mb":3146,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_F4s","provider":"Azure","local_disk_size_gb":16,"supports_accelerated_networking":true,"compute_units":4.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":8192,"num_cores":4,"cpu_quota_type":"Standard FS Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":8192,"is_hidden":false,"category":"Compute Optimized","num_cores":4.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":8873,"instance_type_id":"Standard_F8s","node_type_id":"Standard_F8s","description":"Standard_F8s","support_cluster_tags":true,"container_memory_mb":11092,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_F8s","provider":"Azure","local_disk_size_gb":32,"supports_accelerated_networking":true,"compute_units":8.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":16384,"num_cores":8,"cpu_quota_type":"Standard FS Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":16384,"is_hidden":false,"category":"Compute Optimized","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":21587,"instance_type_id":"Standard_F16s","node_type_id":"Standard_F16s","description":"Standard_F16s","support_cluster_tags":true,"container_memory_mb":26984,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_F16s","provider":"Azure","local_disk_size_gb":64,"supports_accelerated_networking":true,"compute_units":16.0,"number_of_ips":16,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":32768,"num_cores":16,"cpu_quota_type":"Standard FS Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":32768,"is_hidden":false,"category":"Compute Optimized","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":85157,"instance_type_id":"Standard_H16","node_type_id":"Standard_H16","description":"Standard_H16","support_cluster_tags":true,"container_memory_mb":106447,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":8},"node_instance_type":{"instance_type_id":"Standard_H16","provider":"Azure","local_disk_size_gb":2000,"supports_accelerated_networking":false,"compute_units":16.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":114688,"num_cores":16,"cpu_quota_type":"Standard H Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":114688,"is_hidden":false,"category":"Compute Optimized","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":2,"spark_heap_memory":85157,"instance_type_id":"Standard_NC12","node_type_id":"Standard_NC12","description":"Standard_NC12 (beta)","support_cluster_tags":true,"container_memory_mb":106447,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":48},"node_instance_type":{"instance_type_id":"Standard_NC12","provider":"Azure","local_disk_size_gb":680,"supports_accelerated_networking":false,"compute_units":12.0,"number_of_ips":2,"local_disks":1,"reserved_compute_units":1.0,"gpus":2,"memory_mb":114688,"num_cores":12,"cpu_quota_type":"Standard NC Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":48,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":114688,"is_hidden":false,"category":"GPU Accelerated","num_cores":12.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":4,"spark_heap_memory":174155,"instance_type_id":"Standard_NC24","node_type_id":"Standard_NC24","description":"Standard_NC24 (beta)","support_cluster_tags":true,"container_memory_mb":217694,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":48},"node_instance_type":{"instance_type_id":"Standard_NC24","provider":"Azure","local_disk_size_gb":1440,"supports_accelerated_networking":false,"compute_units":24.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":4,"memory_mb":229376,"num_cores":24,"cpu_quota_type":"Standard NC Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":229376,"is_hidden":false,"category":"GPU Accelerated","num_cores":24.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":1,"spark_heap_memory":85157,"instance_type_id":"Standard_NC6s_v3","node_type_id":"Standard_NC6s_v3","description":"Standard_NC6s_v3 (beta)","support_cluster_tags":true,"container_memory_mb":106447,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":0},"node_instance_type":{"instance_type_id":"Standard_NC6s_v3","provider":"Azure","local_disk_size_gb":736,"supports_accelerated_networking":false,"compute_units":6.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":1,"memory_mb":114688,"num_cores":6,"cpu_quota_type":"Standard NCSv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":12,"supported_disk_types":[{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":114688,"is_hidden":false,"category":"GPU Accelerated","num_cores":6.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":2,"spark_heap_memory":174155,"instance_type_id":"Standard_NC12s_v3","node_type_id":"Standard_NC12s_v3","description":"Standard_NC12s_v3 (beta)","support_cluster_tags":true,"container_memory_mb":217694,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":0},"node_instance_type":{"instance_type_id":"Standard_NC12s_v3","provider":"Azure","local_disk_size_gb":1474,"supports_accelerated_networking":false,"compute_units":12.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":2,"memory_mb":229376,"num_cores":12,"cpu_quota_type":"Standard NCSv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":24,"supported_disk_types":[{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":229376,"is_hidden":false,"category":"GPU Accelerated","num_cores":12.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":4,"spark_heap_memory":352151,"instance_type_id":"Standard_NC24s_v3","node_type_id":"Standard_NC24s_v3","description":"Standard_NC24s_v3 (beta)","support_cluster_tags":true,"container_memory_mb":440189,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":0},"node_instance_type":{"instance_type_id":"Standard_NC24s_v3","provider":"Azure","local_disk_size_gb":2948,"supports_accelerated_networking":false,"compute_units":24.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":4,"memory_mb":458752,"num_cores":24,"cpu_quota_type":"Standard NCSv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":458752,"is_hidden":false,"category":"GPU Accelerated","num_cores":24.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false}],"default_node_type_id":"Standard_DS3_v2"},"enableDatabaseSupportClusterChoice":true,"enableClusterAcls":true,"notebookRevisionVisibilityHorizon":0,"serverlessClusterProductName":"Serverless Pool","showS3TableImportOption":false,"redirectBrowserOnWorkspaceSelection":true,"maxEbsVolumesPerInstance":10,"enableRStudioUI":false,"isAdmin":true,"deltaProcessingBatchSize":1000,"timerUpdateQueueLength":100,"sqlAclsEnabledMap":{"spark.databricks.acl.enabled":"true","spark.databricks.acl.sqlOnly":"true"},"enableLargeResultDownload":true,"maxElasticDiskCapacityGB":5000,"serverlessDefaultMinWorkers":2,"zoneInfos":[],"enableCustomSpotPricingUIByTier":true,"serverlessClustersEnabled":true,"enableWorkspaceBrowserSorting":true,"enableSentryLogging":false,"enableFindAndReplace":true,"disallowUrlImportExceptFromDocs":false,"defaultStandardClusterModel":{"cluster_name":"","node_type_id":"Standard_DS3_v2","spark_version":"4.0.x-scala2.11","num_workers":null,"autoscale":{"min_workers":2,"max_workers":8},"autotermination_minutes":120,"enable_elastic_disk":false,"default_tags":{"Vendor":"Databricks","Creator":"ivv@adatis.co.uk","ClusterName":null,"ClusterId":""}},"enableEBSVolumesUIForJobs":true,"enablePublishNotebooks":false,"enableBitbucketCloud":true,"shouldShowCommandStatus":true,"createTableInNotebookS3Link":{"url":"https://docs.azuredatabricks.net/_static/notebooks/data-import/s3.html","displayName":"S3","workspaceFileName":"S3 Example"},"sanitizeHtmlResult":true,"enableClusterPinningUI":true,"enableJobAclsConfig":false,"enableFullTextSearch":false,"enableElasticSparkUI":true,"enableNewClustersCreate":true,"clusters":true,"allowRunOnPendingClusters":true,"useAutoscalingByDefault":true,"enableAzureToolbar":true,"enableRequireClusterSettingsUI":true,"fileStoreBase":"FileStore","enableEmailInAzure":true,"enableRLibraries":true,"enableTableAclsConfig":false,"enableSshKeyUIInJobs":true,"enableDetachAndAttachSubMenu":true,"configurableSparkOptionsSpec":[{"keyPattern":"spark\\.kryo(\\.[^\\.]+)+","valuePattern":".*","keyPatternDisplay":"spark.kryo.*","valuePatternDisplay":"*","description":"Configuration options for Kryo serialization"},{"keyPattern":"spark\\.io\\.compression\\.codec","valuePattern":"(lzf|snappy|org\\.apache\\.spark\\.io\\.LZFCompressionCodec|org\\.apache\\.spark\\.io\\.SnappyCompressionCodec)","keyPatternDisplay":"spark.io.compression.codec","valuePatternDisplay":"snappy|lzf","description":"The codec used to compress internal data such as RDD partitions, broadcast variables and shuffle outputs."},{"keyPattern":"spark\\.serializer","valuePattern":"(org\\.apache\\.spark\\.serializer\\.JavaSerializer|org\\.apache\\.spark\\.serializer\\.KryoSerializer)","keyPatternDisplay":"spark.serializer","valuePatternDisplay":"org.apache.spark.serializer.JavaSerializer|org.apache.spark.serializer.KryoSerializer","description":"Class to use for serializing objects that will be sent over the network or need to be cached in serialized form."},{"keyPattern":"spark\\.rdd\\.compress","valuePattern":"(true|false)","keyPatternDisplay":"spark.rdd.compress","valuePatternDisplay":"true|false","description":"Whether to compress serialized RDD partitions (e.g. for StorageLevel.MEMORY_ONLY_SER). Can save substantial space at the cost of some extra CPU time."},{"keyPattern":"spark\\.speculation","valuePattern":"(true|false)","keyPatternDisplay":"spark.speculation","valuePatternDisplay":"true|false","description":"Whether to use speculation (recommended off for streaming)"},{"keyPattern":"spark\\.es(\\.[^\\.]+)+","valuePattern":".*","keyPatternDisplay":"spark.es.*","valuePatternDisplay":"*","description":"Configuration options for ElasticSearch"},{"keyPattern":"es(\\.([^\\.]+))+","valuePattern":".*","keyPatternDisplay":"es.*","valuePatternDisplay":"*","description":"Configuration options for ElasticSearch"},{"keyPattern":"spark\\.(storage|shuffle)\\.memoryFraction","valuePattern":"0?\\.0*([1-9])([0-9])*","keyPatternDisplay":"spark.(storage|shuffle).memoryFraction","valuePatternDisplay":"(0.0,1.0)","description":"Fraction of Java heap to use for Spark's shuffle or storage"},{"keyPattern":"spark\\.streaming\\.backpressure\\.enabled","valuePattern":"(true|false)","keyPatternDisplay":"spark.streaming.backpressure.enabled","valuePatternDisplay":"true|false","description":"Enables or disables Spark Streaming's internal backpressure mechanism (since 1.5). This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process. Internally, this dynamically sets the maximum receiving rate of receivers. This rate is upper bounded by the values `spark.streaming.receiver.maxRate` and `spark.streaming.kafka.maxRatePerPartition` if they are set."},{"keyPattern":"spark\\.streaming\\.receiver\\.maxRate","valuePattern":"^([0-9]{1,})$","keyPatternDisplay":"spark.streaming.receiver.maxRate","valuePatternDisplay":"numeric","description":"Maximum rate (number of records per second) at which each receiver will receive data. Effectively, each stream will consume at most this number of records per second. Setting this configuration to 0 or a negative number will put no limit on the rate. See the deployment guide in the Spark Streaming programing guide for mode details."},{"keyPattern":"spark\\.streaming\\.kafka\\.maxRatePerPartition","valuePattern":"^([0-9]{1,})$","keyPatternDisplay":"spark.streaming.kafka.maxRatePerPartition","valuePatternDisplay":"numeric","description":"Maximum rate (number of records per second) at which data will be read from each Kafka partition when using the Kafka direct stream API introduced in Spark 1.3. See the Kafka Integration guide for more details."},{"keyPattern":"spark\\.streaming\\.kafka\\.maxRetries","valuePattern":"^([0-9]{1,})$","keyPatternDisplay":"spark.streaming.kafka.maxRetries","valuePatternDisplay":"numeric","description":"Maximum number of consecutive retries the driver will make in order to find the latest offsets on the leader of each partition (a default value of 1 means that the driver will make a maximum of 2 attempts). Only applies to the Kafka direct stream API introduced in Spark 1.3."},{"keyPattern":"spark\\.streaming\\.ui\\.retainedBatches","valuePattern":"^([0-9]{1,})$","keyPatternDisplay":"spark.streaming.ui.retainedBatches","valuePatternDisplay":"numeric","description":"How many batches the Spark Streaming UI and status APIs remember before garbage collecting."}],"enableReactNotebookComments":true,"enableAdminPasswordReset":false,"checkBeforeAddingAadUser":true,"enableResetPassword":true,"maxClusterTagValueLength":256,"enableJobsSparkUpgrade":true,"createTableInNotebookDBFSLink":{"url":"https://docs.azuredatabricks.net/_static/notebooks/data-import/dbfs.html","displayName":"DBFS","workspaceFileName":"DBFS Example"},"perClusterAutoterminationEnabled":true,"enableNotebookCommandNumbers":true,"measureRoundTripTimes":true,"allowStyleInSanitizedHtml":false,"sparkVersions":[{"key":"3.3.x-scala2.10","displayName":"3.3 (includes Apache Spark 2.2.0, Scala 2.10)","packageLabel":"spark-image-86a9b375074f5afad339e70230ec0ec265c4cefbd280844785fab3bcde5869f9","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":[]},{"key":"4.1.x-scala2.11","displayName":"4.1 (includes Apache Spark 2.3.0, Scala 2.11)","packageLabel":"spark-image-f439d0c801506f0362ffed2e4541799ffeb433963a5b449822b3878a3751bcf8","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS","SUPPORTS_RSTUDIO"]},{"key":"4.1.x-ml-gpu-scala2.11","displayName":"4.1 ML Beta (includes Apache Spark 2.3.0, GPU, Scala 2.11)","packageLabel":"spark-image-5907529b625e97ac8feb0a069002b4fdb861a16740752b5df568fe4efb1c004e","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":[]},{"key":"4.0.x-scala2.11","displayName":"4.0 (includes Apache Spark 2.3.0, Scala 2.11)","packageLabel":"spark-image-f55a02a6e4d4df4eae4ac60d16781794f2061a3e86e0fc2c9697f69fd8211a22","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS"]},{"key":"3.4.x-scala2.11","displayName":"3.4 (includes Apache Spark 2.2.0, Scala 2.11)","packageLabel":"spark-image-ae9f3e3a4d6c99f1d9704425a13db5561c44c9ec0b9a45121cc9c7e8036e4051","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION"]},{"key":"3.2.x-scala2.10","displayName":"3.2 (includes Apache Spark 2.2.0, Scala 2.10)","packageLabel":"spark-image-557788bea0eea16bbf7a8ba13ace07e64dd7fc86270bd5cea086097fe886431f","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":[]},{"key":"4.1.x-ml-scala2.11","displayName":"4.1 ML Beta (includes Apache Spark 2.3.0, Scala 2.11)","packageLabel":"spark-image-ad599fbbca53898d7531a4b94c73f3e68b3c2e49e3502c09f6bf01468d801882","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":[]},{"key":"latest-experimental-scala2.10","displayName":"[DO NOT USE] Latest experimental (3.5 snapshot, Scala 2.10)","packageLabel":"spark-image-5e4f1f2feb631875a6036dffb069ec14b436939b5efe0ecb3ff8220c835298d6","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS"]},{"key":"latest-rc-scala2.11","displayName":"Latest RC (4.2 snapshot, Scala 2.11)","packageLabel":"spark-image-0592d44fb91741b05c314304a91661216c672f0424e342fe8a2a0b2c794d3cd0","upgradable":true,"deprecated":false,"customerVisible":false,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS","SUPPORTS_RSTUDIO"]},{"key":"latest-stable-scala2.11","displayName":"Latest stable (Scala 2.11)","packageLabel":"spark-image-f439d0c801506f0362ffed2e4541799ffeb433963a5b449822b3878a3751bcf8","upgradable":true,"deprecated":false,"customerVisible":false,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS","SUPPORTS_RSTUDIO"]},{"key":"4.1.x-gpu-scala2.11","displayName":"4.1 (includes Apache Spark 2.3.0, GPU, Scala 2.11)","packageLabel":"spark-image-aea6646f13f50000babc2af707cc31c43b11e68e7f452eb25c7dfd73ad24d705","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_RSTUDIO"]},{"key":"3.5.x-scala2.10","displayName":"3.5 LTS (includes Apache Spark 2.2.1, Scala 2.10)","packageLabel":"spark-image-9fa5a2faf500af1478f5d9e67ccbde7f9342a5c91a3b6260c974bc15e922928d","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS"]},{"key":"latest-rc-scala2.10","displayName":"[DO NOT USE] Latest RC (3.5 snapshot, Scala 2.10)","packageLabel":"spark-image-5e4f1f2feb631875a6036dffb069ec14b436939b5efe0ecb3ff8220c835298d6","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS"]},{"key":"latest-stable-scala2.10","displayName":"[DEPRECATED] Latest stable (Scala 2.10)","packageLabel":"spark-image-5e4f1f2feb631875a6036dffb069ec14b436939b5efe0ecb3ff8220c835298d6","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS"]},{"key":"latest-experimental-gpu-scala2.11","displayName":"Latest experimental (4.2 snapshot, GPU, Scala 2.11)","packageLabel":"spark-image-dfabb78470bd1abaaf209f0f6397e0da699a115933f76bea620aa664f8667e7b","upgradable":true,"deprecated":false,"customerVisible":false,"capabilities":["SUPPORTS_RSTUDIO"]},{"key":"3.1.x-scala2.11","displayName":"3.1 (includes Apache Spark 2.2.0, Scala 2.11)","packageLabel":"spark-image-241fa8b78ee6343242b1756b18076270894385ff40a81172a6fb5eadf66155d3","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":[]},{"key":"3.1.x-scala2.10","displayName":"3.1 (includes Apache Spark 2.2.0, Scala 2.10)","packageLabel":"spark-image-7efac6b9a8f2da59cb4f6d0caac46cfcb3f1ebf64c8073498c42d0360f846714","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":[]},{"key":"3.3.x-scala2.11","displayName":"3.3 (includes Apache Spark 2.2.0, Scala 2.11)","packageLabel":"spark-image-46cc39a9afa43fbd7bfa9f4f5ed8d23f658cd0b0d74208627243222ae0d22f8d","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":[]},{"key":"3.5.x-scala2.11","displayName":"3.5 LTS (includes Apache Spark 2.2.1, Scala 2.11)","packageLabel":"spark-image-6f5628e4962f9e1963e43a5acfbc89a3e32ef15d3da0462bca09a985bae2fe14","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS"]},{"key":"latest-experimental-scala2.11","displayName":"Latest experimental (4.2 snapshot, Scala 2.11)","packageLabel":"spark-image-0592d44fb91741b05c314304a91661216c672f0424e342fe8a2a0b2c794d3cd0","upgradable":true,"deprecated":false,"customerVisible":false,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS","SUPPORTS_RSTUDIO"]},{"key":"3.2.x-scala2.11","displayName":"3.2 (includes Apache Spark 2.2.0, Scala 2.11)","packageLabel":"spark-image-5537926238bc55cb6cd76ee0f0789511349abead3781c4780721a845f34b5d4e","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":[]},{"key":"latest-rc-gpu-scala2.11","displayName":"Latest RC (4.2 snapshot, GPU, Scala 2.11)","packageLabel":"spark-image-ed2636c4e44f9d4ce48b6ea1b4de7d172e33d9971f82700788751efee16d6bd1","upgradable":true,"deprecated":false,"customerVisible":false,"capabilities":["SUPPORTS_RSTUDIO"]},{"key":"3.4.x-scala2.10","displayName":"3.4 (includes Apache Spark 2.2.0, Scala 2.10)","packageLabel":"spark-image-f28e110ac5f9efcb3f48d812bf34961f4fe17336d732e33de5ffa64f269ed59a","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION"]}],"enablePresentationMode":false,"enableClearStateAndRunAll":true,"enableTableAclsByTier":true,"enableRestrictedClusterCreation":false,"enableFeedback":false,"enableClusterAutoScaling":true,"enableUserVisibleDefaultTags":true,"defaultNumWorkers":8,"serverContinuationTimeoutMillis":10000,"jobsUnreachableThresholdMillis":60000,"driverStderrFilePrefix":"stderr","roundTripReportTimeoutMs":5000,"enableNotebookRefresh":true,"createTableInNotebookImportedFileLink":{"url":"https://docs.azuredatabricks.net/_static/notebooks/data-import/imported-file.html","displayName":"Imported File","workspaceFileName":"Imported File Example"},"accountsOwnerUrl":"https://portal.azure.com/?feature.customportal=false&microsoft_azure_marketplace_ItemHideKey=DatabricksExtensionHidden&Microsoft_Azure_Databricks=true#resource/subscriptions/76dd74d5-e8e7-493d-91dc-d8113ee1f20c/resourceGroups/RGABI/providers/Microsoft.Databricks/workspaces/abiweuadlsdev","driverStdoutFilePrefix":"stdout","showDbuPricing":true,"databricksDocsBaseHostname":"docs.azuredatabricks.net","defaultNodeTypeToPricingUnitsMap":{"Standard_E64s_v3":16,"r3.2xlarge":2,"i3.4xlarge":4,"Standard_NC12s_v2":6.75,"class-node":1,"m4.2xlarge":1.5,"Standard_D11_v2":0.5,"r4.xlarge":1,"m4.4xlarge":3,"p3.2xlarge":4.15,"Standard_DS5_v2":3,"Standard_D2s_v3":0.5,"Standard_DS4_v2_Promo":1.5,"Standard_DS14":4,"Standard_DS11_v2_Promo":0.5,"r4.16xlarge":16,"Standard_NC6":1.5,"Standard_DS11":0.5,"Standard_D2_v3":0.5,"Standard_DS14_v2_Promo":4,"Standard_D64s_v3":12,"p2.8xlarge":9.76,"m4.10xlarge":8,"Standard_D8s_v3":1.5,"Standard_E32s_v3":8,"Standard_DS3":0.75,"Standard_DS2_v2":0.5,"r3.8xlarge":8,"r4.4xlarge":4,"dev-tier-node":1,"Standard_L8s":2,"Standard_D13_v2":2,"p3.16xlarge":33.2,"Standard_NC24rs_v3":20,"Standard_DS13_v2_Promo":2,"Standard_E4s_v3":1,"Standard_D3_v2":0.75,"Standard_NC24":6,"Standard_NC24r":6,"Standard_DS15_v2":5,"Standard_D16s_v3":3,"Standard_D5_v2":3,"Standard_E8s_v3":2,"Standard_DS2_v2_Promo":0.5,"c3.8xlarge":4,"Standard_D4_v3":0.75,"Standard_E2s_v3":0.5,"Standard_D32_v3":6,"Standard_DS3_v2":0.75,"Standard_NC6s_v3":5,"r3.4xlarge":4,"Standard_DS4":1.5,"i2.4xlarge":6,"Standard_DS3_v2_Promo":0.75,"m4.xlarge":0.75,"r4.8xlarge":8,"Standard_D14_v2":4,"Standard_H16":4,"Standard_NC12":3,"Standard_DS14_v2":4,"r4.large":0.5,"Standard_D15_v2":5,"Standard_DS12":1,"development-node":1,"i2.2xlarge":3,"Standard_NC6s_v2":3.38,"g2.8xlarge":6,"Standard_D12_v2":1,"i3.large":0.75,"Standard_NC12s_v3":10,"memory-optimized":1,"m4.large":0.4,"Standard_D16_v3":3,"Standard_F4s":0.5,"p2.16xlarge":19.52,"Standard_NC24rs_v2":13.5,"i3.8xlarge":8,"Standard_D32s_v3":6,"i3.16xlarge":16,"Standard_DS12_v2":1,"Standard_L32s":8,"Standard_D4s_v3":0.75,"Standard_DS13":2,"Standard_DS11_v2":0.5,"Standard_DS12_v2_Promo":1,"Standard_DS13_v2":2,"c3.2xlarge":1,"Standard_L4s":1,"Standard_F16s":2,"c4.2xlarge":1,"Standard_L16s":4,"i2.xlarge":1.5,"Standard_DS2":0.5,"compute-optimized":1,"c4.4xlarge":2,"Standard_DS5_v2_Promo":3,"Standard_D64_v3":12,"Standard_D2_v2":0.5,"Standard_D8_v3":1.5,"i3.2xlarge":2,"Standard_E16s_v3":4,"Standard_F8s":1,"c3.4xlarge":2,"Standard_NC24s_v2":13.5,"Standard_NC24s_v3":20,"Standard_D4_v2":1.5,"g2.2xlarge":1.5,"p3.8xlarge":16.6,"p2.xlarge":1.22,"m4.16xlarge":12,"Standard_DS4_v2":1.5,"c4.8xlarge":4,"i3.xlarge":1,"r3.xlarge":1,"r4.2xlarge":2,"i2.8xlarge":12},"tableFilesBaseFolder":"/tables","enableSparkDocsSearch":true,"sparkHistoryServerEnabled":true,"enableClusterAppsUIOnServerless":false,"enableEBSVolumesUI":true,"minDaysSinceDeletedToPurge":"30 days","homePageWelcomeMessage":"","metastoreServiceRowLimit":1000000,"enableIPythonImportExport":true,"enableClusterTagsUIForJobs":true,"enableClusterTagsUI":true,"enableNotebookHistoryDiffing":true,"branch":"2.73.271","accountsLimit":-1,"enableSparkEnvironmentVariables":true,"enableX509Authentication":false,"useAADLogin":true,"enableStructuredStreamingNbOptimizations":true,"enableNotebookGitBranching":true,"terminatedClustersWindow":2592000000,"local":false,"enableNotebookLazyRenderWrapper":false,"enableClusterAutoScalingForJobs":true,"enableStrongPassword":false,"showReleaseNote":false,"displayDefaultContainerMemoryGB":30,"broadenedEditPermission":false,"enableWorkspacePurgeDryRun":false,"disableS3TableImport":true,"enableArrayParamsEdit":true,"deploymentMode":"production","useSpotForWorkers":true,"removePasswordInAccountSettings":true,"preferStartTerminatedCluster":false,"enableUserInviteWorkflow":true,"createTableConnectorOptionLinks":[{"url":"https://docs.databricks.com/_static/notebooks/data-import/azure-blob-store.html","displayName":"Azure Blob Storage","workspaceFileName":"Azure Blob Storage Import Example Notebook"},{"url":"https://docs.azuredatabricks.net/_static/notebooks/data-import/jdbc.html","displayName":"JDBC","workspaceFileName":"JDBC Example"},{"url":"https://docs.azuredatabricks.net/_static/notebooks/cassandra.html","displayName":"Cassandra","workspaceFileName":"Cassandra Example"},{"url":"https://docs.azuredatabricks.net/_static/notebooks/structured-streaming-etl-kafka.html","displayName":"Kafka","workspaceFileName":"Kafka Example"},{"url":"https://docs.azuredatabricks.net/_static/notebooks/redis.html","displayName":"Redis","workspaceFileName":"Redis Example"},{"url":"https://docs.azuredatabricks.net/_static/notebooks/elasticsearch.html","displayName":"Elasticsearch","workspaceFileName":"Elasticsearch Example"}],"enableStaticNotebooks":true,"enableNewLineChart":true,"shouldReportUnhandledPromiseRejectionsToSentry":false,"sandboxForUrlSandboxFrame":"allow-scripts allow-popups allow-popups-to-escape-sandbox allow-forms","enableCssTransitions":true,"serverlessEnableElasticDisk":true,"minClusterTagKeyLength":1,"showHomepageFeaturedLinks":true,"pricingURL":"https://databricks.com/product/pricing","enableClusterEdit":true,"enableClusterAclsConfig":false,"useTempS3UrlForTableUpload":false,"notifyLastLogin":false,"enableFilePurge":true,"enableSshKeyUIByTier":true,"enableCreateClusterOnAttach":false,"defaultAutomatedPricePerDBU":0.35,"enableNotebookGitVersioning":true,"defaultMinWorkers":2,"commandStatusDebounceMaxWait":1000,"files":"files/","feedbackEmail":"feedback@databricks.com","enableDriverLogsUI":true,"enableExperimentalCharts":false,"defaultMaxWorkers":8,"enableWorkspaceAclsConfig":false,"serverlessRunPythonAsLowPrivilegeUser":false,"dropzoneMaxFileSize":2047,"enableNewClustersList":true,"enableNewDashboardViews":true,"enableJobListPermissionFilter":true,"terminatedInteractiveClustersMax":70,"driverLog4jFilePrefix":"log4j","enableSingleSignOn":false,"enableMavenLibraries":true,"updateTreeTableToV2Schema":false,"displayRowLimit":1000,"notebookTooManyCommandsNotificationLimit":100,"deltaProcessingAsyncEnabled":true,"enableSparkEnvironmentVariablesUI":false,"defaultSparkVersion":{"key":"4.0.x-scala2.11","displayName":"4.0 (includes Apache Spark 2.3.0, Scala 2.11)","packageLabel":"spark-image-f55a02a6e4d4df4eae4ac60d16781794f2061a3e86e0fc2c9697f69fd8211a22","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS"]},"enableNewLineChartParams":false,"deprecatedEnableStructuredDataAcls":false,"enableCustomSpotPricing":true,"enableRStudioFreeUI":false,"enableMountAclsConfig":false,"defaultAutoterminationMin":120,"useDevTierHomePage":false,"disableExportNotebook":false,"enableClusterClone":true,"enableNotebookLineNumbers":true,"enablePublishHub":false,"notebookHubUrl":"http://hub.dev.databricks.com/","commandStatusDebounceInterval":100,"enableTrashFolder":true,"showSqlEndpoints":true,"enableNotebookDatasetInfoView":true,"defaultTagKeys":{"CLUSTER_NAME":"ClusterName","VENDOR":"Vendor","CLUSTER_TYPE":"ResourceClass","CREATOR":"Creator","CLUSTER_ID":"ClusterId"},"enableClusterAclsByTier":true,"databricksDocsBaseUrl":"https://docs.azuredatabricks.net/","azurePortalLink":"https://portal.azure.com","cloud":"Azure","customSparkVersionPrefix":"custom:","disallowAddingAdmins":false,"enableSparkConfUI":true,"enableClusterEventsUI":true,"featureTier":"STANDARD_W_SEC_TIER","mavenCentralSearchEndpoint":"http://search.maven.org/solrsearch/select","defaultServerlessClusterModel":{"cluster_name":"","node_type_id":"Standard_DS13_v2","spark_version":"latest-stable-scala2.11","num_workers":null,"enable_jdbc_auto_start":true,"custom_tags":{"ResourceClass":"Serverless"},"autoscale":{"min_workers":2,"max_workers":8},"spark_conf":{"spark.databricks.cluster.profile":"serverless","spark.databricks.repl.allowedLanguages":"sql,python,r"},"autotermination_minutes":120,"enable_elastic_disk":true,"default_tags":{"Vendor":"Databricks","Creator":"ivv@adatis.co.uk","ClusterName":null,"ClusterId":""}},"enableClearRevisionHistoryForNotebook":true,"enableOrgSwitcherUI":true,"bitbucketCloudBaseApiV2Url":"https://api.bitbucket.org/2.0","clustersLimit":-1,"enableJdbcImport":true,"enableClusterAppsUIOnNormalClusters":false,"enableMergedServerlessUI":false,"enableElasticDisk":true,"logfiles":"logfiles/","enableRelativeNotebookLinks":true,"enableMultiSelect":true,"homePageLogo":"login/DB_Azure_Lockup_2x.png","enableWebappSharding":true,"enableNotebookParamsEdit":true,"enableClusterDeltaUpdates":true,"enableSingleSignOnLogin":false,"separateTableForJobClusters":true,"ebsVolumeSizeLimitGB":{"GENERAL_PURPOSE_SSD":[100,4096],"THROUGHPUT_OPTIMIZED_HDD":[500,4096]},"enableClusterDeleteUI":true,"enableMountAcls":false,"requireEmailUserName":true,"enableRServerless":true,"frameRateReportIntervalMs":10000,"dbcFeedbackURL":"http://feedback.databricks.com/forums/263785-product-feedback","enableMountAclService":true,"showVersion":false,"serverlessClustersByDefault":false,"collectDetailedFrameRateStats":true,"enableWorkspaceAcls":true,"maxClusterTagKeyLength":512,"gitHash":"","clusterTagReservedPrefixes":["azure","microsoft","windows"],"tableAclsEnabledMap":{"spark.databricks.acl.dfAclsEnabled":"true","spark.databricks.repl.allowedLanguages":"python,sql"},"showWorkspaceFeaturedLinks":true,"signupUrl":"","databricksDocsNotebookPathPrefix":"^https://docs\\.azuredatabricks\\.net/_static/notebooks/.+$","serverlessAttachEbsVolumesByDefault":false,"enableTokensConfig":true,"allowFeedbackForumAccess":true,"frameDurationReportThresholdMs":1000,"enablePythonVersionUI":true,"enableImportFromUrl":true,"allowDisplayHtmlByUrl":true,"enableTokens":true,"enableMiniClusters":false,"enableNewJobList":true,"maxPinnedClustersPerOrg":20,"enableDebugUI":false,"enableStreamingMetricsDashboard":true,"allowNonAdminUsers":true,"enableSingleSignOnByTier":true,"enableJobsRetryOnTimeout":true,"loginLogo":"/login/DB_Azure_Lockup_2x.png","useStandardTierUpgradeTooltips":true,"staticNotebookResourceUrl":"https://databricks-prod-cloudfront.cloud.databricks.com/static/d9b6f94329a854d5e4b563a3854b37b2e3a5420ba1b2e8434244d6f80244e6c6/","enableSpotClusterType":true,"enableSparkPackages":true,"checkAadUserInWorkspaceTenant":false,"dynamicSparkVersions":false,"useIframeForHtmlResult":false,"enableClusterTagsUIByTier":true,"enableUserPromptForPendingRpc":true,"enableNotebookHistoryUI":true,"addWhitespaceAfterLastNotebookCell":true,"enableClusterLoggingUI":true,"setDeletedAtForDeletedColumnsOnWebappStart":false,"enableDatabaseDropdownInTableUI":true,"showDebugCounters":false,"enableInstanceProfilesUI":true,"enableFolderHtmlExport":true,"homepageFeaturedLinks":[{"linkURI":"https://docs.azuredatabricks.net/_static/notebooks/azure/gentle-introduction-to-apache-spark-azure.html","displayName":"Introduction to Apache Spark on Databricks","icon":"img/home/Python_icon.svg"},{"linkURI":"https://docs.azuredatabricks.net/_static/notebooks/azure/databricks-for-data-scientists-azure.html","displayName":"Databricks for Data Scientists","icon":"img/home/Scala_icon.svg"},{"linkURI":"https://docs.azuredatabricks.net/_static/notebooks/structured-streaming-python.html","displayName":"Introduction to Structured Streaming","icon":"img/home/Python_icon.svg"}],"enableClusterStart":true,"maxImportFileVersion":5,"enableEBSVolumesUIByTier":true,"enableTableAclService":true,"removeSubCommandCodeWhenExport":true,"upgradeURL":"","maxAutoterminationMinutes":10000,"showResultsFromExternalSearchEngine":true,"autoterminateClustersByDefault":false,"notebookLoadingBackground":"#fff","sshContainerForwardedPort":2200,"enableStaticHtmlImport":true,"enableInstanceProfilesByTier":true,"showForgotPasswordLink":true,"defaultMemoryPerContainerMB":28000,"enablePresenceUI":true,"minAutoterminationMinutes":10,"accounts":true,"useOnDemandClustersByDefault":false,"enableAutoCreateUserUI":true,"defaultCoresPerContainer":4,"showTerminationReason":true,"enableNewClustersGet":true,"showPricePerDBU":true,"showSqlProxyUI":true,"enableNotebookErrorHighlighting":true}; var __DATABRICKS_NOTEBOOK_MODEL = 'JTdCJTIydmVyc2lvbiUyMiUzQSUyMk5vdGVib29rVjElMjIlMkMlMjJvcmlnSWQlMjIlM0E0MzQxNTg5Nzc5MzMxMjU4JTJDJTIybmFtZSUyMiUzQSUyMkJsb2cyJTIyJTJDJTIybGFuZ3VhZ2UlMjIlM0ElMjJweXRob24lMjIlMkMlMjJjb21tYW5kcyUyMiUzQSU1QiU3QiUyMnZlcnNpb24lMjIlM0ElMjJDb21tYW5kVjElMjIlMkMlMjJvcmlnSWQlMjIlM0E0MzQxNTg5Nzc5MzMxMjU5JTJDJTIyZ3VpZCUyMiUzQSUyMjNmYmJiYzI1LWVhZGEtNGZhOS1hOGNiLTJiNjFkODg2NWYzMCUyMiUyQyUyMnN1YnR5cGUlMjIlM0ElMjJjb21tYW5kJTIyJTJDJTIyY29tbWFuZFR5cGUlMjIlM0ElMjJhdXRvJTIyJTJDJTIycG9zaXRpb24lMjIlM0ExLjAlMkMlMjJjb21tYW5kJTIyJTNBJTIyJTIzJTIwRmlyc3QlMjBjb21tYW5kJTVDbiU1Q25wcmludCglNUMlMjJQaW5nJTVDJTIyKSUyMiUyQyUyMmNvbW1hbmRWZXJzaW9uJTIyJTNBMCUyQyUyMnN0YXRlJTIyJTNBJTIyZmluaXNoZWQlMjIlMkMlMjJyZXN1bHRzJTIyJTNBJTdCJTIydHlwZSUyMiUzQSUyMmh0bWwlMjIlMkMlMjJkYXRhJTIyJTNBJTIyJTNDZGl2JTIwY2xhc3MlM0QlNUMlMjJhbnNpb3V0JTVDJTIyJTNFUGluZyU1Q24lM0MlMkZkaXYlM0UlMjIlMkMlMjJhcmd1bWVudHMlMjIlM0ElN0IlN0QlMkMlMjJhZGRlZFdpZGdldHMlMjIlM0ElN0IlN0QlMkMlMjJyZW1vdmVkV2lkZ2V0cyUyMiUzQSU1QiU1RCUyQyUyMmRhdGFzZXRJbmZvcyUyMiUzQSU1QiU1RCU3RCUyQyUyMmVycm9yU3VtbWFyeSUyMiUzQW51bGwlMkMlMjJlcnJvciUyMiUzQW51bGwlMkMlMjJ3b3JrZmxvd3MlMjIlM0ElNUIlNUQlMkMlMjJzdGFydFRpbWUlMjIlM0ExNTI4NzE1ODg2MzcxJTJDJTIyc3VibWl0VGltZSUyMiUzQTE1Mjg3MTU4ODYwODIlMkMlMjJmaW5pc2hUaW1lJTIyJTNBMTUyODcxNTg4NjM4NSUyQyUyMmNvbGxhcHNlZCUyMiUzQWZhbHNlJTJDJTIyYmluZGluZ3MlMjIlM0ElN0IlN0QlMkMlMjJpbnB1dFdpZGdldHMlMjIlM0ElN0IlN0QlMkMlMjJkaXNwbGF5VHlwZSUyMiUzQSUyMnRhYmxlJTIyJTJDJTIyd2lkdGglMjIlM0ElMjJhdXRvJTIyJTJDJTIyaGVpZ2h0JTIyJTNBJTIyYXV0byUyMiUyQyUyMnhDb2x1bW5zJTIyJTNBbnVsbCUyQyUyMnlDb2x1bW5zJTIyJTNBbnVsbCUyQyUyMnBpdm90Q29sdW1ucyUyMiUzQW51bGwlMkMlMjJwaXZvdEFnZ3JlZ2F0aW9uJTIyJTNBbnVsbCUyQyUyMmN1c3RvbVBsb3RPcHRpb25zJTIyJTNBJTdCJTdEJTJDJTIyY29tbWVudFRocmVhZCUyMiUzQSU1QiU1RCUyQyUyMmNvbW1lbnRzVmlzaWJsZSUyMiUzQWZhbHNlJTJDJTIycGFyZW50SGllcmFyY2h5JTIyJTNBJTVCJTVEJTJDJTIyZGlmZkluc2VydHMlMjIlM0ElNUIlNUQlMkMlMjJkaWZmRGVsZXRlcyUyMiUzQSU1QiU1RCUyQyUyMmdsb2JhbFZhcnMlMjIlM0ElN0IlN0QlMkMlMjJsYXRlc3RVc2VyJTIyJTNBJTIyYSUyMHVzZXIlMjIlMkMlMjJsYXRlc3RVc2VySWQlMjIlM0FudWxsJTJDJTIyY29tbWFuZFRpdGxlJTIyJTNBJTIyQ29tbWFuZCUyMDElMjBUaXRsZSUyMiUyQyUyMnNob3dDb21tYW5kVGl0bGUlMjIlM0F0cnVlJTJDJTIyaGlkZUNvbW1hbmRDb2RlJTIyJTNBZmFsc2UlMkMlMjJoaWRlQ29tbWFuZFJlc3VsdCUyMiUzQWZhbHNlJTJDJTIyaVB5dGhvbk1ldGFkYXRhJTIyJTNBbnVsbCUyQyUyMnN0cmVhbVN0YXRlcyUyMiUzQSU3QiU3RCUyQyUyMm51aWQlMjIlM0ElMjJhYjdkMjhjZC04NzZkLTRlNGUtYWU2Mi05N2UwZTVkM2Q0ZWElMjIlN0QlMkMlN0IlMjJ2ZXJzaW9uJTIyJTNBJTIyQ29tbWFuZFYxJTIyJTJDJTIyb3JpZ0lkJTIyJTNBNDM0MTU4OTc3OTMzMTI2MiUyQyUyMmd1aWQlMjIlM0ElMjI5OGY1ODFiMS01MTQyLTRhZWEtYjlhNy0zYTAyYTM5ZDFmYzMlMjIlMkMlMjJzdWJ0eXBlJTIyJTNBJTIyY29tbWFuZCUyMiUyQyUyMmNvbW1hbmRUeXBlJTIyJTNBJTIyYXV0byUyMiUyQyUyMnBvc2l0aW9uJTIyJTNBMS41JTJDJTIyY29tbWFuZCUyMiUzQSUyMiUyMyUyMFNlY29uZCUyMGNvbW1hbmQlNUNuJTVDbnByaW50KCU1QyUyMlBvbmclNUMlMjIpJTIyJTJDJTIyY29tbWFuZFZlcnNpb24lMjIlM0EwJTJDJTIyc3RhdGUlMjIlM0ElMjJmaW5pc2hlZCUyMiUyQyUyMnJlc3VsdHMlMjIlM0ElN0IlMjJ0eXBlJTIyJTNBJTIyaHRtbCUyMiUyQyUyMmRhdGElMjIlM0ElMjIlM0NkaXYlMjBjbGFzcyUzRCU1QyUyMmFuc2lvdXQlNUMlMjIlM0VQb25nJTVDbiUzQyUyRmRpdiUzRSUyMiUyQyUyMmFyZ3VtZW50cyUyMiUzQSU3QiU3RCUyQyUyMmFkZGVkV2lkZ2V0cyUyMiUzQSU3QiU3RCUyQyUyMnJlbW92ZWRXaWRnZXRzJTIyJTNBJTVCJTVEJTJDJTIyZGF0YXNldEluZm9zJTIyJTNBJTVCJTVEJTdEJTJDJTIyZXJyb3JTdW1tYXJ5JTIyJTNBbnVsbCUyQyUyMmVycm9yJTIyJTNBbnVsbCUyQyUyMndvcmtmbG93cyUyMiUzQSU1QiU1RCUyQyUyMnN0YXJ0VGltZSUyMiUzQTE1Mjg3MTU4ODYzOTIlMkMlMjJzdWJtaXRUaW1lJTIyJTNBMTUyODcxNTg4NjMzMSUyQyUyMmZpbmlzaFRpbWUlMjIlM0ExNTI4NzE1ODg2NDAxJTJDJTIyY29sbGFwc2VkJTIyJTNBZmFsc2UlMkMlMjJiaW5kaW5ncyUyMiUzQSU3QiU3RCUyQyUyMmlucHV0V2lkZ2V0cyUyMiUzQSU3QiU3RCUyQyUyMmRpc3BsYXlUeXBlJTIyJTNBJTIydGFibGUlMjIlMkMlMjJ3aWR0aCUyMiUzQSUyMmF1dG8lMjIlMkMlMjJoZWlnaHQlMjIlM0ElMjJhdXRvJTIyJTJDJTIyeENvbHVtbnMlMjIlM0FudWxsJTJDJTIyeUNvbHVtbnMlMjIlM0FudWxsJTJDJTIycGl2b3RDb2x1bW5zJTIyJTNBbnVsbCUyQyUyMnBpdm90QWdncmVnYXRpb24lMjIlM0FudWxsJTJDJTIyY3VzdG9tUGxvdE9wdGlvbnMlMjIlM0ElN0IlN0QlMkMlMjJjb21tZW50VGhyZWFkJTIyJTNBJTVCJTVEJTJDJTIyY29tbWVudHNWaXNpYmxlJTIyJTNBZmFsc2UlMkMlMjJwYXJlbnRIaWVyYXJjaHklMjIlM0ElNUIlNUQlMkMlMjJkaWZmSW5zZXJ0cyUyMiUzQSU1QiU1RCUyQyUyMmRpZmZEZWxldGVzJTIyJTNBJTVCJTVEJTJDJTIyZ2xvYmFsVmFycyUyMiUzQSU3QiU3RCUyQyUyMmxhdGVzdFVzZXIlMjIlM0ElMjJhJTIwdXNlciUyMiUyQyUyMmxhdGVzdFVzZXJJZCUyMiUzQW51bGwlMkMlMjJjb21tYW5kVGl0bGUlMjIlM0ElMjJDb21tYW5kJTIwMiUyMFRpdGxlJTIyJTJDJTIyc2hvd0NvbW1hbmRUaXRsZSUyMiUzQXRydWUlMkMlMjJoaWRlQ29tbWFuZENvZGUlMjIlM0FmYWxzZSUyQyUyMmhpZGVDb21tYW5kUmVzdWx0JTIyJTNBZmFsc2UlMkMlMjJpUHl0aG9uTWV0YWRhdGElMjIlM0FudWxsJTJDJTIyc3RyZWFtU3RhdGVzJTIyJTNBJTdCJTdEJTJDJTIybnVpZCUyMiUzQSUyMjE1MjY0NzBlLTgxMjgtNDIyZi04MzQ1LTQ1MWI5OGUwZTgyNyUyMiU3RCU1RCUyQyUyMmRhc2hib2FyZHMlMjIlM0ElNUIlNUQlMkMlMjJndWlkJTIyJTNBJTIyZWQxNWFlNjMtNjVmYi00MTMwLTk2ZmQtOWIzNWU3MTQ5MDJiJTIyJTJDJTIyZ2xvYmFsVmFycyUyMiUzQSU3QiU3RCUyQyUyMmlQeXRob25NZXRhZGF0YSUyMiUzQW51bGwlMkMlMjJpbnB1dFdpZGdldHMlMjIlM0ElN0IlN0QlN0Q='; if (window.mainJsLoadError) { var u = 'https://databricks-prod-cloudfront.cloud.databricks.com/static/d9b6f94329a854d5e4b563a3854b37b2e3a5420ba1b2e8434244d6f80244e6c6/js/notebook-main.js'; var b = document.getElementsByTagName('body')[0]; var c = document.createElement('div'); c.innerHTML = ('Network Error' + 'Please check your network connection and try again.' + 'Could not load a required resource: ' + u + ''); c.style.margin = '30px'; c.style.padding = '20px 50px'; c.style.backgroundColor = '#f5f5f5'; c.style.borderRadius = '5px'; b.appendChild(c); } ">Your browser is not supported. Please use any of the following browsers: Chrome, Firefox, Safari, Opera. Not all notebook features are available within an iframe, but the main ones, such as syntax highlighting are intact and so is the ability to display a table results (with column sort capability!). The proposed way of embedding notebook code in a blog post is just the first idea that came to my mind. Maybe there are more clever ways to achieve this. Please let me know in the comments. (function(root, factory) { // `root` does not resolve to the global window object in a Browserified // bundle, so a direct reference to that object is used instead. var _srcDoc = window.srcDoc; if (typeof define === "function" && define.amd) { define(['exports'], function(exports) { factory(exports, _srcDoc); root.srcDoc = exports; }); } else if (typeof exports === "object") { factory(exports, _srcDoc); } else { root.srcDoc = {}; factory(root.srcDoc, _srcDoc); } })(this, function(exports, _srcDoc) { var idx, iframes; var isCompliant = !!("srcdoc" in document.createElement("iframe")); var sandboxMsg = "Polyfill may not function in the presence of the " + "`sandbox` attribute. Consider using the `force` option."; var sandboxAllow = /\ballow-same-origin\b/; /** * Determine if the operation may be blocked by the `sandbox` attribute in * some environments, and optionally issue a warning or remove the * attribute. */ var validate = function( iframe, options ) { var sandbox = iframe.getAttribute("sandbox"); if (typeof sandbox === "string" && !sandboxAllow.test(sandbox)) { if (options && options.force) { iframe.removeAttribute("sandbox"); } else if (!options || options.force !== false) { logError(sandboxMsg); iframe.setAttribute("data-srcdoc-polyfill", sandboxMsg); } } }; var implementations = { compliant: function( iframe, content, options ) { if (content) { validate(iframe, options); iframe.setAttribute("srcdoc", content); } }, legacy: function( iframe, content, options ) { var jsUrl; if (!iframe || !iframe.getAttribute) { return; } if (!content) { content = iframe.getAttribute("srcdoc"); } else { iframe.setAttribute("srcdoc", content); } if (content) { validate(iframe, options); // The value returned by a script-targeted URL will be used as // the iFrame's content. Create such a URL which returns the // iFrame element's `srcdoc` attribute. jsUrl = "javascript: window.frameElement.getAttribute('srcdoc');"; // Explicitly set the iFrame's window.location for // compatability with IE9, which does not react to changes in // the `src` attribute when it is a `javascript:` URL, for // some reason if (iframe.contentWindow) { iframe.contentWindow.location = jsUrl; } iframe.setAttribute("src", jsUrl); } } }; var srcDoc = exports; var logError; if (window.console && window.console.error) { logError = function(msg) { window.console.error("[srcdoc-polyfill] " + msg); }; } else { logError = function() {}; } // Assume the best srcDoc.set = implementations.compliant; srcDoc.noConflict = function() { window.srcDoc = _srcDoc; return srcDoc; }; // If the browser supports srcdoc, no shimming is necessary if (isCompliant) { return; } srcDoc.set = implementations.legacy; // Automatically shim any iframes already present in the document iframes = document.getElementsByTagName("iframe"); idx = iframes.length; while (idx--) { srcDoc.set( iframes[idx] ); } });

Connecting Azure Databricks to Data Lake Store

Just a quick post here to help anyone who needs integrate their Azure Databricks cluster with Data Lake Store. This is not hard to do but there are a few steps so its worth recording them here in a quick and easy to follow form.This assumes you have created your Databricks cluster and have created a data lake store you want to integrate with. If you haven’t created your cluster that’s described in a previous blog here which you may find useful.The objective here is to create a mount point, a folder in the lake accessible from Databricks so we can read from and write to ADLS. Here this is done in notebooks in Databricks using Python but if Scala is your thing then its just as easy. To create the mount point you need to run the following command:-configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",            "dfs.adls.oauth2.client.id": "{YOUR SERVICE CLIENT ID}",            "dfs.adls.oauth2.credential": "{YOUR SERVICE CREDENTIALS}",            "dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/{YOUR DIRECTORY ID}/oauth2/token"} dbutils.fs.mount(   source = "adl://{YOUR DATA LAKE STORE ACCOUNT NAME}.azuredatalakestore.net{YOUR DIRECTORY NAME}",   mount_point = "{mountPointPath}",   extra_configs = configs)So to do this we need to collect together the values to use for{YOUR SERVICE CLIENT ID}{YOUR SERVICE CREDENTIALS}{YOUR DIRECTORY ID}{YOUR DATA LAKE STORE ACCOUNT NAME}{YOUR DIRECTORY NAME}{mountPointPath}First the easy ones, my data lake store is called “pythonregression” and I want the folder I am going to use to be ‘/mnt/python’, these are just my choices.I need the service client id and credentials, for this I will create a new Application Registration by going to the Active Directory blade in the Azure portal and clicking on “New Application Registration”Fill in you chosen App name, here I have used the name ‘MyNewApp’, I know not really original. Then press ‘Create’ to create the App registrationThis will only take a few seconds to create and you should then see your App registration in the list of available apps. Click on the App you have created to see the details which will look something like this:Make a note of the ApplicationId GUID (partially deleted here), this is the SERVICE CLIENT ID you will need. Then from this screen click the “Settings” button and then the “Keys” link. We are going to create a key specifically for the purpose. Enter a Key Description, choose a Duration from the drop down and when you hit “Save” a key will be produced. Save this key, its the value you need for YOUR SERVICE CREDENTIALS and as soon as you leave the blade it will disappear. We now have everything we need except the DIRECTORY ID. To get the DIRECTORY ID go back to the Active Directory blade and click on “Properties” as shown below:-From here you can get the DIRECTORY IDOk, one last thing to do. You need to grant access to the “MyNewApp” App to the Data Lake Store, otherwise you will get access forbidden messages when you try to access ADLS from Databricks. This can be done from Data Explorer in ADLS using the link highlighted below.Now we have everything we need to mount the drive. In Databricks launch a workspace then create a new notebook (as described in my previous post). Run the command we put together above in the python notebookYou can then create directories and files in the lake from within your databricks notebookIf you want to you can unmount the drive using the following commandSomething to note. If you Terminate the cluster (terminate meaning shutdown) you can restart the cluster and the mounted folder will still be available to you, it doesn’t need to be remounted.You can access the file system from Python as follows:with open("/dbfs/mnt/python/newdir/iris_labels.txt", "w") as outfile:and then write to the file in ADLS as if it was a local file system.Ok, that was more detailed than I intended when I started, but I hope that was interesting and helpful. Let me know if you have any questions on any of the above and enjoy the power of Databricks and ADLS combined.