Adatis

Adatis BI Blogs

Geospatial Data Visualisation R, Python or BI Tools

IntroductionAs a newbie to Adatis I begin my journey of learning the vast, ever-growing languages and platforms. In my first few steps I am taking a problem I already have experience with in R and understanding if the solution in Python presents similar hurdles as a comparison between the two languages.Before joining Adatis I have just graduated with a degree in Biology. During my course I discovered a website containing animal GPS tracked movement data sets, free to use from movebank. So, I am using a data set I am familiar with that contains GPS tracking data of 27 Vulture Turkeys (Cathartes aura) and using both tools to plot the 350,000 points on a map. How the data set looks as a csv file loaded into R’s viewing window.Note: While both platforms are open source there may be packages out there that would optimise the process I was trying to achieve, as a beginner with limited knowledge this was my experience of trying to map geospatial data in R compared to Python and ArcGIS in PowerBI. PythonPackages concerned: basic (numpy, pandas, matplotlib) geometry (geopandas and shapely)Python failed to install the geopandas and shapely package and I had to use the pip install command in the command window, alternatively the packages can be downloaded in the anaconda environment. The data loaded in fine as expected. First the longitude and latitude values were zipped together and shapely’s apply point function was used to turn them into a single co-ordinate value. ------> If I were to plot now, I could retrieve a world map (template from the geopandas library, changing the fill, colour of sea and country outline colour) with custom size, shape and a single colour for all the data points which is something, but we can make it more informative. Now when plotting I could not find a function that would allow python to sort colour by groups based on a string value (name of vulture). Instead I created a function just to convert every name to a number, so each co-ordinate had a group number assigned to it depending which vulture it came from. This means a list of 350,000 numbers ranging from 1:27 can be classed as colour for the points. After plotting python offers a zoom function which allows you to zoom in down to a single point but is limited by the resolution of the map/plot. I did however limit the axis to over the Americas just for initial plot clarity. The point conversion and plotting process takes around 15 seconds according to python sys.time() function placed at beginning and end of code. Ignoring the library imports the whole process took 14 lines of unique code. RPackages concerned: rworldmapNow R has some useful packages like ggmap, mapdata and ggplot2 which allow you to source you map satellite images directly from google maps, but this does require a free google API key to source from the cloud. These packages can also plot the map around the data as I am currently trimming the map to fit the data. But for a fair test I also used a simplistic pre-built map in R. This was from the package rworldmap, which allows plotting at a country level with defined borders. Axes can be scaled to act like a zoom function but without a higher resolutions map or raster satellite image map it is pointless to go past a country level.R works a little different to Python, initially we plot a blank map and in a sequential but new line add the points. This differs from the points being plotted and the map being contained in the plot function.The first step is to get the map from the rworldmap library and plot limiting axes to act a zoom. However, the axes do not automatically show this way. So next, manually define the intervals of axes aka longitude and latitude. Finally plot the points with customisation on size, shape and colour. The colour of each point can be assigned using ‘col=factor(dataframe$column)’. So, no need to make a group number index as we did in python.Using the function system.time(), R reported the plotting take 7 seconds but the actual graphic doesn’t show up for a further 8 seconds making the total time 15 seconds, same as python. R however, again ignoring the library imports executed this in only 6 lines of unique code!Just for fun I did check for a correlation between distance travelled and weight and a regression analysis would suggest a statistically significant relationship, but without a time scale we cannot conclude anything e.g. some birds could have been tagged longer than others resulting in greater distance travelled. Is geospatial suitable to present in R and Python?During the write up of this blog I read Jason’s blog(s) on ArcGIS Maps for PowerBI (Jason’s blog), So I trialled my own data. Now quite simply the program is drag and drop. An image presenting the same final information from both R and Python can be created in roughly 2 minutes. One of the feature I really got on well with is the variable zoom down to the precision of a street. If you have read Jasons blog you will know about all the customisable features, if not read it! But quite simply ArcGIS Maps gave you the option to fully explore the data visually but is limited by analysis and ‘map features’.PowerBI handles statistical analysis similarly to excel, possible but tedious having to generate each column. The map feature limit poses another problem; ArcGIS (free version) only allows for 1,500 ‘map features’ (5,000 in paid version) which equated to around 30,000 points. With this data set it meant not all the data could be plotted. I thinned out the data in R just by removing every 9 of 10 points and considering the data was continuous this meant the readings were once every half/full day instead of every 1-3 hours, this shouldn’t really affect the overall distance travelled. I suspect a limit like this is in place to save the performance as a similar platform called Tableau plotted all of the data but at the cost of performance to the point that the zoom navigation was unusable. ConclusionIn summary, I feel like difference between R/Python and PowerBI just depends on the user. PowerBI makes the data very accessible and gives great control for visually exploring the data to a business analyst for example. But if the user is more technical e.g. data scientist, then R/Python might be more fitting. Between R and Python there was no difference in performance with this data set but that could be because in the grand scheme of things 350,000 isn’t ‘big’. There was a notable difference in the amount of code needed but this could be down to my greater experience with R and lacking knowledge in a Python package capable of achieving the same task.

Spark Streaming in Azure Databricks

Real-time stream processing is becoming more prevalent on modern day data platforms, and with a myriad of processing technologies out there, where do you begin? Stream processing involves the consumption of messages from either queue/files, doing some processing in the middle (querying, filtering, aggregation) and then forwarding the result to a sink – all with a minimal latency. This is in direct contrast to batch processing which usually occurs on an hourly or daily basis. Often is this the case, both of these will need to be combined to create a new data set. In terms of options for real-time stream processing on Azure you have the following: Azure Stream Analytics Spark Streaming / Storm on HDInsight Spark Streaming on Databricks Azure Functions Stream Analytics is a simple PaaS offering. It connects easily into other Azure resources such as Event Hubs, IoT Hub, and Blob, and outputs to a range of resources that you’d expect. It has its own intuitive query language, with the added benefit of letting you create functions in JavaScript. Scaling can be achieved by partitions, and it has windowing and late arrival event support that you’d expect from a processing option. For most jobs, this service will be the quickest/easiest to implement as long as its relatively small amount of limitations fall outside the bounds of what you want to achieve. Its also worth noting that the service does not currently support Azure network security such as Virtual Networks or IP Filtering. I suspect this may only be time with the Preview of this in EventHubs. Both Spark Streaming on HDInsight and Databricks open up the options for configurability and are possibly more suited to an enterprise level data platform, allowing us to use languages such as Scala/Python or even Java for the processing in the middle. The use of these options also allows us to integrate Kafka (an open source alternative to EventHubs) as well as HDFS, and Data Lake as inputs. Scalability is determined by the cluster sizes and the support for other events mentioned above is also included. These options also give us the flexibility for the future, and allow us to adapt moving forward depending on evolving technologies. They also come with the benefit of Azure network security support so we can peer our clusters onto a virtual network. Lastly – I wouldn’t personally use this but we can also use Functions to achieve the same goal through C#/Node.js. This route however does not include support for those temporal/windowing/late arrival events since functions are serverless and act on a per execution basis. In the following blog, I’ll be looking at Spark Streaming on Databricks (which is fast becoming my favourite research topic). A good place to start this is to understand the structured streaming model which I’ve seen a documented a few times now. Essentially treating the stream as an unbounded table, with new records from the stream being appended as a new rows to the table. This allows us to treat both batch and streaming data as tables in a DataFrame, therefore allowing similar queries to be run across them.     At this point, it will be useful to include some code to help explain the process. Before beginning its worth mounting your data sink to your databricks instance so you can reference it as if it were inside the DBFS (Databricks File System) – this is merely a pointer. For more info on this, refer to the databricks documentation here. Only create a mount point if you want all users in the workspace to have access. If you wish to apply security, you will need to access the store directly (also documented in the same place) and then apply permissions to the notebook accordingly. As my input for my stream was from EventHubs, we can start by defining the reading stream. You’ll firstly need to add the maven coordinate com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.2 to add the EventHub library to the cluster to allow the connection. Further options can be added for the consumer group, starting positions (for partitioning), timeouts and events per trigger. Positions can also be used to define starting and ending points in time so that the stream is not running continuously. connectionString = "Endpoint=sb://{EVENTHUBNAMESPACE}.servicebus.windows.net/{EVENTHUBNAME};EntityPath={EVENTHUBNAME};SharedAccessKeyName={ACCESSKEYNAME};SharedAccessKey={ACCESSKEY}" startingEventPosition = { "offset": "-1", # start of stream "seqNo": -1, # not in use "enqueuedTime": None, # not in use "isInclusive": True } endingEventPosition = { "offset": None, # not in use "seqNo": -1, # not in use "enqueuedTime": dt.now().strftime("%Y-%m-%dT%H:%M:%S.%fZ"), # point in time "isInclusive": True } ehConf = {} ehConf['eventhubs.connectionString'] = connectionString ehConf['eventhubs.startingPosition'] = json.dumps(startingEventPosition) ehConf['eventhubs.endingPosition'] = json.dumps(endingEventPosition) df = spark \ .readStream \ .format("eventhubs") \ .options(**ehConf) \ .load() The streaming data that is then output then follows the following schema – the body followed by a series of metadata about the streaming message.     Its important to note that the body comes out as a binary stream (this contains our message). We will need to cast the body to a String to deserialize the column to the JSON that we are expecting. This can be done by using some Spark SQL to turn the binary into a string as JSON and then parsing the column into a StructType with specified schema. If multiple records are coming through in the same message, you will need to explode out the result into separate records. Flattening out the nested columns is also useful as long as the data frame is still manageable. Spark SQL provides some great functions here to make our life easy. rawData = df. \ selectExpr("cast(body as string) as json"). \ select(from_json("json", Schema).alias("data")). \ select("data.*") While its entirely possible to construct your schema manually, its also worth noting that you can take a sample JSON, read it into a data frame using spark.read.json(path) and then calling printSchema() on top of it to return the inferred schema. This can then used be used to create the StructType. # Inferred schema: # root # |-- LineTotal: string (nullable = true) # |-- OrderQty: string (nullable = true) # |-- ProductID: string (nullable = true) # |-- SalesOrderDetailID: string (nullable = true) # |-- SalesOrderID: string (nullable = true) # |-- UnitPrice: string (nullable = true) # |-- UnitPriceDiscount: string (nullable = true) Schema = StructType([ StructField('SalesOrderID', StringType(), False), StructField('SalesOrderDetailID', StringType(), False), StructField('OrderQty', StringType(), False), StructField('ProductID', StringType(), False), StructField('UnitPrice', StringType(), False), StructField('UnitPriceDiscount', StringType(), False), StructField('LineTotal', StringType(), False) ]) At this point, you have the data streaming into your data frame. To output to the console you can use display(rawData) to see the data visually. However this is only useful for debugging since the data is not actually going anywhere! To write the stream into somewhere such as data lake you would then use the following code. The checkpoint location can be used to recover from failures when the stream is interrupted, and this is important if this code were to make it to a production environment. Should a cluster fail, the query be restarted on a new cluster from a specific point and consistently recover, thus enabling exactly-once guarantees. This also means we can change the query as long as the input source and output schema are the same, and not directly interrupt the stream. Lastly, the trigger will check for new rows in to stream every 10 seconds. rawData.writeStream \ .format("json") \ .outputMode("append") \ .option("path", PATH) \ .trigger(processingTime = "10 seconds") \ .option("checkpointLocation", PATH) \ .start() Checking our data lake, you can now see the data has made its way over, broken up by the time intervals specified.     Hopefully this is useful for anyone getting going in the topic area. I’d advise to stick to Python given the extra capacity of the PySpark language over Scala, even though a lot of the Databricks documentation / tutorials uses Scala. This was just something that felt more comfortable. If you intend to do much in this area I would definitely suggest you use the PySpark SQL documentation which can be found here. This is pretty much a bible for all commands and I’ve been referencing it quite a bit. If this is not enough there is also a cheat sheet available here. Again, very useful for reference when the language is still not engrained.

Getting Started with PyTorch: A Deep Learning Tutorial

PyTorch is a deep learning framework created by the Artificial Intelligence Research Group at Facebook to build neural networks for machine learning projects. It isn’t brand new; PyTorch has been around since October 2016, almost exactly two years ago, but only now it is gaining the momentum it deserves. When used as an alternative to Keras, TensorFlow or NumPy, PyTorch shines in the following areas:It is very tightly integrated with native Python API’s which allows it to seamlessly interact with Python libraries such as NumPy, SciPy and Pandas.This also includes native debuggers so there is no need for a specialist debugger – Looking at you, TensorFlow…It supports both forward passes for prediction and back propagation using the Autograd library for training.And most importantly:PyTorch builds dynamic computation graphs which can run code immediately with no separated build and run phases. This makes the neural networks much easier to extend, debug and maintain as you can edit your neural network during runtime or build your graph one step at a time.** Note: If the basics of Deep Learning, Neural Networks and Back Propagation are alien to you or even if you fancy a little revision, I really recommend that you check out 3Blue1Brown’s awesome YouTube playlist covering the foundations of Neural Networks: https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi **In this tutorial I will introduce a basic deep neural network in PyTorch and explain the important steps along the way. Now, let’s get started!InstallationThe simplest and recommended way to install PyTorch is through a package management tool like Conda or Pip but it can be installed directly from source using the instructions found at: https://pytorch.org/Install using PipLinux:pip3 install torch torchvisionMac:pip3 install torch torchvision # MacOS Binaries dont support CUDA, install from source if CUDA is needed Windows:pip3 install http://download.pytorch.org/whl/cu90/torch-0.4.1-cp37-cp37m-win_amd64.whl pip3 install torchvisionInstall using CondaLinux:conda install pytorch torchvision -c pytorchMac:conda install pytorch torchvision -c pytorch # MacOS Binaries dont support CUDA, install from source if CUDA is needed Windows:conda install pytorch -c pytorch pip3 install torchvision PyTorch can be installed on Azure Databricks as a Databricks PyPI library and comes preinstalled and configured on Azure Data Science Virtual Machines. Dataset SelectionI originally found the dataset used in this tutorial in the UCI Machine Learning Repository. The dataset represents data on 11,000+ instances of phishing and non-phishing webpages which have 30 categorical attributes including PageRank, AbnormalURL, Google_Index and age_of_domain. The data also contained a label, 1 or -1 indicating if the webpage with a phishing webpage or not. The aim of this tutorial is to show how to use a deep neural network, so all data was cleansed before being split into two separate CSV files, train.csv and test.csv. Simple Neural Network DesignThe neural network I will build consists of:30 input nodes each representing a column of the dataset.These are passed through the first ReLU layer. Rectified Linear Units (ReLU) improve neural networks by speeding up training – all negative numbers are set to 0 and positive aren’t changed to speed up computations.Next is the first hidden layer consisting of 128 fully connected nodes.Then we have the second ReLU layer and second fully connected hidden layer which also consists of 128 nodes.The next layer is the Logarithmic Softmax layer. This layer integrates both the softmax and log functions to calculate probabilities of each output in the range of 0 and 1.Finally, the last layer comprises of just two nodes representing the two labels – phishing and non-phishing.This neural network is visualized below:Simple Neural Network BuildAs always, the first step is to import the libraries we’ll need.import numpy as np import torch import torch.nn as nn from torch.autograd import Variable import pandas as pd import torch.utils.data The next step is to read in the CSVs that contain the data we need into a pandas dataframe.trainDataset = pd.read_csv("..\\train.csv", header=None) testDataset = pd.read_csv("..\\test.csv", header=None) Now we choose our hyperparameters. The input size will equal the number of attributes in our dataset, 30. The size of both hidden layers will be 128 nodes. The number of classes will be 2, non-phishing and phishing. The number of epochs will be 100. An epoch is both one forward pass through the network and one backward pass for training. We want to do these two steps 100 times. Our batch size will also be 100. This means we will complete each epoch with a batch of 100 rows of the dataset to speed up the training. The learning rate will be 0.001. The lower the learning rate, generally, the slower but more accurate the training is. The learning rate can be referred to as the speed of the gradient descent – Too big and you may overshoot the minimum, too small and it’ll take a long long time to converge. These parameters can be set using the following code: inputSize = len(trainDataset.columns) -1 hidden1Size = 128 hidden2Size = 128 numClasses = 2 numEpoch = 100 batchSize = 100 learningRate = 0.001 Data loaders are a really simple abstraction to the standard batch machine learning pipeline. Behind the scenes, the data loader will handle: Batching the data. Shuffling the data. Loading the data in parallel using multiprocessing workers. trainLoader = torch.utils.data.DataLoader(dataset=torch.tensor(trainDataset.values), batch_size=batchSize, shuffle=True) testLoader = torch.utils.data.DataLoader(dataset=torch.tensor(testDataset.values), batch_size=batchSize, shuffle=False) Now it’s time to define our neural network. The best way to do this is to subclassing the nn.Module class. The new DeepNeuralNetwork class is made up of the seven layers discussed earlier. class DeepNeuralNetwork(nn.Module): def __init__(self, inputSize, hidden1Size, hidden2Size, numClasses): super(DeepNeuralNetwork, self).__init__() self.fc1 = nn.Linear(inputSize, hidden1Size) self.relu1 = nn.ReLU() self.fc2 = nn.Linear(hidden1Size, hidden2Size) self.relu2 = nn.ReLU() self.fc3 = nn.Linear(hidden1Size, numClasses) self.logsm1 = nn.LogSoftmax(dim=1) def forward(self, x): out = self.fc1(x) out = self.relu1(out) out = self.fc2(out) out = self.relu2(out) out = self.fc3(out) out = self.logsm1(out) return out dnn = DeepNeuralNetwork(inputSize, hidden1Size, hidden2Size, numClasses) We now need to define both the loss function and optimizer. The loss function will calculate how effective our current weights and biases are at producing an accurate classification and we will be using the NLL function or negative log-likelihood function. NLL is very often the loss function of choice alongside the softmax activation we have in our neural network. Additionally, we need an optimizer to take the results from the loss function and alter the weights and biases to move the accuracy in a positive direction. The optimizer we will want use is called Adam (Adaptive Moment Estimation) – a popular optimizer that will outperform (almost) every other optimization algorithm. lossFN = nn.NLLLoss() optimizer = torch.optim.Adam(dnn.parameters(), lr=learningRate) Time to train the network! We will loop through the training steps 100 times as we stated in our hyperparameters. These steps consist of: Looping through each batch of training data. Separating each batch of training data into two variables – one for the attributes, one for the class labels. Zero the gradient of the previous epoch. Completing a forward pass through the network with the batch of training attributes. Calculating the loss with respect to the class labels. And finally performing back propagation. for epoch in range(0, numEpoch): for i, data in enumerate(trainLoader,0): labels = Variable(data[:,-1]) data = Variable(data[:,0:30].float()) optimizer.zero_grad() outputs = dnn(data) loss = lossFN(outputs, labels.long()) loss.backward() optimizer.step() print('Epoch [%d/%d], Loss: %.4f' %(epoch+1, numEpoch, loss.item())) Finally, test our network using the following code: correct = 0 total = 0 for data in testLoader: labels = Variable(data[:,-1]) data = Variable(data[:,0:30].float()) outputs = dnn(data) _, predicted = torch.max(outputs.data, 1) total += labels.size(0) correct += (predicted == labels.long()).sum() print('Accuracy of the network on the data: %d %%' % (100 * correct / total)) This tells us that our accuracy is at 95%. This is pretty good for a first try! We can now identify phishing websites with a high accuracy using only 30 features. Due to its Python integration and dynamic computational graphs, PyTorch is relatively easy to pick up making it a more approachable neural network framework than TensorFlow. However, PyTorch is a relatively new framework, so it only has a small community and limited resources hindering the ability to learn and debug. But as with any tech, it’s all a matter of personal preference.The dataset and code used in this tutorial have been uploaded to my GitHub account which can be found at: https://github.com/ToriTompkins/DataShare

Databricks UDF Performance Comparisons

I’ve recently been spending quite a bit of time on the Azure Databricks platform, and while learning decided it was worth using it to experiment with some common data warehousing tasks in the form of data cleansing. As Databricks provides us with a platform to run a Spark environment on, it offers options to use cross-platform APIs that allow us to write code in Scala, Python, R, and SQL within the same notebook. As with most things in life, not everything is equal and there are potential differences in performance between them. In this blog, I will explain the tests I produced with the aim of outlining best practice for Databricks implementations for UDFs of this nature. Scala is the native language for Spark – and without going into too much detail here, it will compile down faster to the JVM for processing. Under the hood, Python on the other hand provides a wrapper around the code but in reality is a Scala program telling the cluster what to do, and being transformed by Scala code. Converting these objects into a form Python can read is called serialisation / deserialisation, and its expensive, especially over time and across a distributed dataset. This most expensive scenario occurs through UDFs (functions) – the runtime process for which can be seen below. The overhead here is in (4) and (5) to read the data and write into JVM memory. Using Scala to create the UDFs, the execution process can skip these steps and keep everything native. Scala UDFs operate within the JVM of the executor so we can skip serialisation and deserialisation.   Experiments As part of my data for this task I took a list of company names from a data set and then run them through a process to codify them, essentially stripping out characters which cause them to be unique and converting them to upper case, thus grouping a set of companies together under the same name. For instance Adatis, Adatis Ltd, and Adatis (Ltd) would become ADATIS. This was an example of a typical cleansing activity when working with data sets. The dataset in question was around 2.5GB and contained 10.5m rows. The cluster I used was Databricks runtime 4.2 (Spark 2.3.1 / Scala 2.11) with Standard_DS2_v2 VMs for the driver/worker nodes (14GB memory) with autoscaling disabled and limited to 2 workers. I disabled the autoscaling for this as I was seeing wildly inconsistent timings each run which impacted the tests. The goods news is that with it enabled and using up to 8 workers, the timings were about 20% faster albeit more erratic from a standard deviation point of view. The following approaches were tested: Scala program calls Scala UDF via Function Scala program calls Scala UDF via SQL Python program calls Scala UDF via SQL Python program calls Python UDF via Function Python program calls Python Vectorised UDF via Function Python program uses SQL While it was true in previous versions of Spark that there was a difference between these using Scala/Python, in the latest version of Spark (2.3) it is believed to be more of a level playing field by using Apache Arrow in the form of Vectorised Pandas UDFs within Python. As part of the tests I also wanted to use Python to call a Scala UDF via a function but unfortunately we cannot do this without creating a Jar file of the Scala code and importing it separately. This would be done via SBT (build tool) using the following guide here. I considered this too much of an overhead for the purposes of the experiment. The following code was then used as part of a Databricks notebook to define the tests. A custom function to time the write was required for Scala whereas Python allows us to use %timeit for a similar purpose.   Scala program calls Scala UDF via Function // Scala program calls Scala UDF via Function %scala def codifyScalaUdf = udf((string: String) => string.toUpperCase.replace(" ", "").replace("#","").replace(";","").replace("&","").replace(" AND ","").replace(" THE ","").replace("LTD","").replace("LIMITED","").replace("PLC","").replace(".","").replace(",","").replace("[","").replace("]","").replace("LLP","").replace("INC","").replace("CORP","")) spark.udf.register("ScalaUdf", codifyScalaUdf) val transformedScalaDf = table("DataTable").select(codifyScalaUdf($"CompanyName").alias("CompanyName")) val ssfTime = timeIt(transformedScalaDf.write.mode("overwrite").format("parquet").saveAsTable("SSF"))   Scala program calls Scala UDF via SQL // Scala program calls Scala UDF via SQL %scala val sss = spark.sql("SELECT ScalaUdf(CompanyName) as a from DataTable where CompanyName is not null") val sssTime = timeIt(sss.write.mode("overwrite").format("parquet").saveAsTable("SSS"))   Python program calls Scala UDF via SQL # Python program calls Scala UDF via SQL pss = spark.sql("SELECT ScalaUdf(CompanyName) as a from DataTable where CompanyName is not null") %timeit -r 1 pss.write.format("parquet").saveAsTable("PSS", mode='overwrite')   Python program calls Python UDF via Function # Python program calls Python UDF via Function from pyspark.sql.functions import * from pyspark.sql.types import StringType @udf(StringType()) def pythonCodifyUDF(string): return (string.upper().replace(" ", "").replace("#","").replace(";","").replace("&","").replace(" AND ","").replace(" THE ","").replace("LTD","").replace("LIMITED","").replace("PLC","").replace(".","").replace(",","").replace("[","").replace("]","").replace("LLP","").replace("INC","").replace("CORP","")) pyDF = df.select(pythonCodifyUDF(col("CompanyName")).alias("CompanyName")).filter(col("CompanyName").isNotNull()) %timeit -r 1 pyDF.write.format("parquet").saveAsTable("PPF", mode='overwrite')   Python program calls Python Vectorised UDF via Function # Python program calls Python Vectorised UDF via Function from pyspark.sql.types import StringType from pyspark.sql.functions import pandas_udf, col @pandas_udf(returnType=StringType()) def pythonCodifyVecUDF(string): return (string.replace(" ", "").replace("#","").replace(";","").replace("&","").replace(" AND ","").replace(" THE ","").replace("LTD","").replace("LIMITED","").replace("PLC","").replace(".","").replace(",","").replace("[","").replace("]","").replace("LLP","").replace("INC","").replace("CORP","")).str.upper() pyVecDF = df.select(pythonCodifyVecUDF(col("CompanyName")).alias("CompanyName")).filter(col("CompanyName").isNotNull()) %timeit -r 1 pyVecDF.write.format("parquet").saveAsTable("PVF", mode='overwrite')   Python Program uses SQL # Python Program uses SQL sql = spark.sql("SELECT upper(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(CompanyName,' ',''),'&',''),';',''),'#',''),' AND ',''),' THE ',''),'LTD',''),'LIMITED',''),'PLC',''),'.',''),',',''),'[',''),']',''),'LLP',''),'INC',''),'CORP','')) as a from DataTable where CompanyName is not null") %timeit -r 1 sql.write.format("parquet").saveAsTable("SQL", mode='overwrite')   Results and Observations It was interesting to note the following: The hypothesis above does indeed hold true and the 2 methods which were expected to be slowest were within the experiment, and by a considerable margin. The Scala UDF performs consistently regardless of the method used to call the UDF. The Python vectorised UDF now performs on par with the Scala UDFs and there is a clear difference between the vectorised and non-vectorised Python UDFs. The standard deviation for the vectorised UDF was surprisingly low and the method was performing consistently each run. The non-vectorised Python UDF was the opposite. To summarise, moving forward – as long as you adopt to writing your UDFs in Scala or use the vectorised version of the Python UDF, the performance will be similar for this type of activity. Its worth noting to definitely avoid writing the UDFs as standard Python functions due to the theory and results above. Over time, across a complete solution and with more data, this time would add up.

Visualising Network Data in Power BI with Python Integration and NetworkX

The long awaited Python Integration in Power BI added earlier this month welcomes the opportunity for further customised reporting by exploiting the vast range of Python visualisation libraries.Among my favourite of these Python visualisation/ data science libraries is NetworkX, a powerful package designed to manipulate and study the structure and dynamics of complex networks. While NetworkX excels most at applying graph theory algorithms on network graphs in excess of 100 million edges, it also provides the capability to visualise these networks efficiently and, in my opinion, easier than the equivalent packages in R.In this article, I will explain how to visualise network data in Power BI utilising the new Python Integration and the NetworkX Python library.Getting StartedTo begin experimenting with NetworkX and Python in Power BI, there are several pre-requisites:Enable Python integration in the preview settings by going to File –> Options and Settings –> Options –> Preview features and enabling Python support.Ensure Python is installed and fully up-to-date.Install the following Python libraries:NetworkXNumPypandasMatplotlibLoading DataThe data I used was created to demonstrate this task in Power BI but there are many real-world network datasets to experiment with provided by Stanford Network Analysis Project. This small dummy dataset represents a co-purchasing network of books.The data I loaded into Power BI consisted of two separate CSVs. One, Books.csv, consisted of metadata pertaining to the top 40 bestselling books according to Wikipedia and their assigned IDs. The other, Relationship.csv, was an edgelist of the book IDs which is a popular method for storing/ delivering network data. The graph I wanted to create was an undirected, unweighted graph which I wanted to be able to cross-filter accurately. Because of this, I duplicated this edgelist and reversed the columns so the ToNodeId and FromNodeId were swapped. Adding this new edge list onto the end of the original edgelist has created a dataset with can be filtered on both columns later down the line. For directed graphs, this step is unnecessary and can be ignored.Once loaded into Power BI, I duplicated the Books table to create the following relationship diagram as it isn’t possible to replicate the relationship between FromNodeId to Book ID and ToNodeId to Book ID with only one Books table.From here I can build my network graph.Building the Network GraphFinally, we can begin the Python Integration!Select the Python visual from the visualizations pane and drag this onto the dashboard.Drag the Book Title columns of both Books and Books2 into Values.Power BI will create a data frame from these values. This can be seen in the top 4 lines in the Python script editor.The following Python code (also shown above) will create and draw a simple undirected and unweighted network graph with node labels from the data frame Power BI generated:import networkx as nx import matplotlib.pyplot as plt G = nx.from_pandas_edgelist(dataset, source="Book Title", target="Book Title.1") nx.draw(G, with_labels = True) plt.show() ** NOTE: You may find that the code above will fail to work with large networks. This is because by default networkx will draw the graph according to the Fruchterman Reingold layout, which will position the nodes for the highest readability. This layout is unsuitable for networks larger than 1000 nodes due to the memory and run time required to run the algorithm. As an alternative, you can position the nodes in a circle or randomly by editing the linenx.draw(G, with_labels = True)tonx.draw(G, with_labels = True, pos=nx.circular_layout(G)) or nx.draw(G, with_labels = True, pos=nx.random_layout(G)) **This will produce the network graph below:You are also able to cross filter the network graph by selecting rows in the table on the right-hand side:ConclusionPython visuals are simple to produce and although the visual itself isn’t interactive, they will update with data refreshes and cross filtering, much like the R integration added 3 years ago. The introduction of Python in Power BI has opened doors for visualisation with libraries such as NetworkX, to visualise all BI networks from Airline Connection Flights and Co-Purchasing networks to Social Network Analysis.

Parsing nested JSON lists in Databricks using Python

Parsing complex JSON structures is usually not a trivial task. When your destination is a database, what you expect naturally is a flattened result set. Things get more complicated when your JSON source is a web service and the result consists of multiple nested objects including lists in lists and so on. Things get even more complicated if the JSON schema changes over time, which is often a real-life scenario. We have these wonderful Azure Logic Apps, which help us consistently get the JSON results from various sources. However, Logic Apps are not so good at parsing more complex nested structures. And they definitely don’t like even subtle source schema changes. Enter Databricks! With Databricks you get: An easy way to infer the JSON schema and avoid creating it manually Subtle changes in the JSON schema won’t break things The ability to explode nested lists into rows in a very easy way (see the Notebook below) Speed! Following is an example Databricks Notebook (Python) demonstrating the above claims. The JSON sample consists of an imaginary JSON result set, which contains a list of car models within a list of car vendors within a list of people. We want to flatten this result into a dataframe. Here you go: Blog - Nested JSON Arrays in Databricks - Databricks window.settings = {"enableUsageDeliveryConfiguration":false,"enableNotebookNotifications":true,"enableSshKeyUI":false,"defaultInteractivePricePerDBU":0.55,"enableDynamicAutoCompleteResourceLoading":false,"enableClusterMetricsUI":false,"allowWhitelistedIframeDomains":true,"enableOnDemandClusterType":true,"enableAutoCompleteAsYouType":[],"devTierName":"Community Edition","enableJobsPrefetching":true,"workspaceFeaturedLinks":[{"linkURI":"https://docs.azuredatabricks.net/index.html","displayName":"Documentation","icon":"question"},{"linkURI":"https://docs.azuredatabricks.net/release-notes/product/index.html","displayName":"Release Notes","icon":"code"},{"linkURI":"https://docs.azuredatabricks.net/spark/latest/training/index.html","displayName":"Training & Tutorials","icon":"graduation-cap"}],"enableReservoirTableUI":true,"enableClearStateFeature":true,"dbcForumURL":"http://forums.databricks.com/","enableProtoClusterInfoDeltaPublisher":true,"enableAttachExistingCluster":true,"sandboxForSandboxFrame":"allow-scripts allow-popups allow-popups-to-escape-sandbox allow-forms","resetJobListOnConnect":true,"serverlessDefaultSparkVersion":"latest-stable-scala2.11","maxCustomTags":8,"serverlessDefaultMaxWorkers":20,"enableInstanceProfilesUIInJobs":true,"nodeInfo":{"node_types":[{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":7284,"instance_type_id":"Standard_DS3_v2","node_type_id":"Standard_DS3_v2","description":"Standard_DS3_v2","support_cluster_tags":true,"container_memory_mb":9105,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_DS3_v2","provider":"Azure","local_disk_size_gb":28,"supports_accelerated_networking":true,"compute_units":4.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":14336,"num_cores":4,"cpu_quota_type":"Standard DSv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":14336,"is_hidden":false,"category":"General Purpose","num_cores":4.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":18409,"instance_type_id":"Standard_DS4_v2","node_type_id":"Standard_DS4_v2","description":"Standard_DS4_v2","support_cluster_tags":true,"container_memory_mb":23011,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_DS4_v2","provider":"Azure","local_disk_size_gb":56,"supports_accelerated_networking":true,"compute_units":8.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":28672,"num_cores":8,"cpu_quota_type":"Standard DSv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":28672,"is_hidden":false,"category":"General Purpose","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":40658,"instance_type_id":"Standard_DS5_v2","node_type_id":"Standard_DS5_v2","description":"Standard_DS5_v2","support_cluster_tags":true,"container_memory_mb":50823,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_DS5_v2","provider":"Azure","local_disk_size_gb":112,"supports_accelerated_networking":true,"compute_units":16.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":57344,"num_cores":16,"cpu_quota_type":"Standard DSv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":57344,"is_hidden":false,"category":"General Purpose","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":21587,"instance_type_id":"Standard_D8s_v3","node_type_id":"Standard_D8s_v3","description":"Standard_D8s_v3","support_cluster_tags":true,"container_memory_mb":26984,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":348},"node_instance_type":{"instance_type_id":"Standard_D8s_v3","provider":"Azure","local_disk_size_gb":64,"supports_accelerated_networking":true,"compute_units":8.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":32768,"num_cores":8,"cpu_quota_type":"Standard DSv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":32768,"is_hidden":false,"category":"General Purpose","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":47015,"instance_type_id":"Standard_D16s_v3","node_type_id":"Standard_D16s_v3","description":"Standard_D16s_v3","support_cluster_tags":true,"container_memory_mb":58769,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":348},"node_instance_type":{"instance_type_id":"Standard_D16s_v3","provider":"Azure","local_disk_size_gb":128,"supports_accelerated_networking":true,"compute_units":16.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":65536,"num_cores":16,"cpu_quota_type":"Standard DSv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":65536,"is_hidden":false,"category":"General Purpose","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":97871,"instance_type_id":"Standard_D32s_v3","node_type_id":"Standard_D32s_v3","description":"Standard_D32s_v3","support_cluster_tags":true,"container_memory_mb":122339,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":348},"node_instance_type":{"instance_type_id":"Standard_D32s_v3","provider":"Azure","local_disk_size_gb":256,"supports_accelerated_networking":true,"compute_units":32.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":131072,"num_cores":32,"cpu_quota_type":"Standard DSv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":131072,"is_hidden":false,"category":"General Purpose","num_cores":32.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":199583,"instance_type_id":"Standard_D64s_v3","node_type_id":"Standard_D64s_v3","description":"Standard_D64s_v3","support_cluster_tags":true,"container_memory_mb":249479,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":348},"node_instance_type":{"instance_type_id":"Standard_D64s_v3","provider":"Azure","local_disk_size_gb":512,"supports_accelerated_networking":true,"compute_units":64.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":262144,"num_cores":64,"cpu_quota_type":"Standard DSv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":262144,"is_hidden":false,"category":"General Purpose","num_cores":64.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":7284,"instance_type_id":"Standard_D3_v2","node_type_id":"Standard_D3_v2","description":"Standard_D3_v2","support_cluster_tags":true,"container_memory_mb":9105,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D3_v2","provider":"Azure","local_disk_size_gb":200,"supports_accelerated_networking":true,"compute_units":4.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":14336,"num_cores":4,"cpu_quota_type":"Standard Dv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":14336,"is_hidden":false,"category":"General Purpose (HDD)","num_cores":4.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":21587,"instance_type_id":"Standard_D8_v3","node_type_id":"Standard_D8_v3","description":"Standard_D8_v3","support_cluster_tags":true,"container_memory_mb":26984,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D8_v3","provider":"Azure","local_disk_size_gb":200,"supports_accelerated_networking":true,"compute_units":8.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":32768,"num_cores":8,"cpu_quota_type":"Standard Dv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":32768,"is_hidden":false,"category":"General Purpose (HDD)","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":47015,"instance_type_id":"Standard_D16_v3","node_type_id":"Standard_D16_v3","description":"Standard_D16_v3","support_cluster_tags":true,"container_memory_mb":58769,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D16_v3","provider":"Azure","local_disk_size_gb":400,"supports_accelerated_networking":true,"compute_units":16.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":65536,"num_cores":16,"cpu_quota_type":"Standard Dv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":65536,"is_hidden":false,"category":"General Purpose (HDD)","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":97871,"instance_type_id":"Standard_D32_v3","node_type_id":"Standard_D32_v3","description":"Standard_D32_v3","support_cluster_tags":true,"container_memory_mb":122339,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D32_v3","provider":"Azure","local_disk_size_gb":800,"supports_accelerated_networking":true,"compute_units":32.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":131072,"num_cores":32,"cpu_quota_type":"Standard Dv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":131072,"is_hidden":false,"category":"General Purpose (HDD)","num_cores":32.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":199583,"instance_type_id":"Standard_D64_v3","node_type_id":"Standard_D64_v3","description":"Standard_D64_v3","support_cluster_tags":true,"container_memory_mb":249479,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D64_v3","provider":"Azure","local_disk_size_gb":1600,"supports_accelerated_networking":true,"compute_units":64.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":262144,"num_cores":64,"cpu_quota_type":"Standard Dv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":262144,"is_hidden":false,"category":"General Purpose (HDD)","num_cores":64.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":18409,"instance_type_id":"Standard_D12_v2","node_type_id":"Standard_D12_v2","description":"Standard_D12_v2","support_cluster_tags":true,"container_memory_mb":23011,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D12_v2","provider":"Azure","local_disk_size_gb":200,"supports_accelerated_networking":true,"compute_units":4.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":28672,"num_cores":4,"cpu_quota_type":"Standard Dv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":28672,"is_hidden":false,"category":"Memory Optimized (Remote HDD)","num_cores":4.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":40658,"instance_type_id":"Standard_D13_v2","node_type_id":"Standard_D13_v2","description":"Standard_D13_v2","support_cluster_tags":true,"container_memory_mb":50823,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D13_v2","provider":"Azure","local_disk_size_gb":400,"supports_accelerated_networking":true,"compute_units":8.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":57344,"num_cores":8,"cpu_quota_type":"Standard Dv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":57344,"is_hidden":false,"category":"Memory Optimized (Remote HDD)","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":85157,"instance_type_id":"Standard_D14_v2","node_type_id":"Standard_D14_v2","description":"Standard_D14_v2","support_cluster_tags":true,"container_memory_mb":106447,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D14_v2","provider":"Azure","local_disk_size_gb":800,"supports_accelerated_networking":true,"compute_units":16.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":114688,"num_cores":16,"cpu_quota_type":"Standard Dv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":114688,"is_hidden":false,"category":"Memory Optimized (Remote HDD)","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":107407,"instance_type_id":"Standard_D15_v2","node_type_id":"Standard_D15_v2","description":"Standard_D15_v2","support_cluster_tags":true,"container_memory_mb":134259,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_D15_v2","provider":"Azure","local_disk_size_gb":1000,"supports_accelerated_networking":true,"compute_units":20.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":143360,"num_cores":20,"cpu_quota_type":"Standard Dv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":143360,"is_hidden":false,"category":"Memory Optimized (Remote HDD)","num_cores":20.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":18409,"instance_type_id":"Standard_DS12_v2","node_type_id":"Standard_DS12_v2","description":"Standard_DS12_v2","support_cluster_tags":true,"container_memory_mb":23011,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_DS12_v2","provider":"Azure","local_disk_size_gb":56,"supports_accelerated_networking":true,"compute_units":4.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":28672,"num_cores":4,"cpu_quota_type":"Standard DSv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":28672,"is_hidden":false,"category":"Memory Optimized","num_cores":4.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":40658,"instance_type_id":"Standard_DS13_v2","node_type_id":"Standard_DS13_v2","description":"Standard_DS13_v2","support_cluster_tags":true,"container_memory_mb":50823,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_DS13_v2","provider":"Azure","local_disk_size_gb":112,"supports_accelerated_networking":true,"compute_units":8.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":57344,"num_cores":8,"cpu_quota_type":"Standard DSv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":57344,"is_hidden":false,"category":"Memory Optimized","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":85157,"instance_type_id":"Standard_DS14_v2","node_type_id":"Standard_DS14_v2","description":"Standard_DS14_v2","support_cluster_tags":true,"container_memory_mb":106447,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_DS14_v2","provider":"Azure","local_disk_size_gb":224,"supports_accelerated_networking":true,"compute_units":16.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":114688,"num_cores":16,"cpu_quota_type":"Standard DSv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":114688,"is_hidden":false,"category":"Memory Optimized","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":107407,"instance_type_id":"Standard_DS15_v2","node_type_id":"Standard_DS15_v2","description":"Standard_DS15_v2","support_cluster_tags":true,"container_memory_mb":134259,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_DS15_v2","provider":"Azure","local_disk_size_gb":280,"supports_accelerated_networking":true,"compute_units":20.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":143360,"num_cores":20,"cpu_quota_type":"Standard DSv2 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":143360,"is_hidden":false,"category":"Memory Optimized","num_cores":20.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":47015,"instance_type_id":"Standard_E8s_v3","node_type_id":"Standard_E8s_v3","description":"Standard_E8s_v3","support_cluster_tags":true,"container_memory_mb":58769,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_E8s_v3","provider":"Azure","local_disk_size_gb":128,"supports_accelerated_networking":true,"compute_units":8.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":65536,"num_cores":8,"cpu_quota_type":"Standard ESv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":65536,"is_hidden":false,"category":"Memory Optimized","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":97871,"instance_type_id":"Standard_E16s_v3","node_type_id":"Standard_E16s_v3","description":"Standard_E16s_v3","support_cluster_tags":true,"container_memory_mb":122339,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_E16s_v3","provider":"Azure","local_disk_size_gb":256,"supports_accelerated_networking":true,"compute_units":16.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":131072,"num_cores":16,"cpu_quota_type":"Standard ESv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":131072,"is_hidden":false,"category":"Memory Optimized","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":199583,"instance_type_id":"Standard_E32s_v3","node_type_id":"Standard_E32s_v3","description":"Standard_E32s_v3","support_cluster_tags":true,"container_memory_mb":249479,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_E32s_v3","provider":"Azure","local_disk_size_gb":512,"supports_accelerated_networking":true,"compute_units":32.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":262144,"num_cores":32,"cpu_quota_type":"Standard ESv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":262144,"is_hidden":false,"category":"Memory Optimized","num_cores":32.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":21587,"instance_type_id":"Standard_L4s","node_type_id":"Standard_L4s","description":"Standard_L4s","support_cluster_tags":true,"container_memory_mb":26984,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_L4s","provider":"Azure","local_disk_size_gb":678,"supports_accelerated_networking":false,"compute_units":4.0,"number_of_ips":2,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":32768,"num_cores":4,"cpu_quota_type":"Standard LS Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":32768,"is_hidden":false,"category":"Storage Optimized","num_cores":4.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":47015,"instance_type_id":"Standard_L8s","node_type_id":"Standard_L8s","description":"Standard_L8s","support_cluster_tags":true,"container_memory_mb":58769,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_L8s","provider":"Azure","local_disk_size_gb":1388,"supports_accelerated_networking":false,"compute_units":8.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":65536,"num_cores":8,"cpu_quota_type":"Standard LS Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":65536,"is_hidden":false,"category":"Storage Optimized","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":97871,"instance_type_id":"Standard_L16s","node_type_id":"Standard_L16s","description":"Standard_L16s","support_cluster_tags":true,"container_memory_mb":122339,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_L16s","provider":"Azure","local_disk_size_gb":2807,"supports_accelerated_networking":false,"compute_units":16.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":131072,"num_cores":16,"cpu_quota_type":"Standard LS Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":131072,"is_hidden":false,"category":"Storage Optimized","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":199583,"instance_type_id":"Standard_L32s","node_type_id":"Standard_L32s","description":"Standard_L32s","support_cluster_tags":true,"container_memory_mb":249479,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_L32s","provider":"Azure","local_disk_size_gb":5630,"supports_accelerated_networking":false,"compute_units":32.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":262144,"num_cores":32,"cpu_quota_type":"Standard LS Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":262144,"is_hidden":false,"category":"Storage Optimized","num_cores":32.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":2516,"instance_type_id":"Standard_F4s","node_type_id":"Standard_F4s","description":"Standard_F4s","support_cluster_tags":true,"container_memory_mb":3146,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_F4s","provider":"Azure","local_disk_size_gb":16,"supports_accelerated_networking":true,"compute_units":4.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":8192,"num_cores":4,"cpu_quota_type":"Standard FS Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":16,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":8192,"is_hidden":false,"category":"Compute Optimized","num_cores":4.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":8873,"instance_type_id":"Standard_F8s","node_type_id":"Standard_F8s","description":"Standard_F8s","support_cluster_tags":true,"container_memory_mb":11092,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_F8s","provider":"Azure","local_disk_size_gb":32,"supports_accelerated_networking":true,"compute_units":8.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":16384,"num_cores":8,"cpu_quota_type":"Standard FS Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":16384,"is_hidden":false,"category":"Compute Optimized","num_cores":8.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":21587,"instance_type_id":"Standard_F16s","node_type_id":"Standard_F16s","description":"Standard_F16s","support_cluster_tags":true,"container_memory_mb":26984,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":350},"node_instance_type":{"instance_type_id":"Standard_F16s","provider":"Azure","local_disk_size_gb":64,"supports_accelerated_networking":true,"compute_units":16.0,"number_of_ips":16,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":32768,"num_cores":16,"cpu_quota_type":"Standard FS Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"},{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":32768,"is_hidden":false,"category":"Compute Optimized","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":0,"spark_heap_memory":85157,"instance_type_id":"Standard_H16","node_type_id":"Standard_H16","description":"Standard_H16","support_cluster_tags":true,"container_memory_mb":106447,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":8},"node_instance_type":{"instance_type_id":"Standard_H16","provider":"Azure","local_disk_size_gb":2000,"supports_accelerated_networking":false,"compute_units":16.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":0,"memory_mb":114688,"num_cores":16,"cpu_quota_type":"Standard H Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":114688,"is_hidden":false,"category":"Compute Optimized","num_cores":16.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":2,"spark_heap_memory":85157,"instance_type_id":"Standard_NC12","node_type_id":"Standard_NC12","description":"Standard_NC12 (beta)","support_cluster_tags":true,"container_memory_mb":106447,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":48},"node_instance_type":{"instance_type_id":"Standard_NC12","provider":"Azure","local_disk_size_gb":680,"supports_accelerated_networking":false,"compute_units":12.0,"number_of_ips":2,"local_disks":1,"reserved_compute_units":1.0,"gpus":2,"memory_mb":114688,"num_cores":12,"cpu_quota_type":"Standard NC Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":48,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":114688,"is_hidden":false,"category":"GPU Accelerated","num_cores":12.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":4,"spark_heap_memory":174155,"instance_type_id":"Standard_NC24","node_type_id":"Standard_NC24","description":"Standard_NC24 (beta)","support_cluster_tags":true,"container_memory_mb":217694,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":48},"node_instance_type":{"instance_type_id":"Standard_NC24","provider":"Azure","local_disk_size_gb":1440,"supports_accelerated_networking":false,"compute_units":24.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":4,"memory_mb":229376,"num_cores":24,"cpu_quota_type":"Standard NC Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":64,"supported_disk_types":[{"azure_disk_volume_type":"STANDARD_LRS"}],"reserved_memory_mb":4800},"memory_mb":229376,"is_hidden":false,"category":"GPU Accelerated","num_cores":24.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":1,"spark_heap_memory":85157,"instance_type_id":"Standard_NC6s_v3","node_type_id":"Standard_NC6s_v3","description":"Standard_NC6s_v3 (beta)","support_cluster_tags":true,"container_memory_mb":106447,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":0},"node_instance_type":{"instance_type_id":"Standard_NC6s_v3","provider":"Azure","local_disk_size_gb":736,"supports_accelerated_networking":false,"compute_units":6.0,"number_of_ips":4,"local_disks":1,"reserved_compute_units":1.0,"gpus":1,"memory_mb":114688,"num_cores":6,"cpu_quota_type":"Standard NCSv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":12,"supported_disk_types":[{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":114688,"is_hidden":false,"category":"GPU Accelerated","num_cores":6.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":2,"spark_heap_memory":174155,"instance_type_id":"Standard_NC12s_v3","node_type_id":"Standard_NC12s_v3","description":"Standard_NC12s_v3 (beta)","support_cluster_tags":true,"container_memory_mb":217694,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":0},"node_instance_type":{"instance_type_id":"Standard_NC12s_v3","provider":"Azure","local_disk_size_gb":1474,"supports_accelerated_networking":false,"compute_units":12.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":2,"memory_mb":229376,"num_cores":12,"cpu_quota_type":"Standard NCSv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":24,"supported_disk_types":[{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":229376,"is_hidden":false,"category":"GPU Accelerated","num_cores":12.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false},{"display_order":0,"support_ssh":true,"num_gpus":4,"spark_heap_memory":352151,"instance_type_id":"Standard_NC24s_v3","node_type_id":"Standard_NC24s_v3","description":"Standard_NC24s_v3 (beta)","support_cluster_tags":true,"container_memory_mb":440189,"node_info":{"status":["NotEnabledOnSubscription"],"available_core_quota":0},"node_instance_type":{"instance_type_id":"Standard_NC24s_v3","provider":"Azure","local_disk_size_gb":2948,"supports_accelerated_networking":false,"compute_units":24.0,"number_of_ips":8,"local_disks":1,"reserved_compute_units":1.0,"gpus":4,"memory_mb":458752,"num_cores":24,"cpu_quota_type":"Standard NCSv3 Family vCPUs","local_disk_type":"AHCI","max_attachable_disks":32,"supported_disk_types":[{"azure_disk_volume_type":"PREMIUM_LRS"}],"reserved_memory_mb":4800},"memory_mb":458752,"is_hidden":false,"category":"GPU Accelerated","num_cores":24.0,"is_io_cache_enabled":false,"support_port_forwarding":true,"support_ebs_volumes":true,"is_deprecated":false}],"default_node_type_id":"Standard_DS3_v2"},"enableDatabaseSupportClusterChoice":true,"enableClusterAcls":true,"notebookRevisionVisibilityHorizon":0,"serverlessClusterProductName":"Serverless Pool","showS3TableImportOption":false,"redirectBrowserOnWorkspaceSelection":true,"maxEbsVolumesPerInstance":10,"enableRStudioUI":false,"isAdmin":true,"deltaProcessingBatchSize":1000,"timerUpdateQueueLength":100,"sqlAclsEnabledMap":{"spark.databricks.acl.enabled":"true","spark.databricks.acl.sqlOnly":"true"},"enableLargeResultDownload":true,"maxElasticDiskCapacityGB":5000,"serverlessDefaultMinWorkers":2,"zoneInfos":[],"enableCustomSpotPricingUIByTier":true,"serverlessClustersEnabled":true,"enableWorkspaceBrowserSorting":true,"enableSentryLogging":false,"enableFindAndReplace":true,"disallowUrlImportExceptFromDocs":false,"defaultStandardClusterModel":{"cluster_name":"","node_type_id":"Standard_DS3_v2","spark_version":"4.0.x-scala2.11","num_workers":null,"autoscale":{"min_workers":2,"max_workers":8},"autotermination_minutes":120,"default_tags":{"Vendor":"Databricks","Creator":"ivv@adatis.co.uk","ClusterName":null,"ClusterId":""}},"enableEBSVolumesUIForJobs":true,"enablePublishNotebooks":false,"enableBitbucketCloud":true,"shouldShowCommandStatus":true,"createTableInNotebookS3Link":{"url":"https://docs.azuredatabricks.net/_static/notebooks/data-import/s3.html","displayName":"S3","workspaceFileName":"S3 Example"},"sanitizeHtmlResult":true,"enableClusterPinningUI":true,"enableJobAclsConfig":false,"enableFullTextSearch":false,"enableElasticSparkUI":true,"enableNewClustersCreate":true,"clusters":true,"allowRunOnPendingClusters":true,"useAutoscalingByDefault":true,"enableAzureToolbar":true,"fileStoreBase":"FileStore","enableEmailInAzure":true,"enableRLibraries":true,"enableTableAclsConfig":false,"enableSshKeyUIInJobs":true,"enableDetachAndAttachSubMenu":true,"configurableSparkOptionsSpec":[{"keyPattern":"spark\\.kryo(\\.[^\\.]+)+","valuePattern":".*","keyPatternDisplay":"spark.kryo.*","valuePatternDisplay":"*","description":"Configuration options for Kryo serialization"},{"keyPattern":"spark\\.io\\.compression\\.codec","valuePattern":"(lzf|snappy|org\\.apache\\.spark\\.io\\.LZFCompressionCodec|org\\.apache\\.spark\\.io\\.SnappyCompressionCodec)","keyPatternDisplay":"spark.io.compression.codec","valuePatternDisplay":"snappy|lzf","description":"The codec used to compress internal data such as RDD partitions, broadcast variables and shuffle outputs."},{"keyPattern":"spark\\.serializer","valuePattern":"(org\\.apache\\.spark\\.serializer\\.JavaSerializer|org\\.apache\\.spark\\.serializer\\.KryoSerializer)","keyPatternDisplay":"spark.serializer","valuePatternDisplay":"org.apache.spark.serializer.JavaSerializer|org.apache.spark.serializer.KryoSerializer","description":"Class to use for serializing objects that will be sent over the network or need to be cached in serialized form."},{"keyPattern":"spark\\.rdd\\.compress","valuePattern":"(true|false)","keyPatternDisplay":"spark.rdd.compress","valuePatternDisplay":"true|false","description":"Whether to compress serialized RDD partitions (e.g. for StorageLevel.MEMORY_ONLY_SER). Can save substantial space at the cost of some extra CPU time."},{"keyPattern":"spark\\.speculation","valuePattern":"(true|false)","keyPatternDisplay":"spark.speculation","valuePatternDisplay":"true|false","description":"Whether to use speculation (recommended off for streaming)"},{"keyPattern":"spark\\.es(\\.[^\\.]+)+","valuePattern":".*","keyPatternDisplay":"spark.es.*","valuePatternDisplay":"*","description":"Configuration options for ElasticSearch"},{"keyPattern":"es(\\.([^\\.]+))+","valuePattern":".*","keyPatternDisplay":"es.*","valuePatternDisplay":"*","description":"Configuration options for ElasticSearch"},{"keyPattern":"spark\\.(storage|shuffle)\\.memoryFraction","valuePattern":"0?\\.0*([1-9])([0-9])*","keyPatternDisplay":"spark.(storage|shuffle).memoryFraction","valuePatternDisplay":"(0.0,1.0)","description":"Fraction of Java heap to use for Spark's shuffle or storage"},{"keyPattern":"spark\\.streaming\\.backpressure\\.enabled","valuePattern":"(true|false)","keyPatternDisplay":"spark.streaming.backpressure.enabled","valuePatternDisplay":"true|false","description":"Enables or disables Spark Streaming's internal backpressure mechanism (since 1.5). This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process. Internally, this dynamically sets the maximum receiving rate of receivers. This rate is upper bounded by the values `spark.streaming.receiver.maxRate` and `spark.streaming.kafka.maxRatePerPartition` if they are set."},{"keyPattern":"spark\\.streaming\\.receiver\\.maxRate","valuePattern":"^([0-9]{1,})$","keyPatternDisplay":"spark.streaming.receiver.maxRate","valuePatternDisplay":"numeric","description":"Maximum rate (number of records per second) at which each receiver will receive data. Effectively, each stream will consume at most this number of records per second. Setting this configuration to 0 or a negative number will put no limit on the rate. See the deployment guide in the Spark Streaming programing guide for mode details."},{"keyPattern":"spark\\.streaming\\.kafka\\.maxRatePerPartition","valuePattern":"^([0-9]{1,})$","keyPatternDisplay":"spark.streaming.kafka.maxRatePerPartition","valuePatternDisplay":"numeric","description":"Maximum rate (number of records per second) at which data will be read from each Kafka partition when using the Kafka direct stream API introduced in Spark 1.3. See the Kafka Integration guide for more details."},{"keyPattern":"spark\\.streaming\\.kafka\\.maxRetries","valuePattern":"^([0-9]{1,})$","keyPatternDisplay":"spark.streaming.kafka.maxRetries","valuePatternDisplay":"numeric","description":"Maximum number of consecutive retries the driver will make in order to find the latest offsets on the leader of each partition (a default value of 1 means that the driver will make a maximum of 2 attempts). Only applies to the Kafka direct stream API introduced in Spark 1.3."},{"keyPattern":"spark\\.streaming\\.ui\\.retainedBatches","valuePattern":"^([0-9]{1,})$","keyPatternDisplay":"spark.streaming.ui.retainedBatches","valuePatternDisplay":"numeric","description":"How many batches the Spark Streaming UI and status APIs remember before garbage collecting."}],"enableReactNotebookComments":true,"enableAdminPasswordReset":false,"checkBeforeAddingAadUser":true,"enableResetPassword":true,"maxClusterTagValueLength":256,"enableJobsSparkUpgrade":true,"createTableInNotebookDBFSLink":{"url":"https://docs.azuredatabricks.net/_static/notebooks/data-import/dbfs.html","displayName":"DBFS","workspaceFileName":"DBFS Example"},"perClusterAutoterminationEnabled":true,"enableNotebookCommandNumbers":true,"measureRoundTripTimes":true,"allowStyleInSanitizedHtml":false,"sparkVersions":[{"key":"3.3.x-scala2.10","displayName":"3.3 (includes Apache Spark 2.2.0, Scala 2.10)","packageLabel":"spark-image-86a9b375074f5afad339e70230ec0ec265c4cefbd280844785fab3bcde5869f9","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":[]},{"key":"4.1.x-scala2.11","displayName":"4.1 (includes Apache Spark 2.3.0, Scala 2.11)","packageLabel":"spark-image-e69ab61f9eb55b59a5df3b9d9ca1605268022efa7b699e11b408049038d8ea8c","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS","SUPPORTS_RSTUDIO"]},{"key":"4.1.x-ml-gpu-scala2.11","displayName":"4.1 ML Beta (includes Apache Spark 2.3.0, GPU, Scala 2.11)","packageLabel":"spark-image-5907529b625e97ac8feb0a069002b4fdb861a16740752b5df568fe4efb1c004e","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":[]},{"key":"4.0.x-scala2.11","displayName":"4.0 (includes Apache Spark 2.3.0, Scala 2.11)","packageLabel":"spark-image-958dfd1fcde8070c85e13f869b8d816b71d63cac31357210d4858c3ff3be83ce","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS"]},{"key":"3.4.x-scala2.11","displayName":"3.4 (includes Apache Spark 2.2.0, Scala 2.11)","packageLabel":"spark-image-35a5008cd4a7aac70818911317758515b85d0fcab6ead08fa3f66157119fa6ce","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION"]},{"key":"3.2.x-scala2.10","displayName":"3.2 (includes Apache Spark 2.2.0, Scala 2.10)","packageLabel":"spark-image-557788bea0eea16bbf7a8ba13ace07e64dd7fc86270bd5cea086097fe886431f","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":[]},{"key":"4.1.x-ml-scala2.11","displayName":"4.1 ML Beta (includes Apache Spark 2.3.0, Scala 2.11)","packageLabel":"spark-image-ad599fbbca53898d7531a4b94c73f3e68b3c2e49e3502c09f6bf01468d801882","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":[]},{"key":"latest-experimental-scala2.10","displayName":"[DO NOT USE] Latest experimental (3.5 snapshot, Scala 2.10)","packageLabel":"spark-image-5e4f1f2feb631875a6036dffb069ec14b436939b5efe0ecb3ff8220c835298d6","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS"]},{"key":"latest-rc-scala2.11","displayName":"Latest RC (4.2 snapshot, Scala 2.11)","packageLabel":"spark-image-eb6b629259e901623758e884730de4a93e6babcb8995b191e759e09a596490fd","upgradable":true,"deprecated":false,"customerVisible":false,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS","SUPPORTS_RSTUDIO"]},{"key":"latest-stable-scala2.11","displayName":"Latest stable (Scala 2.11)","packageLabel":"spark-image-e69ab61f9eb55b59a5df3b9d9ca1605268022efa7b699e11b408049038d8ea8c","upgradable":true,"deprecated":false,"customerVisible":false,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS","SUPPORTS_RSTUDIO"]},{"key":"4.1.x-gpu-scala2.11","displayName":"4.1 (includes Apache Spark 2.3.0, GPU, Scala 2.11)","packageLabel":"spark-image-79462cd30eaa7c6e877243f75bd986e68290424e3856fad0fbd47a3b6353032e","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_RSTUDIO"]},{"key":"3.5.x-scala2.10","displayName":"3.5 LTS (includes Apache Spark 2.2.1, Scala 2.10)","packageLabel":"spark-image-2c23eb0f5a3d83904705cf416a815421bdca898db8835c0dcf9084c01509594e","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS"]},{"key":"latest-rc-scala2.10","displayName":"[DO NOT USE] Latest RC (3.5 snapshot, Scala 2.10)","packageLabel":"spark-image-5e4f1f2feb631875a6036dffb069ec14b436939b5efe0ecb3ff8220c835298d6","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS"]},{"key":"latest-stable-scala2.10","displayName":"[DEPRECATED] Latest stable (Scala 2.10)","packageLabel":"spark-image-5e4f1f2feb631875a6036dffb069ec14b436939b5efe0ecb3ff8220c835298d6","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS"]},{"key":"3.1.x-scala2.11","displayName":"3.1 (includes Apache Spark 2.2.0, Scala 2.11)","packageLabel":"spark-image-241fa8b78ee6343242b1756b18076270894385ff40a81172a6fb5eadf66155d3","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":[]},{"key":"3.1.x-scala2.10","displayName":"3.1 (includes Apache Spark 2.2.0, Scala 2.10)","packageLabel":"spark-image-7efac6b9a8f2da59cb4f6d0caac46cfcb3f1ebf64c8073498c42d0360f846714","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":[]},{"key":"3.3.x-scala2.11","displayName":"3.3 (includes Apache Spark 2.2.0, Scala 2.11)","packageLabel":"spark-image-46cc39a9afa43fbd7bfa9f4f5ed8d23f658cd0b0d74208627243222ae0d22f8d","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":[]},{"key":"3.5.x-scala2.11","displayName":"3.5 LTS (includes Apache Spark 2.2.1, Scala 2.11)","packageLabel":"spark-image-ddab4ca82a96df57f2dd2d5acfecd2373ac2db360d9559cd59e632dee270f05b","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS"]},{"key":"latest-experimental-scala2.11","displayName":"Latest experimental (4.2 snapshot, Scala 2.11)","packageLabel":"spark-image-eb6b629259e901623758e884730de4a93e6babcb8995b191e759e09a596490fd","upgradable":true,"deprecated":false,"customerVisible":false,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS","SUPPORTS_RSTUDIO"]},{"key":"3.2.x-scala2.11","displayName":"3.2 (includes Apache Spark 2.2.0, Scala 2.11)","packageLabel":"spark-image-5537926238bc55cb6cd76ee0f0789511349abead3781c4780721a845f34b5d4e","upgradable":true,"deprecated":true,"customerVisible":false,"capabilities":[]},{"key":"latest-rc-gpu-scala2.11","displayName":"Latest RC (4.2 snapshot, GPU, Scala 2.11)","packageLabel":"spark-image-8fe44f4a94defa1f3e13ffcbe83d5fe86c26e651f20c9250f5f09930295a9e66","upgradable":true,"deprecated":false,"customerVisible":false,"capabilities":["SUPPORTS_RSTUDIO"]},{"key":"3.4.x-scala2.10","displayName":"3.4 (includes Apache Spark 2.2.0, Scala 2.10)","packageLabel":"spark-image-3e68b33974ebcf196fd048476d71c8747b8e3596456ec9e6621d73388e5484f9","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION"]}],"enablePresentationMode":false,"enableClearStateAndRunAll":true,"enableTableAclsByTier":true,"enableRestrictedClusterCreation":false,"enableFeedback":false,"enableClusterAutoScaling":true,"enableUserVisibleDefaultTags":true,"defaultNumWorkers":8,"serverContinuationTimeoutMillis":10000,"jobsUnreachableThresholdMillis":60000,"driverStderrFilePrefix":"stderr","roundTripReportTimeoutMs":5000,"enableNotebookRefresh":true,"createTableInNotebookImportedFileLink":{"url":"https://docs.azuredatabricks.net/_static/notebooks/data-import/imported-file.html","displayName":"Imported File","workspaceFileName":"Imported File Example"},"accountsOwnerUrl":"https://portal.azure.com/?feature.customportal=false&microsoft_azure_marketplace_ItemHideKey=DatabricksExtensionHidden&Microsoft_Azure_Databricks=true#resource/subscriptions/76dd74d5-e8e7-493d-91dc-d8113ee1f20c/resourceGroups/RGABI/providers/Microsoft.Databricks/workspaces/abiweuadlsdev","driverStdoutFilePrefix":"stdout","showDbuPricing":true,"databricksDocsBaseHostname":"docs.azuredatabricks.net","defaultNodeTypeToPricingUnitsMap":{"Standard_E64s_v3":16,"r3.2xlarge":2,"i3.4xlarge":4,"Standard_NC12s_v2":6.75,"class-node":1,"m4.2xlarge":1.5,"Standard_D11_v2":0.5,"r4.xlarge":1,"m4.4xlarge":3,"p3.2xlarge":4.15,"Standard_DS5_v2":3,"Standard_D2s_v3":0.5,"Standard_DS4_v2_Promo":1.5,"Standard_DS14":4,"Standard_DS11_v2_Promo":0.5,"r4.16xlarge":16,"Standard_NC6":1.5,"Standard_DS11":0.5,"Standard_D2_v3":0.5,"Standard_DS14_v2_Promo":4,"Standard_D64s_v3":12,"p2.8xlarge":9.76,"m4.10xlarge":8,"Standard_D8s_v3":1.5,"Standard_E32s_v3":8,"Standard_DS3":0.75,"Standard_DS2_v2":0.5,"r3.8xlarge":8,"r4.4xlarge":4,"dev-tier-node":1,"Standard_L8s":2,"Standard_D13_v2":2,"p3.16xlarge":33.2,"Standard_NC24rs_v3":20,"Standard_DS13_v2_Promo":2,"Standard_E4s_v3":1,"Standard_D3_v2":0.75,"Standard_NC24":6,"Standard_NC24r":6,"Standard_DS15_v2":5,"Standard_D16s_v3":3,"Standard_D5_v2":3,"Standard_E8s_v3":2,"Standard_DS2_v2_Promo":0.5,"c3.8xlarge":4,"Standard_D4_v3":0.75,"Standard_E2s_v3":0.5,"Standard_D32_v3":6,"Standard_DS3_v2":0.75,"Standard_NC6s_v3":5,"r3.4xlarge":4,"Standard_DS4":1.5,"i2.4xlarge":6,"Standard_DS3_v2_Promo":0.75,"m4.xlarge":0.75,"r4.8xlarge":8,"Standard_D14_v2":4,"Standard_H16":4,"Standard_NC12":3,"Standard_DS14_v2":4,"r4.large":0.5,"Standard_D15_v2":5,"Standard_DS12":1,"development-node":1,"i2.2xlarge":3,"Standard_NC6s_v2":3.38,"g2.8xlarge":6,"Standard_D12_v2":1,"i3.large":0.75,"Standard_NC12s_v3":10,"memory-optimized":1,"m4.large":0.4,"Standard_D16_v3":3,"Standard_F4s":0.5,"p2.16xlarge":19.52,"Standard_NC24rs_v2":13.5,"i3.8xlarge":8,"Standard_D32s_v3":6,"i3.16xlarge":16,"Standard_DS12_v2":1,"Standard_L32s":8,"Standard_D4s_v3":0.75,"Standard_DS13":2,"Standard_DS11_v2":0.5,"Standard_DS12_v2_Promo":1,"Standard_DS13_v2":2,"c3.2xlarge":1,"Standard_L4s":1,"Standard_F16s":2,"c4.2xlarge":1,"Standard_L16s":4,"i2.xlarge":1.5,"Standard_DS2":0.5,"compute-optimized":1,"c4.4xlarge":2,"Standard_DS5_v2_Promo":3,"Standard_D64_v3":12,"Standard_D2_v2":0.5,"Standard_D8_v3":1.5,"i3.2xlarge":2,"Standard_E16s_v3":4,"Standard_F8s":1,"c3.4xlarge":2,"Standard_NC24s_v2":13.5,"Standard_NC24s_v3":20,"Standard_D4_v2":1.5,"g2.2xlarge":1.5,"p3.8xlarge":16.6,"p2.xlarge":1.22,"m4.16xlarge":12,"Standard_DS4_v2":1.5,"c4.8xlarge":4,"i3.xlarge":1,"r3.xlarge":1,"r4.2xlarge":2,"i2.8xlarge":12},"tableFilesBaseFolder":"/tables","enableSparkDocsSearch":true,"sparkHistoryServerEnabled":true,"enableClusterAppsUIOnServerless":false,"enableEBSVolumesUI":true,"homePageWelcomeMessage":"","metastoreServiceRowLimit":1000000,"enableIPythonImportExport":true,"enableClusterTagsUIForJobs":true,"enableClusterTagsUI":true,"enableNotebookHistoryDiffing":true,"branch":"2.72.251","accountsLimit":-1,"enableSparkEnvironmentVariables":true,"enableX509Authentication":false,"useAADLogin":true,"enableStructuredStreamingNbOptimizations":true,"enableNotebookGitBranching":true,"terminatedClustersWindow":2592000000,"local":false,"enableNotebookLazyRenderWrapper":false,"enableClusterAutoScalingForJobs":true,"enableStrongPassword":false,"showReleaseNote":false,"displayDefaultContainerMemoryGB":30,"broadenedEditPermission":false,"enableWorkspacePurgeDryRun":false,"disableS3TableImport":true,"enableArrayParamsEdit":true,"deploymentMode":"production","useSpotForWorkers":true,"removePasswordInAccountSettings":true,"preferStartTerminatedCluster":false,"enableUserInviteWorkflow":true,"createTableConnectorOptionLinks":[{"url":"https://docs.databricks.com/_static/notebooks/data-import/azure-blob-store.html","displayName":"Azure Blob Storage","workspaceFileName":"Azure Blob Storage Import Example Notebook"},{"url":"https://docs.azuredatabricks.net/_static/notebooks/data-import/jdbc.html","displayName":"JDBC","workspaceFileName":"JDBC Example"},{"url":"https://docs.azuredatabricks.net/_static/notebooks/cassandra.html","displayName":"Cassandra","workspaceFileName":"Cassandra Example"},{"url":"https://docs.azuredatabricks.net/_static/notebooks/structured-streaming-etl-kafka.html","displayName":"Kafka","workspaceFileName":"Kafka Example"},{"url":"https://docs.azuredatabricks.net/_static/notebooks/redis.html","displayName":"Redis","workspaceFileName":"Redis Example"},{"url":"https://docs.azuredatabricks.net/_static/notebooks/elasticsearch.html","displayName":"Elasticsearch","workspaceFileName":"Elasticsearch Example"}],"enableStaticNotebooks":true,"enableNewLineChart":true,"shouldReportUnhandledPromiseRejectionsToSentry":false,"sandboxForUrlSandboxFrame":"allow-scripts allow-popups allow-popups-to-escape-sandbox allow-forms","enableCssTransitions":true,"serverlessEnableElasticDisk":true,"minClusterTagKeyLength":1,"showHomepageFeaturedLinks":true,"pricingURL":"https://databricks.com/product/pricing","enableClusterEdit":true,"enableClusterAclsConfig":false,"useTempS3UrlForTableUpload":false,"notifyLastLogin":false,"enableFilePurge":true,"enableSshKeyUIByTier":true,"enableCreateClusterOnAttach":false,"defaultAutomatedPricePerDBU":0.35,"enableNotebookGitVersioning":true,"defaultMinWorkers":2,"commandStatusDebounceMaxWait":1000,"files":"files/","feedbackEmail":"feedback@databricks.com","enableDriverLogsUI":true,"enableExperimentalCharts":false,"defaultMaxWorkers":8,"enableWorkspaceAclsConfig":false,"serverlessRunPythonAsLowPrivilegeUser":false,"dropzoneMaxFileSize":2047,"enableNewClustersList":true,"enableNewDashboardViews":true,"enableJobListPermissionFilter":true,"terminatedInteractiveClustersMax":70,"driverLog4jFilePrefix":"log4j","enableSingleSignOn":false,"enableMavenLibraries":true,"updateTreeTableToV2Schema":false,"displayRowLimit":1000,"deltaProcessingAsyncEnabled":true,"enableSparkEnvironmentVariablesUI":false,"defaultSparkVersion":{"key":"4.0.x-scala2.11","displayName":"4.0 (includes Apache Spark 2.3.0, Scala 2.11)","packageLabel":"spark-image-958dfd1fcde8070c85e13f869b8d816b71d63cac31357210d4858c3ff3be83ce","upgradable":true,"deprecated":false,"customerVisible":true,"capabilities":["SUPPORTS_END_TO_END_ENCRYPTION","SUPPORTS_TABLE_ACLS"]},"enableNewLineChartParams":false,"deprecatedEnableStructuredDataAcls":false,"enableCustomSpotPricing":true,"enableRStudioFreeUI":false,"enableMountAclsConfig":false,"defaultAutoterminationMin":120,"useDevTierHomePage":false,"disableExportNotebook":false,"enableClusterClone":true,"enableNotebookLineNumbers":true,"enablePublishHub":false,"notebookHubUrl":"http://hub.dev.databricks.com/","commandStatusDebounceInterval":100,"showSqlEndpoints":true,"enableNotebookDatasetInfoView":true,"defaultTagKeys":{"CLUSTER_NAME":"ClusterName","VENDOR":"Vendor","CLUSTER_TYPE":"ResourceClass","CREATOR":"Creator","CLUSTER_ID":"ClusterId"},"enableClusterAclsByTier":true,"databricksDocsBaseUrl":"https://docs.azuredatabricks.net/","azurePortalLink":"https://portal.azure.com","cloud":"Azure","customSparkVersionPrefix":"custom:","disallowAddingAdmins":false,"enableSparkConfUI":true,"enableClusterEventsUI":true,"featureTier":"STANDARD_W_SEC_TIER","mavenCentralSearchEndpoint":"http://search.maven.org/solrsearch/select","defaultServerlessClusterModel":{"cluster_name":"","node_type_id":"Standard_DS13_v2","spark_version":"latest-stable-scala2.11","num_workers":null,"enable_jdbc_auto_start":true,"custom_tags":{"ResourceClass":"Serverless"},"autoscale":{"min_workers":2,"max_workers":20},"spark_conf":{"spark.databricks.cluster.profile":"serverless","spark.databricks.repl.allowedLanguages":"sql,python,r"},"autotermination_minutes":0,"enable_elastic_disk":true,"default_tags":{"Vendor":"Databricks","Creator":"ivv@adatis.co.uk","ClusterName":null,"ClusterId":""}},"enableClearRevisionHistoryForNotebook":true,"enableOrgSwitcherUI":true,"bitbucketCloudBaseApiV2Url":"https://api.bitbucket.org/2.0","clustersLimit":-1,"enableJdbcImport":true,"enableClusterAppsUIOnNormalClusters":false,"enableElasticDisk":true,"logfiles":"logfiles/","enableRelativeNotebookLinks":true,"enableMultiSelect":true,"homePageLogo":"login/DB_Azure_Lockup_2x.png","enableWebappSharding":true,"enableNotebookParamsEdit":true,"enableClusterDeltaUpdates":true,"enableSingleSignOnLogin":false,"separateTableForJobClusters":true,"ebsVolumeSizeLimitGB":{"GENERAL_PURPOSE_SSD":[100,4096],"THROUGHPUT_OPTIMIZED_HDD":[500,4096]},"enableClusterDeleteUI":true,"enableMountAcls":false,"requireEmailUserName":true,"enableRServerless":true,"frameRateReportIntervalMs":10000,"dbcFeedbackURL":"http://feedback.databricks.com/forums/263785-product-feedback","enableMountAclService":true,"showVersion":false,"serverlessClustersByDefault":false,"collectDetailedFrameRateStats":true,"enableWorkspaceAcls":true,"maxClusterTagKeyLength":512,"gitHash":"","clusterTagReservedPrefixes":["azure","microsoft","windows"],"tableAclsEnabledMap":{"spark.databricks.acl.dfAclsEnabled":"true","spark.databricks.repl.allowedLanguages":"python,sql"},"showWorkspaceFeaturedLinks":true,"signupUrl":"","databricksDocsNotebookPathPrefix":"^https://docs\\.azuredatabricks\\.net/_static/notebooks/.+$","serverlessAttachEbsVolumesByDefault":false,"enableTokensConfig":true,"allowFeedbackForumAccess":true,"frameDurationReportThresholdMs":1000,"enablePythonVersionUI":true,"enableImportFromUrl":true,"allowDisplayHtmlByUrl":true,"enableTokens":true,"enableMiniClusters":false,"enableNewJobList":true,"maxPinnedClustersPerOrg":20,"enableDebugUI":false,"enableStreamingMetricsDashboard":true,"allowNonAdminUsers":true,"enableSingleSignOnByTier":true,"enableJobsRetryOnTimeout":true,"loginLogo":"/login/DB_Azure_Lockup_2x.png","useStandardTierUpgradeTooltips":true,"staticNotebookResourceUrl":"https://databricks-prod-cloudfront.cloud.databricks.com/static/c0a57b890925d4a38b701f56755414e0d7e15ba065243871740ecb804faf39d5/","enableSpotClusterType":true,"enableSparkPackages":true,"checkAadUserInWorkspaceTenant":false,"dynamicSparkVersions":false,"useIframeForHtmlResult":false,"enableClusterTagsUIByTier":true,"enableUserPromptForPendingRpc":true,"enableNotebookHistoryUI":true,"addWhitespaceAfterLastNotebookCell":true,"enableClusterLoggingUI":true,"setDeletedAtForDeletedColumnsOnWebappStart":false,"enableDatabaseDropdownInTableUI":true,"showDebugCounters":false,"enableInstanceProfilesUI":true,"enableFolderHtmlExport":true,"homepageFeaturedLinks":[{"linkURI":"https://docs.azuredatabricks.net/_static/notebooks/azure/gentle-introduction-to-apache-spark-azure.html","displayName":"Introduction to Apache Spark on Databricks","icon":"img/home/Python_icon.svg"},{"linkURI":"https://docs.azuredatabricks.net/_static/notebooks/azure/databricks-for-data-scientists-azure.html","displayName":"Databricks for Data Scientists","icon":"img/home/Scala_icon.svg"},{"linkURI":"https://docs.azuredatabricks.net/_static/notebooks/structured-streaming-python.html","displayName":"Introduction to Structured Streaming","icon":"img/home/Python_icon.svg"}],"enableClusterStart":true,"maxImportFileVersion":5,"enableEBSVolumesUIByTier":true,"enableTableAclService":true,"removeSubCommandCodeWhenExport":true,"upgradeURL":"","maxAutoterminationMinutes":10000,"showResultsFromExternalSearchEngine":true,"autoterminateClustersByDefault":false,"notebookLoadingBackground":"#fff","sshContainerForwardedPort":2200,"enableStaticHtmlImport":true,"enableInstanceProfilesByTier":true,"showForgotPasswordLink":true,"defaultMemoryPerContainerMB":28000,"enablePresenceUI":true,"minAutoterminationMinutes":10,"accounts":true,"useOnDemandClustersByDefault":false,"enableAutoCreateUserUI":true,"defaultCoresPerContainer":4,"showTerminationReason":true,"enableNewClustersGet":true,"showPricePerDBU":true,"showSqlProxyUI":true,"enableNotebookErrorHighlighting":true}; var __DATABRICKS_NOTEBOOK_MODEL = '%7B%22version%22%3A%22NotebookV1%22%2C%22origId%22%3A1379691019282556%2C%22name%22%3A%22Blog%20-%20Nested%20JSON%20Arrays%20in%20Databricks%22%2C%22language%22%3A%22python%22%2C%22commands%22%3A%5B%7B%22version%22%3A%22CommandV1%22%2C%22origId%22%3A1379691019282561%2C%22guid%22%3A%22154497a0-3d19-49fc-8a45-caa1059662e7%22%2C%22subtype%22%3A%22command%22%2C%22commandType%22%3A%22auto%22%2C%22position%22%3A0.5%2C%22command%22%3A%22from%20pyspark.sql.functions%20import%20explode%2C%20col%22%2C%22commandVersion%22%3A0%2C%22state%22%3A%22finished%22%2C%22results%22%3A%7B%22type%22%3A%22html%22%2C%22data%22%3A%22%3Cdiv%20class%3D%5C%22ansiout%5C%22%3E%3C%2Fdiv%3E%22%2C%22arguments%22%3A%7B%7D%2C%22addedWidgets%22%3A%7B%7D%2C%22removedWidgets%22%3A%5B%5D%2C%22datasetInfos%22%3A%5B%5D%7D%2C%22errorSummary%22%3Anull%2C%22error%22%3Anull%2C%22workflows%22%3A%5B%5D%2C%22startTime%22%3A1528377834320%2C%22submitTime%22%3A1528377834203%2C%22finishTime%22%3A1528377834368%2C%22collapsed%22%3Afalse%2C%22bindings%22%3A%7B%7D%2C%22inputWidgets%22%3A%7B%7D%2C%22displayType%22%3A%22table%22%2C%22width%22%3A%22auto%22%2C%22height%22%3A%22auto%22%2C%22xColumns%22%3Anull%2C%22yColumns%22%3Anull%2C%22pivotColumns%22%3Anull%2C%22pivotAggregation%22%3Anull%2C%22customPlotOptions%22%3A%7B%7D%2C%22commentThread%22%3A%5B%5D%2C%22commentsVisible%22%3Afalse%2C%22parentHierarchy%22%3A%5B%5D%2C%22diffInserts%22%3A%5B%5D%2C%22diffDeletes%22%3A%5B%5D%2C%22globalVars%22%3A%7B%7D%2C%22latestUser%22%3A%22a%20user%22%2C%22latestUserId%22%3Anull%2C%22commandTitle%22%3A%22Import%20only%20the%20pyspark%20functions%20that%20we%20need%22%2C%22showCommandTitle%22%3Atrue%2C%22hideCommandCode%22%3Afalse%2C%22hideCommandResult%22%3Afalse%2C%22iPythonMetadata%22%3Anull%2C%22streamStates%22%3A%7B%7D%2C%22nuid%22%3A%223da1c634-7500-4872-9372-4d33707e2fc3%22%7D%2C%7B%22version%22%3A%22CommandV1%22%2C%22origId%22%3A1379691019282562%2C%22guid%22%3A%22456ffc2f-7673-4daf-911e-1dbf060d3eda%22%2C%22subtype%22%3A%22command%22%2C%22commandType%22%3A%22auto%22%2C%22position%22%3A0.75%2C%22command%22%3A%22source_json%20%3D%20%5C%22%5C%22%5C%22%5Cn%7B%5Cn%20%20%20%20%5C%22persons%5C%22%3A%20%5B%5Cn%20%20%20%20%20%20%20%20%7B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%5C%22name%5C%22%3A%20%5C%22John%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%5C%22age%5C%22%3A%2030%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%5C%22cars%5C%22%3A%20%5B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22name%5C%22%3A%20%5C%22Ford%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22models%5C%22%3A%20%5B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22Fiesta%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22Focus%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22Mustang%5C%22%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5D%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7D%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22name%5C%22%3A%20%5C%22BMW%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22models%5C%22%3A%20%5B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22320%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22X3%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22X5%5C%22%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5D%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7D%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%5D%5Cn%20%20%20%20%20%20%20%20%7D%2C%5Cn%20%20%20%20%20%20%20%20%7B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%5C%22name%5C%22%3A%20%5C%22Peter%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%5C%22age%5C%22%3A%2046%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%5C%22cars%5C%22%3A%20%5B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22name%5C%22%3A%20%5C%22Huyndai%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22models%5C%22%3A%20%5B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22i10%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22i30%5C%22%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5D%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7D%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22name%5C%22%3A%20%5C%22Mercedes%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22models%5C%22%3A%20%5B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22E320%5C%22%2C%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5C%22E63%20AMG%5C%22%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5D%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7D%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%5D%5Cn%20%20%20%20%20%20%20%20%7D%5Cn%20%20%20%20%5D%5Cn%7D%5Cn%5C%22%5C%22%5C%22%22%2C%22commandVersion%22%3A0%2C%22state%22%3A%22finished%22%2C%22results%22%3A%7B%22type%22%3A%22html%22%2C%22data%22%3A%22%3Cdiv%20class%3D%5C%22ansiout%5C%22%3E%3C%2Fdiv%3E%22%2C%22arguments%22%3A%7B%7D%2C%22addedWidgets%22%3A%7B%7D%2C%22removedWidgets%22%3A%5B%5D%2C%22datasetInfos%22%3A%5B%5D%7D%2C%22errorSummary%22%3Anull%2C%22error%22%3Anull%2C%22workflows%22%3A%5B%5D%2C%22startTime%22%3A1528377834395%2C%22submitTime%22%3A1528377834204%2C%22finishTime%22%3A1528377834414%2C%22collapsed%22%3Afalse%2C%22bindings%22%3A%7B%7D%2C%22inputWidgets%22%3A%7B%7D%2C%22displayType%22%3A%22table%22%2C%22width%22%3A%22auto%22%2C%22height%22%3A%22auto%22%2C%22xColumns%22%3Anull%2C%22yColumns%22%3Anull%2C%22pivotColumns%22%3Anull%2C%22pivotAggregation%22%3Anull%2C%22customPlotOptions%22%3A%7B%7D%2C%22commentThread%22%3A%5B%5D%2C%22commentsVisible%22%3Afalse%2C%22parentHierarchy%22%3A%5B%5D%2C%22diffInserts%22%3A%5B%5D%2C%22diffDeletes%22%3A%5B%5D%2C%22globalVars%22%3A%7B%7D%2C%22latestUser%22%3A%22a%20user%22%2C%22latestUserId%22%3Anull%2C%22commandTitle%22%3A%22Create%20a%20sample%20JSON%22%2C%22showCommandTitle%22%3Atrue%2C%22hideCommandCode%22%3Afalse%2C%22hideCommandResult%22%3Afalse%2C%22iPythonMetadata%22%3Anull%2C%22streamStates%22%3A%7B%7D%2C%22nuid%22%3A%22fcf920a1-1762-4c1d-b3a8-021e31f4fa04%22%7D%2C%7B%22version%22%3A%22CommandV1%22%2C%22origId%22%3A1379691019282563%2C%22guid%22%3A%22ebed0478-d445-4085-9b4b-6aef41f4974f%22%2C%22subtype%22%3A%22command%22%2C%22commandType%22%3A%22auto%22%2C%22position%22%3A0.875%2C%22command%22%3A%22dbutils.fs.put(%5C%22%2Ftmp%2Fsource.json%5C%22%2C%20source_json%2C%20True)%22%2C%22commandVersion%22%3A0%2C%22state%22%3A%22finished%22%2C%22results%22%3A%7B%22type%22%3A%22html%22%2C%22data%22%3A%22%3Cdiv%20class%3D%5C%22ansiout%5C%22%3EWrote%201074%20bytes.%5Cn%3Cspan%20class%3D%5C%22ansired%5C%22%3EOut%5B%3C%2Fspan%3E%3Cspan%20class%3D%5C%22ansired%5C%22%3E3%3C%2Fspan%3E%3Cspan%20class%3D%5C%22ansired%5C%22%3E%5D%3A%20%3C%2Fspan%3ETrue%5Cn%3C%2Fdiv%3E%22%2C%22arguments%22%3A%7B%7D%2C%22addedWidgets%22%3A%7B%7D%2C%22removedWidgets%22%3A%5B%5D%2C%22datasetInfos%22%3A%5B%5D%7D%2C%22errorSummary%22%3Anull%2C%22error%22%3Anull%2C%22workflows%22%3A%5B%5D%2C%22startTime%22%3A1528377834421%2C%22submitTime%22%3A1528377834224%2C%22finishTime%22%3A1528377834678%2C%22collapsed%22%3Afalse%2C%22bindings%22%3A%7B%7D%2C%22inputWidgets%22%3A%7B%7D%2C%22displayType%22%3A%22table%22%2C%22width%22%3A%22auto%22%2C%22height%22%3A%22auto%22%2C%22xColumns%22%3Anull%2C%22yColumns%22%3Anull%2C%22pivotColumns%22%3Anull%2C%22pivotAggregation%22%3Anull%2C%22customPlotOptions%22%3A%7B%7D%2C%22commentThread%22%3A%5B%5D%2C%22commentsVisible%22%3Afalse%2C%22parentHierarchy%22%3A%5B%5D%2C%22diffInserts%22%3A%5B%5D%2C%22diffDeletes%22%3A%5B%5D%2C%22globalVars%22%3A%7B%7D%2C%22latestUser%22%3A%22a%20user%22%2C%22latestUserId%22%3Anull%2C%22commandTitle%22%3A%22Write%20the%20JSON%20sample%20into%20the%20file%20system%22%2C%22showCommandTitle%22%3Atrue%2C%22hideCommandCode%22%3Afalse%2C%22hideCommandResult%22%3Afalse%2C%22iPythonMetadata%22%3Anull%2C%22streamStates%22%3A%7B%7D%2C%22nuid%22%3A%226e3b4b30-8dc3-4dc7-a599-3022ea70d59a%22%7D%2C%7B%22version%22%3A%22CommandV1%22%2C%22origId%22%3A1379691019282564%2C%22guid%22%3A%22d207f10d-cca2-4f04-aefb-c37acf9e6864%22%2C%22subtype%22%3A%22command%22%2C%22commandType%22%3A%22auto%22%2C%22position%22%3A0.9375%2C%22command%22%3A%22source_df%20%3D%20spark.read.option(%5C%22multiline%5C%22%2C%20%5C%22true%5C%22).json(%5C%22%2Ftmp%2Fsource.json%5C%22)%22%2C%22commandVersion%22%3A0%2C%22state%22%3A%22finished%22%2C%22results%22%3A%7B%22type%22%3A%22html%22%2C%22data%22%3A%22%3Cdiv%20class%3D%5C%22ansiout%5C%22%3E%3C%2Fdiv%3E%22%2C%22arguments%22%3A%7B%7D%2C%22addedWidgets%22%3A%7B%7D%2C%22removedWidgets%22%3A%5B%5D%2C%22datasetInfos%22%3A%5B%7B%22name%22%3A%22source_df%22%2C%22typeStr%22%3A%22pyspark.sql.dataframe.DataFrame%22%2C%22schema%22%3A%7B%22fields%22%3A%5B%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22persons%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%7B%22containsNull%22%3Atrue%2C%22elementType%22%3A%7B%22fields%22%3A%5B%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22age%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22long%22%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22cars%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%7B%22containsNull%22%3Atrue%2C%22elementType%22%3A%7B%22fields%22%3A%5B%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22models%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%7B%22containsNull%22%3Atrue%2C%22elementType%22%3A%22string%22%2C%22type%22%3A%22array%22%7D%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22name%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22string%22%7D%5D%2C%22type%22%3A%22struct%22%7D%2C%22type%22%3A%22array%22%7D%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22name%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22string%22%7D%5D%2C%22type%22%3A%22struct%22%7D%2C%22type%22%3A%22array%22%7D%7D%5D%2C%22type%22%3A%22struct%22%7D%2C%22tableIdentifier%22%3Anull%7D%5D%7D%2C%22errorSummary%22%3Anull%2C%22error%22%3Anull%2C%22workflows%22%3A%5B%5D%2C%22startTime%22%3A1528377834691%2C%22submitTime%22%3A1528377834240%2C%22finishTime%22%3A1528377838859%2C%22collapsed%22%3Afalse%2C%22bindings%22%3A%7B%7D%2C%22inputWidgets%22%3A%7B%7D%2C%22displayType%22%3A%22table%22%2C%22width%22%3A%22auto%22%2C%22height%22%3A%22auto%22%2C%22xColumns%22%3Anull%2C%22yColumns%22%3Anull%2C%22pivotColumns%22%3Anull%2C%22pivotAggregation%22%3Anull%2C%22customPlotOptions%22%3A%7B%7D%2C%22commentThread%22%3A%5B%5D%2C%22commentsVisible%22%3Afalse%2C%22parentHierarchy%22%3A%5B%5D%2C%22diffInserts%22%3A%5B%5D%2C%22diffDeletes%22%3A%5B%5D%2C%22globalVars%22%3A%7B%7D%2C%22latestUser%22%3A%22a%20user%22%2C%22latestUserId%22%3Anull%2C%22commandTitle%22%3A%22Load%20the%20JSON%20file%20into%20a%20dataframe%22%2C%22showCommandTitle%22%3Atrue%2C%22hideCommandCode%22%3Afalse%2C%22hideCommandResult%22%3Afalse%2C%22iPythonMetadata%22%3Anull%2C%22streamStates%22%3A%7B%7D%2C%22nuid%22%3A%22e7492795-5330-47df-9eaf-eeb62e3aa6d4%22%7D%2C%7B%22version%22%3A%22CommandV1%22%2C%22origId%22%3A1379691019282557%2C%22guid%22%3A%22c212a1a1-1e7a-48d2-966d-9f7c6f011f5c%22%2C%22subtype%22%3A%22command%22%2C%22commandType%22%3A%22auto%22%2C%22position%22%3A1.0%2C%22command%22%3A%22%23%20Explode%20all%20persons%20into%20different%20rows%5Cnpersons%20%3D%20source_df.select(explode(%5C%22persons%5C%22).alias(%5C%22persons%5C%22))%5Cn%5Cn%23%20Explode%20all%20car%20brands%20into%20different%20rows%5Cnpersons_cars%20%3D%20persons.select(%5Cn%20%20%20col(%5C%22persons.name%5C%22).alias(%5C%22persons_name%5C%22)%5Cn%20%2C%20col(%5C%22persons.age%5C%22).alias(%5C%22persons_age%5C%22)%5Cn%20%2C%20explode(%5C%22persons.cars%5C%22).alias(%5C%22persons_cars_brands%5C%22)%5Cn%20%2C%20col(%5C%22persons_cars_brands.name%5C%22).alias(%5C%22persons_cars_brand%5C%22)%5Cn)%5Cn%5Cn%23%20Explode%20all%20car%20models%20into%20different%20rows%5Cnpersons_cars_models%20%3D%20persons_cars.select(%5Cn%20%20%20col(%5C%22persons_name%5C%22)%5Cn%20%2C%20col(%5C%22persons_age%5C%22)%5Cn%20%2C%20col(%5C%22persons_cars_brand%5C%22)%5Cn%20%2C%20explode(%5C%22persons_cars_brands.models%5C%22).alias(%5C%22persons_cars_model%5C%22)%5Cn)%22%2C%22commandVersion%22%3A0%2C%22state%22%3A%22finished%22%2C%22results%22%3A%7B%22type%22%3A%22html%22%2C%22data%22%3A%22%3Cdiv%20class%3D%5C%22ansiout%5C%22%3E%3C%2Fdiv%3E%22%2C%22arguments%22%3A%7B%7D%2C%22addedWidgets%22%3A%7B%7D%2C%22removedWidgets%22%3A%5B%5D%2C%22datasetInfos%22%3A%5B%7B%22name%22%3A%22persons%22%2C%22typeStr%22%3A%22pyspark.sql.dataframe.DataFrame%22%2C%22schema%22%3A%7B%22fields%22%3A%5B%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22persons%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%7B%22fields%22%3A%5B%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22age%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22long%22%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22cars%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%7B%22containsNull%22%3Atrue%2C%22elementType%22%3A%7B%22fields%22%3A%5B%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22models%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%7B%22containsNull%22%3Atrue%2C%22elementType%22%3A%22string%22%2C%22type%22%3A%22array%22%7D%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22name%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22string%22%7D%5D%2C%22type%22%3A%22struct%22%7D%2C%22type%22%3A%22array%22%7D%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22name%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22string%22%7D%5D%2C%22type%22%3A%22struct%22%7D%7D%5D%2C%22type%22%3A%22struct%22%7D%2C%22tableIdentifier%22%3Anull%7D%2C%7B%22name%22%3A%22persons_cars%22%2C%22typeStr%22%3A%22pyspark.sql.dataframe.DataFrame%22%2C%22schema%22%3A%7B%22fields%22%3A%5B%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22persons_name%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22string%22%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22persons_age%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22long%22%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22persons_cars_brands%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%7B%22fields%22%3A%5B%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22models%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%7B%22containsNull%22%3Atrue%2C%22elementType%22%3A%22string%22%2C%22type%22%3A%22array%22%7D%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22name%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22string%22%7D%5D%2C%22type%22%3A%22struct%22%7D%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22persons_cars_brand%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22string%22%7D%5D%2C%22type%22%3A%22struct%22%7D%2C%22tableIdentifier%22%3Anull%7D%2C%7B%22name%22%3A%22persons_cars_models%22%2C%22typeStr%22%3A%22pyspark.sql.dataframe.DataFrame%22%2C%22schema%22%3A%7B%22fields%22%3A%5B%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22persons_name%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22string%22%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22persons_age%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22long%22%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22persons_cars_brand%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22string%22%7D%2C%7B%22metadata%22%3A%7B%7D%2C%22name%22%3A%22persons_cars_model%22%2C%22nullable%22%3Atrue%2C%22type%22%3A%22string%22%7D%5D%2C%22type%22%3A%22struct%22%7D%2C%22tableIdentifier%22%3Anull%7D%5D%7D%2C%22errorSummary%22%3Anull%2C%22error%22%3Anull%2C%22workflows%22%3A%5B%5D%2C%22startTime%22%3A1528377838885%2C%22submitTime%22%3A1528377834256%2C%22finishTime%22%3A1528377839087%2C%22collapsed%22%3Afalse%2C%22bindings%22%3A%7B%7D%2C%22inputWidgets%22%3A%7B%7D%2C%22displayType%22%3A%22table%22%2C%22width%22%3A%22auto%22%2C%22height%22%3A%22auto%22%2C%22xColumns%22%3Anull%2C%22yColumns%22%3Anull%2C%22pivotColumns%22%3Anull%2C%22pivotAggregation%22%3Anull%2C%22customPlotOptions%22%3A%7B%7D%2C%22commentThread%22%3A%5B%5D%2C%22commentsVisible%22%3Afalse%2C%22parentHierarchy%22%3A%5B%5D%2C%22diffInserts%22%3A%5B%5D%2C%22diffDeletes%22%3A%5B%5D%2C%22globalVars%22%3A%7B%7D%2C%22latestUser%22%3A%22a%20user%22%2C%22latestUserId%22%3Anull%2C%22commandTitle%22%3A%22Explode%20all%20nested%20lists%20into%20rows%22%2C%22showCommandTitle%22%3Atrue%2C%22hideCommandCode%22%3Afalse%2C%22hideCommandResult%22%3Afalse%2C%22iPythonMetadata%22%3Anull%2C%22streamStates%22%3A%7B%7D%2C%22nuid%22%3A%2211d78aa5-7c55-4119-9ba3-0e9901c26f3a%22%7D%2C%7B%22version%22%3A%22CommandV1%22%2C%22origId%22%3A1379691019282560%2C%22guid%22%3A%220101cb78-3132-4206-a358-b731a3537248%22%2C%22subtype%22%3A%22command%22%2C%22commandType%22%3A%22auto%22%2C%22position%22%3A3.0%2C%22command%22%3A%22display(persons_cars_models)%22%2C%22commandVersion%22%3A0%2C%22state%22%3A%22finished%22%2C%22results%22%3A%7B%22type%22%3A%22table%22%2C%22data%22%3A%5B%5B%22John%22%2C30%2C%22Ford%22%2C%22Fiesta%22%5D%2C%5B%22John%22%2C30%2C%22Ford%22%2C%22Focus%22%5D%2C%5B%22John%22%2C30%2C%22Ford%22%2C%22Mustang%22%5D%2C%5B%22John%22%2C30%2C%22BMW%22%2C%22320%22%5D%2C%5B%22John%22%2C30%2C%22BMW%22%2C%22X3%22%5D%2C%5B%22John%22%2C30%2C%22BMW%22%2C%22X5%22%5D%2C%5B%22Peter%22%2C46%2C%22Huyndai%22%2C%22i10%22%5D%2C%5B%22Peter%22%2C46%2C%22Huyndai%22%2C%22i30%22%5D%2C%5B%22Peter%22%2C46%2C%22Mercedes%22%2C%22E320%22%5D%2C%5B%22Peter%22%2C46%2C%22Mercedes%22%2C%22E63%20AMG%22%5D%5D%2C%22arguments%22%3A%7B%7D%2C%22addedWidgets%22%3A%7B%7D%2C%22removedWidgets%22%3A%5B%5D%2C%22schema%22%3A%5B%7B%22name%22%3A%22persons_name%22%2C%22type%22%3A%22%5C%22string%5C%22%22%2C%22metadata%22%3A%22%7B%7D%22%7D%2C%7B%22name%22%3A%22persons_age%22%2C%22type%22%3A%22%5C%22long%5C%22%22%2C%22metadata%22%3A%22%7B%7D%22%7D%2C%7B%22name%22%3A%22persons_cars_brand%22%2C%22type%22%3A%22%5C%22string%5C%22%22%2C%22metadata%22%3A%22%7B%7D%22%7D%2C%7B%22name%22%3A%22persons_cars_model%22%2C%22type%22%3A%22%5C%22string%5C%22%22%2C%22metadata%22%3A%22%7B%7D%22%7D%5D%2C%22overflow%22%3Afalse%2C%22aggData%22%3A%5B%5D%2C%22aggSchema%22%3A%5B%5D%2C%22aggOverflow%22%3Afalse%2C%22aggSeriesLimitReached%22%3Afalse%2C%22aggError%22%3A%22%22%2C%22aggType%22%3A%22%22%2C%22plotOptions%22%3Anull%2C%22isJsonSchema%22%3Atrue%2C%22dbfsResultPath%22%3Anull%2C%22datasetInfos%22%3A%5B%5D%2C%22columnCustomDisplayInfos%22%3A%7B%7D%7D%2C%22errorSummary%22%3Anull%2C%22error%22%3Anull%2C%22workflows%22%3A%5B%5D%2C%22startTime%22%3A1528377839096%2C%22submitTime%22%3A1528377834273%2C%22finishTime%22%3A1528377841654%2C%22collapsed%22%3Afalse%2C%22bindings%22%3A%7B%7D%2C%22inputWidgets%22%3A%7B%7D%2C%22displayType%22%3A%22table%22%2C%22width%22%3A%221773%22%2C%22height%22%3A%22344%22%2C%22xColumns%22%3Anull%2C%22yColumns%22%3Anull%2C%22pivotColumns%22%3Anull%2C%22pivotAggregation%22%3Anull%2C%22customPlotOptions%22%3A%7B%7D%2C%22commentThread%22%3A%5B%5D%2C%22commentsVisible%22%3Afalse%2C%22parentHierarchy%22%3A%5B%5D%2C%22diffInserts%22%3A%5B%5D%2C%22diffDeletes%22%3A%5B%5D%2C%22globalVars%22%3A%7B%7D%2C%22latestUser%22%3A%22a%20user%22%2C%22latestUserId%22%3Anull%2C%22commandTitle%22%3A%22Display%20the%20flattened%20data%22%2C%22showCommandTitle%22%3Atrue%2C%22hideCommandCode%22%3Afalse%2C%22hideCommandResult%22%3Afalse%2C%22iPythonMetadata%22%3Anull%2C%22streamStates%22%3A%7B%7D%2C%22nuid%22%3A%226bd284df-6c41-454d-b241-55a356b0bfdf%22%7D%5D%2C%22dashboards%22%3A%5B%5D%2C%22guid%22%3A%22c54c9e3f-f940-4b32-8bd6-41fbcd1dd131%22%2C%22globalVars%22%3A%7B%7D%2C%22iPythonMetadata%22%3Anull%2C%22inputWidgets%22%3A%7B%7D%7D'; if (window.mainJsLoadError) { var u = 'https://databricks-prod-cloudfront.cloud.databricks.com/static/c0a57b890925d4a38b701f56755414e0d7e15ba065243871740ecb804faf39d5/js/notebook-main.js'; var b = document.getElementsByTagName('body')[0]; var c = document.createElement('div'); c.innerHTML = ('Network Error' + 'Please check your network connection and try again.' + 'Could not load a required resource: ' + u + ''); c.style.margin = '30px'; c.style.padding = '20px 50px'; c.style.backgroundColor = '#f5f5f5'; c.style.borderRadius = '5px'; b.appendChild(c); } "> We've seen here how we can use Databricks to flatten JSON with just a few lines of code. Keep your eyes open for future Databricks related blogs, which will demonstrate more of the versatility of this great platform. More on some of the used functions (PySpark 2.3.0 documentation): DataFrameReader.json explode (function(root, factory) { // `root` does not resolve to the global window object in a Browserified // bundle, so a direct reference to that object is used instead. var _srcDoc = window.srcDoc; if (typeof define === "function" && define.amd) { define(['exports'], function(exports) { factory(exports, _srcDoc); root.srcDoc = exports; }); } else if (typeof exports === "object") { factory(exports, _srcDoc); } else { root.srcDoc = {}; factory(root.srcDoc, _srcDoc); } })(this, function(exports, _srcDoc) { var idx, iframes; var isCompliant = !!("srcdoc" in document.createElement("iframe")); var sandboxMsg = "Polyfill may not function in the presence of the " + "`sandbox` attribute. Consider using the `force` option."; var sandboxAllow = /\ballow-same-origin\b/; /** * Determine if the operation may be blocked by the `sandbox` attribute in * some environments, and optionally issue a warning or remove the * attribute. */ var validate = function( iframe, options ) { var sandbox = iframe.getAttribute("sandbox"); if (typeof sandbox === "string" && !sandboxAllow.test(sandbox)) { if (options && options.force) { iframe.removeAttribute("sandbox"); } else if (!options || options.force !== false) { logError(sandboxMsg); iframe.setAttribute("data-srcdoc-polyfill", sandboxMsg); } } }; var implementations = { compliant: function( iframe, content, options ) { if (content) { validate(iframe, options); iframe.setAttribute("srcdoc", content); } }, legacy: function( iframe, content, options ) { var jsUrl; if (!iframe || !iframe.getAttribute) { return; } if (!content) { content = iframe.getAttribute("srcdoc"); } else { iframe.setAttribute("srcdoc", content); } if (content) { validate(iframe, options); // The value returned by a script-targeted URL will be used as // the iFrame's content. Create such a URL which returns the // iFrame element's `srcdoc` attribute. jsUrl = "javascript: window.frameElement.getAttribute('srcdoc');"; // Explicitly set the iFrame's window.location for // compatability with IE9, which does not react to changes in // the `src` attribute when it is a `javascript:` URL, for // some reason if (iframe.contentWindow) { iframe.contentWindow.location = jsUrl; } iframe.setAttribute("src", jsUrl); } } }; var srcDoc = exports; var logError; if (window.console && window.console.error) { logError = function(msg) { window.console.error("[srcdoc-polyfill] " + msg); }; } else { logError = function() {}; } // Assume the best srcDoc.set = implementations.compliant; srcDoc.noConflict = function() { window.srcDoc = _srcDoc; return srcDoc; }; // If the browser supports srcdoc, no shimming is necessary if (isCompliant) { return; } srcDoc.set = implementations.legacy; // Automatically shim any iframes already present in the document iframes = document.getElementsByTagName("iframe"); idx = iframes.length; while (idx--) { srcDoc.set( iframes[idx] ); } });

Introduction to Azure Notebooks

Azure Notebooks is a Microsoft Azure Platform as a Service (PaaS) offering of Jupyter Notebooks, here I’d like to introduce you to the Azure version and some of its benefits over the traditional version.A brief overview of Jupyter NotebooksBy downloading and installing you’re able to create documents with a mixture of markdown text, code, and visualizations from within your web browser. This isn’t just any document though – it allows for code modification, execution, and live visualization generation.It’s a fantastic tool for documentation, training, learning, and interactive report generation and exploration. For more information on Jupyter, check out Nigel Meakins’ Introduction to Jupyter Notebooks post. Benefits of Azure NotebooksNo installation, no maintenanceAs with any PaaS solution, Azure Notebooks makes it far quicker and easier to get up and running, as there’s no download or installation required. Microsoft handles all the maintenance for you too!Easier sharingJust click your library, and hit Share to be presented with a popup of sharing options:At the moment, you can share via a direct URL, social media (options being Twitter, Facebook, and Google+), embed code, and emailing directly from the pop-up.Other Features of Azure NotebooksSlidesThis is an excellent tool for presenting your work from directly within your notebook – meaning you can modify and execute code from your slideshow. This allows you to better adapt your presentation to your audience, helping explain or answer questions with additional examples and not having to swap to another application.To set up your presentation, open your Notebook, from the View menu, go to Cell Toolbar, and select Slideshow.This gives you the Slide Type option on each Cell of your notebook. Once you’ve configured each of your cells, you can select Enter/Exit RISE Slideshow to enter presentation mode and see how it looks.Notebook CloningCompanies and universities are turning their books and other content in to Azure Notebooks, making it publicly available for all to clone to their own libraries to play around with and learn.Here are a few to get you started:https://notebooks.azure.com/Microsoft/librarieshttps://notebooks.azure.com/jakevdp/libraries/PythonDataScienceHandbookIt’s also smart to clone your own notebook if you plan on tinkering around and don’t quite know where you’ll end up, essentially backing up and preserving an original copy.Public and private notebooksNot ready to share your work? Keep your notebook private until you are. If you’ve already shared your notebook, but later want to lock it down to make changes, go in to the settings of your library and make it private again. It couldn’t be simpler.LimitationsThis product is still in preview, and I’ve no doubt will be growing in capability, but at time of writing, the following limitations are in place:Jupyter supports over 40 languages, at time of writing Azure Notebooks supports three: Python (2 and 3), R, and F#.4GB memory usage limit.The service restrictions documentation mention Microsoft reserving the right to remove your data after 60 days of inactivity.I’ve read elsewhere online that there’s a 1GB storage limit – but I haven’t been able to find this detailed in Microsoft documentation.Get StartedTo start producing your own notebook, head to https://notebooks.azure.com, sign in with a Microsoft account, and create a library.Add a notebook file to your library by pressing “New”, select the type of Notebook you’d like to create (Python/R/F#), give it a name and press New again to create it. The same New button can be used to add files, such as CSV’s you can reference in your notebook, to your library, either from your own computer, or from the web.Click on your notebook file to start editing, and once in the editor, I’d recommend taking the self-paced user interface tour available from the Help menu if you haven’t worked with Jupyter before.

Python in SQL Server 2017

One of the new features of SQL Server 2017 was the ability to execute Python Scripts within SQL Server. For anyone who hasn’t heard of Python, it is the language of choice for data analysis. It has a lot of libraries for data analysis and predictive modelling, offers power and flexibility for various machine learning tasks and is also a much simpler language to learn than others. The release of SQL Server 2016, saw the integration of the database engine with R Services, a data science language. By extending this support to Python, Microsoft have renamed R Services to ‘Machine Learning Services’ to include both R and Python. The benefits of being able to run Python from SQL Server are that you can keep analytics close to the data (if your data is held within a SQL Server database) and reduce any unnecessary data movement. In a production environment you can simply execute your Python solution via a T-SQL Stored Procedure and you can also deploy the solution using the familiar development tool, Visual Studio. Installation and Setup   When you download SQL Server 2017, make sure, during the time of installation on Feature Selection, you select the following: Database Engine Services Machine Services (In-Database) Python Please see here for detailed instructions on the setup. Make sure you download the latest version of SQL Server 2017 as there are errors within the pre-built Python packages in previous versions. Once the installation is complete, you can now try out executing Python scripts from within Management Studio. Before we begin, we need to make sure we enable the execution of these scripts. In order to see if we can run Python scripts, run the following: EXEC sp_configure 'external scripts enabled'GO If run_value = 1 that means we are allowed to run our Python scripts. If it is 0, run the script below: sp_configure 'external scripts enabled' , 1RECONFIGURE WITH OVERRIDE;GO Now, for the change to take effect, you need to restart the SQL Server service and you are good to go! Executing Python scripts via T-SQL The basic syntax for executing Python scripts is as follows: sp_execute_external_script @language = N'language' , @script = N'script', @input_data_1 = ] 'input_data_1' [ , @input_data_1_name = ] N'input_data_1_name' ] [ , @output_data_1_name = 'output_data_1_name' ] [ , @parallel = 0 | 1 ] [ , @params = ] N'@parameter_name data_type [ OUT | OUTPUT ] [ ,...n ]' [ , @parameter1 = ] 'value1' [ OUT | OUTPUT ] [ ,...n ] [ WITH <execute_option> ] [;] <execute_option>::= { { RESULT SETS UNDEFINED } | { RESULT SETS NONE } | { RESULT SETS ( ) } } The mandatory arguments to provide are @language , @script . @language = Indicates the scripts language. Values are R or Python. @script = This is the body of the Python script @input_data_1 = This is a T-SQL statement that reads some data from a table within the database. @input_data_1_name = This is where you can name the variable used to represent the T-SQL query defined above. For executing Python scripts, the form of data here must be tabular, however for R it is slightly different. @output_data_1_name =  Specifies a name of the variable that contains the data to be returned to SQL Server upon completion of the stored procedure. For Python, the output must be a pandas dataframe. By default, result sets that are returned by this stored procedure are output with unnamed columns. If you would like your result set to contain column names, you can add WITH RESULT SETS to the end on the stored procedure. As well as specifying column names, you will also need to provide the data types. You will see the difference between including it and not in the examples shown below. This system stored procedure, can also be used to execute R scripts, simply specifying the language in the @script parameter. Please see here for more information about this stored procedure. Examples N.B Please be aware that formatting is very important and should be one of the first things you should check if you get an error during execution. There are various Python formatting sites online to help with this. The examples below are to demonstrate how to use the syntax and can be classed as basic in the grand-scheme of what Python can do as a language. Example 1 EXEC sp_execute_external_script @language =N'Python', @script=N'OutputDataSet = InputDataSet', @input_data_1 = N'SELECT 1 AS Test' In the above example, we simply printed the input value of the dataset. If you look at the output returned in SSMS, we receive the value 1 but with no column header. If we add WITH RESULT SETS, we get the following:   Example 2 In this piece of code, we are looping through rows of a table (dbo.Test) and printing the value of each row. EXEC sp_execute_external_script  @language =N'Python', @script= N' for i in InputDataSet.Id: print(i) ', @input_data_1 = N'SELECT Id FROM dbo.Test'   The output in SSMS is as follows: Example 3 In this piece of code, it shows how you can use variables and print the value. EXEC sp_execute_external_script  @language =N'Python', @script= N' var1 = "Emma" var2 = "Test" print (var1 + var2) ' There are lot of things we can do, however, we can achieve these basic concepts using normal T-SQL so there has been nothing new or exciting to see. Example 4 A more interesting scenario, which is slightly harder to do using T-SQL, is we can use Python to perform some descriptive statistics of data we pass into it. For this, we need to import the pandas library to take advantage of it. The pandas library is a package which provides data structures designed to make working with relational data easy and intuitive. See here for more information. EXEC sp_execute_external_script  @language =N'Python',@script= N'import pandas as pdfrom pandas import DataFrame OutputDataSet = pd.DataFrame(InputDataSet.describe()) ',@input_data_1 = N'SELECT   CAST(TotalProductCost AS float), CAST(UnitPrice AS Float), CAST(OrderQuantity AS FLOAT)FROM FactInternetSales'with result sets ((TotalProductCost float, UnitPrice Float, OrderQuantity FLOAT))   By using ‘describe’ we can get all the usual statistical measures for the columns that we pass in.   The statistics are in the following order: Count, Mean, Standard Deviation, Min, 25% quartile, 50% quartile, 75% quartile and Max. Now, a few words about the Python code used above: Data Frame: A data frame is a data structure within Python which is like a table that we are used to within SQL Server. It contains a built-in function named “describe” which allows us to calculate the basic statistics of our dataset. We pass in the InputDataSet to the describe function and then this is converted to a data frame using the DataFrame function. OutputDataSet: The resulting data frame is assigned to the result of the output stream and uses the default output name ‘OutputDataSet’ The example above is using data from FactInternetSales from the AdventureWorksDW. The fields needed to be converted to float as they have ‘money’ datatypes and that is not a supported datatype in Python. Sentiment Analysis Once you have got to grips with the basics, you can move onto what Python is great at – Machine Learning scenarios. One popular machine learning scenario is text analysis (or sentiment analysis). Sentiment analysis is analysing a piece of text to see if the sentiment is positive or negative. A good example of this would be applying it to tweets on Twitter to see if they are positive or negative. Using Python in SQL Server 2017 brings the added advantage that you can use pre-trained models out of the box to do your analysis. In order to use pre-trained models, you need to add the models to the SQL Server instance where Machine Learning Services is installed (instructions are below): 1. Run the separate Windows-based installer for Machine Learning Server. Detailed instructions of what you need to install can be found here.     You should only need to tick the box for Pre-trained models as this is an update to what we already have. 2. To check that they have installed correctly, open the command prompt (Run as administrator) and navigate to C:\Program Files\Microsoft SQL Server\140\Setup Bootstrap\SQL2017\x64\ and run the following RSetup.exe /install /component MLM /version 9.2.0.24 /language 1033 /destdir "C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\library\MicrosoftML\mxLibs\x64"   Now you have everything set up you can begin using the pre-trained models. I will be using this and giving my thoughts in a future blog, however, in the meantime there is a Microsoft blog which provides step by step instructions on how to perform this analysis.   In summary, Microsoft have made it easy to integrate running Python code from within SQL Server and made it more accessible to people who are used to working within a SQL Server environment.

My Experience of the Microsoft Professional Program for Data Science

(Image 1 – Microsoft 2017 - https://academy.microsoft.com/en-us/professional-program/data-science)   In 2016 I was talking to Andrew Fryer (@DeepFat)- Microsoft technical evangelist, (after he attended Dundee university to present about Azure Machine Learning), about how Microsoft were piloting a degree course in data science. My interest was immediately spiked. Shortly after this hints began appear and the Edx page went live. Shortly after the Edx page went live, the degree was rebranded as the "Professional Program". I registered to be part of the pilot, however was not accepted until the course went live in September 2016.   Prior to 2016 my background was in BI, predominately in Microsoft Kimball data warehousing using SQL Server. At the end of 2015 I enrolled on a Master's Degree in Data Science through the University of Dundee. I did this with the intention of getting exposure to tools I had an interest in, but had some/little commercial experience (R, Machine learning and statistics). This course is ongoing and will finish in 2018, I highly recommend it! I would argue that it is the best Data Science Master's degree course in the UK. So going in to the MPP I had a decent idea of what to expect, plus a lot of SQL experience, R and Power BI. Beyond that I had attended a few sessions at various conferences on Azure ML. When the syllabus for the MPP came out, it directly complemented my studies.   Link to program - https://academy.microsoft.com/en-us/professional-program/data-science Link to Dundee Masters - https://www.dundee.ac.uk/study/pg/data-science/   Structure of the program The program is divided up in to 9 modules and a final project. All modules need to be completed but there are different options you can take - You can customise the course to suit your interests. You can choose to pay for the course (which you will need to do if you intend to work towards the certification) or audit the course for free.  I will indicate which modules I took and why. Most modules recommend at least 6 weeks part-time to complete. I started the first module in the middle of September 2016 and completed the final project middle of January 2017 – So the 6 week estimate is quite high, especially if you already have decent a base knowledge of the concepts already.   You can if you wish complete multiple modules at once. I am not sure I recommend this approach as to get the most out of the course, you should read around the subject as well as watching the videos. Each module has a start date and an end date that you need to complete it between. If you do not you will need to do it all again. You can start a module in one period and wait until the next for another module. You do not need to complete them all in 3 months. If you pay for the module but do not request your certificate before the course closes, you will need to take it again (top tip, as soon as you're happy with you score, request you certificate).   Module list Module Detail Time taken Data Science Orientation Data Science Orientation 2 - 3 days Query Relational Data Querying Data with Transact-SQL 1 day - Exam only Analyze and Visualize Data Analyzing and Visualizing Data with Excel  Analyzing and Visualizing Data with Power BI 2 - 4  days Understand Statistics Statistical Thinking for Data Science and Analytics 7 - 9 days Explore Data with Code Introduction to R for Data Science Introduction to Python for Data Science 7 - 9 days Understand Core Data Science Concepts Data Science Essentials 7 - 9 days Understand Machine Learning Principles of Machine Learning 2 weeks Use Code to Manipulate and Model Data  Programming with R for Data Science Programming with Python for Data Science R - 2 - 3 daysPython - 3 weeks Develop Intelligent Solutions   Applied Machine Learning  Implementing Predictive Solutions with Spark in HDInsight Developing Intelligent Applications 2 weeks Final Project Data Science Challenge 2 months*   The times taken are based on the time I had spare. I completed each module between projects, in the evening and at the weekend. This module can be completed in a few days, however you need to wait until it has completed to get you grade.   Structure of the modules Each modules is online. You log on to the Edx website and watch videos by leading experts. Either at the end of the video, after reading some text or at the end of a section of the modules you are given a multiple choice test. The multiple choice options are graded and form part of your overall score. The other main assessment method is labs, where you will be required to complete a series of tasks and enter the results. Unlike certifications, you get to see what your score is as you progress through the module. The multiple choice questions generally allow you to have two to three attempts at the answer, sometimes these are true/false with two attempts, which does undermine the integrity of the course.   There is normally a final section which you're only given one chance to answer, and holds a higher % towards your final mark. You need 70% to pass. Once you hit 70% you can claim your certificate - if you have chosen to pay for the module. Modules range from $20 to $100. For the most part I answered the questions fully and tried for the highest score possible. However, In all honestly towards the end, once I hit around 80%, I started looking at a different module. If the module was really interesting I would persevere.   Modules Data Science Orientation, Query Relational Data & Analyze and Visualize Data. These modules are very basic and really only skim the surface of all the topics they describe. The first module is a gentle introduction to the main concepts you will learn throughout the program. The next modules focused on querying data with SQL. Regardless of your opinion of SQL, you must agree that SQL the is language of data. Having an understanding of the fundamentals of SQL is paramount, as almost every level of the Microsoft Data Science stack has integration with databases. If you're familiar with SQL (I already held an MCSE in SQL 2012) you can skip the main content of this module and just take the test at the end. For the next you have an option of Excel or Power BI for visualisation. As I have experience with Power BI I opted for this module. Once again this is a very basic introduction to Power BI. It will get you familiar enough with the tool that you can do basic data exploration. Some parts of this course jarred with me. Data visualisation is so important and a key skill for any data scientist. In the Power BI module one of the exercises was to create a 3d pie chart. Pie charts are not a good visualisation as it is hard to differentiate between angles and making it 3d only escalates the issue. I wish Microsoft would have made reference to some of the great data viz experts when making this module - I cannot comment on the Excel version.   Understanding statistics. This module is different from its predecessors, in that it is not run by Microsoft. This is a MOOC from Columbia university, which you might have completed before. It covers a lot of the basic and more advanced stats that you need to know for data science. In particular a solid grounding in probability and probability theory. In BI you become familiar with descriptive stats and measures of variance, however I had not done a great deal of stats beyond this. I have researching statistical methods for the MSc, but I had not done any real stats since A-Level maths. This course was really interesting and I learnt a lot. I don’t know if this is the best way to really learn stats, but it is a good primer to what you need to know. I found topping up my understanding with blogs, books and YouTube helped support this module.   Explore data with code. You have two options again for this module, R and Python. Which should you learn I imagine you're asking, well the simple answer is both. Knowing either R or Python will get you so far, knowing both with make you a unicorn. Many ask why to learn one language over the other - aside from the previous point. R is very easy to get in to, it has a rich catalogue of libraries written by some of the smartest statistical minds. It has a simple interface and is easy to install. Python is harder to learn in my opinion as the language is massive! I found Python harder to work with, but it is much richer. I would recommend Python just for SciKitLearn the machine learning library. The python module is extended to use code dojo (the great online tuition site). As you progress through the questions and examples, you have an ide which will check you understanding and  will grade you as you go. I found this really helpful. This module is again a bit on the easier side. If you think the later Python module will be similar, you are in for a surprise! I did not take the R module as I was already using R in my day job.   Understand core data science concepts. Almost a redo of the first module and the understanding statistics module. Not a lot to say here, but repetition helped me understand and remember the concepts. The more I had to think about the core concepts the more they stuck. This module could have been removed with little to no impact on the course, but helped solidify my knowledge.   Understanding Machine learning. As this is a Microsoft course this module is all about Azure Machine Learning. If you have not used Azure ML before, it has a nice drag and drop interface which allows you to build quick simple models and create a web api key which you can then pass data to using any tool with a REST API. This module is half theory and half practical. There are a lot of labs, so you will need to take you time. If you skip ahead you will get the answers wrong and might not make it to 70%.   Using code to manipulate and model data. This section has two options again R and Python. I know quite a bit or R already so I started with Python. I wanted to do them both to see how you can do machine learning in both. I was expecting a continuation of the code dojo format from the previous module, this was far from the case. Each of the modules up until this point have worked with you to find the right answer. This module will equip you with the basics, but expect you to find the correct function and answer. Believe me when I say it was hard (with little prior experience of Python). The course will lead you to towards the right resources, but you need to read the documentation to answer the question. This was a great change of pace. Having to search for the answers made me absorb more than just the quizzes. This module was a struggle. Once I completed this I did the same for R. On a difficulty scale, if the Python module was 100, R was only at 20. The disparity in difficult is massive and frankly unfair. I was able to complete the R module very quickly. I left feeling disappointed that this did not have the same complexity that the Python module did.   Develop intelligent solutions. For this section you can pick one of three modules, Machine learning, Spark or micro services. I went with Spark. Why? Because I had already worked with Spark and Hadoop as part of the MSc at Dundee. I knew how it worked and what it did from an open source point of view, but not from a Microsoft HD-Insight perspective. This module was tricky but nothing compared to the Python module. I spent the best part of the week working on Spark, setting up HD-Insight clusters and forgetting to tear them down (top tip! Don’t leave a HD-Insight cluster running - They are EXPENSIVE!). The last module is a machine learning project, so picking the "Applied Machine Learning" option might put you in a better place than your competition. I did not attempt either the Machine Learning or the Micro-services modules.   Final project. Here is where the fun begins. You're given a problem and a dataset. You need to clean, reduce, derive features and process the dataset, then apply an ML technique to predict something. In my case it was whether or not someone will default on a loan. You could use any technique you liked as long as the final result was in Azure ML. I was pretty happy with my model early on and made very few tweaks as the course progressed. Unlike the previous modules where you can complete a module and get your score, your final score is only available once the module has ended. You will build an ML experiment and test against a private dataset. You can submit your experiment 3 times a day to be scored against the private data (maximus of 100 attempts). This will give you an indication of your score, but this is not your score! You score is calculated against a different dataset after the module has finished.  You top 5 scores will be used to test against the private closed data. If you have over-fitted you model, you might have a shock (as many did on the forums) when you score is marked.   I completed all modules at the start of January and waited until February to get my final score. My highest scoring answer, when used against the closed private dataset, did not get over the required 70% to pass. This was surprising but not all that unexpected. I had over-fitted the model. To counter balance this, I created 5 different experiments with 5 similar but different approaches. All score similar (~1-3% accuracy difference). This was enough to see me past the required 70% and to obtain the MPP in data science. The private dataset has been published now. In the coming weeks I will blog about the steps I took to predict if someone would default on their loan.   I have been asked at different stages of the course "would you recommend the course?". It really depends on what you want out of the course! If you expect to be a data scientist after completing the MPP, then you might be in for a shock. To get the most out of the course you need to supplement it with wider reading / research. YouTube has many great videos and recorded lectures which will really help process the content and see it taught from a different angle. If you're looking to get an understanding of the key techniques in  Data Science (from a Microsoft point-of-view) then you should take this course. If you're doing a degree where you need to do research, many of the modules will really help and build upon what you already know.   I hope you have found this interesting and that it has helped you decide whether or not you want to invest the time and money (each module is not free). If you do decide and you persevere you will too be the owner of the MPP in Data Science (as seen below).   Terry McCann - Adatis Data Science Consultant & Organiser of the Exeter Data Science User Group - You can find us on MeetUp.