Ivan's Blog

Installing Databricks Cluster Libraries from a Python notebook

Working with interactive clusters in Databricks makes it possible to manually install libraries using the workspace UI. It’s done once and for all. You don’t have to worry about it anymore.

Creating job clusters is another story. You must take care of library installations prior to executing the notebooks which reference these libraries. In the case you’re using Azure Data Factory to orchestrate the whole process you’re lucky, because appending libraries to job clusters is an out-of-the-box functionality. For all other scenarios using the Databricks REST API is one possible option. In fact, you can do this right from a Python notebook. That’s what I’m going to demonstrate in the following lines.

Let’s have a look at the REST API documentation first. We’ll be using the Cluster Status and Install endpoints only. For installing a library, we need to provide the library source and its properties. We need to create a proper HTTP request body in JSON format including the library source and properties. Here’s one example:

  "pypi": {
    "package": "simplejson",
    "repo": "http://my-pypi-mirror.com"

Here "pypi" is the source and {"package": "simplejson", "repo": "http://my-pypi-mirror.com"} are its properties. For more examples check the documentation here.

Let’s create a new Python notebook and then some functions in it. The code is simple and self-explanatory. Comments are available where needed.

We’ll be using only json and requests libraries:

import json
import requests
from requests.auth import HTTPBasicAuth

Now, let’s create a REST API wrapper function:

def adbAPI(endPoint, body, method = "GET", region, token):
  """Execute HTTP request against Databricks REST API 2.0"""
  domain = region + ".azuredatabricks.net"
  baseURL = "https://%s/api/" % (domain)

  if method.upper() == "GET":
    response = requests.get(
        baseURL + endPoint
      , auth = HTTPBasicAuth("token", token)
      , json = body
    response = requests.post(
        baseURL + endPoint
      , auth = HTTPBasicAuth("token", token)
      , json = body

  return response

As parameters we’ll take the API endpoint, HTTP request body, HTTP method (GET or POST), Databricks workspace region (westeurope, northeurope, etc.) and, finally, a Databricks token.

Next is a helper function for translating the library status response into a human readable format:

def describeClusterLibraryState(source, clusterId, status):
  """Converts cluster library status response to a verbose message"""
  result_map = {
      "NOT_INSTALLED"       : "{} library is not installed on cluster {}.".format(source.title(), clusterId)
    , "INSTALLED"           : "{} library is already installed on cluster {}.".format(source.title(), clusterId)
    , "PENDING"             : "Pending installation of {} library on cluster {} . . .".format(source, clusterId)
    , "RESOLVING"           : "Retrieving metadata for the installation of {} library on cluster {} . . .".format(source, clusterId)
    , "INSTALLING"          : "Installing {} library on cluster {} . . .".format(source, clusterId)
    , "FAILED"              : "{} library failed to install on cluster {}.".format(source.title(), clusterId)
    , "UNINSTALL_ON_RESTART": "{} library installed on cluster {} has been marked for removal upon cluster restart.".format(source.title(), clusterId)

  return result_map[status.upper()]

The parameters here include the library source, the cluster ID and the API’s status response.

Now let’s get to the serious business. Obviously, we need a way to check the current state of the library to see if it’s not already installed or if something went wrong during installation:

def getClusterLibraryStatus(source, properties, dbRegion, dbToken, verbose = True):
  """Gets the current library status """
  source = source.lower()

  # Validate input
  assert source in ("jar", "egg", "whl", "pypi", "maven", "cran"), \
    "Error: Invalid library source specified. Valid sources are: jar, egg, whl, pypi, maven, cran"
  assert properties is not None, \
    "Error: Empty properties provided"

  # Get the cluster ID from the Spark environment
  clusterId = spark.conf.get("spark.databricks.clusterUsageTags.clusterId")

  # Set default result to not installed
  result = describeClusterLibraryState(source, clusterId, "NOT_INSTALLED") if verbose else "NOT_INSTALLED" 

  # Execute REST API request to get the library statuses
  libStatuses = adbAPI("2.0/libraries/cluster-status?cluster_id=" + clusterId, None, "GET", dbRegion, dbToken)

  if libStatuses.status_code == 200:
    statuses = libStatuses.json()
    if "library_statuses" in statuses:
      for status in statuses["library_statuses"]:
        if status["library"] == {source: properties}:
          if verbose:
            result = describeClusterLibraryState(source, clusterId, status["status"])
            result = status["status"]

  return result

Parameters here include the library source and properties, Databricks region and token and a verbose flag, which if set to true (default) will make the function return human readable library status.

Finally, we have to create the most important function that will actually install the library:

def installClusterLibrary(source, properties, dbRegion, dbToken):
  Installs a cluster library given correct source and properties are provided
  For examples see https://docs.databricks.com/api/latest/libraries.html#install
  source = source.lower()

  # Validate input
  assert source in ("jar", "egg", "whl", "pypi", "maven", "cran"), \
    "Error: Invalid library source specified. Valid sources are: jar, egg, whl, pypi, maven, cran"
  assert properties is not None, \
    "Error: Empty properties provided"
  # Get the cluster ID from the Spark environment
  clusterId = spark.conf.get("spark.databricks.clusterUsageTags.clusterId")

  status = getClusterLibraryStatus(source, properties, dbRegion, dbToken, False).upper()
  if status != "INSTALLED":
    # Create the HTTP request body based on the cluster ID, library source and properties
    body = json.loads('{"cluster_id": "' + clusterId + '", "libraries": [{"' + source + '": ' + json.dumps(properties) + '}]}')
    # Execute REST API request to install the library
    runAPI = adbAPI("2.0/libraries/install", body, "POST", dbRegion, dbToken)
    if runAPI.status_code == 200:
      print("Installation started . . .")
    print(describeClusterLibraryState(source, clusterId, status))

  return status

The parameters here are: the library source and properties and the Databricks region and token.

Example function calls (azure-sqldb-spark library hosted on Maven)

Show the library status:

print(getClusterLibraryStatus("maven", {"coordinates": "com.microsoft.azure:azure-sqldb-spark:1.0.2"}, "westeurope", "dapi********************************"))

Install the library:

installClusterLibrary("maven", {"coordinates": "com.microsoft.azure:azure-sqldb-spark:1.0.2"}, "westeurope", "dapi********************************")

One thing to note here is that the library installation starts and runs asynchronously. This means that you can execute the getClusterLibraryStatus function multiple times after installClusterLibrary and get different results until the installation is done (or failed).

Embedding Databricks Notebooks into a BlogEngine.NET post

Databricks is a buzzword now. This means that each day more and more related content appears on the net. With Databricks Notebooks it’s so easy to share code. If by any chance you need to share a notebook directly in your blog post here are some short guidelines on how to do so. If your favourite blog engine appears to be BlogEngine.NET it’s not so straightforward task. Fear not – following are the steps you need to take:

  1. Export your notebook to HTML from the Databricks portal:
  2. Make the following text replacements in the exported HTML: replace & with & and with "
  3. Paste the following code in the blog’s source replacing both the [[HEIGHT]] placeholders with the notebook’s height in pixels (you may need a little trial and error to get to the exact value so that the vertical scrollers disappear) and [[NOTEBOOK_HTML]] placeholder with the resulting HTML from the above point:
    <iframe height="[[HEIGHT]]px" frameborder="0" scrolling="no" width="100%" style="width: 100%; height: [[HEIGHT]]px;" sandbox="allow-forms allow-pointer-lock allow-popups allow-presentation allow-same-origin allow-scripts allow-top-navigation" srcdoc="[[NOTEBOOK_HTML]]"></iframe>
  4. To make everything compatible with Internet Explorer and Edge, at the very end of your blog script place between <script></script> tags the wonderful srcdoc-polyfill script

Here is an example notebook with the above steps applied:

Not all notebook features are available within an iframe, but the main ones, such as syntax highlighting are intact and so is the ability to display a table results (with column sort capability!).

The proposed way of embedding notebook code in a blog post is just the first idea that came to my mind. Maybe there are more clever ways to achieve this. Please let me know in the comments.