Ivan's Blog

Embedding Databricks Notebooks into a BlogEngine.NET post

Databricks is a buzzword now. This means that each day more and more related content appears on the net. With Databricks Notebooks it’s so easy to share code. If by any chance you need to share a notebook directly in your blog post here are some short guidelines on how to do so. If your favourite blog engine appears to be BlogEngine.NET it’s not so straightforward task. Fear not – following are the steps you need to take:

  1. Export your notebook to HTML from the Databricks portal:
  2. Make the following text replacements in the exported HTML: replace & with & and with "
  3. Paste the following code in the blog’s source replacing both the [[HEIGHT]] placeholders with the notebook’s height in pixels (you may need a little trial and error to get to the exact value so that the vertical scrollers disappear) and [[NOTEBOOK_HTML]] placeholder with the resulting HTML from the above point:
    <iframe height="[[HEIGHT]]px" frameborder="0" scrolling="no" width="100%" style="width: 100%; height: [[HEIGHT]]px;" sandbox="allow-forms allow-pointer-lock allow-popups allow-presentation allow-same-origin allow-scripts allow-top-navigation" srcdoc="[[NOTEBOOK_HTML]]"></iframe>
  4. To make everything compatible with Internet Explorer and Edge, at the very end of your blog script place between <script></script> tags the wonderful srcdoc-polyfill script

Here is an example notebook with the above steps applied:

Not all notebook features are available within an iframe, but the main ones, such as syntax highlighting are intact and so is the ability to display a table results (with column sort capability!).

The proposed way of embedding notebook code in a blog post is just the first idea that came to my mind. Maybe there are more clever ways to achieve this. Please let me know in the comments.

Parsing nested JSON lists in Databricks using Python

Parsing complex JSON structures is usually not a trivial task. When your destination is a database, what you expect naturally is a flattened result set. Things get more complicated when your JSON source is a web service and the result consists of multiple nested objects including lists in lists and so on. Things get even more complicated if the JSON schema changes over time, which is often a real-life scenario.

We have these wonderful Azure Logic Apps, which help us consistently get the JSON results from various sources. However, Logic Apps are not so good at parsing more complex nested structures. And they definitely don’t like even subtle source schema changes.

Enter Databricks!

With Databricks you get:

  • An easy way to infer the JSON schema and avoid creating it manually
  • Subtle changes in the JSON schema won’t break things
  • The ability to explode nested lists into rows in a very easy way (see the Notebook below)
  • Speed!

Following is an example Databricks Notebook (Python) demonstrating the above claims. The JSON sample consists of an imaginary JSON result set, which contains a list of car models within a list of car vendors within a list of people. We want to flatten this result into a dataframe. Here you go:

We've seen here how we can use Databricks to flatten JSON with just a few lines of code.

Keep your eyes open for future Databricks related blogs, which will demonstrate more of the versatility of this great platform.

More on some of the used functions (PySpark 2.3.0 documentation):