This is the third post in a series on Introduction To Spark.
There are a large number of kernels that will run within Jupyter Notebooks, as listed here.
I’ll take you through installing and configuring a few of the more commonly used ones, as listed below:
- Apache Toree (Scala)
Each kernel has its own kernel.json file, containing the required configuration settings. Jupyter will use this when loading the kernels registered in the environment. These are created in a variety of locations, depending on the kernel installation specifics. The file must be named kernel.json, and located within a folder that matches the kernel name.
There are various locations for the installed kernels. For those included in this article the locations below have been identified:
Where <UserProfileDir> will be as per the Environment variable %UserProfile%, <ProgramDataDir> will be as per %ProgramData%, and <AnacondaInstallDir> is the installation root directory for Anaconda, assuming you are using this for your Python installation.
Listing Jupyter Kernels
You can see what kernels are currently installed by issuing the following:
Jupyter kernelspec list
This comes ‘out of the box’ with the Python 3 environment, so should require no actual setup in order to use. You’ll find the configuration file at <AnacondaInstallDir>\envs\Python36\share\jupyter\kernels\Python3. The configuration contains little else other than the location of the python.exe file, some flags, and the Jupyter diplay name and language to use. It will only be available within the Python environment in which it is installed, so you will need to change to that Python environment prior to starting Jupyter notebooks, using ‘Activate <envName>’ from a conda prompt.
This requires a little more effort than the Python 3 kernel. You will need to create a PySpark directory in the required location for your Python environment, i.e. <AnacondaInstallDir>\envs\<EnvName>\share\jupyter\kernels\PySpark
Within this directory, create a kernel.json file, with the following data:
All windows paths will of course use backslashes, which must be escaped using a backslash, hence the ‘\\’. You need to include paths to a zip archives for py4j and pyspark in order to have full kernel functionality. In addition to the basic Python pointers we saw in the Python 3 configuration, we have set a number of windows environment variables for the lifetime of the kernel. These could have course be set ‘globally’ within the machine settings (see here for details on setting these variables), but this is not necessary and I have avoided this to reduce clutter.
"PYSPARK_PYTHON": ""<AnacondaInstallDir>\\Envs\\<EnvName>\\python.exe ",
"PYTHONPATH": "<SparkInstallDir>\\python; <SparkInstallDir>\\python\\pyspark; <SparkInstallDir>\\python\\lib\\py4j-0.10.4-src.zip; <SparkInstallDir>\\python\\lib\\pyspark.zip",
"PYSPARK_SUBMIT_ARGS": "--master local[*] pyspark-shell"
The PYSPARK_SUBMIT_ARGS parameter will vary based on how you are using your Spark environment. Above I am using a local install with all cores available (local[*]).
In order to use the kernel within Jupyter you must then ‘install’ it into Jupyter, using the following:
jupyter PySpark install <AnacondaInstallDir>\envs\<EnvName>\share\jupyter\kernels\PySpark
This can be downloaded from here. Unzip and run the jupyter-scala.ps1 script on windows using elevated permissions in order to install.
The kernel files will end up in <UserProfileDir>\AppData\Roaming\jupyter\kernels\scala-develop and the kernel will appear in Jupyter with the default name of ‘Scala (develop)’. You can of course change this in the respective kernel.json file.
This allows the use of Scala, Python and R languages (you will only see Scala listed after install but apparently it can also process Python and R), and is currently at incubator status within the Apache Software Foundation. The package can be downloaded from Apache here, however to install, just use pip install with the required tarball archive url and then jupyter install as below (from an elevated command prompt):
pip install http://apache.mirror.anlx.net/incubator/toree/0.1.0-incubating/toree-pip/apache-toree-0.1.0.tar.gz
jupyter toree install
This will install the kernel to <ProgramDataDir>\jupyter\kernels\apache_toree_scala
You should now see your kernels listed when running Jupyter from the respective Python environment. Select the ‘New’ dropdown to create a new notebook, and select your kernel of choice.
In part 4 of this series we’ll take a quick look at the Azure HDInsight Spark offering.