IBM  Data Science  Experience

Part 1 : Add a custom library to a Jupyter Scala notebook in IBM Data Science Experience (DSX)

I have been using IBM’s Data Science Experience platform for a few months now. It's a great platform to perform data analyses using the latest tools, like Jupyter Notebooks and Apache Spark. If you are at all familiar with using Jupyter Notebooks you know that they're good for sharing code and running quick analyses. However, I have found that, while the notebook is good for presenting high-level code, it doesn’t work as well when you have too much low-level code — you can quickly get lost in the notebook and lose the main point.

With this idea in mind, I thought it would be a fun experiment to write a few custom Scala packages that could be used to declutter my notebooks. This post covers the basic environment setup you would need to create your own custom library, and show a simple hello world example of how to do this. I plan to write a followup post that will show a more advanced example of using this pattern to write a simple Spark benchmark program.

Environment setup outline

Here is the list of steps to set up your local environment. (Note : I'll show what I did on my MacBook Pro, which should apply to Linux environments as well, but your mileage may vary.)

  • Set up an account in IBM Data Science Experience
  • Create a project and add a notebook
  • Install Scala
  • Install the Scala compiler (SBT)
  • Download the Hello World example from my Github repo
  • Compile your JAR file and load it into DSX
  • Test your Hello World library

1. Set up an account in IBM Data Science Experience

Browse to datascience.ibm.com and sign up for a free trial. This will give you a 30-day free trial of DSX that includes Jupyter, Spark, RStudio, and other tools. You will need to provide a personal email address to set up the account. This step should take about 10–15 minutes to complete.

2. Create a project and add a notebook in DSX

Once logged into DSX, click Create project in the upper right-hand corner.

Fill in the name of your project, and accept the defaults for the Spark and Object Storage instance. Click Create project.

Next, add a notebook in your project by clicking Add notebooks.

Once you type in the notebook name and select the language and Spark service, click Create notebook.

You will now move to setting up your development environment on your local machine.

3. Install Scala

There are several ways to install Scala on the Mac. You can either use brew, or download the tarball directly from the Scala download page. I prefer to download the tarball directly and do a manual setup because I can pick the exact version of Scala I want. Here is a quick example using the terminal:

cd <YOUR_SCALA_DIR> # Path for your scala binaries  
wget http://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz  
tar -zxvf scala-2.11.8.tgz  

Once you have done this, set these values in your shell environment. On my Mac, I configure my .bash\_profile.

PATH=$PATH:/data/app-setup/scala-2.11.8/bin  
SCALA_HOME=/data/app-setup/scala-2.11.8  

To test your Scala installation, type scala at your command line; you should see the Scala interpreter invoked.

4. Install SBT

SBT stands for Simple Build Tool (which I think is a misnomer, but that will wait for a different post!). To use SBT as your scala compiler, follow these steps on your machine to setup SBT:

wget https://dl.bintray.com/sbt/native-packages/sbt/0.13.13/sbt-0.13.13.tgz  
tar -zxvf sbt-0.13.13.tgz  

Modify your .bash_profile:

SBT_HOME=/data/app-setup/sbt-launcher-packaging-0.13.13/  
export PATH=$SBT_HOME/bin:$PATH  

To test your SBT installation, type sbt at the command line and SBT should load.

5. Clone 'hello world' Git repo

Next, grab a hello world example from my Github repository. You will use this example to compile the code using SBT, and create a simple JAR file that you can upload into DSX. The advantage to doing it this way is that I already have the directory structure and simple build.sbt file defined so that compiling and assembling the JAR should be easy.

cd <YOUR_CODE_DIR> # Path for your code  
git clone https://github.com/dustinvanstee/dv-hw-scala.git  

6. Compile the Hello World code and turn it into a skinny JAR file

The hello world code is simple, but it's important to review a few key aspects of the code so that you can reference it in your Jupyter notebook:

package dv.hw

    object HelloWorld {
      def main(args: Array[String]): Unit = {
        println("Hello, world!")
      }
    }

The main concepts to reference here is the package line, the object name, and the main function definition. The package line is important because this is the library you will use when you do an import in the Juptyer notebook. The object is important, because this is what you will instantiate in the notebook, and finally you will call the main subroutine (this will be clear in the notebook).

To compile this code, run sbt as shown:

# Path for your code, which should have build.sbt in this dir
cd <YOUR_CODE_DIR>  
sbt  
> compile
...[success]
> assembly
...
[info] Packaging .. scala-2.10/dv-hw-scala-assembly-1.0.jar ...
[info] Done packaging.
[success] Total time: 0 s, completed Dec 12, 2016 2:04:27 PM

SBT has a number of commands available, but here you are using the compile command, which compiles your code in class files. Next, use the assembly command to create a skinny JAR file. SBT uses the build.sbt file in the project directory to control compilation of the code. It performs many tasks, but the key tasks in this case are automatic dependency management and defining how the JAR should be built.

Note: I am using the sbt-assembly plugin for this example. This is automatically added to the project because I added it in the ./project/plugins.sbt file. Sbt-assembly provides the capability to build portable JAR files. It can be used to build uber JAR files that have all the library dependencies built into the archive, but it can also be used to build skinny JAR files that are useful in notebook environments where most of the libraries are already available. I am using the latter method here.

To verify the contents of the JAR file, use the jar -tvf command as shown below:

$ jar -tvf .../target/scala-2.10/dv-hw-scala-assembly-1.0.jar
   273 Mon Dec 12 14:04:26 EST 2016 META-INF/MANIFEST.MF
     0 Mon Dec 12 14:04:26 EST 2016 dv/
     0 Mon Dec 12 14:04:26 EST 2016 dv/hw/
   603 Mon Dec 12 14:04:26 EST 2016 dv/hw/HelloWorld$.class
   600 Mon Dec 12 14:04:26 EST 2016 dv/hw/HelloWorld.class

As you can see, we have only the class files from our simple build. If you had built an uber JAR file, you would potentially have hundreds of lines for different Scala packages. You don’t want this as it will conflict with the Scala packages already installed in DSX.

7. Test your 'hello world' library

Once you have built your JAR file, make it available to the DSX environment. The method I use is to host the JAR right back on Github. So let's get back to the DSX notebook you created in Step 2.

In the first cell, use this line to add your JAR file. You can use my URL for testing:

%AddJar https://github.com/dustinvanstee/dv-spark-bench/raw/master/target/scala-2.10/dv-spark-bench-assembly-1.0.jar -f

In the next cell, add the following lines, then run the cell.

import dv.hw._  
HelloWorld.main(Array("1","2"))  

If all was successful, you should see Hello, world! echoed to the screen.

Conclusion

While this tutorial may not be exciting, you have accomplished quite a bit in terms of your environment. You are now ready to create your own custom Scala libraries that you can call from within your Jupyter Scala notebooks. In my next post, see how you can build from this simple pattern to write some custom Spark benchmarking code that you can call from your notebooks!

You can see the notebook here: http://ibm.co/2hXuC2D
Original post on medium.com

Dustin VanStee

I am an Open Source Solution Engineer at IBM specializing in technologies like Spark and Hadoop.

Hopewell Junction, NY https://medium.com/@DustinVanStee

Subscribe to IBM Data Science Experience Blog

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!