IBM  Data Science  Experience

Part 2 : Add a custom library to a Jupyter Scala notebook in IBM Data Science Experience (DSX) - Spark Benchmark Application

This post is part 2 in a series about how to simplify your Jupyter Scala notebooks by moving complex code into precompiled jar files. This post builds off of the environment that was setup in Part 1. For a quick refresher see this link .
Ok, now lets get to the fun part. Imagine you are using a Spark environment and you want to get a sense of the capabilities of the platform. Typically you would run benchmark against that platform to characterize such metrics as network IO, CPU, and disk IO capabilities. You would then run your favorite benchmarks and record the results for reference and comparison later. Rather than roll my own benchmarks, I decided to make use of some benchmarks that were made available by the IBM Spark Technology Center called spark bench.

Run a Simple Benchmark in DSX

First, lets jump right into a simple benchmark test using linear regression, and from there I will highlight a few interesting code snippets and concepts so that you can use to create other custom apps with relative ease. You can copy the code below right into a new notebook and run it.

%AddJar -f
import dv.sparkbench.LinearRegression._

// Create a new benchmark object
val bmk = new linearRegressionBenchmark(sc)  
//bm.verbose = false  // verbose flag that controls output
bmk setRunLabel "dsx-linreg"  
bmk.verbose = false

// For LinReg, threshold is set so low, that the algorithm will run the number of iterations requested and then bail
//                       ex       feat  eps part int numIters
bmk addRun linearSettings(100    , 1000,0.5,10,  0.0, 10)  
bmk addRun linearSettings(100    , 1000,0.5,10,  0.0, 100)  
bmk addRun linearSettings(1000   , 1000,0.5,10,  0.0, 10)  
bmk addRun linearSettings(10000  , 1000,0.5,10,  0.0, 10)  
bmk addRun linearSettings(100000 , 1000,0.5,10,  0.0, 10)  
bmk addRun linearSettings(100000 , 1000,0.5,100, 0.0, 10)

bmk loop  
bmk printResults  

I also have a pre-built Jupyter Scala notebook with a Linear Regression and Terasort example as well located here DSX Jupyter Notebook which you could directly import into your own environment.

Notebook Overview

  • Add JAR and import library : As you can see from the lines 2/3, step one is to import a custom library that contains all the heavy lifting for the linear regression benchmark.

  • Next a custom linear regression benchmark(bmk) object is instantiated on line 6. This object contains a number of available methods to quickly test out linear regression with a fabricated data set.

  • addRun : I had the notion that it would be nice to setup a benchmark so that you could run multiple types of runs when running a benchmark so I added a function called addRun. This allows you to add any sequence of custom benchmark parameters to the benchmark object. This seemed nicer to me than adding some kind of construct for looping.

    • Note that each addRun line requires a linearSettings object . Here you enter the number of training examples, features, epsilon, partitions, intercept, and iterations that you want run.
  • The real heavy lifting is done on line 20. The bmk loop causes the notebook to run the benchmark for each run you specified above. The runtimes are saved within the object, and printed you with the bmk printResults command.

Development Workflow

This is how I structured my workflow

Initially I started with a notebook and Spark environment locally on my laptop. Special care was taken so that the development environment software versions matched the target cloud environment (in this case DSX). Once the environment was setup, all the code prototyping was performed locally in the development environment. After I had a simple working prototype, I made design decisions how to design a simple benchmarking class. I packaged the complex functional code in a JAR, while keeping the high level code in the notebook.
The final step was to load the compiled JAR file into the DSX environment and do some integration testing and iterate if I had issues running the program.

The result of keeping most of the low level code in the jar file is that you have a clean notebook that can be shared and used to communicate a high level concept without losing the main point somewhere in the code.

Deeper Dive and Project Review

The code hosted in github provides a great template to use for your own custom projects. I will cover a few key points in the rest of this blog. First, starting by cloning or downloading the benchmark source code.

git clone  

Code Packaging

Once you have downloaded the code lets take a look at how to first build it. The build.sbt file that is in the cloned root directory (dv-spark-bench). This project file is matched to the DSX environment for Spark 1.6.0 notebooks.

name := "dv-spark-bench"  
version := "1.0"  
organization := "dv.sparkbench"  
scalaVersion := "2.10.4"

// Note the use of the %% vs % sign.
// %% appends scala version while % does not
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0" % "provided"  
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.6.0" % "provided"  
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.0" % "provided"  
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0" % "provided"

// This is for times where dependencies are called multiple times
assemblyMergeStrategy in assembly := {  
  case PathList("org", "apache", "hadoop", xs @ _*)        => MergeStrategy.first
  case PathList("org", "apache", "spark", xs @ _*)         => MergeStrategy.first
  case PathList("com", "google", xs @ _*)                  => MergeStrategy.first
  case PathList("org", "apache", xs @ _*)                  => MergeStrategy.first
  case PathList("javax", "xml", xs @ _*)                   => MergeStrategy.first
  case PathList("com", "esotericsoftware", xs @ _*)        => MergeStrategy.first
  case PathList(ps @ _*) if ps.last endsWith ".html"       => MergeStrategy.first
  case "application.conf"                                  => MergeStrategy.concat
  case "unwanted.txt"                                      => MergeStrategy.discard
  case x =>
    val oldStrategy = (assemblyMergeStrategy in assembly).value

//Important line below.  This strips out all the scala dependencies and shrinks down your jar into skinny jar
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)  

As you can see, this is a lot more complex than our helloworld build.sbt file . First, note in the libraryDependencies line the use of the keyword provided. This is important to highlight because it notifies sbt not to add the spark and hadoop dependencies to your jar file. Second, there is a section for defining a merge strategy when you have references to the same dependencies more than once. Here I am selected the first reference. Finally the assemblyOption line tells sbt not to include any of the scala libraries in the jar file. This creates a very skinny jar, with only your classes defined.

Compile The Project

For starters, lets compile the source as is . Do this by invoking sbt from the top level project directory and running assembly

Expected Output :  
[info] Checking every *.class/*.jar file's SHA-1.
[info] Merging files...
[info] Assembly up to date: /dv-spark-bench/target/scala-2.10/dv-spark-bench-assembly-1.0.jar
[success] Total time: 1 s

Once the jar is assembled, take note of the path specified in your sbt output to get the jar file. You can use this file in your jupyter notebook. I have been hosting the jar files on github as a convenient way to load them into my DSX notebooks.

Benchmark High Level Design Overview

This project is arranged in the typical scala project. All the source is located under src/main/scala directory. Currently I have implemented two benchmarks (Linear Regression and Terasort).
Since benchmarks typically all have a few fundamental building blocks, I created an abstract class interface in bmCommon.scala to enforce a standard method for designing benchmarks. In general you have the following

  • Create a new benchmark object
  • Add custom run parameters to specify how the benchmark should loop
  • for each loop
    • generate synthetic data
    • run the benchmark
    • record the runtime
  • Print the results

Using this type of structure which is fairly common for most benchmarks worked well for me. The code for running terasort or linear regression in the notebook benefits from a standard implementation. The following section will cover how you could go about creating your own benchmark.

Example Implementation

This post won’t cover detailed code implementation, but if you are interested the file BenchmarkExample.scala in the github repo contains some documented code that you could use to implement your own benchmark. It is a simple contrived example about adding a bunch of random numbers, but it shows how you could add your own custom benchmark to the project(also see Terasort.scala and LinearRegression.scala for more complex implementations).


In this post, I covered some ideas related to how you can develop applications for use with your Data Science Experience environment. It shows how complex low level code can be offloaded from your notebook environments and packaged up neatly in jar files. While the development environment setup does take some time, I have found it to be worth the effort in terms of code reuse, simplification and efficiency. If you have questions, reach out to me at or on twitter @dustinvanstee.

Links :

Dustin VanStee

I am an Open Source Solution Engineer at IBM specializing in technologies like Spark and Hadoop.

Hopewell Junction, NY

Subscribe to IBM Data Science Experience Blog

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!