IBM  Data Science  Experience

Working with dashDB in Data Science Experience

dashDB logo

Because it is optimized for analytic operations in many ways, dashDB is an excellent choice to be the home of your data to analyze with Data Science Experience.

In this article we review the options to access dashDB and read or write table data from Data Science Experience.

Overview

There are two approaches to working with dashDB data:

  • R & Python APIs for direct access and push-down to dashDB
  • Access through Spark service to dashDB

Which method to choose depends on your specific needs.

Use the first method when you want to rely on dashDB to scale your analytics through generated SQL statements or invocation of built-in analytic routines running inside the database. Also use the first method when you want to define the SQL statements that run inside dashDB yourself, and you only want to use R or Python to work on the results of these SQL statements, for example, to visualize them.

Use the second method if you want to define your own analytic logic that needs to be scaled out and run in parallel. Also use the second method when you want to employ Spark-based machine learning libraries to work on your dashDB data.

Python logoR logo

Python and R API for dashDB

dashDB provides dedicated API libraries for both R and Python. They both abstract dashDB tables as regular data frames, which is the established mechanism for tabular data representation in both languages. In addition, these APIs provide R and Python wrapper methods to invoke dashDB's built-in predictive analytic routines to train, persist, manage and score predictive models inside dashDB.

ibmdbR

The ibmdbR library is deployed out of the box in Data Science Experience. It allows you to connect directly from R notebooks or R scripts in RStudio to a dashDB database, and to interact with the table data as if it was standard R data frames. Operations that you perform on such data frames are automatically translated into SQL statements running in dashDB. Also, the library provides a set of methods to invoke the dashDB predictive analytics routines.

As a quick starter, use these R demo notebooks:

In addition, the extension pack ibmdbRXt is deployed out of the box in Data Science Experience, which provides API methods to use the dashDB geospatial data and analytic functions.

ibmdbpy

The ibmdbpy library is deployed out of the box in Data Science Experience. The library allows you to connect directly from Python notebooks to a dashDB database, and to interact with the table data as if it was Pandas data frames. Operations that you perform on such data frames are automatically translated into SQL statements running in dashDB. Also, the library provides a set of methods to invoke the dashDB predictive analytics routines.

As a quick starter, use these Python demo notebooks:

Spark logo

Accessing dashDB through Spark

You can use the Apache Spark service in Data Science Experience to read and write data from dashDB just as from any other storage backend. Specifically, the dashDB data access relies on the standard JDBC data source mechanism of Apache Spark. There are two things to take care of to make this work smoothly: Specify the correct driver JDBC URL, and define a special dashDB dialect for Spark to fix the default String data type mapping, which doesn't apply to dashDB.

To start with the right best practices, take a look at these demo notebooks:

Optimizing data reading for dashDB MPP

Spark provides mechanisms to read data in parallel. When you already have a partitioned data backend such as a dashDB MPP instance, then it makes sense to read the individual data partitions in parallel into Spark. This saves a lot of reshuffling work that would otherwise have to be performed under the hood.

Here is how you can specify that reading happens in parallel along the lines of the dashDB MPP partitions:

var df = spark.read.
format("jdbc").
option("url", "jdbc:db2://<DB2 server>:<DB2     port>/<dbname>").
option("user", "<username>").
option("password", "<password>").
option("dbtable", "<your table>").
option("partitionColumn", "DBPARTITIONNUM(<a column   name>)").
option("lowerBound", "<lowest partition number>").
option("upperBound", "<largest partition number>").
option("numPartitions", "<number of partitions>").
load()

In case you don't know the partitioning of your dashDB MPP system, here is how you can find it out using SQL:

SELECT min(member_number), max(member_number),    count(member_number) 
FROM TABLE(SYSPROC.DB_MEMBERS())

Torsten Steinbach

Read more posts by this author.

Subscribe to IBM Data Science Experience Blog

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!