Because it is optimized for analytic operations in many ways, Db2 Warehouse on Cloud (formerly known as IBM dashDB) is an excellent choice to be the home of your data to analyze with Data Science Experience.
In this article we review the options to access Db2 Warehouse on Cloud and read or write table data from Data Science Experience.
There are two approaches to working with Db2 Warehouse on Cloud data:
- R & Python APIs for direct access and push-down to Db2 Warehouse
- Access through Spark service to Db2 Warehouse on Cloud
Which method to choose depends on your specific needs.
Use the first method when you want to rely on Db2 Warehouse on Cloud to scale your analytics through generated SQL statements or invocation of built-in analytic routines running inside the database. Also use the first method when you want to define the SQL statements that run inside Db2 Warehouse on Cloud yourself, and you only want to use R or Python to work on the results of these SQL statements, for example, to visualize them.
Use the second method if you want to define your own analytic logic that needs to be scaled out and run in parallel. Also use the second method when you want to employ Spark-based machine learning libraries to work on your Db2 Warehouse on Cloud data.
Python and R API for Db2 Warehoused
Db2 Warehouse provides dedicated API libraries for both R and Python. They both abstract Db2 Warehouse tables as regular data frames, which is the established mechanism for tabular data representation in both languages. In addition, these APIs provide R and Python wrapper methods to invoke Db2 Warehouse's built-in predictive analytic routines to train, persist, manage and score predictive models inside Db2 Warehouse.
The ibmdbR library is deployed out of the box in Data Science Experience. It allows you to connect directly from R notebooks or R scripts in RStudio to a Db2 Warehouse database, and to interact with the table data as if it was standard R data frames. Operations that you perform on such data frames are automatically translated into SQL statements running in Db2 Warehouse. Also, the library provides a set of methods to invoke the Db2 Warehouse predictive analytics routines.
As a quick starter, use these R demo notebooks:
- Running SQL from R notebook in Db2 Warehouse
- KMeans clustering from R notebooks inside Db2 Warehouse
- Naive Bayes from R notebooks inside Db2 Warehouse on Cloud
- Linear Regression from R notebooks inside Db2 Warehouse
In addition, the extension pack ibmdbRXt is deployed out of the box in Data Science Experience, which provides API methods to use the Db2 Warehouse geospatial data and analytic functions.
The ibmdbpy library is deployed out of the box in Data Science Experience. The library allows you to connect directly from Python notebooks to a Db2 Warehouse database, and to interact with the table data as if it was Pandas data frames. Operations that you perform on such data frames are automatically translated into SQL statements running in Db2 Warehouse. Also, the library provides a set of methods to invoke the Db2 Warehouse predictive analytics routines.
As a quick starter, use these Python demo notebooks:
Accessing Db2 Warehouse on Cloud through Spark
You can use the Apache Spark service in Data Science Experience to read and write data from Db2 Warehouse on Cloud just as from any other storage backend. Specifically, the Db2 Warehouse on Cloud data access relies on the standard JDBC data source mechanism of Apache Spark. There are two things to take care of to make this work smoothly: Specify the correct driver JDBC URL, and define a special Db2 Warehouse dialect for Spark to fix the default String data type mapping, which doesn't apply to Db2 Warehouse on Cloud.
To start with the right best practices, take a look at these demo notebooks:
- Reading and writing Spark data frames from and to Db2 Warehouse on Cloud with Scala
- Reading and writing Spark data frames from and to Db2 Warehouse on Cloud with PySpark
Optimizing data reading for Db2 Warehouse on Cloud MPP
Spark provides mechanisms to read data in parallel. When you already have a partitioned data backend such as a Db2 Warehouse on Cloud MPP instance, then it makes sense to read the individual data partitions in parallel into Spark. This saves a lot of reshuffling work that would otherwise have to be performed under the hood.
Here is how you can specify that reading happens in parallel along the lines of the Db2 Warehouse on Cloud MPP partitions:
var df = spark.read. format("jdbc"). option("url", "jdbc:db2://<DB2 server>:<DB2 port>/<dbname>"). option("user", "<username>"). option("password", "<password>"). option("dbtable", "<your table>"). option("partitionColumn", "DBPARTITIONNUM(<a column name>)"). option("lowerBound", "<lowest partition number>"). option("upperBound", "<largest partition number>"). option("numPartitions", "<number of partitions>"). load()
In case you don't know the partitioning of your Db2 Warehouse on Cloud MPP system, here is how you can find it out using SQL:
SELECT min(member_number), max(member_number), count(member_number) FROM TABLE(SYSPROC.DB_MEMBERS())