Data Science Experience (DSX) is a unified analytics environment providing access to Jupyter notebooks and RStudio on top of IBM Analytics for Apache Spark (think Spark-as-a-Service). DSX is tightly integrated with SoftLayer Object Storage; it serves as the default storage mechanism for files of any size. That said, DSX is flexible and can connect to any cloud or on-premises data source that Spark supports. DSX also includes several predefined connectors for cloud data sources including Amazon S3, Microsoft Azure, etc.
This tutorial will cover uploading a Parquet table with 30 partitions into SoftLayer Object Storage, then reading that file into a Spark DataFrame. The sample Parquet table being used for this tutorial is small (36 MB); you can download it here. This Parquet table is a folder containing 30 files, one for each partition. Be sure to download the entire folder. The data is a sample of public transit schedule information for buses and rapid transit lines from the Massachusetts Bay Transportation Authority. You can access the full datasets here in .csv format.
Although the tutorial uses a Parquet file and a Python notebook, the Object Storage portion of this tutorial is applicable to other file types and notebooks. If you haven’t tried DSX yet, you can create an account here.
1. Get Object Storage Credentials
While logged into DSX, click the profile icon in the top-right corner. On the pane that appears, choose Settings. On the Settings page, choose the Services tab. All IBM Bluemix services that your DSX account has access to will be listed. At a minimum, you should see Apache Spark and Object Storage, as these are created during the DSX setup process. You might have others too.
Click on the ellipsis next to Object Storage, then click Manage on Bluemix. The screen that appears will show all files that are in your SoftLayer Object Storage instance. Click the Service Credentials tab. Copy the JSON block from that page into something you can reference later (e.g. a text file).
Note: The JSON credentials block contains your password. Make sure to keep the block a secret, as anyone who has access to it can access files in your account.
SoftLayer Object Storage is powered by OpenStack Swift, a distributed and scalable object/blob store. OpenStack Keystone is used by Object Storage for authentication. The key-value pairs in the JSON credentials above are used to access Object Storage through the Keystone API.
This tutorial uses the latest version of the Keystone API, version 3. The concept of a domain was added in API v3, and the concept of a tenant was replaced with a project. This can be confusing when reading older tutorials or posts involving Keystone authentication. You can read more about API v3 here.
2. Installing the OpenStack Client
OpenStack provides two clients for Object Storage: an individual client called swift and a client common to all services called openstack. The individual clients are being deprecated in favor of using the common client. This tutorial uses the newer openstack client, which is available for Linux, macOS, and Windows. I am using a Mac, but the experience on other platforms should be similar. From the command line, install the client using pip:
$ pip install python-openstackclient
3. Create an OpenStack RC File
The Object Storage client reads several environment variables to get the credentials. Although these credentials can be provided as arguments, it is usually easier to define them beforehand in a script. This is referred to as an OpenStack RC file. Create a file called dsx-openrc.sh containing the lines below. Replace the four fields in <brackets> with the corresponding values in the JSON block from Step 1.
Save dsx-openrc.sh to the same directory as your Parquet table. Be sure to save the script to a location not managed by version control since it contains your password.
On the shell from which you want to run OpenStack commands, navigate to the directory containing the table and source the RC file as shown below. This will set environment variables according to the script.
$ source dsx-openrc.sh
4. Upload the Parquet File
Object Storage includes the concept of a container, which serves as a namespace for objects. By default, DSX creates a container called notebooks during the setup process; we will store the Parquet table in this container.
Object Storage is a key-value store — it isn’t hierarchical. However, users can create pseudo-hierarchies by using the ‘/’ character in object names, such as “/marketing/2015/file.parquet”.
The Parquet table in this case is actually a folder containing several parts of the data:
MBTAStopFrequency.parquet/ _SUCCESS _common_metadata _metadata part-r-00000-374be5e3-1ba6-43e4-bcd3-7dc126fa0f24.gz.parquet part-r-00001-374be5e3-1ba6-43e4-bcd3-7dc126fa0f24.gz.parquet ... part-r-00029-374be5e3-1ba6-43e4-bcd3-7dc126fa0f24.gz.parquet
Upload the contents of the folder using the following command:
$ openstack object create notebooks MBTAStopFrequency.parquet/*
5. Use the File From Data Science Experience
Click here to open the ParquetFromObjectStorage notebook in DSX. This notebook uses Spark SQLContext to import the Parquet file and create a Spark DataFrame. The notebook itself is also available as a Jupyter .ipynb file and can be downloaded here.
Follow the instructions embedded in the notebook to insert your Object Storage credentials, then click Cell→Run All. If successful, the last cell will print the schema, followed by the first 10 rows from the DataFrame.
At this point, you have successfully uploaded a Parquet table to SoftLayer Object Storage, then read the table into Spark as a DataFrame. This concludes the tutorial. For more DSX tutorials and Big Data courses, check out Big Data University.
This entry was originally posted to my personal blog on Medium.