IBM  Data Science  Experience

Working with Object Storage in Data Science Experience - Python Edition

In this post I'll run through some code needed to read and write data in CSV and JSON formats in Object Storage. This is the storage associated with your Data Science Experience account, which works great for any kind of flat file storage.

Getting Started



To work with Object Storage you will need your API credentials to authenticate your account. In Data Science Experience notebooks you can find your data by clicking the 1001 icon in the top right, which opens a right side panel showing your data assets. The screenshot below shows the options you have when you click Insert to code for a CSV file in a Python notebook:

As you can see, you can easily insert this CSV file as any of the most popular data types in Python with one click. By choosing one of the first three options, code is inserted into your notebook that leaves you with an object named something like df_data_1 that will be ready for analysis.

For the purposes of this post, click Insert Credentials so that you have the object storage API credentials as an object available for use. Inserting credentials in a Python notebook creates code that looks like this:

credentials_1 = {  
  'auth_url':'https://identity.open.softlayer.com',
  'project':'object_storage_9-----3',
  'project_id':'7babac2********e0',
  'region':'dallas',
  'user_id':'9603b8************70f',
  'domain_id':'2c66d***********b9d26',
  'domain_name':'1026***',
  'username':'member_******************',
  'password':"""***************""",
  'container':'TemplateNotebooks',
  'tenantId':'undefined',
  'filename':'data_by_var.json'
}

Reading from Object Storage

As mentioned previously, if you want to read CSV data from Object Storage in a notebook, you can do it with one click.

If you are working with JSON data, you only have the option to insert the credentials for the data file to code. You can use the code snippet below to read the JSON data into a Python object:

Reading JSON

Update 12/02/2016 - Data Science Experience now supports one-click importing of JSON data inside Python and R notebooks. The code below will still work in DSX and may be helpful for reference.

from io import BytesIO  
import requests  
import json  
import pandas as pd

def get_data(credentials):  
    """This functions returns a StringIO object containing
    the file content from Bluemix Object Storage V3."""

    url1 = ''.join(['https://identity.open.softlayer.com', '/v3/auth/tokens'])
    data = {'auth': {'identity': {'methods': ['password'],
            'password': {'user': {'name': credentials['username'],'domain': {'id': credentials['domain_id']},
            'password': credentials['password']}}}}}
    headers1 = {'Content-Type': 'application/json'}
    resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
    resp1_body = resp1.json()
    for e1 in resp1_body['token']['catalog']:
        if(e1['type']=='object-store'):
            for e2 in e1['endpoints']:
                        if(e2['interface']=='public'and e2['region']=='dallas'):
                            url2 = ''.join([e2['url'],'/', credentials['container'], '/', credentials['filename']])
    s_subject_token = resp1.headers['x-subject-token']
    headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json'}
    resp2 = requests.get(url=url2, headers=headers2)
    return json.loads(resp2.content)

This code returns a Python dictionary with your JSON data. If you want this data in a Pandas data frame, use the following code:

my_json = get_data(credentials_1)  
my_json_df = pd.DataFrame.from_dict(my_json)  

(Note that I am passing credentials_1 as the argument for the get_data argument because that is the name of my credentials in the first section.)

Writing to Object Storage

Now, imagine that you have read in your data, done some manipulations, maybe even scored your data with a model. To save the data set that you worked with, you will need to write this data out to object storage. The function below accepts credentials in the same format as the first section, as well as a file name that refers to a file in the virtual machine where your notebook is running.

For example, if you are working with a Pandas data frame (df), you could save it with the following code: df.to_csv('myPandasData.csv',index=False). If you are working with JSON/Python dictionaries, you can save those using the following code:

with open('my_json.json', 'w') as my_json_file:  
    json.dump(my_json, my_json_file)

With your CSV or JSON file saved, you can use the following function:

Writing CSV/JSON

from io import BytesIO  
import requests  
import json  
import pandas as pd

def put_file(credentials, local_file_name):  
    """This functions returns a StringIO object containing
    the file content from Bluemix Object Storage V3."""
    f = open(local_file_name,'r')
    my_data = f.read()
    url1 = ''.join(['https://identity.open.softlayer.com', '/v3/auth/tokens'])
    data = {'auth': {'identity': {'methods': ['password'],
            'password': {'user': {'name': credentials['username'],'domain': {'id': credentials['domain_id']},
            'password': credentials['password']}}}}}
    headers1 = {'Content-Type': 'application/json'}
    resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
    resp1_body = resp1.json()
    for e1 in resp1_body['token']['catalog']:
        if(e1['type']=='object-store'):
            for e2 in e1['endpoints']:
                        if(e2['interface']=='public'and e2['region']=='dallas'):
                            url2 = ''.join([e2['url'],'/', credentials['container'], '/', local_file_name])
    s_subject_token = resp1.headers['x-subject-token']
    headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json'}
    resp2 = requests.put(url=url2, headers=headers2, data = my_data )
    print resp2

Once again, call this function by passing the credentials you inserted earlier, as well as the name of the file you saved from your data, either as CSV or JSON. If you have a JSON file named 'my_json', putting it in object storage looks like this: put_file(credentials_1,'my_json.json').

Special Case: Writing Spark DataFrames

If you are working with a large data set in Spark and want to push it directly to Object Storage, you can use the code below. The credentials dictionary is omitted here, but the pre-req for this action is having your object storage credentials in a dictionary. For this example, assume that the credentials dictionary is called creds and your Spark DataFrame is called df.

First, set your Hadoop configuration using this function. The source for this function is the DSX insert to code function for inserting a Spark DataFrame from object storage (slightly tweaked). This function configures our notebook Spark service to easily work with data in the associated object storage container.

def set_hadoop_config_with_credentials(creds):  
    """This function sets the Hadoop configuration so it is possible to
    access data from Bluemix Object Storage using Spark"""

    # you can choose any name
    name = 'keystone'

    prefix = 'fs.swift.service.' + name
    hconf = sc._jsc.hadoopConfiguration()
    hconf.set(prefix + '.auth.url', 'https://identity.open.softlayer.com'+'/v3/auth/tokens')
    hconf.set(prefix + '.auth.endpoint.prefix', 'endpoints')
    hconf.set(prefix + '.tenant', creds['project_id'])
    hconf.set(prefix + '.username', creds['user_id'])
    hconf.set(prefix + '.password', creds['password'])
    hconf.setInt(prefix + '.http.port', 8080)
    hconf.set(prefix + '.region', 'dallas')
    hconf.setBoolean(prefix + '.public', False)

Pass the creds dict to the set_hadoop_config_with_credentials() function. Nothing will be returned from this function.

set_hadoop_config_with_credentials(creds)  

With the configuration in place, we can write a Spark DataFrame. The code below uses the creds dict again for reference to the object storage container. The output file name is hardcoded here to be export_data.csv, be sure to update this as needed.

fileNameOut = 'swift://'+ creds['container'] + '.keystone/export_data.csv'  
df.write.format('com.databricks.spark.csv').options(header='true').save(fileNameOut)  

Now you should be comfortable reading and writing data from Object Storage in Data Science Experience using Python. In the upcoming weeks I'll show how to do the same tasks in R and Scala.

Tweet me if you want to see anything else or have a tip that I missed!

Greg Filla

Read more posts by this author.

Chicago

Subscribe to IBM Data Science Experience Blog

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!