IBM  Data Science  Experience

Working with Object Storage in Data Science Experience - Python Edition

In this post I'll run through some code needed to read and write data in CSV and JSON formats in Object Storage. This is the storage associated with your Data Science Experience account, which works great for any kind of flat file storage.

Getting Started



To work with Object Storage you will need your API credentials to authenticate your account. In Data Science Experience notebooks you can find your data by clicking the 1001 icon in the top right, which opens a right side panel showing your data assets. The screenshot below shows the options you have when you click Insert to code for a CSV file in a Python notebook:

As you can see, you can easily insert this CSV file as any of the most popular data types in Python with one click. By choosing one of the first three options, code is inserted into your notebook that leaves you with an object named something like df_data_1 that will be ready for analysis.

For the purposes of this post, click Insert Credentials so that you have the object storage API credentials as an object available for use. Inserting credentials in a Python notebook creates code that looks like this:

credentials_1 = {  
  'auth_url':'https://identity.open.softlayer.com',
  'project':'object_storage_9-----3',
  'project_id':'7babac2********e0',
  'region':'dallas',
  'user_id':'9603b8************70f',
  'domain_id':'2c66d***********b9d26',
  'domain_name':'1026***',
  'username':'member_******************',
  'password':"""***************""",
  'container':'TemplateNotebooks',
  'tenantId':'undefined',
  'filename':'data_by_var.json'
}

Reading from Object Storage

As mentioned previously, if you want to read CSV data from Object Storage in a notebook, you can do it with one click.

If you are working with JSON data, you only have the option to insert the credentials for the data file to code. You can use the code snippet below to read the JSON data into a Python object:

Reading JSON

Update 12/02/2016 - Data Science Experience now supports one-click importing of JSON data inside Python and R notebooks. The code below will still work in DSX and may be helpful for reference.

from io import BytesIO  
import requests  
import json  
import pandas as pd

def get_data(credentials):  
    """This functions returns a StringIO object containing
    the file content from Bluemix Object Storage V3."""

    url1 = ''.join(['https://identity.open.softlayer.com', '/v3/auth/tokens'])
    data = {'auth': {'identity': {'methods': ['password'],
            'password': {'user': {'name': credentials['username'],'domain': {'id': credentials['domain_id']},
            'password': credentials['password']}}}}}
    headers1 = {'Content-Type': 'application/json'}
    resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
    resp1_body = resp1.json()
    for e1 in resp1_body['token']['catalog']:
        if(e1['type']=='object-store'):
            for e2 in e1['endpoints']:
                        if(e2['interface']=='public'and e2['region']=='dallas'):
                            url2 = ''.join([e2['url'],'/', credentials['container'], '/', credentials['filename']])
    s_subject_token = resp1.headers['x-subject-token']
    headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json'}
    resp2 = requests.get(url=url2, headers=headers2)
    return json.loads(resp2.content)

This code returns a Python dictionary with your JSON data. If you want this data in a Pandas data frame, use the following code:

my_json = get_data(credentials_1)  
my_json_df = pd.DataFrame.from_dict(my_json)  

(Note that I am passing credentials_1 as the argument for the get_data argument because that is the name of my credentials in the first section.)

Writing to Object Storage

Now, imagine that you have read in your data, done some manipulations, maybe even scored your data with a model. To save the data set that you worked with, you will need to write this data out to object storage. The function below accepts credentials in the same format as the first section, as well as a file name that refers to a file in the virtual machine where your notebook is running.

For example, if you are working with a Pandas data frame (df), you could save it with the following code: df.to_csv('myPandasData.csv',index=False). If you are working with JSON/Python dictionaries, you can save those using the following code:

with open('my_json.json', 'w') as my_json_file:  
    json.dump(my_json, my_json_file)

With your CSV or JSON file saved, you can use the following function:

Writing CSV/JSON

from io import BytesIO  
import requests  
import json  
import pandas as pd

def put_file(credentials, local_file_name):  
    """This functions returns a StringIO object containing
    the file content from Bluemix Object Storage V3."""
    f = open(local_file_name,'r')
    my_data = f.read()
    url1 = ''.join(['https://identity.open.softlayer.com', '/v3/auth/tokens'])
    data = {'auth': {'identity': {'methods': ['password'],
            'password': {'user': {'name': credentials['username'],'domain': {'id': credentials['domain_id']},
            'password': credentials['password']}}}}}
    headers1 = {'Content-Type': 'application/json'}
    resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
    resp1_body = resp1.json()
    for e1 in resp1_body['token']['catalog']:
        if(e1['type']=='object-store'):
            for e2 in e1['endpoints']:
                        if(e2['interface']=='public'and e2['region']=='dallas'):
                            url2 = ''.join([e2['url'],'/', credentials['container'], '/', local_file_name])
    s_subject_token = resp1.headers['x-subject-token']
    headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json'}
    resp2 = requests.put(url=url2, headers=headers2, data = my_data )
    print resp2

Once again, call this function by passing the credentials you inserted earlier, as well as the name of the file you saved from your data, either as CSV or JSON. If you have a JSON file named 'my_json', putting it in object storage looks like this: put_file(credentials_1,'my_json.json').

Special Case: Writing Spark DataFrames

If you are working with a large data set in Spark and want to push it directly to Object Storage, you can use the code below. Assume for this example that your credentials are called creds and your data frame is called df.

fileNameOut = 'swift://'+ creds['container'] + '.spark/export_data.csv'  
df.write.format('com.databricks.spark.csv').options(header='true').save(fileNameOut)  

Now you should be comfortable reading and writing data from Object Storage in Data Science Experience using Python. In the upcoming weeks I'll show how to do the same tasks in R and Scala.

Tweet me if you want to see anything else or have a tip that I missed!

Greg Filla

Read more posts by this author.

Chicago

Subscribe to IBM Data Science Experience Blog

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!