IBM  Data Science  Experience

Excel Files: Loading from Object Storage - Python

In this blog post we will work with pulling an Excel file from Object Storage into our notebook.

This blog leverages a previous post that introduced work with Object Storage in DSx.

Prerequisites

In order to read the excel file into a Pandas DataFrame, you might need to install a prerequisite package:
!pip install xlrd

Getting Data

Data Science Experience allows you to insert a file into your notebook with a single click. Once in the notebook, you can see your data assets you have in Object Storage by clicking the 1001 icon in the top right, opening up a panel on the right side of the notebook. The screenshot below shows the options you have when you click Insert to code for an Excel file in a Python notebook. Selecting the first option Insert StringIO object will bring in the code that will need to slightly tweak for working with Excel files.
alt

With just one click you bring in the code that uses your Object Storage API credentials to pull in your data into a generic data_1 object. You can rename it later if you choose.

While the StringIO code will allow you to seamlessly import CSV files with one click, you will need to make two minor adjustments in order to load Excel Files:
1. On first line of code in the cell, replace from io import StringIO with from io import BytesIO - BytesIO allows you to work with binary data. Using StringIO class might generate Unicode-related errors when you attempt to read the Excel File into a Pandas DataFrame
2. Change the last line of the function get_object_storage_.... that calls your Object Storage to return BytesIO(resp2.content) Simply change StringIO to BytesIO

The full code is below:

from io import BytesIO  
import requests  
import json  
import pandas as pd

# @hidden_cell
# This function accesses a file in your Object Storage. The definition contains your credentials.
# You might want to remove those credentials before you share your notebook.
def get_object_storage_file_with_credentials_***(container, filename):  
    """This functions returns a StringIO object containing
    the file content from Bluemix Object Storage."""

    url1 = ''.join(['https://identity.open.softlayer.com', '/v3/auth/tokens'])
    data = {'auth': {'identity': {'methods': ['password'],
            'password': {'user': {'name': 'member_********************','domain': {'id': '**************'},
            'password': '*************}}}}}
    headers1 = {'Content-Type': 'application/json'}
    resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
    resp1_body = resp1.json()
    for e1 in resp1_body['token']['catalog']:
        if(e1['type']=='object-store'):
            for e2 in e1['endpoints']:
                        if(e2['interface']=='public'and e2['region']=='dallas'):
                            url2 = ''.join([e2['url'],'/', container, '/', filename])
    s_subject_token = resp1.headers['x-subject-token']
    headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json'}
    resp2 = requests.get(url=url2, headers=headers2)
    return BytesIO(resp2.text)

# Your data file was loaded into a StringIO object and you can process the data.
# Please read the documentation of pandas to learn more about your possibilities to load your data.
# pandas documentation: http://pandas.pydata.org/pandas-docs/stable/io.html
data_1 = get_object_storage_file_with_credentials_***('SampleProject', 'Superstore.xlsx')  

With the file object loaded we can now use pandas imported when the one-click Insert StringIO Object code was execute.
We create an excel file object and pull in the desired tab into a pandas dataframe with 2 simple lines of code, previewing it with myd.head()

xls_file = pd.ExcelFile(data_1) #create instance of ExcelFile class  
myd = xls_file.parse('Orders') #read the data from selected Excel sheet  
myd.head()  

Note: This is a sample data set. Customer names and details are not real. The data set is available for download.