On May 20, 2015, then U.S. Chief Data Scientist DJ Patil tweeted,
"I’ve found in my experience that cleaning the data is 80% of the hard work." (1)
Since then, this metric has been ubiquitously quoted because it's both accurate and data cleansing is crucial to the success of any analytics project.
Data tidying includes, but is not limited to:
- Meaningful column names
- Atomic data elements
- Consistent observational granularity
- Relevant features and if required, a target variable
Why does data need to be tidy? Because most modeling algorithms work best with well formed data that is in a table format. Exceptions to this include analytics on any unstructured data source like text analytics.
This hands-on tutorial demonstrates the basics of data tidying in Data Science Experience using an R notebook.
Specifically, you will be shown:
- How to access data that is local and external
- Approaches to identifying and fixing data issues using R
- Best practices on documentation, sharing, and reproducibility
Upon completion, you will have a working knowledge of how to tidy a data set. Also, you will be provided with all the code presented here. All source data used here is retrieved from an Amazon Web Services public repository. It is available in the public domain and is attributed to Lending Club "Lending Statistics."
The complete tutorial, which includes the Jupyter R notebook and associated data sets, is available for download on Github.
- Familiarity with the R programming language
- A DSX account - sign up for your free account
- Basic understanding of Jupyter notebooks
- Optional: Working knowledge of Github
Table of Contents
- Load and Explore Data
- Tidy Data
- Fix Column Headers
- Convert % Character Data to Numeric
- Unpack Packed Fields
- Discover and Remove Null Data
- Convert Character Categories to Numeric
- Plot Data and Fit Linear Model
- Summary and Next Steps
The Target Container assumes the name of the project but with the spaces removed, e.g. “TidyData.” Click “Create” to continue. You will be brought to the project’s dashboard:
Although not required, I do recommend that Github be used for notebook versioning. It is free to join.
To use Github:
- Create a repository (Github)
- Generate an access token for the repository (Github)
- Add the token to your profile (DSX -> profile -> settings -> integration)
- Add Github repository to project
Adding the URL is accomplished by clicking the “Settings” tab in the dashboard, scrolling to the bottom, and then pasting the URL in the “Connect to a GitHub Repository” field:
Click “Add” to tie the repository to the project.
There are several ways to get data into a DSX project. The simplest way is to add it manually:
Browse local file system and select data file to load:
In the overview page, you will see the file has been uploaded. Check it off, then click "Apply":
The CSV is now part of the project and is accessible to any notebook within the project. There is only one more thing needed before coding can begin; create an R notebook.
From the "Overview" page of the project, click on "add notebooks" to create a new notebook:
Give the notebook a meaningful name and description. Be sure to choose "R" as the "Language and accept the remaining defaults:
Click "Create Notebook" to continue. You will be prompted with an empty notebook:
Recall that the data set “sample.csv” is part of the project; it is stored in a container within a Bluemix Object Storage container. Access to the container requires credentials.
In the upper right of the notebook header there is an small icon made up of “1’s” and “0’s.” Click on this icon and a panel will appear and you will see the “sample.csv” file. Beneath “sample.csv” click on “Insert to code” then choose “Insert Credentials.”
The result is that the first code cell "In" has been populated with credential information necessary to move data in and out of the object store:
The object “credentials_1” contains a lot of information, I am only showing a bit of it here because it contains sensitive credential information that I don’t want to expose. Keep this in mind if you choose to share a notebook.
If you chose to use Github, you can publish the notebook at any time. To publish, click the publish icon on the header of the notebook, then choose “Publish on Github”:
You will be prompted with a dialogue in which you add a commit message. You also need to choose an option for “Cell content.” Choose “All content except hidden code cells.”
By choosing this option, the notebook is saved without sensitive, or “hidden cell” items as indicated by the token “# @hidden_cell”.
A quick check of Github shows that the notebook has been published:
From the point forward, you can check in changes to Github at your discretion. The last step in the bootstrap process is to install and load R packages that make moving data in and out of Bluemix Object Storage easier:
The three new codes cells install and activate two R packages "devtools" and "objectStoreR."
The first is part of the R eco-system. The second was developed by the IBM Data Science Experience team and is open source under the Apache License 2.0.
Everything up to this point has laid the plumbing enabling us work with R, in a Jupyter notebook, accessing both internally and externally. Recall that we uploaded the file “sample.csv.” Let’s see how easy it is to retrieve it from object storage and use it in an R data frame:
We now know how to pull data from object storage but what if the data is external and can only be accessed by URL? Below, I demonstrate how to pull data back from an external data source, put it into a data frame, and how to self-document:
Access external open source loan data using R:
The data set was imported as a CSV file "loanData.csv" but it had to be converted to an R data frame "loanDataRaw" to make it usable in R.
Datafile "loanDataRaw" has 2500 records and 14 columns.
Great, but what does the data look like?
This brief output demonstrates that the data, and its metadata, are not tidy. A few of the many items that should be addressed include:
- Column headings: Inconsistent case and use of periods
- Interest.Rate: Numerical data stored as strings
- Loan.Length: Can be simplified and represented as numerical data
- Fico.Range: Single column range would be more useful as distinct columns
The complete raw data set can be viewed on GitHub.
With a better understanding of the data and its issues, we first need to copy this data to storage before making the necessary changes.
The helper function "objectStore.put" makes quick work of copying the raw data set to object storage. The output of the "status" function is "201" which indicates success.
Let's dive in and start tidying up this data.
Step 1: Fix the column headings, remove the periods, and change the names so that they are consistent:
You can see several of the column names before and after the treatment. This consistent naming will make things easier moving forward.
Step 2: Remove "%" sign from "InterestRate" and "DebtToIncomeRatio" then cast as numeric:
Step 3: Break up FICORange into three new columns: FICOLow, FICOHigh, FICOAvg:
Step 4: Find and remove rows that have null values:
Step 5: Convert string values to a numerical representation for several variables: LoanPurpose, EmploymentLength, HomeOwnership, and LoanLength.
I show just one transformation for brevity. The other three transformations are detailed in the complete tutorial R Notebook that supports this tutorial.
A quick check of the data shows that the transformations have been applied:
- Column headings: Consistent case
- InterestRate, DebtToIncomeRatio: "%" removed, numerical represenation
- LoanLength: Numerical representation
- FicoRange: Single concatenated string transformed to three numerical columns
- Plus several other transformations
The data is tidy and ready for analysis. Before I proceed, I want to write the tidy data set to BlueMix Object Storage. This step is not always necessary because we can always regenerate the final data set. However, I may want to perform analysis in another notebook but avoid having to process the raw data again.
As indicated earlier in the post, a "201" return message indicates it was successfully saved to object storage.
Using the tidy data I will now demonstrate basics graphing, analysis, and the creation of a linear model.
Simple plot of Interest Rate vs. Avg FICO Score:
Check for correlations between variables in the data. This is an example of the need for numerical data. Recall the string to numerical representation step:
Checking for correlations is an important step that helps identify variables that are tied to both independent and dependent variables; known as confounders. This type of analysis is outside the scope of this blog, but I do want to point out how this output might be used to indicate confounding data.
The highlighted output shows the relationships between InterestRate, AmountFunded, LoanLength, and FICOavg. Nothing seems too suspicious here. It is reasonable to expect that the interest rate charged for a loan is functionally dependent on the variables marked in green. However, a good question to ask is whether or not AmountFundedByInvestors and LoanLength are necessary for analysis. They have similar correlations to InterestRate. Without further analysis, this question can't be answered, but it certainly warrants further exploration.
In any data analysis or in machine learning, an output is a model that can be used to make predictions on new data elements as they are generated. The R programming language makes it easy to fit data to a linear model. Let's look at Interest Rate vs. FICO Avg Score:
Plot the data points and overlay the linear model:
In this tutorial I demonstrated how Data Science Experience, along with an R notebook is used to tidy data and ready it for analytics.
- Project creation in Data Science Experience
- Github association to a project
- Bluemix object storage credentials: retrieval and use
- Accessing data in Bluemix, and externally via HTTPS
- Basic methods of data exploration using R
- Data operations on R data frames
- Basic introduction to data analysis using R
I strongly encourage you to pursue a deep dive on the topics presented in this tutorial.
- Data Science Experience Blog - Tutorials, Sample Notebooks, Open Data Sets, plus more
- Getting Started with Github - Short but informative tutorials on learning Github
- Introduction to R - New to R? This is where you start
- Tidy Data - Complete guide to data tidying
- IBM Data Science Experience Githu Repository - Great resource to see what others are doing with DSX
- The Jupyter Notebook - In-depth Jupyter notebook documentatio
Lending Club - Lending Club Statistics. Accessed via https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv?accessType=DOWNLOAD on 03/25/2017
 DJ Patil. (2015). Twitter. https://twitter.com/dj44/status/601119768955920384
 Hadley Wickham. (2014). Journal of Statistical Software: Tidy Data. https://www.jstatsoft.org/article/view/v059i10