On June 6 we introduced the IBM Data Science Experience to the world at the Spark Maker Event that took place in Galvanize. We demonstrated the Experience with a real use case developed in partnership with BlocPower.
BlocPower is a startup based in New York City. Its technology and finance platform develops clean energy projects in American inner cities. IBM Data Science Experience helped BlocPower perform a comprehensive energy audit of each property to determine the correct mix of high-efficiency technology to reduce each customer's energy consumption. Tooraj Arvajeh, Chief Engineering Officer at BlocPower, explained how IBM Data Science Experience made this process simpler.
"BlocPower operation is diverse from outreach and targeting, origination of investment-grade clean energy projects to financing projects through our crowdfunding marketplace. Data is the underlying tool of our operation and IBM's Data Science Experience will facilitate a closer integration across it and help our business scale up faster. "
Goals of the demo:
- Easily import data into a notebook from object storage to quickly start analyzing data and creating predictive models.
- Model energy usage of buildings in kWh.
- Identify buildings that consume energy inefficiently.
- Create a project and collaborate with other data scientists.
- Create an easy-to-use application to make the outcome of the models consumable by any user.
To do that, we used tools that data scientists love today that are integrated into the IBM Data Science Experience: Jupyter notebooks connected to Apache Spark, RStudio, Shiny, and GitHub.
These are the steps that we followed:
1- GitHub + Jupyter notebooks = <3
When starting a new project, the data scientist can choose to start from scratch or to leverage someone else's work. In this case, we showcase the Import from URL capability to import an existing notebook from GitHub and start working on it right away. There are more than 200k public Jupyter notebooks out there that you can use!
2- Load and clean data
To analyze data in a Jupyter notebook, first load the data. Many libraries and commands can do that, but it's not always obvious which one to use. One of the add-ons to Jupyter notebooks is the capability to access data files stored in object storage or available through data connections and in one click to add the code needed to load the data into the notebook.
Once the data is loaded, the next step is to clean it. We created a library called Sparkling.Data, which can scale to big data, to help the data scientist perform this task.
3- Data Exploration
After cleaning the data, we used Matplotlib, the best tool available for data visualization in Python, to explore the correlations between energy usage and building characteristics such as age, number of stories, square footage, amount of plugged equipment, and domestic and heating gas consumption. By analyzing variable relationships, the data scientist can, for example, determine the best model to use and which variables have more predictive power.
4- Create a Prediction Model
Our goal is to create a model that predicts the energy consumption in kWh of different buildings based on characteristics such as square feet, age, number of stories, and so on. We model energy usage with a linear regression using the algorithm included in scikit-learn, one of the best Python libraries for machine learning. Before running the linear regression, we used the MaxAbsScaler function from scikit-learn to scale the data. To visualize the fit of this model, we use a scatter plot of the observed vs. the predicted values. The resulting R-squared value was approximately 0.72.
5- Classify buildings by efficiency
We used the popular K-means algorithm to cluster buildings in NYC based on four dimensions that indicate energy efficiency: gas use for heating, gas use for domestic purposes, electricity use for plugged equipment, and electricity use for air conditioning. In the next matplotlib plot, we colored our buildings by using the K-means labels with K=4 and using two out of the four dimensions. This visualization, and other visualizations not shown here, helped us reduce the four clusters to two. These two clusters of buildings were interpreted as the efficient and the inefficient groups of buildings.
6- Flexdashboard and Shiny in RStudio
RStudio just published on CRAN a new R package called Flexdashboard. This great package enables creating dashboards very easily, and you can include Shiny code to make dashboards very interactive. A dashboard can be shared with anyone by simply sending the URL.
The dashboard is divided into 4 sections:
- Data Exploration: A map of buildings colored by their electricity consumption. When a building is selected, a bar plot indicates how this building is doing with respect to the average energy efficiency measured in four dimensions.
- Clustering: A map of buildings classified as efficient or inefficient.
- Prediction: Scoring of the linear regression model built in the notebook to predict the energy usage in kWh and annual cost of electricity for the buildings. On the left side are sliders for selecting the properties of the building to score the model.
- Raw Data: We use the Data.Tables package to display the data set with search and sorting capabilities.
You can check out the 10-minute demo of IBM Data Science Experience here:
We created a GitHub repository with all of the material and instructions needed to run this demo, too. Enjoy!