IBM  Data Science  Experience

Analyze open data sets using pandas in a Python notebook

Open data is freely available, which means you can modify, store, and use it without any restrictions. Governments, academic institutions, and publicly focused agencies are the most common providers of open data. They typically share things like environmental, economic, census, and health data sets. You can learn more about open data from The Open Data Institute or from wikipedia.

Two great places to start browsing are data.gov and data.gov.uk where you can find all sorts of data sets. Other good sources are the World Bank, the FAO, eurostat and the bureau for labor statistics. If you're interested in a specific country or region, just do a quick Google search, and you'll likely uncover other sources as well.

Open data can be a powerful analysis tool, especially when you connect multiple data sets to derive new insights. This tutorial features a notebook that helps you get started with analysis using pandas. Pandas is one of my favorite data analysis packages. It's very flexible and includes tools that make it easy to load, index, classify, and group data.

In this tutorial, you will learn how to work with a DataFrame in 2 basic steps:

  1. Load data from open data sets into a Python notebook in Data Science Experience.
  2. Work with a Python notebook on Data Science Experience (join data frames, clean, check, and analyze the data using simple statistical tools).

Data & analytics on Data Science Experience

Data Science Experience features a selection of open data sets that you can download and use any way you want. It's easy to get an account, start a notebook, and grab some data:

  1. Sign in to Data Science Experience (or sign up for a free trial).
  2. Open the sample notebook called Analyze open data sets with pandas DataFrames . To open the sample notebook, click here (or type its name in the Search field on the home page of Data Science Experience and select the card for the notebook), then click the Shows the Create Notebook icon button on the top of the preview page that opens. Select a project and Spark service and click Create Notebook. The sample notebook opens for you to work with.
  3. Find the first data set and get its access key URL.
    1. From the Data Science Experience home page, search for "life expectancy".
    2. Click the card with the title Life expectancy at birth by country in total years.
    3. Click the Manage Access Keys button.
    4. Click Request a New Access Key.
    5. Copy the access key URL, and click Close. You'll use this link in a minute to load data into the Python notebook.

    Tip: If you don't want to run the commands yourself, you can also just open the notebook in your browser and follow along: https://apsportal.ibm.com/exchange/public/entry/view/47ed96c50374ccd15f93ef262c1af63b

    Load data into a DataFrame

    Paste the access key URL you copied from the Life Expectancy data set into the following code (replacing the <LINK-TO-DATA> string). Then run the following code to load the data in a data frame. This code keeps 3 columns and renames them.

    import pandas as pd
    import numpy as np
    
    # life expectancy at birth in years
    life = pd.read_csv("<LINK-TO-DATA>",usecols=['Country or Area','Year','Value'])
    life.columns = ['country','year','life']
    life[0:5] 
    
    country year life
    0 Afghanistan 2012 60.509122
    1 Afghanistan 2011 60.065366
    2 Afghanistan 2010 59.600098
    3 Afghanistan 2009 59.112341
    4 Afghanistan 2008 58.607098

    Life expectancy figures might be more meaningful if we combine them with other open data sets from Data Science Experience. Let's start by loading the data set Total Population by country. To do so, find the data set on the DSX home page, request an access key for it, and replace <LINK-TO-DATA> with your access key URL in the following code. Then run the code.

    # population
    population = pd.read_csv("<LINK-TO-DATA>",usecols=['Country or Area', 'Year','Value'])
    population.columns = ['country', 'year','population']
    
    print "Nr of countries in life:", np.size(np.unique(life['country']))
    print "Nr of countries in population:", np.size(np.unique(population['country'])) 
    

    Nr of countries in life: 246
    Nr of countries in population: 277 
    

    Joining data frames

    These two data sets don't fit together perfectly. For instance, one lists more countries than the other. When we join the two data frames we're sure to introduce nulls or NaNs into the new data frame. We'll use the pandas merge function to handle this problem. This function includes many options. In the following code, how='outer' makes sure we keep all data from life and population. on=['country','year'] specifies which columns to perform the merge on.

    df = pd.merge(life, population, how='outer', sort=True, on=['country','year'])
    df[400:405]    
    
    country year life population
    400 Antigua and Barbuda 1998 72.973780 74206.0
    401 Antigua and Barbuda 1999 73.186024 76041.0
    402 Antigua and Barbuda 2000 73.397293 77648.0
    403 Antigua and Barbuda 2001 73.606073 78972.0
    404 Antigua and Barbuda 2002 73.813390 80030.0

    We can add more data to the data frame in a similar way. For each data set in the following list, find the data set on the DSX home page, request an access key URL, and copy the the URL into the code (again replacing the <LINK-TO-DATA> string with the corresponding access key URL):

    • Population below national poverty line, total, percentage
    • Primary school completion rate % of relevant age group by country
    • Total employment, by economic activity (Thousands)
    • Births attended by skilled health staff (% of total) by country
    • Measles immunization % children 12-23 months by country
        # poverty (%)
        poverty = pd.read_csv("<LINK-TO-DATA>",usecols=['Country or Area', 'Year','Value'])
        poverty.columns = ['country', 'year','poverty']
        df = pd.merge(df, poverty, how='outer', sort=True, on=['country','year'])
    
        # school completion (%)
        school = pd.read_csv("<LINK-TO-DATA>",usecols=['Country or Area', 'Year','Value'])
        school.columns = ['country', 'year','school']
        df = pd.merge(df, school, how='outer', sort=True, on=['country','year'])
    
        # employment
        employmentin = pd.read_csv("<LINK-TO-DATA>",usecols=['Country or Area','Year','Value','Sex','Subclassification'])
        employment = employmentin.loc[(employmentin.Sex=='Total men and women') &  (employmentin.Subclassification=='Total.')]
        employment = employment.drop('Sex', 1)
        employment = employment.drop('Subclassification', 1)
        employment.columns = ['country', 'year','employment']
        df = pd.merge(df, employment, how='outer', sort=True, on=['country','year'])
    
        # births attended by skilled staff (%)
        births = pd.read_csv("<LINK-TO-DATA>",usecols=['Country or Area', 'Year','Value'])
        births.columns = ['country', 'year','births']
        df = pd.merge(df, births, how='outer', sort=True, on=['country','year'])
    
        # measles immunization (%)
        measles = pd.read_csv("<LINK-TO-DATA>",usecols=['Country or Area', 'Year','Value'])
        measles.columns = ['country', 'year','measles']
        df = pd.merge(df, measles, how='outer', sort=True, on=['country','year'])
    
        df[0:50]    
    

    The resulting table looks kind of strange, as it contains incorrect values, like numbers in the country column and text in the year column. You can manually remove these errors from the data frame. Also, we can now create a multi-index with country and year.

    
    df2=df.drop(df.index[0:40])
    
    df2 = df2.set_index(['country','year'])
    
    df2[0:10]     
    
    life population poverty school employment births measles
    country year
    Afghanistan 1980 NaN NaN NaN NaN NaN NaN 11.0
    1982 NaN NaN NaN NaN NaN NaN 8.0
    1983 NaN NaN NaN NaN NaN NaN 9.0
    1984 NaN NaN NaN NaN NaN NaN 14.0
    1985 NaN NaN NaN NaN NaN NaN 14.0
    1986 NaN NaN NaN NaN NaN NaN 14.0
    1987 NaN NaN NaN NaN NaN NaN 31.0
    1988 NaN NaN NaN NaN NaN NaN 34.0
    1989 NaN NaN NaN NaN NaN NaN 22.0
    1990 NaN NaN NaN NaN NaN NaN 20.0

    If you are curious about other variables, you can keep adding data sets from Data Science Experience to this data frame. Be aware that not all data is equally formatted and might need some clean-up before you add it. Use the code samples you just read about, and make sure you keep checking results with a quick look at each of your tables when you load or change them with commands like df2[0:10].

    Check the data

    You can run a first check of the data with describe(), which calculates some basic statistics for each of the columns in the dataframe. It gives you the number of values (count), the mean, the standard deviation (std), the min and max, and some percentiles.

    df2.describe()    
    
    life population poverty school employment births measles
    count 11969.000000 1.309100e+04 651.000000 5078.000000 2909.000000 1523.000000 6944.000000
    mean 63.156417 1.409922e+08 30.763209 78.018509 14337.147966 83.944882 76.452661
    std 11.290103 5.450133e+08 17.349350 25.675860 57236.797036 23.885349 22.153693
    min 19.504927 4.279000e+03 1.700000 1.522030 0.663000 5.000000 1.000000
    25% 54.884268 8.189045e+05 17.245109 60.831905 954.300000 73.650000 65.000000
    50% 66.171191 5.366554e+06 26.900000 88.120480 3256.500000 98.000000 84.000000
    75% 71.691415 2.574550e+07 43.700000 97.417360 9463.000000 99.700000 94.000000
    max 83.480488 7.124544e+09 96.000000 193.263340 737400.000000 100.000000 99.000000

    Data analysis

    At this point, we have enough sample data to work with. Let's start by finding the correlation between different variables. First we'll create a scatter plot, and relate the values for two variables of each row. In our code, we also customize the look by defining the font and figure size and colors of the points with matplotlib.

    import matplotlib.pyplot as plt
    %matplotlib inline
    
    plt.rcParams['font.size']=11
    plt.rcParams['figure.figsize']=[8.0, 3.5]
    fig, axes=plt.subplots(nrows=1, ncols=2)
    df2.plot(kind='scatter', x='life', y='population', ax=axes[0], color='Blue');
    df2.plot(kind='scatter', x='life', y='school', ax=axes[1], color='Red');
    plt.tight_layout()    
    

    The figure on the left shows that increased life expectancy leads to higher population. The figure on the right shows that the life expectancy increases with the percentage of school completion. But the percentage ranges from 0 to 200, which is odd for a percentage. You can remove the outliers by keeping the values within a specified range df2[df2.school>100]=float('NaN').

    Even better, would be to check where these values in the original data came from. In some cases, a range like this could indicate an error in your code somewhere. In this case, the values are correct, see the description of the school completion data.

    We don't have data for all the exact same years. So we'll group by country (be aware that we lose some information by doing so). Also because variables are percentages, we'll convert our employment figures to percent. Probably, we no longer need the population column, so let's drop it. Then we create scatter plots from the data frame using scatter_matrix, which creates plots for all variables and also adds a histogram for each.

    from pandas.tools.plotting import scatter_matrix
    
    # group by country
    grouped = df2.groupby(level=0)
    dfgroup = grouped.mean()
    
    # employment in % of total population
    dfgroup['employment']=(dfgroup['employment']*1000.)/dfgroup['population']*100
    dfgroup=dfgroup.drop('population',1)
    
    scatter_matrix(dfgroup,figsize=(12, 12), diagonal='kde')    
    

    You can see that the data is now in a pretty good state. There are no large outliers. We can even start to see some relationships: life expectancy increases with schooling, employment, safe births, and measles vaccination. You are deriving insights from the data and can now build a statistical model--for instance, have a look at an ordinary least squares regression (OLS) from StatsModels.

    Summary

    In this tutorial, you learned how to use open data from Data Science Experience in a Python notebook. You saw how to load, clean and explore data using pandas. As you can see from this example, data analysis entails lots of trial and error. This experimentation can be challenging, but is also a lot of fun!

Margriet Groenendijk

Margriet is Developer Advocate at IBM. She is all about data from storing, cleaning, munging, analysing to visualising. She uses a range of tools for this, such as Cloudant, dashDB, Spark and Python.

Bristol, UK

Subscribe to IBM Data Science Experience Blog

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!