IBM  Data Science  Experience

Trump and Clinton may have used some Machine Learning

Photo cred: The New York Times.

The US Presidential Election has undoubtedly been one of the topics of greatest attention in the past year. Social media has disrupted traditional political campaigning strategies and allowed for better understanding of people's views towards political issues and candidates. Paradoxically, despite the large amount of available data, the election results went against most polls, analyses, and predictions.

In this study, we use social media data (specifically Twitter) to provide insights on each candidate's popularity, tweeting patterns, and most common topics. Additionally, we attempt to model and predict the success of a new candidate's tweet.

The dataset


We use a public Twitter dataset containing a total of 6 thousand tweets from the candidates' official Twitter accounts: @realDonaldTrump and @HillaryClinton (about 3 thousand tweets each). Each tweet contains its text, date, number of times it was retweeted by users, number of times it was marked as favorite, along with some other metadata.

The dataset can be downloaded from the Kaggle website:
https://www.kaggle.com/benhamner/clinton-trump-tweets

The code


The code was written in R using the IBM Watson Data Platform. You can sign up for free at http://datascience.ibm.com/. A Jupiter Notebook containing the source code is available here:

https://github.com/IBMDataScience/election2016

Exploring the tweets


Donald Trump's most frequently used words are as follows:

Note how the most common words in Trump's tweets (i.e., great, will, thank) have very positive meanings, which is ideal for political campaigning.

Interestingly, Hillary Clinton's most frequently used word on Twitter is trump:

Most viral Trump's tweets

Tweet text Retweets Likes
How long did it take your staff of 823 people to
think that up--and where are your 33,000 emails that
you deleted?
167,274 294,162
The media is spending more time doing a forensic
analysis of Melania's speech than the FBI spent
on Hillary's emails.
120,817 247,883
Happy #CincoDeMayo! The best taco bowls are
made in Trump Tower Grill. I love Hispanics!
82,653 115,107

Most viral Clinton's tweets

Tweet text Retweets Likes
Delete your account 490,180 660,384
"I never said that." —Donald Trump, who said that. 91,670 134,808
Great speech. She's tested. She's ready. She never quits. That's why
Hillary should be our next @POTUS. (She'll get the Twitter handle, too)
63,628 190,992

Basic statistics on tweets

Note Trump's tweets are retweeted twice as much as Clinton's. However, Trump's tweets are between December 2015 and September 2016, whereas Clinton's are between April 2016 and September 2016, meaning that Clinton tweets more frequently than Trump.

Note after both the Democratic and Republican nominations, both candidates obtained great attention. However, after the first debate (arguably won by Clinton), Clinton gained huge attention, which was not the case for Trump.

Modeling tweet success


After exploring the dataset, one interesting problem that comes to mind is to model and predict how successful a new tweet would be.

First step is to define success. In our case, we assume a successful tweet is one that is retweeted many times. One can immediately visualize a regression problem where the goal is to predict the number of retweets.

Second step is to engineer features that will be useful to describe meaning in the text. A very common and simple approach is the TF/IDF (Term Frequency/Inverse Document Frequency) algorithm. R's tm (text mining) package offers built-in capabilities to extract TF/IDF features. There will be as many features as terms (i.e., words) in the corpus and as many rows as tweets.

Third step is to build a regression model to predict the number of retweets. We have chosen the Multivariate Adaptive Regression Splines (MARS) algorithm since it allows to model non-linear relationships in the data (which may be the case of TF/IDF features) and works well with high-dimensional data.

Prediction results


A model was trained for Trump's tweets (the reader can use a similar aproach for Clinton's tweets). After the model was trained, we tested it on 11 tweets with various topics.

Tweet number 1 was part of the training set and was in fact the most retweeted one (167,274 times) for Donald Trump. The predicted number of retweets was very close (161,918) which gives us a good upper bound of the model quality. R-square was around 0.75 during training.

Tweet number two is a random text completely unrelated to Trump's vocabulary. As expected, it got a much smaller predicted number of retweets (4,517).

Third tweet is related to Trump's ideology, thereby obtaining almost 20K predicted retweets. Notice the number of times words appear in the text is irrelevant as shown in tweet number 4.

Tweet text Predicted retweets
1How long did it take your staff of 823 people to think that up--and where are your 33,000 emails that you deleted?161,918
2I love crocodiles4,517
3I will defeat Isis19,886
4I will defeat Isis I will defeat Isis I will defeat Isis I will defeat Isis I will defeat Isis19,886
5will take trump realdonaldtrump hillari3,323
6I will make America great again23,866
7I like Hillary Clinton14138
8I hate Hillary Clinton14138
9Climate change is a hoax, and a very expensive one!40,023
10Climate change is real. We must act to save the planet.2,521
11College education should be free for everyone!4517

Tweet number five is a random permutation of the most common words by Trump (see word cloud above). It is expected to be very unsuccessful (~3K retweets). This is since TF/IDF not only considers term frequency but also normalizes feature values taking into account document frequency.

Tweet number 6 (Trump's slogan) is expected to be successful, so is tweet number 9, given that Trump tweeted a similar idea using sligtly different words.

Tweets 10 and 11 are expected to be as bad or even worse than "I love crocodiles", which is not surprising since they are against Trump's political views.

Interestingly, tweets 7 and 8 have the exact same predicted success. This may be because the frequency of the words "like" and "hate" does not correlate with tweet success. The interested reader may use sentiment analysis (R has package qdap) to enhance the model in this regard.

Lessons learned


We have shown how the IBM Watson Data Platform can be leveraged to explore, visualize, and model data using the R language.

Simple statistics such as the number of retweets/favorites were already favoring Trump, which supports the election results.

Despite the the small size of both the dataset and the vocabulary (i.e., 3K tweets and 6K words), the predictions on Twitter success were shown to align to the candidate's political views. We believe a larger dataset will definitely allow to improve accuracy and broaden the prediction topics.

The model was not able to differentiate negative vs. positive views towards a particular topic. We believe this can be improved by incorporating additional features from the sentiment analysis domain.

The MARS algorithm selects a smaller subset of the features. This means that some words were discarded, reducing the size of the vocabulary the model can handle. Using all the features would be ideal but it brings up scalability issues. In fact, algorithms such as Linear Regression and Support Vector Machines were not able to succeed when run on vanilla R. We suggest using SparkR to create a regression model with all features (i.e., words).

Oscar D. Lara-Yejas

Oscar D. Lara Yejas is a Data Scientist with IBM Watson Data Platform. He specializes in Machine Learning, Big Data, Human-centric Sensing, and Software Engineering.

San Jose, CA

Subscribe to IBM Data Science Experience Blog

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!