Photo cred: The New York Times.
The US Presidential Election has undoubtedly been one of the topics of greatest attention in the past year. Social media has disrupted traditional political campaigning strategies and allowed for better understanding of people's views towards political issues and candidates. Paradoxically, despite the large amount of available data, the election results went against most polls, analyses, and predictions.
In this study, we use social media data (specifically Twitter) to provide insights on each candidate's popularity, tweeting patterns, and most common topics. Additionally, we attempt to model and predict the success of a new candidate's tweet.
We use a public Twitter dataset containing a total of 6 thousand tweets from the candidates' official Twitter accounts: @realDonaldTrump and @HillaryClinton (about 3 thousand tweets each). Each tweet contains its text, date, number of times it was retweeted by users, number of times it was marked as favorite, along with some other metadata.
The dataset can be downloaded from the Kaggle website:
The code was written in R using the IBM Watson Data Platform. You can sign up for free at
http://datascience.ibm.com/. A Jupiter Notebook containing the source code is available here:
Exploring the tweets
Donald Trump's most frequently used words are as follows:
Note how the most common words in Trump's tweets (i.e., great, will, thank) have very positive meanings, which is ideal for political campaigning.
Interestingly, Hillary Clinton's most frequently used word on Twitter is trump:
Most viral Trump's tweets
|How long did it take your staff of 823 people to
think that up--and where are your 33,000 emails that
|The media is spending more time doing a forensic
analysis of Melania's speech than the FBI spent
on Hillary's emails.
Happy #CincoDeMayo! The best taco bowls are
made in Trump Tower Grill. I love Hispanics!
Most viral Clinton's tweets
|Delete your account||490,180||660,384|
|"I never said that." —Donald Trump, who said that.||91,670||134,808|
|Great speech. She's tested. She's ready. She never quits. That's why
Hillary should be our next @POTUS. (She'll get the Twitter handle, too)
Basic statistics on tweets
Note Trump's tweets are retweeted twice as much as Clinton's. However, Trump's tweets are between December 2015 and September 2016, whereas Clinton's are between April 2016 and September 2016, meaning that Clinton tweets more frequently than Trump.
Note after both the Democratic and Republican nominations, both candidates obtained great attention. However, after the first debate (arguably won by Clinton), Clinton gained huge attention, which was not the case for Trump.
Modeling tweet success
After exploring the dataset, one interesting problem that comes to mind is to model and predict how successful a new tweet would be.
First step is to define success. In our case, we assume a successful tweet is one that is retweeted many times. One can immediately visualize a regression problem where the goal is to predict the number of retweets.
Second step is to engineer features that will be useful to describe meaning in the text. A very common and simple approach is the TF/IDF (Term Frequency/Inverse Document Frequency) algorithm. R's tm (text mining) package offers built-in capabilities to extract TF/IDF features. There will be as many features as terms (i.e., words) in the corpus and as many rows as tweets.
Third step is to build a regression model to predict the number of retweets. We have chosen the Multivariate Adaptive Regression Splines (MARS) algorithm since it allows to model non-linear relationships in the data (which may be the case of TF/IDF features) and works well with high-dimensional data.
A model was trained for Trump's tweets (the reader can use a similar aproach for Clinton's tweets). After the model was trained, we tested it on 11 tweets with various topics.
Tweet number 1 was part of the training set and was in fact the most retweeted one (167,274 times) for Donald Trump. The predicted number of retweets was very close (161,918) which gives us a good upper bound of the model quality. R-square was around 0.75 during training.
Tweet number two is a random text completely unrelated to Trump's vocabulary. As expected, it got a much smaller predicted number of retweets (4,517).
Third tweet is related to Trump's ideology, thereby obtaining almost 20K predicted retweets. Notice the number of times words appear in the text is irrelevant as shown in tweet number 4.
|Tweet text||Predicted retweets|
|1||How long did it take your staff of 823 people to think that up--and where are your 33,000 emails that you deleted?||161,918|
|2||I love crocodiles||4,517|
|3||I will defeat Isis||19,886|
|4||I will defeat Isis I will defeat Isis I will defeat Isis I will defeat Isis I will defeat Isis||19,886|
|5||will take trump realdonaldtrump hillari||3,323|
|6||I will make America great again||23,866|
|7||I like Hillary Clinton||14138|
|8||I hate Hillary Clinton||14138|
|9||Climate change is a hoax, and a very expensive one!||40,023|
|10||Climate change is real. We must act to save the planet.||2,521|
|11||College education should be free for everyone!||4517|
Tweet number five is a random permutation of the most common words by Trump (see word cloud above). It is expected to be very unsuccessful (~3K retweets). This is since TF/IDF not only considers term frequency but also normalizes feature values taking into account document frequency.
Tweet number 6 (Trump's slogan) is expected to be successful, so is tweet number 9, given that Trump tweeted a similar idea using sligtly different words.
Tweets 10 and 11 are expected to be as bad or even worse than "I love crocodiles", which is not surprising since they are against Trump's political views.
Interestingly, tweets 7 and 8 have the exact same predicted success. This may be because the frequency of the words "like" and "hate" does not correlate with tweet success. The interested reader may use sentiment analysis (R has package qdap) to enhance the model in this regard.
We have shown how the IBM Watson Data Platform can be leveraged to explore, visualize, and model data using the R language.
Simple statistics such as the number of retweets/favorites were already favoring Trump, which supports the election results.
Despite the the small size of both the dataset and the vocabulary (i.e., 3K tweets and 6K words), the predictions on Twitter success were shown to align to the candidate's political views. We believe a larger dataset will definitely allow to improve accuracy and broaden the prediction topics.
The model was not able to differentiate negative vs. positive views towards a particular topic. We believe this can be improved by incorporating additional features from the sentiment analysis domain.
The MARS algorithm selects a smaller subset of the features. This means that some words were discarded, reducing the size of the vocabulary the model can handle. Using all the features would be ideal but it brings up scalability issues. In fact, algorithms such as Linear Regression and Support Vector Machines were not able to succeed when run on vanilla R. We suggest using SparkR to create a regression model with all features (i.e., words).