Creating Twitter Wordclouds in R · Michael Harper (2024)

I recently finished my PhD, and my supervisor, Patrick James, always described me as a “data monster” in reference to how much I enjoyed playing with data. He was a massive influence throughout my PhD, so I felt it was only appropriate to get him a data-related gift when I finished. To this effect, I made him a wordcloud of all his tweet history!

This blog post explains how we can interact with Twitter data in R using the rtweet (Kearney 2018) package, and convert this raw data into pretty visualisations using the wordcloud2 (Lang 2018) package. Hopefully it of use to others who may want to replicate the analysis themselves.

There are three key stages to the process of making the wordcloud:

  1. Access the data from Twitter: this is done via the rtweet (Kearney 2018) package.
  2. Clean and extract the word data: removing all additional characters, hyperlinks, etc.
  3. Format the wordcloud: we need to stylise the appearance of the wordcloud.

The packages used in the analysis are listed as follows:

library(rtweet) # Used for extracting the tweetslibrary(tm) # Text mining cleaninglibrary(stringr) # Removing characterslibrary(qdapRegex) # Removing URLs library(wordcloud2) # Creating the wordcloud

Extracting Tweets

The Twitter API makes it very easy to download tweet history for a user, a the rtweet (Kearney 2018) package has been developed to provide an interface with this to R. You will need to sign up for a developer account to be able to access the API. From my experience, the process was not overly difficult, but there was almost a three week wait in my application being approved. Once you have an account, you will need to authenticate it with R as explained here.

Having setup the package, the tweet history for a user can be extracted using the get_timelines function. This extracts up to 3200 recent tweets from a user and provides lots of metdata for each tweet (date, time, text, links, location etc.). This is shown below:

# scrape the tweetstweets_pab <- get_timelines(c("pab_james"), n = 3200)

Cleaning the Data

Once the tweet history has been extracted, it must be formatted and cleaned for the plot. Firstly, the column text is collapsed into a single character vector:

# Clean the datatext <- str_c(tweets_pab$text, collapse = "")

We need to clean the text in the string. The str_remove function is used to remove linebreaks, hyperlinks, any hashtags and mentions. We are also not interested in keeping any basic words such as “a”, “the”, “and” etc., so we can use the removeWords and stopwords function from the tm (Feinerer and Hornik 2018) package. In addition, the qdapRegex package (Rinker 2017) is used to strip out the URLs:

# continue cleaning the texttext <- text %>% str_remove("\\n") %>% # remove linebreaks rm_twitter_url() %>% # Remove URLS rm_url() %>% str_remove_all("#\\S+") %>% # Remove any hashtags str_remove_all("@\\S+") %>% # Remove any @ mentions removeWords(stopwords("english")) %>% # Remove common words (a, the, it etc.) removeNumbers() %>% stripWhitespace() %>% removeWords(c("amp")) # Final cleanup of other small changes

Having cleaned the data, we can format the table. The function ‘TermDocumentMatrix’ is used to construct a frequency table of the words from the text string above. This table is sorted by frequency to make it easier to inspect. A quick summary of the most common words is shown in Table 1.

# Convert the data into a summary tabletextCorpus <- Corpus(VectorSource(text)) %>% TermDocumentMatrix() %>% as.matrix()textCorpus <- sort(rowSums(textCorpus), decreasing=TRUE)textCorpus <- data.frame(word = names(textCorpus), freq=textCorpus, row.names = NULL)
Table 1: Six most commonly used words
wordfreq
energy410
today224
new169
students111
now108
southampton107

Building the Wordcloud

Finally, we can build the wordcloud. There are two main options which can be used this: either wordcloud or wordcloud2. For the example, I have used the wordcloud2 package (Lang 2018), as it offered a few more functions for customising the output. Below, we use the frequency table developed above to create the wordlcloud, as shown in Figure 1.

# build wordcloud wordcloud <- wordcloud2(data = textCorpus, minRotation = 0, maxRotation = 0, ellipticity = 0.6)wordcloud

Figure 1: Our tweet wordcloud

We can play around with this basic setup, and I would recommend checking out the package documentation to see some of the things that can be done. For example, we can provide our own image as a mask to customise the shape of the wordcloud.

If we want to create Wordclouds for multiple users, we can wrap the above code up into a function. Below is the TweetsToWordcloud function:

TweetsToWordcloud <- function(username){ tweets <- get_timelines(username, n = 3200) # Clean the data text <- str_c(tweets$text, collapse = "") %>% str_remove("\\n") %>% # remove linebreaks rm_twitter_url() %>% # Remove URLS rm_url() %>% str_remove_all("#\\S+") %>% # Remove any hashtags str_remove_all("@\\S+") %>% # Remove any @ mentions removeWords(stopwords("english")) %>% # Remove common words (a, the, it etc.) removeNumbers() %>% stripWhitespace() %>% removeWords(c("amp")) # Final cleanup of other small changes # Convert the data into a summary table textCorpus <- Corpus(VectorSource(text)) %>% TermDocumentMatrix() %>% as.matrix() textCorpus <- sort(rowSums(textCorpus), decreasing=TRUE) textCorpus <- data.frame(word = names(textCorpus), freq=textCorpus, row.names = NULL) wordcloud <- wordcloud2(data = textCorpus, minRotation = 0, maxRotation = 0, ellipticity = 0.6) return(wordcloud)}

Then using this function on another example of another one of my academic supervisors:

TweetsToWordcloud(username = "dataknut")

Figure 2: A wordcloud using the TweetsToWordcloud function

This post highlights how we can extract Tweets from Twitter and use this to build data visualisations like wordclouds. I certainly feel like there is a lot more that can be done with this data, so keep an eye out for more posts in the future on this!

Feinerer, Ingo, and Kurt Hornik. 2018. Tm: Text Mining Package. https://CRAN.R-project.org/package=tm.

Kearney, Michael W. 2018. Rtweet: Collecting Twitter Data. https://CRAN.R-project.org/package=rtweet.

Lang, Dawei. 2018. Wordcloud2: Create Word Cloud by htmlWidget. https://github.com/lchiffon/wordcloud2.

Rinker, Tyler. 2017. QdapRegex: Regular Expression Removal, Extraction, and Replacement Tools. https://CRAN.R-project.org/package=qdapRegex.

Creating Twitter Wordclouds in R  · Michael Harper (2024)
Top Articles
Latest Posts
Article information

Author: Van Hayes

Last Updated:

Views: 5556

Rating: 4.6 / 5 (46 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Van Hayes

Birthday: 1994-06-07

Address: 2004 Kling Rapid, New Destiny, MT 64658-2367

Phone: +512425013758

Job: National Farming Director

Hobby: Reading, Polo, Genealogy, amateur radio, Scouting, Stand-up comedy, Cryptography

Introduction: My name is Van Hayes, I am a thankful, friendly, smiling, calm, powerful, fine, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.