A post on generating word cloud from tweets containing a particular hashtag (like #SpaceX)
Vijay Vankayalapati · Follow
--
Word Cloud is a popular visualisation tool that is used to visualise textual data. The words are sized according their frequency of occurrence in a corpus and arranged in random order. Word Cloud gives a quick and dirty insight about what’s inside a text corpus.
As we know, Twitter is a micro-blogging site known for its quick updates on what’s happening in real-time. Word Cloud can be used to get an overview of the words mentioned along with specific hashtag(s) or keyword(s) on Twitter. One quick use case we can think of word cloud visualisations is that brands can track what their customers are saying about them.
In this post, we shall make a word cloud from tweets containing hashtag #SpaceX. You can use any hashtags or keywords.
Import the required libraries:
import tweepy
import pandas as pd
Tweepy is an easy-to-use Python library for accessing the Twitter API.
Using Tweepy we shall scrape tweets from Twitter. To use Tweepy we need Twitter developer credentials. Apply here for developer account. You will be asked questions about the purpose. The approval process may take a couple of days. Once you are approved set up development environment in dashboard and go to “Keys and Tokens” tab of your app to retrieve developer credentials — Consumer API key, Consumer API Secret Key, Access Token and Access Token Secret.
Now that you have dev credentials, you need to request authorisation from Twitter to use their data.The following snippet of code does that job:
consumer_key = "Enter your consumer key" #Enter your key as string
consumer_secret = "Enter your consumer key secret"
access_token = "Enter your access token"
access_token_secret = "Enter your access token secret"auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)
We will use keyword_to_csv() method to scrape tweets associated with the desired keyword(s) and save them into a csv file. First create a query method using tweepy.Cursor() method inputting our keyword as parameter. Next pull information related to twitter text from iterable object ‘tweets’ and save it in tweets_list. Then create a pandas dataframe from tweets_list. When this method is called, a csv file will be created with the scraped tweets.
def keyword_to_csv(keyword,recent):
try:
tweets = tweepy.Cursor(api.search,q=keyword).items(recent) #creates query method tweets_list = [[tweet.text] for tweet in tweets]
#pulls text information from tweets df = pd.DataFrame(tweets_list,columns=['Text'])
#creates a pandas dataframe df.to_csv('{}.csv'.format(keyword), sep=',', index = False)
#creates a csv from data frame except BaseException as e:
print('failed on_status,',str(e))
time.sleep(3)
We are now interested in scraping most recent 3000 tweets with hashtag #SpaceX and retweets will be excluded.
keyword = 'SpaceX'+ " -filter:retweets" #excludes retweets
recent = 3000
keyword_to_csv(keyword, recent)
A csv file (#spacex-filter:retweets.csv)with tweets containing the hashtags will be downloaded to your path.
We need to clean our data before we make word cloud with our tweets. First load the generated csv file into pandas dataframe.
df = pd.read_csv("./#spacex-filter:retweets.csv") #loads csv file into pandas dataframepd.options.display.max_colwidth = 200 df.head() #prints out first few columns in a dataframe
df.shape #prints the shape of dataframe
Then we will clean the data loaded into pandas dataframe in four steps.
Step-1: Removing emojis and symbols
Emoticons, symbols etc. will be removed using regular expressions in Python.
a = df.loc[1272].to_string() #loads the row from dataframe
print(a)
The following pattern will remove most emoticons and symbols in the text.
regex_pattern = re.compile(pattern = "["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]+", flags = re.UNICODE)match = re.sub(regex_pattern,'',a) #replaces pattern with ''
print(match)
Step-2: Removing urls
a = df.loc[0].to_string()
print(a)
The below block of code removes urls from the text.
pattern = re.compile(r'(https?://)?(www\.)?(\w+\.)?(\w+)(\.\w+)(/.+)?')
match = re.sub(pattern,'',a)
print(match)
Step-3: Removing @mentions and hash symbols
a = df.loc[3].to_string()
print(a)
The following block removes @mentions and hashes from the text.
re_list = [’@[A-Za-z0–9_]+’, '#']
combined_re = re.compile( '|'.join( re_list) )
match = re.sub(combined_re,’’,a)
print(match)
Step-4: HTML to text
When HTML encoding is not converted to text, we will see characters like &, ", \n etc… in the text. For converting to text, we will use BeautifulSoup for this.
from bs4 import BeautifulSoupa = df.loc[27].to_string()
print(a)
The block below will remove html characters from the text.
del_amp = BeautifulSoup(a, 'lxml')
del_amp_text = del_amp.get_text()
print(del_amp_text)
Now we will aggregate all these steps into a function cleaning_tweets(). Additionally we will tokenise the words to remove white spaces. Also all the words will be converted to lowercase to avoid same words like ‘Falcon’ & ‘falcon’ appearing in the wordcloud. We will also consider only those words which have at least three characters.
from bs4 import BeautifulSoup
from nltk.tokenize import WordPunctTokenizer
token = WordPunctTokenizer()def cleaning_tweets(t):
del_amp = BeautifulSoup(t, 'lxml')
del_amp_text = del_amp.get_text()
del_link_mentions = re.sub(combined_re, '', del_amp_text)
del_emoticons = re.sub(regex_pattern, '', del_link_mentions)
lower_case = del_emoticons.lower()
words = token.tokenize(lower_case)
result_words = [x for x in words if len(x) > 2]
return (" ".join(result_words)).strip()
Now pass the dataframe into our function and clean the text. The function will return a list of cleaned tweets. You can see progress of the process in the output window.
print("Cleaning the tweets...\n")
cleaned_tweets = []
for i in range(0,3000): #3000 columns in our dataframe
if( (i+1)%100 == 0 ):
print("Tweets {} of {} have ben processed".format(i+1,3000))
cleaned_tweets.append(cleaning_tweets((df.Text[i])))
Next we will use pandas.Series.str.cat() to concatenate the strings in the list cleaned_tweets separated by ‘ ‘.
string = pd.Series(cleaned_tweets).str.cat(sep=' ')
In this last section, we will use wordcloud library of python to generate word cloud of the tweets.
Stopwords are the commonly occurring words in English language such as “the”, “a”, “an”, “in”… that don’t add much meaning to sentences. These words are ignored in natural language processing tasks. We can add our own stopwords as per our need. We know the words like “elonmusk”,”elon musk”,”elon”,”musk”… will be very common in tweets with hashtag #SpaceX. So we add them to stopwords list and they won’t appear in the word cloud.
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as pltstopwords = set(STOPWORDS)stopwords.update(["elonmusk","elon musk","elon","musk","spacex"]) #adding our own stopwords
In the WordCloud() function, we can pass arguments as per our requirements.
max_words:200 words is default value; 50–100 words are recommended so that we our word cloud is clean and legible.
collocations: Default value is True. We will choose False to avoid bigrams of words.
background_color: Use any color you wish to have like 'blue', ‘green’,’grey’…Black is default value.
Finally we generate our word cloud with this block of code. You can play with the arguments in WordCloud() function to generate clouds of different colors and sizes.
wordcloud = WordCloud(width=1600, stopwords=stopwords,height=800,max_font_size=200,max_words=50,collocations=False, background_color='black').generate(string)
plt.figure(figsize=(40,30))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
Looking at the above cloud, we can see what words are mentioned with #SpaceX — starlink,space, nasa, falcon9 etc. Bigger the size of the word, the more frequent they occur in the tweets.
We can generate custom word clouds with the additional argument mask in WordCloud().
Upload your own image in your path and the below code will do the rest. This works the best with images having white backgrounds. You can see some samples images and the corresponding clouds.
import numpy as np
from PIL import Imagemask = np.array(Image.open('./your_image.jpg'))
wordcloud = WordCloud(width=1600, mask = mask,stopwords=stopwords,height=800,max_font_size=200,max_words=50,collocations=False).generate(string)f = plt.figure(figsize=(50,50))
f.add_subplot(1,2, 1)
plt.imshow(mask, cmap=plt.cm.gray, interpolation='bilinear')
plt.title('Original Image', size=40)
plt.axis("off")
f.add_subplot(1,2, 2)
plt.imshow(wordcloud, interpolation='bilinear')
plt.title('Generated Word Cloud', size=40)
plt.axis("off")
plt.show()
The link for Jupyter notebook containing the code can be found here.