Topic Analysis of Twitter Profiles

Twitter2.jpg

This blog is an exercise to use two different APIs for collecting and analysing tweets.

To pull the tweets we will use the tweepy library which provides a python wrapper for the API as provided by Twitter. In order to use it you will have to create a free twitter developer account and create an application in order to get the credentials.

To analyse the content of a tweet and extract the media topics from it we will use the TextRazor API which also has a free plan including 500 calls/day.


Step 1: Pulling the tweets

Let's pull the most recent 600 tweets from BBCWorld and from basecamp_ai and compare their topics. From each tweet we will extract the text, the shared url, and then convert the results to a pandas dataframe.

 

import tweepy
import os

tweets = {}
auth = tweepy.OAuthHandler(os.environ['TWITTER_CONSUMER_KEY'],
                           os.environ['TWITTER_CONSUMER_SECRET'])
auth.set_access_token(os.environ['TWITTER_ACCESS_KEY'],
                      os.environ['TWITTER_ACCESS_SECRET'])

api = tweepy.API(auth)
screen_name = 'basecamp_ai'
tweets_per_page = 200
num_pages = 3
for res in (tweepy.Cursor(api.user_timeline,
                          id=screen_name,
                          count=tweets_per_page)
            .pages(num_pages)):

    if len(res) > 0:
        for r in res:
            tweet = {}
            tweet['id'] = r.id
            tweet['published_at'] = r.created_at
            tweet['content'] = r.text  
            if len(r.entities['urls']) > 0:
                url = r.entities['urls'][0]['expanded_url']
                if 'twitter.com' not in url:
                    tweet['shared_url'] = url
                else:
                    tweet['shared_url'] = None
            else:
                tweet['shared_url'] = None
                
            tweets[tweet['id']] = tweet

tweets_df = pd.DataFrame.from_dict(tweets, orient='index')            

Step 2: Extracting the topics

Out of the 600 basecamp_ai tweets 572 contain at least one shared url. This is great for topic analysis since the tweet text itself is too short to extract topics reliably. Luckily you can also extract  topics from urls directly using the TextRazor API. Go to www.textrazor.com  and create an account to get your API key. For each url the API will usually return a number of media topics with the corresponding scores.

 

results = []

textrazor.api_key = os.environ['TEXTRAZOR_API_KEY']
client = textrazor.TextRazor(extractors=["topics"])
client.set_classifiers(['textrazor_mediatopics'])

category_names = {'-1': 'no topics discovered',
                  '-2': 'no shared urls',}

for tweet in basecamp_tweets.itertuples():
    url = tweet.shared_url
    labels = []
    if url:
        response = client.analyze_url(url)

        for c in response.categories():
            labels.append((c.category_id, c.score))
            if c.category_id not in category_names:
                category_names[c.category_id] = c.label
                
        if len(labels) == 0:
            labels = [('-1', 0)]  # no topics discovered
    else:
        labels = [('-2', 0)]  # no shared urls
    
    for subject_code, score in labels:
        results.append({'tweet_id': tweet.id,
                        'content': tweet.content,
                        'shared_url': url,
                        'subject_code': subject_code,
                        'score': score,
                        'topic_name': category_names[subject_code]})
                        
basecamp_topics = pd.DataFrame.from_records(results) 

 

The topic structure is tree like and can be found here. That means for one url we can get the topic "science and technology" with a score of 0.3465 and "science and technology>social sciences>geography" with a score of 0.4041.

Before reaching the daily limit I could extract 3964 topics from 500 urls. The full topic labels are in the form "science and technology>social sciences>psychology". From that I will create 2 additional columns in the dataframe with 1st and 2nd level topic.

 

basecamp_topics['1st_level_topic'] = basecamp_topics['topic_name'].apply(lambda x: x.split('>')[0])
basecamp_topics['2nd_level_topic'] = basecamp_topics['topic_name'].apply(lambda x: '>'.join(x.split('>')[0:2]))

 

Let's see which topics are most common.

 

basecamp_topics['topic_name'].value_counts()

 

No surprises here. As expected from a data science bootcamp most tweets are about computer sciences, software and mathematics.

 

economy, business and finance>economic sector>computing and information technology                        418
science and technology                                                                                    380
science and technology>technology and engineering>IT/computer sciences                                    366
economy, business and finance>economic sector>computing and information technology>software               319
science and technology>technology and engineering                                                         241
science and technology>mathematics                                                                        224

 

In the next step I would like to calculate the average score per 1st level topic for each tweet and then sum those score for each 1st level topic.

 

basecamp_topic_profile =   (basecamp_topics.groupby(['tweet_id','1st_level_topic'])
                                           .agg({'score': 'mean'})
                                           .groupby('1st_level_topic')
                                           .agg({'score': 'sum'}))

basecamp_ai topic profile

#grey color =  sum of scores

#grey color = sum of scores

I was a little bit surprised that "economy, business and finance" came before "science and technology". A possible explanation could be that there a lot of tweets mentioning data science and AI companies and startups. And maybe this 1st level topic is more easily recognised with high confidence (score) than others.  

And indeed checking the numbers revealed that "science and technology" has been detected in 440 tweets and "economy, business and finance" in 437 tweets.

 

Number of tweets per 1st level topic

 

#grey color = tweets

#grey color = tweets

The topics BBCWorld show a more evenly distributed profile with the focus on politics and economy.

 

Topic profile of BBCWorld

#grey color = sum of scores

#grey color = sum of scores

Step 3: Your Turn

This short exercise shows how easily you can perform such a complex task as topic classification using the right API.

Try it on your own twitter profile and share your results in the comments :-)

 

Get in touch

Previous
Previous

Exploring the Neighbourhoods of Vienna with Open data

Next
Next

GDPR... And Why It Only Cures Symptoms