Semantic Similarity Across Facebook Posts

 

In this post we will use modern Natural Language Processing techniques to find similar posts in a Facebook group.

You have probably been in a situation where you want to post something in a Facebook group but you are not sure whether almost the same post already exists and is maybe just hiding on the next page.

You could click through the pages and read all the posts or you could use the search function, but both ways have their disadvantages. While the first one is obviously very tedious, for the second one you will have to try different words and phrasing in order not to miss something.

Our goal here is to find the most similar posts in a group, based on their meaning.
 

SCRAPING THE GROUP POSTS

First you will need to get an App-ID and App-Secret from Facebook in order to access the content through the API. It is very simple and you can find the required steps here.

Second you will need the Group-ID for the group you want to analyse. I decided to look at the posts in the group Python Programming Language.

The easiest way to get the Group-ID is to use this website.

We will use the library facepy "which makes it really easy to use Facebook's Graph API". You can install facepy with pip install facepy.

Once you have installed facepy and have got your App-ID, App-Secret and Group-ID, scraping all posts from the group wall can be done with just a couple of lines of code:

from facepy import GraphAPI

access_token = app_id + "|" + app_secret
graph = GraphAPI(access_token)
pages = graph.get(group_id + "/feed",
page=True, 
retry=3, 
limit=100)
posts =[]
for p in pages:
posts.extend(p['data'])

There have been about 7000 post in the python group at the time of writing. You can store them in a simple list or in a pandas DataFrame.

Most of the posts are quite short as you would expect:


Here you can see the most active hours of the group:
 


COMPARING THE POSTS

There are many different ways to calculate the similarity between two documents.

First we need to decide how do we want to represent the document (Bag of words , Tf-Idf, word embeddings)

Second we have to choose the distance metric (euclidean distance , cosine similarity , word mover's distance ) which will tell us how close (similar) two documents are.

We are going to represent the content of a Facebook post using word embeddings and comparing the transformed posts using word mover's distance. The combination of both have shown lower k-nearest neighbor-document classification error rates compared to other state of the art techniques.

The advantage of word embeddings is that the words which have similar meanings but don’t have any letters in common will still have similar vectors (be close) in the embedded space (e.g. lion and tiger).

This requires a model that has been trained on a large corpus of text of the respective language. Luckily for us such models are already available and we don't have to train our own. We will use the library spaCy to transform the documents. You can find the instruction on how to install spaCy and how to download the language models here. Currently four languages (EN , DE , ES , FR) are supported out of the box but you can find even more open sourced language models and add them to the library yourself.

Additionally we will use the library textacy which is build on top of spaCy to compute the word mover's distance.

After the transformation each word in every post will be represented by 300-dimensional vectors in the embedding space.

Now you can think of another post or question you would like to ask in the group and transform it in the same way as you did the other posts.

For example:

I want to become a data scientist. What online resources can you recommend? 

Then you will need to loop over the old posts and calculate the word mover's distance. After it is finished print out the posts with the highest similarity. 
 

RESULTS

Here are the five most similar group posts to my question:

1) We are so 1excited to interview Kai Xin Thia ( Co-Founder of DataScience SG the largest data science community in Singapore) as the first data scientist for our DataAspirant blog lovers. He has shared some interesting things about data science which every data science lover has to know.

(Similarity: 0.583, Link


2) Hello everyone, I have collected some useful resources in Data science, Python and R, which I found useful, I am looking forward to adding more resources. I would appreciate it if everyone could share their valuable data science learning resources with every learner and enthusiast.

(Similarity: 0.577, Link)
 

3) 12 Python Resources for Data Science

(Similarity: 0.575, Link)
 

4) What to become an expert in data science by learning some set of online coursers targeted to help you truly master your data science skill. Then this month is the great way start. coursera offering different data science specializations to full fill your dream to become data science expert. Have look at coursera data science specializations list ordered by DataAspirant team. ALL THE BEST

(Similarity: 0.573, Link)
 

5) What exactly  statistical knowledge I need to acquire for data science,machine learning. Predictive modelling etc?? Want answers from experienced data scientists.

(Similarity: 0.572, Link)
 

Notice how, for example, the result 5 does not contain the words "resources" or "become" but since the words "knowledge" and "acquire" are close to them in the embedding space the similarity based on the word mover's distance is also quite high.

Try it out by yourself and post your findings in the comments!

For less-spammed Facebook pages ;-)