Topic Modeling On Articles From Slovak Newspapers

 

Today we focus on text mining techniques, more specifically on topic modeling.

The main goal is to find clusters of articles based on their content. These articles were scraped from the websites www.sme.sk, www.pravda.sk and www.dennikn.sk. As you might have already noticed, the tricky part is that we will be dealing with articles in Slovak, and that makes our job a bit harder. We will compare two techniques and we will see if they are able to deliver good quality.
 

Data gathering

To get our data, we use the package BeautifulSoup to scrape the text, as well as a category of the article. BeautifulSoup parses a webpage using the HTML tags and it is very simple to access separate parts of the webpage using those tags. The category of the article is extracted from the URL, for example, by using the following URL we can identify tech as the topic of the article: https://tech.sme.sk/c/20652556/vymena-baterii-namiesto-nabijania-elektroauta-sa-mozu-zmenit.html?ref=trz

Using our own scraper we were able to download text and category of 28545 articles. We will use two unsupervised techniques to identify the clusters and topics of those articles.

Afterwards, we will use extracted categories to evaluate the models. You can see the sample of our data below (article and its label):

label: techbox

text: Gmail na prispôsobenie reklamy používa prehliadanie správ, ktoré sú doručené do mailovej schránky. Tomu má do konca tohto roka definitívne odzvoniť.
Mailový klient od Google patrí medzi najpopulárnejšie produkty spoločnosti. Niektorí však môžu namietať, že celý vyhľadávací gigant je len veľký obchod s reklamou a zberom osobných údajov.
Pozrite siGmailify – funkcie Gmail aj bez @gmail.com
Gmail má v sebe zakomponovaný skener, ktorý hľadá kľúčové slová. Na základe nich vie veľký brat prispôsobiť reklamy každému osobne. A i napriek tomu, že toto prehliadanie má na starosti stroj a nie človek majú niektorí z 1,2 miliardy používateľov na tento zásah do súkromia ťažké srdce. To by sa však malo už o chvíľu zmeniť.
Po tom, ako spoločnosť zrušila túto aktivitu v business verzii Gmailu, ktorý spadá do skupiny G Suite sa tak zmeny dočkáme aj my, bežní používatelia. Nové pravidlo nadobudne účinnosť do konca tohto roka.
Pozrite si7 tipov, ktoré oceníte ak používate Gmail
To samozrejme neznamená, že ostatné služby od Google, ale aj iných spoločností nebudú vaše dáta zbierať. Vyhľadávač si na nás aj ďalej posvieti. Či to už bude história prehliadania, stránky, ktoré navštevujeme, videá, ktoré sledujeme na YouTube a ďalšie.
ZdrojGoogle Blog 


Kmeans clustering

The first approach we use is KMeans clustering. We need to transform the data from the text into the numeric features before that. We will use the tf-idf algorithm, which transforms the documents into the numeric vectors. It is short for term frequency - inverse document frequency and it uses the counts of different terms in each article and compares with the counts in all documents. For more information about the algorithm, you can visit this page.

We have fitted the algorithm for 10 to 20 clusters and we compared the solution with inertia and silhoutte score (click here for more information):

styled-line
styled-line-silhouette

We can see that the inertia of clusters is getting smaller with increasing numbers of cluster. On the other hand, the silhouette score is getting higher. These two techniques hint that the twenty clusters solution is the best and we will use this one to check the distribution of extracted labels/topics of the articles in separate groups. There are many possible topics of articles and it would be hard to fit them into ten cluster solutions.

Now we will check the distribution of topics in different clusters. We always check the top five of the most often shown topics in each group. Let's start with clusters, which are clearly defined by one topic. You can see the topics on the axis x and a percentage of articles with this topic on axis y.

basic-bar-9
basic-bar-0
basic-bar-11

In the pictures above we could see some of the perfect clusters. It's very easy to find the main article topic of the clusters. However, this wasn't the case for all of them. There are some which don't have a clear leader and topics are a bit mixed.

basic-bar-14

In general, the KMeans algorithm performed quite well and similar articles were grouped together. Now we can take a look at the next approach: LDA.
 

Latent Dirichlet Allocation (LDA)

It is a statistical algorithm that allows observations to be explained by special groups that explain why some parts of the data are similar, e.g. topics of articles. In our case the groups are made of specific words taken from the articles. For more information about the algorithm, please visit this link.

Before modeling we have pre-processed the data using the following steps:

  1. Stopwords removal
  2. Punctuation removal
  3. Numbers removal
  4. Normalization
  5. Removal of words with a length shorter than 4 characters


After the pre-processing, each document is transformed into a list of words, e.g.:

[gmail,prispôsobenie,reklamy,používa,prehliadanie,správ,doručené,
mailovej,schránky,konca,tohto,roka,definitívne,odzvoniť,mailový,
klient,google,patrí,najpopulárnejšie,produkty,spoločnosti,niektorí,
môžu,namietať,.....]


After fitting the model to the dataset in the pre-processed format, we will use the code below to get the topics of the articles:
 

ldamodel.print_topics(num_topics=20, num_words=10)[19]


There are two hyperparameters: The first one is num_topics, which indicates how many topics we expect to have (we use 20 so we are consistent with KMeans above) and num words, which tells us how many words one topic consists of. You can see some example results below:
 

0.005*"zadné" + 0.005*"kilometrov" + 0.004*"príspevok" + 0.004*"zdieľa"
+ 0.004*"sagan" + 0.004*"etapa" + 0.003*"júl" + 0.003*"              "
+ 0.002*"august" + 0.002*"september"

0.006*"krajiny" + 0.005*"európskej" + 0.004*"povedal" +
0.004*"slovensko" + 0.004*"utečencov" + 0.004*"únie" + 0.004*"krajín" +
0.003*"proti" + 0.003*"európy" + 0.002*"teraz"

0.011*"strany" + 0.007*"vlády" + 0.006*"strana" + 0.005*"povedal" +
0.005*"parlamentu" + 0.005*"voľbách" + 0.005*"fico" + 0.004*"predseda"
+ 0.004*"voľby" + 0.004*"volieb"

0.007*"mäso" + 0.005*"pridáme" + 0.004*"cukor" + 0.004*"soľ" +
0.003*"mäsa" + 0.003*"postup" + 0.003*"necháme" + 0.003*" minút" +
0.003*"korenie" + 0.003*"jedlo"

0.003*"zápas" + 0.003*"veľmi" + 0.003*"prémiovému" + 0.002*"tréner" +
0.002*"miesto" + 0.002*"tour" + 0.002*"finále" + 0.002*"peter" +
0.002*"sveta" + 0.002*"tímu"

0.006*"polícia" + 0.003*"ukrajiny" + 0.003*"policajti" +
0.003*"informovala" + 0.002*"lietadlo" + 0.002*"informácie" +
0.002*"lietadla" + 0.002*"polície" + 0.002*"hovorkyňa" +
0.002*"vyšetrovanie"

0.005*"slovensku" + 0.005*"nemocnice" + 0.004*"pacientov" +
0.003*"zdravotníctva" + 0.003*"ceny" + 0.003*"potravín" +
0.002*"zdravotnej" + 0.002*"prievidza" + 0.002*"slovenska" +
0.002*"starostlivosti"

0.007*"audi" + 0.006*" litra" + 0.005*"motory" + 0.004*"ford" +
0.004*"sedadlá" + 0.003*"predné" + 0.003*"toyota" + 0.003*"opel" +
0.002*"fiat" + 0.002*"disky"


You can see that the results are quite nice and topics consist of words of similar meaning, even when we use the Slovak language.
 

SUMMARY

We have applied two different modeling techniques to cluster the articles based on the topics; Kmeans and LDA. Although there is still room for improvements, both of them have shown pretty nice results. Let us know if you have any nice experience with topic modeling in different languages than English.