Gensim - Creating LDA Mallet Model



This chapter will explain what is a Latent Dirichlet Allocation (LDA) Mallet Model and how to create the same in Gensim.

In the previous section we have implemented LDA model and get the topics from documents of 20Newsgroup dataset. That was Gensim’s inbuilt version of the LDA algorithm. There is a Mallet version of Gensim also, which provides better quality of topics. Here, we are going to apply Mallet’s LDA on the previous example we have already implemented.

What is LDA Mallet Model?

Mallet, an open source toolkit, was written by Andrew McCullum. It is basically a Java based package which is used for NLP, document classification, clustering, topic modeling, and many other machine learning applications to text. It provides us the Mallet Topic Modeling toolkit which contains efficient, sampling-based implementations of LDA as well as Hierarchical LDA.

Mallet2.0 is the current release from MALLET, the java topic modeling toolkit. Before we start using it with Gensim for LDA, we must download the mallet-2.0.8.zip package on our system and unzip it. Once installed and unzipped, set the environment variable %MALLET_HOME% to the point to the MALLET directory either manually or by the code we will be providing, while implementing the LDA with Mallet next.

Gensim Wrapper

Python provides Gensim wrapper for Latent Dirichlet Allocation (LDA). The syntax of that wrapper is gensim.models.wrappers.LdaMallet. This module, collapsed gibbs sampling from MALLET, allows LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents as well.

Implementation Example

We will be using LDA Mallet on previously built LDA model and will check the difference in performance by calculating Coherence score.

Providing Path to Mallet File

Before applying Mallet LDA model on our corpus built in previous example, we must have to update the environment variables and provide the path the Mallet file as well. It can be done with the help of following code −

import os
from gensim.models.wrappers import LdaMallet
os.environ.update({'MALLET_HOME':r'C:/mallet-2.0.8/'}) 
#You should update this path as per the path of Mallet directory on your system.
mallet_path = r'C:/mallet-2.0.8/bin/mallet' 
#You should update this path as per the path of Mallet directory on your system.

Once we provided the path to Mallet file, we can now use it on the corpus. It can be done with the help of ldamallet.show_topics() function as follows −

ldamallet = gensim.models.wrappers.LdaMallet(
   mallet_path, corpus=corpus, num_topics=20, id2word=id2word
)
pprint(ldamallet.show_topics(formatted=False))

Output

[
   (4,
   [('gun', 0.024546225966016102),
   ('law', 0.02181426826996709),
   ('state', 0.017633545129043606),
   ('people', 0.017612848479831116),
   ('case', 0.011341763768445888),
   ('crime', 0.010596684396796159),
   ('weapon', 0.00985160502514643),
   ('person', 0.008671896020034356),
   ('firearm', 0.00838214293105946),
   ('police', 0.008257963035784506)]),
   (9,
   [('make', 0.02147966482730431),
   ('people', 0.021377478029838543),
   ('work', 0.018557122419783363),
   ('money', 0.016676885346413244),
   ('year', 0.015982015123646026),
   ('job', 0.012221540976905783),
   ('pay', 0.010239117106069897),
   ('time', 0.008910688739014919),
   ('school', 0.0079092581238504),
   ('support', 0.007357449417535254)]),
   (14,
   [('power', 0.018428398507941996),
   ('line', 0.013784244460364121),
   ('high', 0.01183271164249895),
   ('work', 0.011560979224821522),
   ('ground', 0.010770484918850819),
   ('current', 0.010745781971789235),
   ('wire', 0.008399002000938712),
   ('low', 0.008053160742076529),
   ('water', 0.006966231071366814),
   ('run', 0.006892122230182061)]),
   (0,
   [('people', 0.025218349201353372),
   ('kill', 0.01500904870564167),
   ('child', 0.013612400660948935),
   ('armenian', 0.010307655991816822),
   ('woman', 0.010287984892595798),
   ('start', 0.01003226060272248),
   ('day', 0.00967818081674404),
   ('happen', 0.009383114328428673),
   ('leave', 0.009383114328428673),
   ('fire', 0.009009363443229208)]),
   (1,
   [('file', 0.030686386604212003),
   ('program', 0.02227713642901929),
   ('window', 0.01945561169918489),
   ('set', 0.015914874783314277),
   ('line', 0.013831003577619592),
   ('display', 0.013794120901412606),
   ('application', 0.012576992586582082),
   ('entry', 0.009275993066056873),
   ('change', 0.00872275292295209),
   ('color', 0.008612104894331132)]),
   (12,
   [('line', 0.07153810971508515),
   ('buy', 0.02975597944523662),
   ('organization', 0.026877236406682988),
   ('host', 0.025451316957679788),
   ('price', 0.025182275552207485),
   ('sell', 0.02461728860071565),
   ('mail', 0.02192687454599263),
   ('good', 0.018967419085797303),
   ('sale', 0.017998870026097017),
   ('send', 0.013694207538540181)]),
   (11,
   [('thing', 0.04901329901329901),
   ('good', 0.0376018876018876),
   ('make', 0.03393393393393394),
   ('time', 0.03326898326898327),
   ('bad', 0.02664092664092664),
   ('happen', 0.017696267696267698),
   ('hear', 0.015615615615615615),
   ('problem', 0.015465465465465466),
   ('back', 0.015143715143715144),
   ('lot', 0.01495066495066495)]),
   (18,
   [('space', 0.020626317374284855),
   ('launch', 0.00965716006366413),
   ('system', 0.008560244332602057),
   ('project', 0.008173097603991913),
   ('time', 0.008108573149223556),
   ('cost', 0.007764442723792318),
   ('year', 0.0076784101174345075),
   ('earth', 0.007484836753129436),
   ('base', 0.0067535595990880545),
   ('large', 0.006689035144319697)]),
   (5,
   [('government', 0.01918437232469453),
   ('people', 0.01461203206475212),
   ('state', 0.011207097828624796),
   ('country', 0.010214802708381975),
   ('israeli', 0.010039691804809714),
   ('war', 0.009436532025838587),
   ('force', 0.00858043427504086),
   ('attack', 0.008424780138532182),
   ('land', 0.0076659662230523775),
   ('world', 0.0075103120865437)]),
   (2,
   [('car', 0.041091194044470564),
   ('bike', 0.015598981291017729),
   ('ride', 0.011019688510138114),
   ('drive', 0.010627877363110981),
   ('engine', 0.009403467528651191),
   ('speed', 0.008081104907434616),
   ('turn', 0.007738270153785875),
   ('back', 0.007738270153785875),
   ('front', 0.007468899990204721),
   ('big', 0.007370947203447938)])
]

Evaluating Performance

Now we can also evaluate its performance by calculating the coherence score as follows −

ldamallet = gensim.models.wrappers.LdaMallet(
   mallet_path, corpus=corpus, num_topics=20, id2word=id2word
)
pprint(ldamallet.show_topics(formatted=False))

Output

Coherence Score: 0.5842762900901401
Advertisements