- Gensim Tutorial
- Gensim - Home
- Gensim - Introduction
- Gensim - Getting Started
- Gensim - Documents & Corpus
- Gensim - Vector & Model
- Gensim - Creating a Dictionary
- Creating a bag of words (BoW) Corpus
- Gensim - Transformations
- Gensim - Creating TF-IDF Matrix
- Gensim - Topic Modeling
- Gensim - Creating LDA Topic Model
- Gensim - Using LDA Topic Model
- Gensim - Creating LDA Mallet Model
- Gensim - Documents & LDA Model
- Gensim - Creating LSI & HDP Topic Model
- Gensim - Developing Word Embedding
- Gensim - Doc2Vec Model
- Gensim Useful Resources
- Gensim - Quick Guide
- Gensim - Useful Resources
- Gensim - Discussion
Gensim - Creating LDA Mallet Model
This chapter will explain what is a Latent Dirichlet Allocation (LDA) Mallet Model and how to create the same in Gensim.
In the previous section we have implemented LDA model and get the topics from documents of 20Newsgroup dataset. That was Gensim’s inbuilt version of the LDA algorithm. There is a Mallet version of Gensim also, which provides better quality of topics. Here, we are going to apply Mallet’s LDA on the previous example we have already implemented.
What is LDA Mallet Model?
Mallet, an open source toolkit, was written by Andrew McCullum. It is basically a Java based package which is used for NLP, document classification, clustering, topic modeling, and many other machine learning applications to text. It provides us the Mallet Topic Modeling toolkit which contains efficient, sampling-based implementations of LDA as well as Hierarchical LDA.
Mallet2.0 is the current release from MALLET, the java topic modeling toolkit. Before we start using it with Gensim for LDA, we must download the mallet-2.0.8.zip package on our system and unzip it. Once installed and unzipped, set the environment variable %MALLET_HOME% to the point to the MALLET directory either manually or by the code we will be providing, while implementing the LDA with Mallet next.
Gensim Wrapper
Python provides Gensim wrapper for Latent Dirichlet Allocation (LDA). The syntax of that wrapper is gensim.models.wrappers.LdaMallet. This module, collapsed gibbs sampling from MALLET, allows LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents as well.
Implementation Example
We will be using LDA Mallet on previously built LDA model and will check the difference in performance by calculating Coherence score.
Providing Path to Mallet File
Before applying Mallet LDA model on our corpus built in previous example, we must have to update the environment variables and provide the path the Mallet file as well. It can be done with the help of following code −
import os from gensim.models.wrappers import LdaMallet os.environ.update({'MALLET_HOME':r'C:/mallet-2.0.8/'}) #You should update this path as per the path of Mallet directory on your system. mallet_path = r'C:/mallet-2.0.8/bin/mallet' #You should update this path as per the path of Mallet directory on your system.
Once we provided the path to Mallet file, we can now use it on the corpus. It can be done with the help of ldamallet.show_topics() function as follows −
ldamallet = gensim.models.wrappers.LdaMallet( mallet_path, corpus=corpus, num_topics=20, id2word=id2word ) pprint(ldamallet.show_topics(formatted=False))
Output
[ (4, [('gun', 0.024546225966016102), ('law', 0.02181426826996709), ('state', 0.017633545129043606), ('people', 0.017612848479831116), ('case', 0.011341763768445888), ('crime', 0.010596684396796159), ('weapon', 0.00985160502514643), ('person', 0.008671896020034356), ('firearm', 0.00838214293105946), ('police', 0.008257963035784506)]), (9, [('make', 0.02147966482730431), ('people', 0.021377478029838543), ('work', 0.018557122419783363), ('money', 0.016676885346413244), ('year', 0.015982015123646026), ('job', 0.012221540976905783), ('pay', 0.010239117106069897), ('time', 0.008910688739014919), ('school', 0.0079092581238504), ('support', 0.007357449417535254)]), (14, [('power', 0.018428398507941996), ('line', 0.013784244460364121), ('high', 0.01183271164249895), ('work', 0.011560979224821522), ('ground', 0.010770484918850819), ('current', 0.010745781971789235), ('wire', 0.008399002000938712), ('low', 0.008053160742076529), ('water', 0.006966231071366814), ('run', 0.006892122230182061)]), (0, [('people', 0.025218349201353372), ('kill', 0.01500904870564167), ('child', 0.013612400660948935), ('armenian', 0.010307655991816822), ('woman', 0.010287984892595798), ('start', 0.01003226060272248), ('day', 0.00967818081674404), ('happen', 0.009383114328428673), ('leave', 0.009383114328428673), ('fire', 0.009009363443229208)]), (1, [('file', 0.030686386604212003), ('program', 0.02227713642901929), ('window', 0.01945561169918489), ('set', 0.015914874783314277), ('line', 0.013831003577619592), ('display', 0.013794120901412606), ('application', 0.012576992586582082), ('entry', 0.009275993066056873), ('change', 0.00872275292295209), ('color', 0.008612104894331132)]), (12, [('line', 0.07153810971508515), ('buy', 0.02975597944523662), ('organization', 0.026877236406682988), ('host', 0.025451316957679788), ('price', 0.025182275552207485), ('sell', 0.02461728860071565), ('mail', 0.02192687454599263), ('good', 0.018967419085797303), ('sale', 0.017998870026097017), ('send', 0.013694207538540181)]), (11, [('thing', 0.04901329901329901), ('good', 0.0376018876018876), ('make', 0.03393393393393394), ('time', 0.03326898326898327), ('bad', 0.02664092664092664), ('happen', 0.017696267696267698), ('hear', 0.015615615615615615), ('problem', 0.015465465465465466), ('back', 0.015143715143715144), ('lot', 0.01495066495066495)]), (18, [('space', 0.020626317374284855), ('launch', 0.00965716006366413), ('system', 0.008560244332602057), ('project', 0.008173097603991913), ('time', 0.008108573149223556), ('cost', 0.007764442723792318), ('year', 0.0076784101174345075), ('earth', 0.007484836753129436), ('base', 0.0067535595990880545), ('large', 0.006689035144319697)]), (5, [('government', 0.01918437232469453), ('people', 0.01461203206475212), ('state', 0.011207097828624796), ('country', 0.010214802708381975), ('israeli', 0.010039691804809714), ('war', 0.009436532025838587), ('force', 0.00858043427504086), ('attack', 0.008424780138532182), ('land', 0.0076659662230523775), ('world', 0.0075103120865437)]), (2, [('car', 0.041091194044470564), ('bike', 0.015598981291017729), ('ride', 0.011019688510138114), ('drive', 0.010627877363110981), ('engine', 0.009403467528651191), ('speed', 0.008081104907434616), ('turn', 0.007738270153785875), ('back', 0.007738270153785875), ('front', 0.007468899990204721), ('big', 0.007370947203447938)]) ]
Evaluating Performance
Now we can also evaluate its performance by calculating the coherence score as follows −
ldamallet = gensim.models.wrappers.LdaMallet( mallet_path, corpus=corpus, num_topics=20, id2word=id2word ) pprint(ldamallet.show_topics(formatted=False))
Output
Coherence Score: 0.5842762900901401