Blogs · NLP · Topic Modeling

Topic Modeling With Latent Dirichlet Allocation

A practical introduction to topic modeling with LDA, including bag-of-words, document-topic distributions, topic-word distributions, preprocessing, and evaluation.

2019.07.21 · 1 min read · by Zhenlin Wang

Introduction

Topic modeling discovers themes in a collection of documents. Latent Dirichlet Allocation (LDA) is a classic topic model.

LDA assumes:

Intuition

A document about machine learning systems might be:

60% engineering, 25% evaluation, 15% deployment

A topic might assign high probability to words such as:

model, training, metric, feature, deployment

LDA tries to infer these hidden topic and word distributions from observed documents.

Preprocessing

Preprocessing matters:

Bad preprocessing often creates bad topics.

Choosing the Number of Topics

The number of topics is a modeling choice.

Too few topics:

Too many topics:

Use both quantitative metrics and human inspection. Topic coherence can help, but the topics should also be useful for the actual task.

Interpreting Topics

Inspect:

Name topics after looking at documents, not only top words.

Limitations

LDA uses a bag-of-words representation, so it ignores word order and much of syntax.

It can struggle with:

For modern semantic discovery, embeddings plus clustering may work better. LDA is still useful when a simple probabilistic model and interpretable word distributions are desired.

Closing

LDA is best understood as a model of document-topic mixtures and topic-word distributions. It is not magic, but it gives a clear framework for exploring large text collections.