Arnie Bhadury

Topic Models - Simple, Modular and Effective !

Probabilistic Topic Models are a suite of Machine Learning (ML) algorithms for discovering thematic structures that pervade in big data corpuses. Topic models learn lower-dimensional representations of data and are heavily used in text compression, information retrieval and text classification. 

Latent Dirichlet Allocation (LDA), one of the cornerstones of Natural Language Processing (NLP) and Bayesian Machine Learning, is one such hierarchical Bayesian graphical model that learns sparse "topics" and represents documents in terms of these learned "topics". This reduces the variability of documents from a large vocabulary set to a small topic set. LDA is very useful for such structuring tasks in unsupervised settings, while also being extremely modular and extendable for customized tasks.

In this talk I will discuss my industry and research experience with LDA in unsupervised tasks, tricks to speed up the posterior inference of LDA, some interesting extensions such as Correlated Topic Models and Dynamic Topic Models, and using topic models for traditional supervised classification tasks. I will conclude my talk by focusing on interesting industry applications and some emerging research questions.