probabilistic Latent Semantic Analysis

(45 minutes to learn)

Summary

Probabilistic Latent Semantic Analysis (pLSA), also known as probabilistic Latent Semantic Indexing (pLSI), is a matrix decomposition technique for binary and count data, where one component of the data is conditionally independent of the other component given some unobserved factor. pLSA is most commonly used for document modeling, where the count data is the number of times a term appears in each document (forming an observed term by document count matrix), and the factors are interpreted as the latent/unobserved topics.

Context

This concept has the prerequisites:

Goals

  • Understand the difference between pLSA and LSA
    • Why is pLSA considered a statistical model while LSA is not?
    • What objective function does pLSA maximize in order to determine the decomposition?
  • How would a trained pLSA model handle new documents? (see Blei et al.'s LDA paper)

Core resources (read/watch one of the following)

-Free-

Bayesian Reasoning and Machine Learning
A textbook for a graudate machine learning course.
Author: David Barber
Additional dependencies:
  • Expectation-Maximization algorithm
Other notes:
  • presents the expectation-maximization algorithm for learning the matrix decomposition, which is the standard technique for learning the decomposition
Probabilistic Latent Semantic Indexing
Author: Thomas Hoffman
Other notes:
  • You can gloss over section 3 if you're not familiar with the expectation maximization algorithm

Supplemental resources (the following are optional, but you may find them useful)

-Free-

Latent Dirichlet Allocation (2003)
The research paper that introduced latent Dirichlet allocation.
Authors: David Blei,Andrew Ng,Michael I Jordan
Other notes:
  • points out some of the weaknesses of pLSA

See also