latent semantic analysis

(60 minutes to learn)

Summary

Latent Semantic Analysis (LSA), or Latent Semantic Indexing (LSI), is a statistical technique typically used for analyzing relationships between a set of documents and the terms they contain. At its core, LSA performs singular value decomposition (SVD) on a term-by-document count matrix of a corpus and interprets the SVD factors as the "topics" of the documents. We can then use these resulting factors (topics) to determine the document-document, document-term, and term-term similarities in the given corpus.

Context

This concept has the prerequisites:

Core resources (read/watch one of the following)

-Free-

Introduction to Information Retrieval
A textbook on information retrieval techniques.
Location: Section 18.4
Authors: Christopher D. Manning,Prabhakar Raghavan,Hinrich Schütze
Other notes:
  • this section focuses on LSA from an information retrieval perspective, where it is referred to as Latent Semantic Indexing (LSI)
An Introduction to Latent Semantic Analysis
Authors: Thomas Landauer,Peter Foltz,Darrell Laham
Other notes:
  • a full understanding of SVD is not needed for this introduction

Supplemental resources (the following are optional, but you may find them useful)

See also

  • LSA has an odd interpretation when viewed as a probabilistic model. Probabilistic LSA is a similar model with a more principled probabilistic interpretation.
  • LSA is exactly equivalent to applying principle component analysis to a term by document count matrix
  • LSA is commonly used in information retrieval. There, it's referred to as latent semantic indexing .