latent semantic analysis
(60 minutes to learn)
Summary
Latent Semantic Analysis (LSA), or Latent Semantic Indexing (LSI), is a statistical technique typically used for analyzing relationships between a set of documents and the terms they contain. At its core, LSA performs singular value decomposition (SVD) on a term-by-document count matrix of a corpus and interprets the SVD factors as the "topics" of the documents. We can then use these resulting factors (topics) to determine the document-document, document-term, and term-term similarities in the given corpus.
Context
This concept has the prerequisites:
- singular value decomposition (The SVD is one step of LSA.)
Core resources (read/watch one of the following)
-Free-
→ Introduction to Information Retrieval
A textbook on information retrieval techniques.
Location:
Section 18.4
Other notes:
- this section focuses on LSA from an information retrieval perspective, where it is referred to as Latent Semantic Indexing (LSI)
→ An Introduction to Latent Semantic Analysis
Supplemental resources (the following are optional, but you may find them useful)
-Free-
→ Wikipedia
See also
- LSA has an odd interpretation when viewed as a probabilistic model. Probabilistic LSA is a similar model with a more principled probabilistic interpretation.
- LSA is exactly equivalent to applying principle component analysis to a term by document count matrix
- LSA is commonly used in information retrieval. There, it's referred to as latent semantic indexing .