Questions tagged [latent-semantic-analysis]

LSA stands for Latent Semantic Analysis, a natural language processing technique which involves analysing the relationships between documents and terms they contain by producing a set of related concepts.

LSA stands for Latent Semantic Analysis, a natural language processing technique which involves analysing the relationships between documents and terms they contain by producing a set of related concepts.

50 questions
28
votes
3 answers

LSA vs. PCA (document clustering)

I'm investigation various techniques used in document clustering and I would like to clear some doubts concerning PCA (principal component analysis) and LSA (latent semantic analysis). First thing - what are the differences between them? I know that…
user1315305
  • 1,199
  • 4
  • 14
  • 15
13
votes
4 answers

Fast alternatives to the EM algorithm

Are there any speedy alternatives to the EM algorithm for learning models with latent variables (especially pLSA)? I'm okay with sacrificing precision in favor of speed.
10
votes
1 answer

A parellel between LSA and pLSA

In the original paper of pLSA the author, Thomas Hoffman, draw a parallel between pLSA and LSA data structures that I would like to discuss with you. Background: Taking inspiration the Information Retrieval suppose we have a collection of $N$…
10
votes
3 answers

K-means on cosine similarities vs. Euclidean distance (LSA)

I am using latent semantic analysis to represent a corpus of documents in lower dimensional space. I want to cluster these documents into two groups using k-means. Several years ago, I did this using Python's gensim and writing my own k-means…
9
votes
1 answer

When to choose PCA vs. LSA/LSI

Question: Are there any general guidelines with respect to the input data characteristics, that can be used to decide between applying PCA versus LSA/LSI? Brief summary of PCA vs. LSA/LSI: Principle Component Analysis (PCA) and Latent Semantic…
qi5d02lx
  • 221
  • 2
  • 4
8
votes
2 answers

Is it ok to get negative Cosine Similarity using LSA?

I am getting negative cosine similarity value between two documents in Latent Semantic analysis. How should it be treated?
6
votes
2 answers

Deriving mathematical model of pLSA

After knowing how LSA works, I went on continue reading on pLSA but couldn't really make sense of the mathematical formula. This is what I get from wikipedia (other academic papers/tutorial show similar form) \begin{align} P(w,d) & = \sum_{c} P(c)…
6
votes
1 answer

How to cluster LDA/LSI topics generated by gensim?

I'm an enthusiastic single developer working on a small start-up idea. I reduced a corpus of mine to an LSA/LDA vector space using gensim. Now I have a bunch of topics hanging around and I am not sure how to cluster the corpus documents. I see that…
5
votes
2 answers

Latent Dirichlet Allocation vs. pLSA

In the original LDA paper it is stated that: The parameters for a k-topic pLSI model are k multinomial distributions of size V and M mixtures over the k hidden topics. This gives kV +kM parameters and therefore linear growth in M. The linear…
5
votes
2 answers

What is a "tempered EM algorithm"?

In the paper of Probabilistic Latent Semantic Analysis by Hofmann, the author fits the model for document $\times$ word matrix through EM Algorithm in section 3. I was able to follow the derivation and meaning of the model derived in it. However in…
Learner
  • 4,007
  • 11
  • 37
  • 39
4
votes
1 answer

Computing document similarity in latent semantic analysis

I have a question regarding Latent Semantic Analysis - after performing SVD decomposition of term-document matrix and choosing some number of dimensions, I get the set of new document vectors. Now, how can I calculate similarity between two…
user1315305
  • 1,199
  • 4
  • 14
  • 15
4
votes
0 answers

Finding similarity between a reference and few working documents

I have to find the similarity between a reference document and a set of documents in a repository . Here is my method : 1. I find the term document matrix for all the documents including the reference document. 2. The svd is calculated for this…
siddharth
  • 71
  • 1
  • 2
4
votes
1 answer

pLSA - Probabilistic Latent Semantic Analysis, how to choose topic number?

I am learning about pLSA (Probabilistic Latent Semantic Analysis) right now, in the hopes of being able to apply it to biomolecular annotation prediction. I have a very simple question: How do you choose the number of topics / classes to use in the…
4
votes
1 answer

Latent Semantic Indexing and Data Centering

In PCA it's common to center the data, i.e. preprocess the data matrix such that the columns have zero mean. PCA can be done via SVD, but in this case the data matrix also has to be mean-centered. If we don't center it, the found principal…
3
votes
0 answers

Application of LSA/LSI; Is it common to include the use of an edit distance?

I have been using Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI) to identify whether different email addresses belong to the same individual by matching on names used for each email address; An email address represents a…
ErikKou
  • 31
  • 1
1
2 3 4