3

I have come across Latent Semantic Analysis, but I could not understand it.

  1. Can Latent Semantic Analysis be used by humans in clustering of the data-sets? For convenience let us consider the datasets to be a two dimensional sets. Can the humans cluster the datasets manually without the help of clustering algorithms?
  2. How can clustering algorithms use Latent Semantic Analysis?
Alexey Grigorev
  • 8,147
  • 3
  • 26
  • 39
Ramseyl
  • 51
  • 1
  • 5

1 Answers1

5

Latent Semantic Analysis (LSA) is text mining dimension reduction technique akin to Principal Component Analysis. It assigns each document different "loadings" on the topics (the reduced dimensions). The input to LSA is a term document matrix (often modified using tf-idf). So each document has a bag-of-words count (I'm going to call this a vector) of the different terms that appeared in it. You could also cluster the different term vectors using a clustering algorithm such as k-means. The difference between clustering and LSA is that clustering algorithms assign each document to a specific "cluster" while LSA assigns a set of topic loadings to each document. Using an example, a clustering algorithm might be able to cluster documents about "medicine" vs. "sports" using the term vector for each document but would do a bad job assigning documents about "sports medicine" or "sport injuries" to the correct cluster since the documents are about multiple topics. LSA (or LDA or another topic modeling algorithm) would in theory show that certain documents (such as the "sports injuries" documents) have loadings in both the "sports" and "medicine" topics.

Andrew Cassidy
  • 476
  • 3
  • 15
  • You could also say that LSA transforms a document from word features (often very high dimensional and sparse) to topic features (often low dimensional and dense). The documents from the same topic are expected to be similar with respect to cosine metric for example. – Vladislavs Dovgalecs Mar 26 '15 at 17:31
  • What do you mean by "topic loadings"? – user697911 Nov 26 '16 at 06:04
  • @user697911 each reduced dimension is made up of different loadings from each word. – Andrew Cassidy Nov 27 '16 at 00:02
  • the sports topic for example might have high loadings of words such as "play", "match", "ball", "game", ..... Each document then has a set of loadings of the different topics. – Andrew Cassidy Nov 27 '16 at 00:02
  • Does each topic have a name, as you mentioned "sports"? Is a 'topic' in LSA is the same as a 'cluster'? – user697911 Nov 27 '16 at 05:18
  • @user697911 no a topic does not have a explicit name. The name comes from an interpretation of the different words. A topic in LSA is not at all like a cluster. A cluster is a set points. A point can had loadings for multiple topics. If all of this is foreign to you I'd recommend reading wikipedia for clustering and dimension reduction. LSA is a dimension reduction technique. Also if my post is of help you should upvote. – Andrew Cassidy Nov 28 '16 at 02:20