1

Input to tfidf Matrix with a list of text documents, output is a matrix which contains a numerical value for each (word, document) pair. How can I use that matrix to perform feature selection, i.e. reduce the size of the features where matrix are very sparse ... 2000+ unique words (terms)?

so i would as about the following ideas:

  • can i considers Tf-Idf threshold such that if tf-idf(word) vector across all documents > threshold then ok, otherwise remove, if this idea is valid ... can i select threshold based on (mean and average value in every word vector) can some one give me suggestion about this?

  • i have read so many text mining books, they mention about using tf,idf and tf-idf as a suitable algorithms for selection subset of features from unlabeled documents so can some one give an ideas about using these methods ?

  • finally, can i used NMF(non-matrix factorization or SVD) for optimal number of features then using optimal features..

thanks for any suggestion.

azifallail
  • 37
  • 1
  • 6

1 Answers1

1

Note that this has some overlap with an earlier, somewhat similar question (where I suggested to group the words in the TF-IDF matrix by their covariance, and selecting the most frequent word in each group as the best feature).

Typical approaches are to just take some $n$ top most frequent words (or some top fraction $x$), which you can do, as you suggest, after various forms of TF-IDF scaling those word frequencies. While spectral analysis and clustering (e.g., of word embeddings, instead of TF-IDF values, and then choosing/selecting the most central word in each cluster) indeed have been suggested recently (2012-2016) to improve unsupervised word feature selection, they are not very common, however (and way more complex to set up than a quick TF-IDF-ranked frequency filter).

As to measuring the "correct" choice of $n$ (or $x$), if all your work is unsupervised, you can only measure intrinsic correctness (c.f., model perplexity); Or you need to evaluate your unsupervised results against some supervised task with a simple setup, e.g., as is common practice when evaluating word embeddings.

fnl
  • 1,491
  • 11
  • 15