When is it ok to not use a held-out set for topic model evaluation?

Question

Typically, I see people calculate the perplexity/likelihood of a topic model against a held out set of documents, but is this always necessary/appropriate?

I'm thinking in particular of a use case in which we're modeling a full corpus, with no plans to use the model for prediction on new documents. In other words, the corpus is complete and self-contained, and the topic model is only for exploration and dimensionality reduction on the original corpus.

At least intuitively, this seems to be a case where overfitting isn't really possible/meaningful, so isn't it best to calculate the model likelihood/perplexity against the full training set of documents, and try to minimize that value?

*Why* do you model? If you are not predicting on a holdout sample, how do you guard against overfitting? — Stephan Kolassa, Nov 16 '15 at 14:59
Well I guess this is fairly conceptual, but the whole question hinges on precisely that. We're modeling a complete corpus, for the purposes of dimensionality reduction and topic exploration. My question boils down to, if we have the complete data we want to model (not a sample), then is "overfitting" really bad? Is it even possible? Overfitting is typically an issue where you have a sample of some larger population, and worry about fitting noise in the sample you're using. But if you have the full population, is overfitting really possible? — moustachio, Nov 16 '15 at 15:02
Overfitting does not just mean your model is not generalizable. Overfitted models are often based on spurious relationships between your variables. You mention dimensionality reduction. Wouldn't you want to know that the reduced dimensions are meaningful? If not, what is the purpose of the dimensionality reduction? — Erik, Nov 16 '15 at 15:19
See [Does it make sense to compute confidence intervals and to test hypotheses when data from whole population is available?](http://stats.stackexchange.com/q/68886/17230), [Statistical inference when the sample “is” the population](http://stats.stackexchange.com/q/2628/17230), & [Machine learning on big data: capability of generalization](http://stats.stackexchange.com/q/70019/17230). — Scortchi - Reinstate Monica, Nov 16 '15 at 15:23
Ok these are all helpful pointers. I see that I should use the traditional holdout method, then minimize perplexity on the test set. The issue this raises for my use case is that I need to know the topic weights for *all* words in my corpus, but when training on a sample I'm not guaranteed to get that (i.e.relatively uncommon words might never make it into the model, if they only occur in test documents). Given this, is it appropriate to use the holdout method to pick my "best" number of topics, then retrain a new model on complete corpus to ensure I get the topic weights for all terms? — moustachio, Nov 16 '15 at 15:44
If you are really only interested in describing patterns in this population, & shun any form of inference, prediction, or generalization, then you don't need to worry about over-fitting. (But it's rather common for people to *say* they just want to describe & then go on to infer, &c.) — Scortchi - Reinstate Monica, Nov 16 '15 at 15:48
My ultimate goal with the model actually focuses on being able to determine similarity between arbitrary pairs of terms in the corpus (incidentally, see my [other question here](https://stats.stackexchange.com/questions/181889/calculation-of-word-word-similarity-in-an-lda-topic-model)), so perhaps what I describe in my previous comment is a good compromise. Use the hold-out method to pick my number of topics, then train a model on all documents with that number of topics to ensure I have good data for all unique terms. — moustachio, Nov 16 '15 at 15:54
@moustachio: It's probably worth asking about *how to* rather than *whether to* validate your model as a separate question. — Scortchi - Reinstate Monica, Nov 20 '15 at 13:37

score 2 · Accepted Answer · answered Nov 20 '15 at 13:58

"but is this always necessary/appropriate?"

No, this is not always necessary. Many papers, e.g. "Improving Topic Models with Latent Feature Word Representations", use topic coherence and/or document clustering and/or document classification and/or information retrieval to compare topic models other than computing perplexity on held-out data.

When is it ok to *not* use a held-out set for topic model evaluation?

1 Answers1

When is it ok to not use a held-out set for topic model evaluation?