What is a "tempered EM algorithm"?

Question

In the paper of Probabilistic Latent Semantic Analysis by Hofmann, the author fits the model for document $\times$ word matrix through EM Algorithm in section 3. I was able to follow the derivation and meaning of the model derived in it.

However in the later section, the author mentioned about Tempered EM for improving the generalization. Could anyone explain or point me to the location where I can understand the actual meaning of Tempered EM Algorithm.

Also found another paper by same other on [Tempered EM algorithm](http://www.cs.pitt.edu/~milos/courses/cs3750/Readings/Link-analysis/Hofmann-SIGIR99.pdf) — Learner, May 03 '11 at 15:13
Looks like the one that was suggested to you on [MetaOptimize](http://metaoptimize.com/qa/questions/5673/tempered-em), right? — chl, May 04 '11 at 08:40
@chi: yep. But im still not found the right material for it. — Learner, May 04 '11 at 08:48

score 3 · Accepted Answer · answered May 03 '11 at 15:03

3

I found an answer via Google in a UTexas Paper. As I suspected from the name, it combines a temperature that decreases ala Simulated Annealing, changing the E step of the algorithm slightly.

answered May 03 '11 at 15:03

Wayne

19,981
4
50
99

Unfortunately the UTexas Paper link is dead now - does anyone know the title of the paper referred to here? – Dylan Hogg Dec 18 '17 at 06:54
"On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing" by Chris Ding, Tao Li, Wei Peng – Rob Romijnders Oct 04 '19 at 09:40

score 1 · Answer 2 · answered Oct 04 '19 at 09:30

For those seeking an answer still in 2019:

Tempered EM has aims orthogonal to annealing methods. In annealing methods, the temperature scaling helps our optimization to find an optimimum faster/easier; in tempered EM, the temperature scaling provides a trade-off between overfitting and underfitting on the likelihood terms.

We can see the trade-off between under- and overfitting as follows:

The Helmholtz energy comprises of two factors

J(q) = beta * (expected negative log posterior) - (entropy) = beta * E_q [- \log p(x)] - H[q]

1) The expected negative log posterior wants an approximation, q, so that the data is fitted extremely well

2) The entropy wants an approximation, q, with high entropy. High entropy can be interpreted as lots of chaos, lots of uncertainty, broad typical sets.

Now tempered EM provides a trade-off between the two factors. For beta close to 1, the approximation can follow very well the likelihoods terms, but potentially overfitting. For beta close to 0, the approximation will have high entropy, not following the likelihood terms, but thereby potentially underfitting.

What is a "tempered EM algorithm"?

2 Answers2