2

The perplexity formula in the official paper of t-SNE IS NOT the same as in its implementation.

In the implementation (MATLAB):

% Function that computes the Gaussian kernel values given a vector of
% squared Euclidean distances, and the precision of the Gaussian kernel.
% The function also computes the perplexity of the distribution.
function [H, P] = Hbeta(D, beta)
    %Where D is a single row from the Euclidean distance matrix.
    P = exp(-D * beta);
    sumP = sum(P);
    H = log(sumP) + beta * sum(D .* P) / sumP; %<==== PERPLEXITY FORMULA
    P = P / sumP;
end

However, in the paper, perplexity is defined as in here:

$$ \mathrm{Perp}(P_i) = 2^{H(P_i)} $$ $$ H(P_i) = -\sum _j p_{j|i} \log_2{p_{j|i}} $$

I understand they must be equivalent. The question is: how can this equivalence be shown?

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • 1
    Note that in the code, `sumP` is not necessarily unity, which means that until the very end, `P` does not contain probabilities. (It is strange that `P` is not immediately normalized, because that would simplify the formula for `H` and immediately answer your question, I believe.) – whuber Mar 11 '19 at 15:45
  • @whuber I suspect the reason to compute it in this roundabout way was to prevent possible numerical problems -- see my answer. – amoeba Mar 26 '19 at 10:06
  • Hi ElegantLogic! Did my answer make sense to you? If it resolved your question then please mark it as "accepted" by clicking on a tick sign to the left of it. Otherwise feel free to ask for clarifications. – amoeba Apr 01 '19 at 21:41

1 Answers1

2

For a given $i$, the similarities $p_{j|i}$ are obtained from Euclidean distances $d_{ij}$ via a Gaussian similarity kernel and then normalized to sum to one: $$p_{j|i} = \frac{\exp(-\beta d_{ij})}{\sum_k \exp(-\beta d_{ik})}.$$ Plugging this into the entropy formula, we obtain: \begin{align} H(p_{\cdot|i}) &= -\sum_j p_{j|i} \log p_{j|i}= -\sum_j \frac{\exp(-\beta d_{ij})}{\sum_k \exp(-\beta d_{ik})} \log \frac{\exp(-\beta d_{ij})}{\sum_k \exp(-\beta d_{ik})}\\ &=-\frac{1}{\sum_k \exp(-\beta d_{ik})}\sum_j \exp(-\beta d_{ij}) \big[\log \exp(-\beta d_{ij}) - \log\sum_k \exp(-\beta d_{ik})\big]=\\ &=\frac{1}{\sum_k \exp(\beta d_{ik})}\sum_j \exp(-\beta d_{ij}) \big[-\beta d_{ij} - \log\sum_k \exp(-\beta d_{ik})\big]=\\ &=\frac{\beta\sum_j d_{ij} \exp(-\beta d_{ij})}{\sum_k \exp(-\beta d_{ik})} + \frac{\sum_j \exp(-\beta d_{ij}) \log\sum_j \exp(-\beta d_{ij})}{\sum_k \exp(-\beta d_{ik})}\\ &=\frac{\beta\sum_j d_{ij} \exp(-\beta d_{ij})}{\sum_k \exp(-\beta d_{ik})} + \log\sum_j \exp(-\beta d_{ij}).\\ \end{align}

This is what is computed by

P = exp(-D * beta);
sumP = sum(P);
H = log(sumP) + beta * sum(D .* P) / sumP;

There are two questions left. First, shouldn't we have used $\log_2$ and not $\log$? Turns out this does not matter, because $$2^{-\sum a_i\log_2 a_i} = \prod a_i^{-a_i} = e^{-\sum a_i \log a_i},$$ so one can use any power/logarithm to define perplexity. The paper uses binary logarithm while the Matlab code uses natural logarithm.

Second, why wasn't this function written simply as

P = exp(-D * beta);
sumP = sum(P);
P = P / sumP;
H = -sum(P .* log(P));

I think it was a way to avoid possible numerical problems due to very small values of $p_{j|i}$. As written, there is no need to explicitly compute log(P) which would have large absolute value) and then to multiply P (very close to zero) with log(P) (very far from zero).

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • 1
    Please see request for information about CV.SE work [here](https://stats.meta.stackexchange.com/questions/6205/). – Ben Aug 27 '21 at 06:37