For a given $i$, the similarities $p_{j|i}$ are obtained from Euclidean distances $d_{ij}$ via a Gaussian similarity kernel and then normalized to sum to one: $$p_{j|i} = \frac{\exp(-\beta d_{ij})}{\sum_k \exp(-\beta d_{ik})}.$$ Plugging this into the entropy formula, we obtain:
\begin{align}
H(p_{\cdot|i}) &= -\sum_j p_{j|i} \log p_{j|i}=
-\sum_j \frac{\exp(-\beta d_{ij})}{\sum_k \exp(-\beta d_{ik})} \log \frac{\exp(-\beta d_{ij})}{\sum_k \exp(-\beta d_{ik})}\\
&=-\frac{1}{\sum_k \exp(-\beta d_{ik})}\sum_j \exp(-\beta d_{ij}) \big[\log \exp(-\beta d_{ij}) - \log\sum_k \exp(-\beta d_{ik})\big]=\\
&=\frac{1}{\sum_k \exp(\beta d_{ik})}\sum_j \exp(-\beta d_{ij}) \big[-\beta d_{ij} - \log\sum_k \exp(-\beta d_{ik})\big]=\\
&=\frac{\beta\sum_j d_{ij} \exp(-\beta d_{ij})}{\sum_k \exp(-\beta d_{ik})} + \frac{\sum_j \exp(-\beta d_{ij}) \log\sum_j \exp(-\beta d_{ij})}{\sum_k \exp(-\beta d_{ik})}\\
&=\frac{\beta\sum_j d_{ij} \exp(-\beta d_{ij})}{\sum_k \exp(-\beta d_{ik})} + \log\sum_j \exp(-\beta d_{ij}).\\
\end{align}
This is what is computed by
P = exp(-D * beta);
sumP = sum(P);
H = log(sumP) + beta * sum(D .* P) / sumP;
There are two questions left. First, shouldn't we have used $\log_2$ and not $\log$? Turns out this does not matter, because $$2^{-\sum a_i\log_2 a_i} = \prod a_i^{-a_i} = e^{-\sum a_i \log a_i},$$ so one can use any power/logarithm to define perplexity. The paper uses binary logarithm while the Matlab code uses natural logarithm.
Second, why wasn't this function written simply as
P = exp(-D * beta);
sumP = sum(P);
P = P / sumP;
H = -sum(P .* log(P));
I think it was a way to avoid possible numerical problems due to very small values of $p_{j|i}$. As written, there is no need to explicitly compute log(P)
which would have large absolute value) and then to multiply P
(very close to zero) with log(P)
(very far from zero).