Is $-\sum_{n=1}^{N} \log(1+\exp(-t_ny_n)) $ the same loss as $\sum_{n=1}^{N}\{ t_n\log(y_n) + (1 - t_n)\log (1-y_n) \}$?

Question

I am trying to understand different forms of loss functions. I get confused with the terms cross entropy-loss and negative log-likelihood losses. I have seen the two following definitions of cross entropy-loss and negative log-likelihood losses (both terms are used for both formulas). The first one is (in the following paper: Improved Knowledge Graph Embedding using Background Taxonomic Information - Fatemi et al.: in the Objective Function and Training Section):

$-\sum_{n=1}^{N}\{ t_n\log(y_n) + (1 - t_n)\log (1-y_n) \}$, where $t_n$ is the label (either 0 or 1) and $y_n$ is the probability

and the second one is (in the following paper: Low-Dimensional Hyperbolic Knowledge Graph Embeddings - Chami et al.: Equation 11):

$\sum_{n=1}^{N} \log(1+\exp(-t_ny_n)) $, where $t_n$ is the label (BUT either -1 or 1) AND $y_n$ is NOT a probability but a similarity score (distance in this case)

I have changed the notation so it is the same for both formulas. Apparently as stated in the following post, cross entropy loss and negative log likelihood are equivalent. But are the above two formulas the same, I don't think so? Where is the difference? Is it because one time they use probabilities and the other time just a similarity score?

Thank you for your help! I really appreciate it! :D

(There are quite a few papers using those two formulas, the two above were just an example)

I’ve edited the title because the question as I understand it isn’t about the terminology of cross entropy or likelihood, but about comparing these two loss functions. If you feel that this isn't a good title, or that there’s a more descriptive title, please feel free to edit with my apologies. But please do make the title specific and descriptive of the question you want answered. — Sycorax, Jun 10 '21 at 16:54
Thank you! I was confused about the terminology and at first I thought that Loss 1 corresponds to negative log-likelihood and Loss 2 to cross entropy loss. But essentially they are the same right or are they only the same for Bernoulli distribution? — hubsiii, Jun 11 '21 at 16:27

Sycorax · Accepted Answer · 2021-06-10T16:43:26.683

Cross-entropies and likelihoods arise from specific probability models. When the probability model is the same, cross-entropy and likelihoods are different ways of writing down the same optimization problem.

We can write down cross-entropy losses for lots of probability models besides Bernoulli probability models How to construct a cross-entropy loss for general regression targets?
We can demonstrate the relationship between cross-entropy and the (log) likelihood for a Bernoulli model: the relationship between maximizing the likelihood and minimizing the cross-entropy

However, even if we are precise about our terminology, I think what you're asking is not about the terminology, but instead asking whether (1) and (2) refer to the same probability model, and if they do, how can one show that (1) and (2) are the same.

But there's a distinction which answers the question: (1) and (2) aren't the same probability model. (1) is used when $y_n$ is a predicted probability (of a Bernoulli trial), but in (2) $y_n$ is a distance, so they are not the same probability model. Distances are not probabilities because they can be arbitrarily large, but probabilities are at most 1. (The encoding of the targets is different between the two, 0/1 versus -1/+1, but this is not necessarily a deal-breaker; in certain problems, we can show equivalence between encoding choice.)

Thank you very much! I see, so it depends on the underlying probability distribution! If I have probabilities, I would use Loss 1 and if I have other similarity values I would use Loss 2? What exactly do I minimize in Loss 2, the softplus function. Is there a reason why minimizing softplus function for distances? — hubsiii, Jun 11 '21 at 16:25
I haven’t read the papers where (2) is used, so I’m not sure. Softplus is smooth and has a gradient that’s at most 1, so maybe it’s nicer for gradient-based optimization. — Sycorax, Jun 11 '21 at 16:36

Is $-\sum_{n=1}^{N} \log(1+\exp(-t_ny_n)) $ the same loss as $\sum_{n=1}^{N}\{ t_n\log(y_n) + (1 - t_n)\log (1-y_n) \}$?

1 Answers1