Questions tagged [cross-entropy]

A measure of the difference between two probability distributions for a given random variable or set of events.

In information theory, the cross-entropy between two probability distributions $p$ and $q$ over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution $q$, rather than the true distribution $p$.

The cross-entropy of the distribution $q$ relative to a distribution $p$ over a given set is defined as follows: $$ H(p,q)=-\mathbb{E}_p(\log q) $$ where $\mathbb{E}_p(\cdot)$ is the expected value operator with respect to the distribution $p$.

Source: Wikipedia.
Excerpt source: Brownlee "A Gentle Introduction to Cross-Entropy for Machine Learning" (2019).

230 questions
113
votes
6 answers

What loss function for multi-class, multi-label classification tasks in neural networks?

I'm training a neural network to classify a set of objects into n-classes. Each object can belong to multiple classes at the same time (multi-class, multi-label). I read that for multi-class problems it is generally recommended to use softmax and…
aKzenT
  • 1,231
  • 2
  • 8
  • 5
75
votes
4 answers

What is the difference Cross-entropy and KL divergence?

Both the cross-entropy and the KL divergence are tools to measure the distance between two probability distributions, but what is the difference between them? $$ H(P,Q) = -\sum_x P(x)\log Q(x) $$ $$ KL(P | Q) = \sum_{x} P(x)\log {\frac{P(x)}{Q(x)}}…
yoyo
  • 979
  • 1
  • 6
  • 9
63
votes
4 answers

Should I use a categorical cross-entropy or binary cross-entropy loss for binary predictions?

First of all, I realized if I need to perform binary predictions, I have to create at least two classes through performing a one-hot-encoding. Is this correct? However, is binary cross-entropy only for predictions with only one class? If I were to…
60
votes
5 answers

Backpropagation with Softmax / Cross Entropy

I'm trying to understand how backpropagation works for a softmax/cross-entropy output layer. The cross entropy error function is $$E(t,o)=-\sum_j t_j \log o_j$$ with $t$ and $o$ as the target and output at neuron $j$, respectively. The sum is over…
micha
  • 703
  • 1
  • 6
  • 5
58
votes
4 answers

Cross Entropy vs. Sparse Cross Entropy: When to use one over the other

I am playing with convolutional neural networks using Keras+Tensorflow to classify categorical data. I have a choice of two loss functions: categorial_crossentropy and sparse_categorial_crossentropy. I have a good intuition about the…
57
votes
1 answer

Why do we use Kullback-Leibler divergence rather than cross entropy in the t-SNE objective function?

In my mind, KL divergence from sample distribution to true distribution is simply the difference between cross entropy and entropy. Why do we use cross entropy to be the cost function in many machine learning models, but use Kullback-Leibler…
JimSpark
  • 673
  • 1
  • 6
  • 5
44
votes
3 answers

Dice-coefficient loss function vs cross-entropy

When training a pixel segmentation neural network, such as a fully convolutional network, how do you make the decision to use the cross-entropy loss function versus Dice-coefficient loss function? I realize this is a short question, but not quite…
Christian
  • 1,382
  • 3
  • 16
  • 27
39
votes
2 answers

Why is mean squared error the cross-entropy between the empirical distribution and a Gaussian model?

In 5.5, Deep Learning (by Ian Goodfellow, Yoshua Bengio and Aaron Courville), it states that Any loss consisting of a negative log-likelihood is a cross-entropy between the empirical distribution defined by the training set and the probability…
Mufei Li
  • 553
  • 1
  • 5
  • 9
28
votes
2 answers

Loss function for autoencoders

I am experimenting a bit autoencoders, and with tensorflow I created a model that tries to reconstruct the MNIST dataset. My network is very simple: X, e1, e2, d1, Y, where e1 and e2 are encoding layers, d2 and Y are decoding layers (and Y is the…
AkiRoss
  • 465
  • 1
  • 4
  • 11
20
votes
6 answers

Tensorflow Cross Entropy for Regression?

Does cross-entropy cost make sense in the context of regression? (as opposed to classification) If so, could you give a toy example through tensorflow and if not, why not? I was reading about cross entropy in Neural Networks and Deep Learning by…
JacKeown
  • 628
  • 1
  • 6
  • 17
16
votes
4 answers

How meaningful is the connection between MLE and cross entropy in deep learning?

I understand that given a set of $m$ independent observations $\mathbb{O}=\{\mathbf{o}^{(1)}, . . . , \mathbf{o}^{(m)}\}$ the Maximum Likelihood Estimator (or, equivalently, the MAP with flat/uniform prior) that identifies the parameters…
orome
  • 368
  • 1
  • 4
  • 15
15
votes
2 answers

How to construct a cross-entropy loss for general regression targets?

It's common short-hand in neural networks literature to refer to categorical cross-entropy loss as simply "cross-entropy." However, this terminology is ambiguous because different probability distributions have different cross-entropy loss…
Sycorax
  • 76,417
  • 20
  • 189
  • 313
15
votes
2 answers

Different definitions of the cross entropy loss function

I started off learning about neural networks with the neuralnetworksanddeeplearning dot com tutorial. In particular in the 3rd chapter there is a section about the cross entropy function, and defines the cross entropy loss as: $C = -\frac{1}{n}…
Reginald
  • 153
  • 1
  • 6
14
votes
2 answers

Why we use log function for cross entropy?

I'm learning about a binary classifier. It uses the cross-entropy function as its loss function. $y_i \log p_i + (1-y_i) \log(1-p_i)$ But why does it use the log function? How about just use linear form as follows? $y_ip_i + (1-y_i)(1-p_i)$ Is there…
Viridisjun
  • 141
  • 1
  • 3
13
votes
1 answer

the relationship between maximizing the likelihood and minimizing the cross-entropy

There is a statement that maximizing the likelihood is equivalent to minimizing the cross-entropy. Are there any proof for this statement?
1
2 3
15 16