Use of sigmoidal function in output layer to predict continuous values and correct use of the loss function

Question

Lets supose that a neural network is utilized to map a set of training data to a continuous interval between 0 and 1 utilizing a sigmoidal function on its output layer. Is it correct to optimize the model with a log-likelihood loss function such as:

\begin{equation} J(\theta) = -\frac{1}{N}\sum_{i=1}^{N}y_ilog(h(\theta^Tx_i))+(1-y_i)log(1-h(\theta^Tx_i)) \end{equation}

Or the loss function has to be somehow modified due to the continuous valued output?

doubllle · Answer 1 · 2020-08-31T08:43:03.407

2

The loss function you wrote is the cross entropy loss, which arises from the assumption that your target values follow a Bernoulli distribution. That means, for your data pairs $(x_i, y_i)$, the likelihood function is $$L(\theta)=\prod_i (h(\theta^T x_i))^{y_i}[1-h(\theta^T x_i)]^{1-y_i}.$$ Taking the negative logrithm of this equation leads to your loss function $J(\theta)$. Minimizing $J(\theta)$ would push sigmoid outputs away from 0.5 to approach either 0 or 1.

So back to your question, if you want your continuous target values to behave like this, just keep the loss. However if the assumed target distribution is not Bernoulli but more like Gaussian, then you might want to use mean squared error (MSE) or mean absolute error (MAE) based on the sigmoid outputs.

edited Aug 31 '20 at 08:43

answered May 07 '20 at 08:12

doubllle

1,348
11
19

1

+1 for the answer from MLE perspective. – Haitao Du May 07 '20 at 08:40
Thanks for the answer, and the probabilistic interpretation makes everything more clear for sure. However I was wondering if you could provide some book reference so I could read about the issue. Thanks again. – Marcus May 07 '20 at 12:53
1

References related to your specific question or loss functions? – doubllle May 07 '20 at 13:02
About loss functions and a probabilistic view of neural networks. – Marcus May 07 '20 at 17:31
@Marcus C. Bishop's "Pattern Recognition and Machine Learning" can be a good read – doubllle May 07 '20 at 18:23

score 1 · Answer 2 · answered May 07 '20 at 08:42

We have a closely related question here

What are the impacts of choosing different loss functions in classification to approximate 0-1 loss

Note that, doubllle had a nice answer from MLE perspective. On the other hand, in some machine learning literature, people just think about logistic loss or hinge loss is a convex approximation of 0-1 loss, without any probability interpretation.

In short: In classification problem, we want to minimize 0-1 loss*. But 0-1 loss is hard to minimize, we use logistic loss to approximate.

0-1 loss is the "mis-classification cost": when make wrong prediction, loss is 1, and making right prediction the loss is 0 for a given point.

Use of sigmoidal function in output layer to predict continuous values and correct use of the loss function

2 Answers2