1

Lets supose that a neural network is utilized to map a set of training data to a continuous interval between 0 and 1 utilizing a sigmoidal function on its output layer. Is it correct to optimize the model with a log-likelihood loss function such as:

\begin{equation} J(\theta) = -\frac{1}{N}\sum_{i=1}^{N}y_ilog(h(\theta^Tx_i))+(1-y_i)log(1-h(\theta^Tx_i)) \end{equation}

Or the loss function has to be somehow modified due to the continuous valued output?

Marcus
  • 71
  • 2

2 Answers2

2

The loss function you wrote is the cross entropy loss, which arises from the assumption that your target values follow a Bernoulli distribution. That means, for your data pairs $(x_i, y_i)$, the likelihood function is $$L(\theta)=\prod_i (h(\theta^T x_i))^{y_i}[1-h(\theta^T x_i)]^{1-y_i}.$$ Taking the negative logrithm of this equation leads to your loss function $J(\theta)$. Minimizing $J(\theta)$ would push sigmoid outputs away from 0.5 to approach either 0 or 1.

So back to your question, if you want your continuous target values to behave like this, just keep the loss. However if the assumed target distribution is not Bernoulli but more like Gaussian, then you might want to use mean squared error (MSE) or mean absolute error (MAE) based on the sigmoid outputs.

doubllle
  • 1,348
  • 11
  • 19
1

We have a closely related question here

What are the impacts of choosing different loss functions in classification to approximate 0-1 loss

Note that, doubllle had a nice answer from MLE perspective. On the other hand, in some machine learning literature, people just think about logistic loss or hinge loss is a convex approximation of 0-1 loss, without any probability interpretation.

In short: In classification problem, we want to minimize 0-1 loss*. But 0-1 loss is hard to minimize, we use logistic loss to approximate.

  • 0-1 loss is the "mis-classification cost": when make wrong prediction, loss is 1, and making right prediction the loss is 0 for a given point.
Haitao Du
  • 32,885
  • 17
  • 118
  • 213