Loss function for semantic segmentation?

Question

Apologizes for misuse of technical terms. I am working on a project of semantic segmentation via convolutional neural networks (CNNs) ; trying to implement an architecture of type Encoder-Decoder, therefore output is the same size as the input.

How do you design the labels ? What loss function should one apply ? Especially in the situation of heavy class imbalance (but the ratio between the classes is variable from image to image).

The problem deals with two classes (objects of interest and background). I am using Keras with tensorflow backend.

So far, I am going with designing expected outputs to be the same dimensions as the input images, applying pixel-wise labeling. Final layer of model has either softmax activation (for 2 classes), or sigmoid activation ( to express probability that the pixels belong to the objects class). I am having trouble with designing a suitable objective function for such a task, of type:

function(y_pred,y_true),

in agreement with Keras.

Please, try to be specific with the dimensions of tensors involved (input/output of the model). Any thoughts and suggestions are much appreciated.

Have a read of this https://arxiv.org/pdf/1511.00561.pdf "We use the cross-entropy loss as the objective function for training the network. " — tea_pea, May 14 '17 at 15:42

score 8 · Answer 1 · answered Aug 30 '17 at 09:58

Cross entropy is definitely the way to go. I don't know Keras but TF has this: https://www.tensorflow.org/api_docs/python/tf/nn/sigmoid_cross_entropy_with_logits

Here is a paper directly implementing this: Fully Convolutional Networks for Semantic Segmentation by Shelhamer et al.

The U-Net paper is also a very successful implementation of the idea, using skip connections to avoid loss of spatial resolution. You can find many implementations of this in the net.

From my personal experience, you might want to start with a simple encoder-decoder network first, but do not use strides (or strides=1), otherwise you lose a lot of resolution because the upsampling is not perfect. Go with small kernel sizes. I don't know your specific application but even a 2-3 hidden layer network will give very good results. Use 32-64 channels at each layer. Start simple, 2 hidden layers, 32 channels each, 3x3 kernels, stride=1 and experiment with parameters in an isolated manner to see their effect. Keep the dimensions always equal to the input dimension for starters to avoid resolution loss. Afterwards you can switch on strides and upsampling and implement ideas like U-Net. U-Net works extremely well for medical image segmentation.

For class-inbalance see https://swarbrickjones.wordpress.com/2017/03/28/cross-entropy-and-training-test-class-imbalance/ Here the idea is to weight the different classes with $\alpha$ and $\beta$ parameters.

I'm not an expert in this domain, but shouldn't classes be exclusive in this setting? If yes wouldn't the softmax loss be the better option? https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits — Harald Thomson, Feb 21 '18 at 07:17
@HaraldThomson, Segmentation is a binary problem. Many people use softmax for binary problems, but it's completely unnecessary and overkill. Instead of having two outputs nodes, have one output node that represents P(y=1), then use cross-entropy. — Ricardo Magalhães Cruz, Mar 13 '18 at 16:10

score 2 · Answer 2 · answered Oct 09 '18 at 08:17

Use weighted Dice loss and weighted cross entropy loss. Dice loss is very good for segmentation. The weights you can start off with should be the class frequencies inversed i.e take a sample of say 50-100, find the mean number of pixels belonging to each class and make that classes weight 1/mean. You may have to implement dice yourself but its simple. Additionally you can look in inverse dice loss and focal loss

score -1 · Answer 3 · answered Feb 07 '17 at 19:20

Let me be more specific at first, and then more general. I apologize if I misunderstand you.

I think you are talking about needing an autoencoder neural network because you mentioned encode and decode, and you mentioned input size same as output size. If so, then your loss function is based on reproducing the input vector, yet also compressing the data into a shorter vector in the middle hidden layer. Choices would be to achieve the minimal mean square error (for regression) or logloss or misclassification ratio (for classification). However, CNNs are not something I have seen used in an autoencoder, but I do think it would be both possible and useful to do so in cases where translational invariance is important, such as edge and object detection in images.

More generally, you seem to be building a very complex machine learning model since you mentioned CNNs. CNNs and other deep learning models are some of the most complex machine learning models that exist.

Choosing dimensions, labels, and loss functions is more like elementary machine learning however. I think you might be in over your head with deep learning. Did you take a class on plain old machine learning first?

Is this even necessary? For example, see [Pixon method](http://www.adass.org/adass/proceedings/adass98/puetterrc/). — Carl, Feb 08 '17 at 02:57
"CNNs and other deep learning models are some of the most complex machine learning models that exist.". I tend to disagree. The model in itself may be complex but hey are actually incredibly simple to use with very little theoretical understanding. That is the reason of the whole hype about DL, little theory, easy to write models and very high accuracies... — , Aug 30 '17 at 09:40

Loss function for semantic segmentation?

3 Answers3