Neural network for multi label classification with large number of classes outputs only zero

Question

I am training a neural network for multilabel classification, with a large number of classes (1000). Which means more than one output can be active for every input. On an average, I have two classes active per output frame. On training with a cross entropy loss the neural network resorts to outputting only zeros, because it gets the least loss with this output since 99.8% of my labels are zeros. Any suggestions on how I can push the network to give more weight to the positive classes?

Btw: 99.8% is just a number, you know that a 0.2% of error on average corresponds to 0.002*1000, so 2 wrong labels per training instance on average. BTW are you using categorical cross_entropy or binary_crossentropy with sigmoids on the last layer? — Tommaso Guerrini, Feb 10 '17 at 14:52
@TommasoGuerrini used python+ keras, sigmoid and binary_crossentropy. Now testing with categorical_crossentropy, the network is outputting values closer to 1 now. But the loss is too high for now. Waiting to see how it trains over more epochs now. — Yakku, Feb 10 '17 at 15:14
@TommasoGuerrini I did not understand the purpose of the callback. — Yakku, Feb 10 '17 at 15:27
@TommasoGuerrini just fyi, got a loss of less than 0.01 in just 3 epochs with binary, and it continues to stay around 0.01 forever. — Yakku, Feb 10 '17 at 15:37
200,000 instances, tried batch sizes of 8 and 64 can't go beyond due to memory constraints. The network has approximately the same number of parameters as the instances. — Yakku, Feb 10 '17 at 15:51
*inputsize* = $2*10^5$ right? Uhm, you may look for someone with more expertise than me.. I could just think about Dropout to fasten the training with so many input parameters.. Or create your custom loss function where you give weights according to the class distribution (you don't solve the *overzeros* problem, but it may help) — Tommaso Guerrini, Feb 10 '17 at 16:05
you may try sparse_categorical_crossentropy .. By the way: when training don't just look at the loss function, look also at the binary_accuracy ok? I have a similar case to yours and using mean squared error as loss function I obtained a better binary accuracy than when using binary logloss :) — Tommaso Guerrini, Feb 10 '17 at 16:10
@TommasoGuerrini I have a multilabel loss function which i calculate for every epoch. I could not convert it to the keras format, so cant use it for backpropagation. Though the mse loss was 0.01, my metric was really high, thats how i figured the network was outputting only zeros, inorder to reduce the mse. — Yakku, Feb 10 '17 at 16:13
post the function I'll try to convert it to the keras format for you — Tommaso Guerrini, Feb 10 '17 at 16:15
Ahh thanks, but It needs some external data to measure the loss. So need to store them in the memory and stuff. It might need some workaround. I posted the problem here wondering if someone else also had faced similar problems and wanted to know what methods worked for them. — Yakku, Feb 10 '17 at 16:25

score 8 · Answer 1 · edited Jun 11 '20 at 14:32

Tensorflow has a loss function weighted_cross_entropy_with_logits, which can be used to give more weight to the 1's. So it should be applicable to a sparse multi-label classification setting like yours.

From the documentation:

This is like sigmoid_cross_entropy_with_logits() except that pos_weight, allows one to trade off recall and precision by up- or down-weighting the cost of a positive error relative to a negative error.

The argument pos_weight is used as a multiplier for the positive targets

If you use the tensorflow backend in Keras, you can use the loss function like this (Keras 2.1.1):

import tensorflow as tf
import keras.backend.tensorflow_backend as tfb

POS_WEIGHT = 10  # multiplier for positive targets, needs to be tuned

def weighted_binary_crossentropy(target, output):
    """
    Weighted binary crossentropy between an output tensor 
    and a target tensor. POS_WEIGHT is used as a multiplier 
    for the positive targets.

    Combination of the following functions:
    * keras.losses.binary_crossentropy
    * keras.backend.tensorflow_backend.binary_crossentropy
    * tf.nn.weighted_cross_entropy_with_logits
    """
    # transform back to logits
    _epsilon = tfb._to_tensor(tfb.epsilon(), output.dtype.base_dtype)
    output = tf.clip_by_value(output, _epsilon, 1 - _epsilon)
    output = tf.log(output / (1 - output))
    # compute weighted loss
    loss = tf.nn.weighted_cross_entropy_with_logits(targets=target,
                                                    logits=output,
                                                    pos_weight=POS_WEIGHT)
    return tf.reduce_mean(loss, axis=-1)

Then in your model:

model.compile(loss=weighted_binary_crossentropy, ...)

I have not found many resources yet which report well working values for the pos_weight in relation to the number of classes, average active classes, etc.

Also, it might be a good idea to evaluate the f-measure in a callback after each epoch when tuning the hyperparameters (such as pos_weights). — tobigue, Nov 15 '17 at 16:56
Is there a corresponding `weighted_binary_accuracy` metric that can be used for the model as well? — CMCDragonkai, Oct 21 '19 at 08:20
Lifesaver, but I could also use something like `weighted_binary_accuracy` — David Cian, Jun 16 '20 at 17:26
You can just use [binary accuracy](https://stackoverflow.com/questions/57331013/custom-keras-binary-crossentropy-loss-function-not-working) actually, unless you really want to weigh the accuracy as well — David Cian, Jun 16 '20 at 17:50
about the proper values for `pos_weight`, documenation suggests that any value above 1 increase recall, while any value less than 1 increase precision. — Naveen Reddy Marthala, Oct 27 '21 at 12:08
i am using tf.keras. i have dense as my final layer, with number of units equal to number of unique labels. should i use no activation or sigmoid activation in my final layer, while using this loss? i shouldn't, correct? — Naveen Reddy Marthala, Nov 09 '21 at 07:46

score 1 · Answer 2 · answered Sep 29 '21 at 09:20

Update for tensorflow 2.6.0:

I was going to write a comment but there are many things that needs to be changed for @tobigue answer to work. And I am not entirely sure if everything is correct with my answer. To make things work:

You need to replace import keras.backend.tensorflow_backend as tfb with import keras.backend as tfb
The target parameter in tf.nn.weighted_cross_entropy_with_logits needs to be changed to labels
tf.log needs to be called like this: tf.math.log
To make this custom loss function to work with keras, you need to import get_custom_objects and define the custom loss function as a loss function. So, from keras.utils.generic_utils import get_custom_objects and then before you compile the model you need to: get_custom_objects().update({"weighted_binary_crossentropy": weighted_binary_crossentropy})
I also encountered this error but it may not be the same for everyone. The error is: TypeError: Input 'y' of 'Mul' Op has type float32 that does not match type int32 of argument 'x'. To fix this error, I have converted the target to float32 like this: target = tf.cast(target, tf.float32)

So, the final code that I am using is this:

import tensorflow as tf
import keras.backend as tfb
from keras.utils.generic_utils import get_custom_objects

POS_WEIGHT = 10  # multiplier for positive targets, needs to be tuned
def weighted_binary_crossentropy(target, output):
    """
    Weighted binary crossentropy between an output tensor
    and a target tensor. POS_WEIGHT is used as a multiplier
    for the positive targets.

    Combination of the following functions:
    * keras.losses.binary_crossentropy
    * keras.backend.tensorflow_backend.binary_crossentropy
    * tf.nn.weighted_cross_entropy_with_logits
    """
    # transform back to logits
    _epsilon = tfb._to_tensor(tfb.epsilon(), output.dtype.base_dtype)
    output = tf.clip_by_value(output, _epsilon, 1 - _epsilon)
    output = tf.math.log(output / (1 - output))
    # compute weighted loss
    target = tf.cast(target, tf.float32)
    loss = tf.nn.weighted_cross_entropy_with_logits(labels=target,
                                                    logits=output,
                                                    pos_weight=POS_WEIGHT)
    return tf.reduce_mean(loss, axis=-1)

Then in your model

get_custom_objects().update({"weighted_binary_crossentropy": weighted_binary_crossentropy})
model.compile(loss='weighted_binary_crossentropy', ...)

i am using tf.keras. i have dense as my final layer, with number of units equal to number of unique labels. should i use no activation or sigmoid activation in my final layer, while using this loss? i shouldn't, correct? — Naveen Reddy Marthala, Nov 09 '21 at 07:48

Neural network for multi label classification with large number of classes outputs only zero

2 Answers2