33

I have a neural network set up to predict something where the output variable is ordinal. I will describe below using three possible outputs A < B < C.

It is pretty obvious how to use a neural network to output categorical data: the output is just a softmax of the last (usually fully connected) layer, one per category, and the predicted category is the one with the largest output value (this is the default in many popular models). I have been using the same setup for ordinal values. However, in this case the outputs often don't make sense, for example the network outputs for A and C are high but B is low: this is not plausible for ordinal values.

I have one idea for this, which is to calculate loss based on comparing the outputs with 1 0 0 for A, 1 1 0 for B, and 1 1 1 for C. The exact thresholds can be tuned later using another classifier (eg Bayesian) but this seems to capture the essential idea of an ordering of inputs, without prescribing any specific interval scale.

What is the standard way of solving this problem? Is there any research or references that describe the pros and cons of different approaches?

Alex I
  • 913
  • 2
  • 9
  • 18
  • 1
    I got lots of interesting hits on Google for "ordinal logistic regression" e.g. [this paper](http://pubsonline.informs.org/doi/pdf/10.1287/serv.3.4.304) – shadowtalker Mar 03 '15 at 01:53
  • @ssdecontrol: Interesting. I tried it; the results were better than picking the one output with the highest value but slightly worse than other methods (naive Bayesian, etc). This is useful, but it doesn't help train the network, only improves results slightly after the fact... or at least I don't see how to make it help train the network. – Alex I Mar 03 '15 at 08:22
  • which "it" did you try? My only point is that the search engine could be more helpful than you might expect – shadowtalker Mar 03 '15 at 08:56
  • Also I'm not sure I understand what you mean by "for example the network outputs for A and C are high but B is low: this is not plausible". You mean you're predicting lots of As and Cs but few Bs? I don't see why that should be implausible unless you have substantive or domain-specific reason to think so – shadowtalker Mar 03 '15 at 09:05
  • I also don't know how you could ever have an output like "1 1 0". I think there's some confusion about terminology here. Are you describing _cumulative_ ordinal outcomes? As in a cumulative logit model? – shadowtalker Mar 03 '15 at 09:11
  • @ssdecontrol: for prediction of categories, the softmax of the outputs of the last layer of the network produces values 0-1 for each category, they sum to 1. the single category predicted as most likely is the one with the highest output value. "the network outputs for A and C are high but B is low" I mean the outputs corresponding to those ordinal values are relatively high, let's say 0.5, 0.1, 0.4: in other words the network is saying the ordinal output is likely to be A or C but not B. but B is defined such that everything in C is also in B (they are ordered by set inclusion). – Alex I Mar 03 '15 at 11:44
  • @ssdecontrol: treating A, B, C as totally independent categorical values, the standard setup is to have one output per. but usually what is used to calculate/backprop loss in the network is a comparison of the outputs with a vector in which only one value (the correct category) is 1. – Alex I Mar 03 '15 at 11:46
  • @ssdecontrol: which "it" - I set up an ordinal logit regression model using the outputs of the network (on the training data set) as the independent variables and the true ordinal value as the dependent variable. I then ran the network outputs from the test data set through the same model. it is much better than picking the category with the highest value, but slightly worse than using a naive Bayesian classifier. It was a good suggestion, thanks. – Alex I Mar 03 '15 at 11:49
  • 1
    See also: https://datascience.stackexchange.com/questions/23233/cost-function-for-ordinal-regression-using-neural-networks/23468 and https://stackoverflow.com/questions/38375401/neural-network-ordinal-classification-for-age and https://arxiv.org/pdf/0704.1028.pdf – kjetil b halvorsen Aug 01 '18 at 21:49

2 Answers2

24

I believe what most people do is to simply treat ordinal classification as a generic multi-class classification. So, if they have $K$ classes, they will have $K$ outputs, and simply use cross-entropy as the loss.

But some people have managed to invent a clever encoding for your ordinal classes (see this stackoverflow answer). It's a sort of one-hot encoding,

  • class 1 is represented as [0 0 0 0 ...]

  • class 2 is represented as [1 0 0 0 ...]

  • class 3 is represented as [1 1 0 0 ...]

i.e. each neuron is predicting the probability $P(\hat y < k)$. You still have to use a sigmoid as the activation function, but I think this helps the network understanding some continuity between classes, I don't know. Afterwards, you do a post-processing (np.sum) to convert the binary output into your classes.

This strategy resembles the ensemble from Frank and Hall, and I think this is the first publication of such.

  • This approach seems much more appealing. It is important to realize that use predicted modes to turn this into a classification problem is not a good idea. Predicted cumulative probabilities can be turned into predicted individual probabilities, and so the utility function for making a final decision can be inserted much later when utilities are known. See http://fharrell.com/post/classification . – Frank Harrell Jan 30 '18 at 12:29
  • 1
    @RicardoCruz - Hmm, that sounds a lot like what I had suggested: "1 0 0 for A, 1 1 0 for B, and 1 1 1 for C". Good to know that works! Also wow that was a paper from 2007, this idea has been around for a long time – Alex I Mar 08 '18 at 20:51
  • Yeah, I was surprised myself when I found that paper! – Ricardo Magalhães Cruz Mar 08 '18 at 23:15
  • Note: As stated in "A Neurel Network Approach to Ordinal Regression": "...using independent sigmoid functions for output nodes does not guaranttee the monotonic relation (o1 >= o2 >= .... >= oK), which is not necessary but, desirable for making predictions." Therefore, just performing an "np.sum" at prediction time is not the best method. – sccrthlt Apr 19 '18 at 14:51
  • 4
    Edit to my comment above: Performing "np.sum" on the outputs of the neural network is misleading. The following situation may arise where the output vector is [0 1 0 1 0]. Performing a summation on this vector would produce a class prediction of 2, when in fact the neural network is unsure. – sccrthlt Apr 19 '18 at 15:17
  • 1
    To perform prediction the paper "A Neural Network Approach to Ordinal Regression" states: "...our methods scans output nodes in the order O1, O2,....,OK. It stop when the output of a node is smaller than the predefined threshold T (e.g. 0.5) or no nodes left. The index k of the last node Ok whose output is bigger than T is the predicted category of the data point." – sccrthlt Apr 19 '18 at 15:30
  • You write "not softmax obviously". Why not? Isn't softmax exactly for multi-category outputs, to guarantee that the sum of the probabilities is 1? – cheesus Oct 26 '20 at 21:57
  • 1
    @cheesus, I wrote this two years ago and I don't remember... :) I removed that because I also can't make sense of it, thank you for helping improve it! – Ricardo Magalhães Cruz Oct 27 '20 at 22:27
  • Thanks for the clarification! :) – cheesus Oct 28 '20 at 08:18
9

I think the approach to only encode the ordinal labels as

  • class 1 is represented as [0 0 0 0 ...]

  • class 2 is represented as [1 0 0 0 ...]

  • class 3 is represented as [1 1 0 0 ...]

and use binary cross-entropy as the loss function is suboptimal. As mentioned in the comments, it might happen that the predicted vector is for example [1 0 1 0 ...]. This is undesirable for making predictions.

The paper Rank-consistent ordinal regression for neural networks describes how to restrict the neural network to make rank-consistent predictions. You have to make sure that the last layer shares its weights, but should have different biases. You can implement this in Tensorflow by adding the following as the last part of the network (credits for https://stackoverflow.com/questions/59656313/how-to-share-weights-and-not-biases-in-keras-dense-layers):

class BiasLayer(tf.keras.layers.Layer):
    def __init__(self, units, *args, **kwargs):
        super(BiasLayer, self).__init__(*args, **kwargs)
        self.bias = self.add_weight('bias',
                                    shape=[units],
                                    initializer='zeros',
                                    trainable=True)

    def call(self, x):
        return x + self.bias


# Add the following as the output of the Sequential model
model.add(keras.layers.Dense(1, use_bias=False))
model.add(BiasLayer(4))
model.add(keras.layers.Activation("sigmoid"))

Note that the number of ordinal classes here is 5, hence the $K-1$ biases.

I tested the difference in performance on actual data, and the predictive accuracy improved substantially. Hope this helps.

thijsvdp
  • 91
  • 1
  • 1
  • Neat!! I am searching for the resource to implement the original regression in PyTorch. "I tested the difference in performance on actual data and the predictive accuracy improved substantially." Do you mean the accuracy on the testing set? Which metric are you using? which models did you compare it with? – Phume Nov 08 '20 at 03:44
  • 2
    I highly recommend this answer, I have been able to successfully implement without much additional work because of this repo: https://github.com/ck37/coral-ordinal/ – Chris Farr Dec 16 '20 at 17:13
  • If I understand the paper, then the Keras-based solution in the answer by @thijsvpd also guarantees the proper ordering of the biases, right? If so, this should be the top answer! – Russell Richie Jul 29 '21 at 18:56
  • @thijsvdp Have you encountered ordinal function with two-side ranking? For example, ordinal classes are [0, 1, 2, 3, 4] and _=2, then it should be (0)(3)>(4) – Kenenbek Arzymatov Sep 29 '21 at 10:14
  • I guess this is not possible, because if I understand you correctly you have the following p0 – thijsvdp Sep 29 '21 at 11:41