5

I am working on a machine learning problem where I have to predict a set of $N$ numbers (proportions) for each data point, all of them summing to one. One toy example to illustrate my problem would be predicting at a daily level the percentage of volume of water rained in each of the states of the US over the total rain in the country - in this example $N=50$ (the number of states) and $\sum_{n=1}^{50}{\hat{y}_n}=1$

I was thinking on designing a neural net with $N$ outputs and apply a Softmax in the output, then backpropagate the MSE or the RMSE... I am a bit unsure about the convergence guarantees (potential vanishing gradient). I would also like to know if you would approach the problem in another way.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
ivallesp
  • 123
  • 8
  • 1
    Depending on your independent variables, I should imagine that the better problem would be to build a model to predict the amount of rain in each area. If you then would like to have percentages, simply divide by the total predicted rain over all areas. – Him Apr 01 '20 at 14:50
  • 1
    But that is very dangerous, given that only an outlier in one of the communities would bias all your distribution... I would rather like to adjust all the probabilities at the same time with the sum-1 constraint – ivallesp Apr 01 '20 at 14:55
  • It may be worth noting that, at least for classification problems, NNs often need special care with [calibration](https://en.wikipedia.org/wiki/Probabilistic_classification#Probability_calibration). I would think that this problem would be equally bad for regression on a compositional variable. Just something to think about if you end up going that route. – Him Apr 01 '20 at 14:55
  • "an outlier in one of the communities would bias all your distribution" I'm not sure how constraining the model to sum to 1 would alleviate this. Would you care to elaborate? – Him Apr 01 '20 at 14:56
  • As you increase $N$, the probability of having a very large value in one of the sub-models increases. This approach optimizes the different sub-models separately while the approach I am looking for would optimize a calibrated output (with a Softmax for example). Hence the model I am looking for should account for this risk of overestimation given that overestimating one of the outputs affects the prediction of all the outputs. – ivallesp Apr 01 '20 at 15:07
  • "Hence the model I am looking for should account for this risk of overestimation" I don't think that the NN model that you propose will naively accomplish this. Perhaps with some kind of regularization... but this applies, of course, to the proposed individual models as well. – Him Apr 01 '20 at 15:25
  • I think that, if your actual concern is about your model being biased due to outliers, then a question to that effect will yield much more productive answers. – Him Apr 01 '20 at 15:27
  • for loss I would think about `KLdivergence` or any `crossentropy` – quester Apr 05 '20 at 17:05
  • @quester crossentropy with continuous data? Say you have a target which is (0.2, 0.3, 0.5) and a prediction like (0.3, 0.3, 0.4). Do you have any reference of crossentropy applied to continuous data? – ivallesp Apr 05 '20 at 22:57

2 Answers2

5

You have what is called . There is quite some literature on how to model this. Take a look through the tag, or search for the term.

Typically, one would choose a reference category and work with log ratios, or similar. One paper I personally know about predicting compositional data is Snyder at al. (2017, IJF). They use a state space approach, not an NN, but their transformation may still be useful to you.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
2

Answering my past self... One elegant solution is to use the cross-entropy with "soft-targets" as loss. This means that your targets will not be in one-hot-encodding format, but they will still sum to one. The original cross-entropy formula formula applies.

The cross-entropy loss with soft targets is widely used in the knowledge-distillation field: ref.

ivallesp
  • 123
  • 8