Predicting values in a given range [0, 1] (not probability)

Question

I need to predict the impact of a set of node failures in a network, based on 2 features: the fraction of failed nodes and a measure of their network centrality.

Failure of less important nodes will have low to no repercussions, while failure of important nodes can bring down the whole network.

The predicted output should be a number in the range [0, 1], where 0 is "no damage", and "1" is "failure of all nodes". 0 is inclusive since I use it to cover the case "no damage" when the fraction of failed nodes is 0.

Here's an example mapping 2 features (x and y axes) to the actual simulation result (z axis). I plotted the scatter and the projection of the points. It looks like something that can be learned, right?

I'm using multivariate linear regression in scikit-learn to train a predictor. I take my training set, pick the 2 features I want, use them to make a 4th degree polynomial, which I then standardize and use to train my model.

The problem is, for some combinations of my 2 feature values, the predictor outputs negative values or values greater than 1. I can post a picture if needed.

I've tried changing the degree of the polynomial, using regularization (lasso, ridge and ElasticNet) and applying normalization. Nothing completely fixes the problem, especially for predictions >1.

Is linear regression the wrong tool for predicting values in a range?
Any suggestion for a drop-in replacement? I'm tempted to try logistic regression, but it does not sound like the right tool, since it's for classification, and it outputs probabilities, while my values have a different meaning (not a "chance" but a degree of damage).

Can you be more precise about what is the format of your data? You could simply map the output to $[0,1]$ in the obvious way. — Olivier, Apr 02 '17 at 15:25
I have two features, one is the initial fraction of failed nodes (between 0 and 1) while the other is a sum of the centrality of those nodes (this could be normalized, actually). I can make the second feature a fraction too, as in "fraction of lost centrality" (according to a specific metric). Would that help? — Agostino, Apr 02 '17 at 15:29
The outcome for (0, 0) is always 0, that is (0 attacked, 0 loss of centrality) -> 0 damage. There might be some points where the same (x, y) value has different z values, but they shouldn't be that many. Turning this into a classification problem is a possibility, yes, but I'd like to make some prediction too. Is there no way to inform the predictor that the value only make sense when in a specific range? Mapping them afterwards sounds like a lost opportunity. — Agostino, Apr 02 '17 at 15:52
In x you have the initial size of the failure (fraction of failed nodes), while in z you have the final size of the failure (again, fraction of failed nodes). This is a prediction for the effect of cascading failures. When an initial damage results in a total failure you have z=1. About the constrained regression, could you point me to an example using scikit-learn? — Agostino, Apr 02 '17 at 16:03
I can try considering more than 2 features to differentiate, although that might still not be enough. I don't fully understand what you are proposing, though. Could you add some background? — Agostino, Apr 02 '17 at 16:36

Olivier · Answer 1 · 2017-04-02T17:55:05.030

You are facing extremely different responses to very similar inputs. Regression will not give you meaningful answers.

Classification approach.

Consider, alternatively, categorical responses such as 'low damage' and 'high damage'. You should be looking for a function $f$ such that $f(p, c)$ is the probability that you will have 'low damage' when a fraction $p$ of the nodes have failed with loss of centrality $c$.

You can construct $f$ in the following way. Let $n \in \mathbb{N}$, let $a_{i,j}$ be the number of points in the region $R_{i,j} = [\frac{i}{n+1}, \frac{i+1}{n+1}]\times [\frac{j}{n+1}, \frac{j+1}{n+1}]$ corresponding to 'low damage' and let $N_{i,j}$ be the total number of points in this region. Thus $$ f_{i,j} = \frac{a_{i,j}}{N_{i,j}} $$ is the fraction of points in $R_{i,j}$ corresponding to 'low damage'. You can then let $$ f(p,c) = f_{i,j}\quad \text{ if } (p,c) \in R_{i,j}. $$

If you want something smooth, you can use $$ f(p,c) = \sum_{i,j} f_{i,j} B_{i,n}(p) B_{j,n}(c), $$ where $B_{i,n}(p) = {n \choose i} x^{i}(1-x)^{n-i}$ are the Bernstein polynomials.

Improving the model.

You could put a prior on the $f_{i,j}$ and on $n$. This would give you a reasonable bayesian model, and fix the small problem that arises when $N_{i,j} = 0$.

Visualizing the results.

The set of points $(p,c)$ such that $f(p,c) = \frac{1}{2}$ will give you a classification boundary that is an algebraic variety. You also have critical regions $C_\varepsilon = \{(p,c) \,|\, f(p,c) < \varepsilon\}$ where 'high damage' is very likely.

You can calculate the marginal probability of high or low damages for a given fraction of failed nodes and plot this together with the probability surface.

It's an interval. Basically, the regions $R_{i,j}$ are the small squares in the $xy$ plane of your plots. You divide the input space $[0,1]^2$ in a grid of squares $R_{i,j}$ of side length $1/n$. — Olivier, Apr 02 '17 at 17:41
OK, thanks. I'll give a shot at classification. Not sure if there's anything about Bernstein's polynomials in sklearn. Still +1 and my thanks. — Agostino, Apr 02 '17 at 17:51
@Agostino Try to avoid sharp binary classification, however, because you don't have separable classes. What you want, for each value $(p,c)$, is a *probability* that you will end up with low or high damage. It's not too difficult to directly program what I suggested, up to the section 'improving the model'. — Olivier, Apr 02 '17 at 17:57
@Coderji This is a bit of a crappy estimation method, but it works and it's easy to implement. I wrote about a Bayesian variant to this on my blog (https://mathstatnotes.com/2017/07/23/bayesian-binary-classification-using-partitions-of-unity/) and I don't have other references. — Olivier, Feb 22 '18 at 16:21

Predicting values in a given range [0, 1] (not probability)

1 Answers1