Logistic regression with actual probabilities $\in(a,b)$ where $0

Question

When modelling probabilities with a logistic regression$^1$, the range of fitted probabilities is $(0,1)$. The logit function$^2$ asymptotes at $0$ and $1$, so this is a good match. However, in some applications the actual range$^3$ of probabilities can be $(a,b)$ with $0<a<b<1$, causing a (potentially substantial) mismatch in the tails.

For example, consider a population of individuals that all work and get paid. The payment depends on skill, effort and other things that we have measurements of. Each individual is also participating in a lottery with monetary outcomes that are normally distributed with an expected value of zero. The income of an individual is the sum of the job income and the outcome of the lottery. For any given constant $c$, the conditional probability $p$ of an individual having income greater than $c$, conditioning on skill, effort and the other things but not the outcome of the lottery, will satisfy $0<a<p<b<1$. (I am sure there are better examples out there, but this is one that came to mind first.)

Questions:

How can the logistic regression model be adjusted when $a$ and $b$ are given? (Or is no adjustment needed?)
How can the logistic regression model be adjusted when $a$ and $b$ are not given, but we know that $0<a<b<1$?

$^1$or for that matter, a probit model
$^2$or for that matter, the standard normal CDF
$^2$the range for the individuals of the population of interest

Inspired by a blog post: Gelman ["Linear or logistic regression with binary outcomes"](https://statmodeling.stat.columbia.edu/2020/01/10/linear-or-logistic-regression-with-binary-outcomes/) (2020). — Richard Hardy, Feb 12 '20 at 12:19
Could you clarify what do you mean that the "actual range of probability" is bounded between $a$ and $b$? Or give an example? Cause I have a hard time seeing this, and it seems like they would not be probabilities anymore... — Davide ND, Feb 12 '20 at 12:54
@DavideND, great comment! I have updated the post with an example. — Richard Hardy, Feb 12 '20 at 13:07
Logistic regression doesn't care what practical bounds are on the data. The estimates and predictions are likely to be good, so long as you correct predictions outside the bounds. If you wish to constrain the fitted probability, then do so. If the variance of response is something other than p(1-p), then you need to consider a radically different approach. — AdamO, Feb 12 '20 at 13:43
@AdamO, I get that "wrong" models can be useful, but I am interested in tools we have available to make the model "right". At least I would like to have a choice between a model that I know behaves poorly in the tails and one that does not suffer from this problem. (Not to say that one would always dominate the other in terms of prediction accuracy, interpretability or something else.) — Richard Hardy, Feb 12 '20 at 13:52
If you really have this information about the actual probabilities $p_i$, why not try to use a bayesian prior? — kjetil b halvorsen, Feb 12 '20 at 15:33
@kjetilbhalvorsen, of course, that is very natural. I suspect this is Gelman's solution when he says this problem is fixed in Stan. But what shall a frequentist do? — Richard Hardy, Feb 12 '20 at 15:36
A simple adjustment to the link function will accomplish everything you require. See https://stats.stackexchange.com/a/64039/919 for an example of a customized link function. In this case you might consider the link function $$h(x)=\left(\frac{1}{1+\exp(-x)} - a\right) / (b-a)$$ or, generally, let $h:\mathbb{R}\to[a,b]$ be a strictly monotonic surjection. — whuber, Feb 12 '20 at 15:38
@whuber, excellent! That must be what I was looking for in the case of known $a$, $b$. What about the case of unknown $a$, $b$ though? — Richard Hardy, Feb 12 '20 at 15:48
@whuber - isnt this basically equivalent to scaling the output? — Davide ND, Feb 12 '20 at 16:07
If they are unknown then just use logistic regression. It will not be able to estimate $a$ and $b$ (although it can estimate bounds for them), but it will conform with *all* your assumptions. — whuber, Feb 12 '20 at 16:08
@Davide Yes,the first one is, understanding that the "output" is the model for the conditional probability function. Indeed, that's precisely what the question asks: for fitted values to be limited to the interval $[a,b].$ There are two basic ways to accomplish this. One is by choosing a link whose image is in $[a,b];$ the other is to apply restrictions on the linear combinations $\sum_i x_i\beta_i$ to keep their logits within the interval $[a,b].$ Unless there is just one explanatory variable, that will be hard to fit, difficult to interpret, and produce very problematic standard errors. — whuber, Feb 12 '20 at 16:12
@whuber the problem with changing the link function is that you are not making a regression over $P(y=1|x)$ anymore, but on $P(y=1|x)*(b-a)+a$. Which means that what you will predict is not a probability anymore, but a tranformation — Davide ND, Feb 12 '20 at 16:14
I had first though about using a different link function - but if we are putting bounds to avoid some values of our"predicted_proba" not respecting the real probabilities, then making a tranformation is not a solution, because our "predicted probabilities" will simply not be probabilities anymore — Davide ND, Feb 12 '20 at 16:17
@Davide Your comment does not correctly characterize the situation. A link function must be a map from the real numbers into a valid range of probabilities. Its value at any linear combination $x\beta$ of explanatory values $x$ will be a probability, thereby uniquely designating a valid conditional response function. It makes no sense to declare that numbers in the interval $[a,b]\subset [0,1]$ "will not be probabilities anymore." — whuber, Feb 12 '20 at 16:17

whuber · Accepted Answer · 2020-02-13T14:57:38.983

When $a$ and $b$ are not given, just use the usual logistic model (or whatever is appropriate), because (if it uses a suitable link function) it is guaranteed to fit probabilities with a lower bound no smaller than $0$ and an upper bound no greater than $1.$ These bounds give interval estimates for $a$ and $b.$

The interesting question concerns when $a$ and $b$ are known. The kind of model you are entertaining appears to be the following. You have in mind a one-parameter family of distributions $\mathcal{F} = \{F_\theta\}$ where $\theta$ corresponds to some "probability" parameter. For instance, $F_\theta$ might be a Bernoulli$(\theta)$ distribution when the responses $Y$ are binary.

For an observation associated with a vector of explanatory variables $x,$ the model for the response $Y_x$ then takes the form

$$Y_x \sim F_{\theta(x)};\quad \theta(x) = g(x\beta)$$

for some "inverse link function" $g$ that we must specify: it's part of the model. In logistic regression, for instance, $g$ is frequently taken to be the logistic function defined by

$$g(x) = \frac{1}{1 + \exp(-x)}.$$

Regardless of the details, when making $n$ independent observations $y_i$ (each associated with a vector $x_i$) assumed to conform to this model, their likelihood is

$$L(\beta) = \prod_{i=1}^n \Pr(Y_{x_i} = y_i\mid \theta(x_i) = g(x_i\beta))$$

and you can proceed to maximize this as usual. (The vertical stroke merely means the parameter value following it determines which probability function to use: it's not a conditional probability.)

Let $\hat\beta$ be the associated parameter estimate. The predicted conditional distributions for the $Y_i$ therefore are

$$Y_i \sim F_{\hat\theta(x_i)};\quad \hat\theta(x_i) = g(x_i\hat\beta).$$

When the image of $g$ is contained in the interval $[a,b],$ then manifestly every $\hat\theta(x)$ lies in that interval, too, no matter what $x$ may be. (That is, this conclusion applies both to $x$ in the dataset and for extrapolation to other $x.$)

One attractive choice for $g$ simply rescales the usual logistic function,

$$g(x;a,b) = \frac{g(x) - a}{b-a}.$$

Consider this a point of departure: as usual, exploratory analysis and goodness-of-fit testing will help you decide whether this is a suitable form for $g.$

For later use, note that $g$ and $g(;a,b)$ have a more complicated relationship than might appear, because ultimately they are used to determine $\hat\beta$ via their argument $x\beta.$ The relationship is therefore characterized by the function $x\to y$ determined by

$$g(x) = g(y;a,b) = \frac{g(y) - a}{b-a},$$

with solution (if $g$ is invertible, as it usually is)

$$y = g^{-1}((b-a)g(x) + a).$$

Unless $g$ originally is linear, this is usually nonlinear.

To address the issues expressed elsewhere in this thread, let's compare the solutions obtained using $g$ and $g(;a,b).$ Consider the simplest case of $n=1$ observation and a scalar explanatory variable requiring estimation of a parameter vector $\beta=(\beta_1).$ Suppose $\mathcal{F}$ is the family of Binomial$(10,\theta)$ distributions, let $x_1 = (1),$ and imagine $Y_i = 9$ is observed. Writing $\theta$ for $\theta(x_1),$ the likelihood is

$$L(\beta) = \binom{10}{9}\theta^9(1-\theta)^1;\quad \theta = g((1)(\beta_1)) = g(\beta_1).$$

$L$ is maximized when $g(\beta_1) = \theta = 9/10,$ with the unique solution $$\hat\beta = g^{-1}(9/10) = \log(9/10 / (1/10)) = \log(9) \approx 2.20.$$

Let us now suppose $a=0$ and $b=1/2:$ that is, we presume $\theta \le 1/2$ no matter what value $x$ might have. With the scaled version of $g$ we compute exactly as before, merely substituting $g(;a,b)$ for $g:$

$$L(\beta;0,1/2) = \binom{10}{9}\theta^9(1-\theta)^1;\quad \theta = g((1)(\beta_1);0,1/2) = g(\beta_1;0,1/2).$$

This is no longer maximized at $\theta=9/10,$ because it is impossible for $g(\theta;0,1/2)$ to exceed $1/2,$ by design. $L(\beta;0,1/2)$ is maximized for any $\beta$ that would make $\theta$ as close as possible to $9/10;$ this happens as $\beta$ grows arbitrarily large. The estimate using the restricted inverse link function, then, is

$$\hat\beta = \infty.$$

Obviously neither $\hat\theta$ or $\hat\beta$ is any simple function of the original (unrestricted) estimates; in particular, they are not related by any rescaling.

This simple example exposes one of the perils of the entire program: when what we presume about $a$ and $b$ (and everything else about the model) is inconsistent with the data, we may wind up with outlandish estimates of the model parameter $\beta.$ That's the price we pay.

But what if our assumptions are correct, or at least reasonable? Let's rework the previous example with $b=0.95$ instead of $b=1/2.$ This time, $\hat\theta=9/10$ does maximize the likelihood, whence the estimate of $\beta$ satisfies

$$\frac{9}{10} = g(\hat\beta;0,0.95) = \frac{g(\hat\beta) - 0}{0.95 - 0},$$

so

$$g(\hat\beta) = 0.95 \times \frac{9}{10} = 0.855,$$

entailing

$$\hat\beta = \log(0.855 / (1 - 0.855)) \approx 1.77.$$

In this case, $\hat\theta$ is unchanged but $\hat\beta$ has changed in a complicated way ($1.77$ is not a rescaled version of $2.20$).

In these examples, $\hat\theta$ had to change when the original estimate was not in the interval $[a,b].$ In more complex examples it might have to change in order to change estimates for other observations at other values of $x.$ This is one effect of the $[a,b]$ restriction. The other effect is that even when the restriction changes none of the estimated probabilities $\hat\theta,$ the nonlinear relationship between the original inverse link $g$ and the restricted link $g(;a,b)$ induces nonlinear (and potentially complicated) changes in the parameter estimates $\hat\beta.$

To illustrate, I created data according to this model with $\beta=(4,-7)$ and limits $a=1/10$ and $b=1/2$ for $n$ equally-spaced values of the explanatory value $x$ between $0$ and $1$ inclusive, and then fit them once using ordinary logistic regression (no constraints) and again with the known constraints using the scaled inverse link method.

Here are the results for $n=12$ Binomial$(8, \theta(x))$ observations (which, in effect, reflect $12\times 8 = 96$ independent binary results):

This already provides insight: the model (left panel) predicts probabilities near the upper limit $b=1/2$ for small $x.$ Random variation causes some of the observed values to have frequencies greater than $1/2.$ Without any constraints, logistic regression (middle panel) tends to predict higher probabilities there. A similar phenomenon happens for large $x.$

The restricted model drastically changes the estimated slope from $-3.45$ to $-21.7$ in order to keep the predictions within $[a,b].$ This occurs partly because it's a small dataset.

Intuitively, larger datasets should produce results closer to the underlying (true) data generation process. One might also expect the unrestricted model to work well. Does it? To check, I created a dataset one thousand times greater: $n=1200$ observations of a Binomial$(80,\theta(x))$ response.

Of course the correct model (right panel) now fits beautifully. However, the random variation in observed frequencies still causes the ordinary logistic model to exceed the limits.

Evidently, when the presumed values of $a$ and $b$ are (close to) correct and the link function is roughly the right shape, maximum likelihood works well--but it definitely does not yield the same results as logistic regression.

In the interests of providing full documentation, here is the R code that produced the first figure. Changing 12 to 1200 and 8 to 80 produced the second figure.

#
# Binomial negative log likelihood.
#
logistic.ab <- function(x, a=0, b=1) {
  a + (b - a) / (1 + exp(-x))
}
predict.ab <- function(beta, x, invlink=logistic.ab) {
  invlink(cbind(1, x) %*% beta)
}
Lambda <- function(beta, n, k, x, invlink=logistic.ab, tol=1e-9) {
  p <- predict.ab(beta, x, invlink)
  p <- (1-2*tol) * p + tol                 # Prevents numerical problems
  - sum((k * log(p) + (n-k) * log(1-p)))
}
#
# Simulate data.
#
N <- 12                       # Number of binomial observations
x <- seq(0, 1, length.out=N)  # Explanatory values
n <- rep(8, length(x))        # Binomial counts per observation
beta <- c(4, -7)              # True parameter
a <- 1/10                     # Lower limit
b <- 1/2                      # Upper limit

set.seed(17)
p <- predict.ab(beta, x, function(x) logistic.ab(x, a, b))
X <- data.frame(x = x, p = p, n = n, k = rbinom(length(x), n, p))
#
# Create a data frame for plotting predicted and true values.
#
Y <- with(X, data.frame(x = seq(min(x), max(x), length.out=101)))
Y$p <-with(Y, predict.ab(beta, x, function(x) logistic.ab(x, a, b)))
#
# Plot the data.
#
par(mfrow=c(1,3))
col <- hsv(0,0,max(0, min(1, 1 - 200/N)))
with(X, plot(x, k / n, ylim=0:1, col=col, main="Data with True Curve"))
with(Y, lines(x, p))
abline(h = c(a,b), lty=3)
#
# Reference fit: ordinary logistic regression.
#
fit <- glm(cbind(k, n-k) ~ x, data=X, family=binomial(link = "logit"),
           control=list(epsilon=1e-12))
#
# Fit two models: ordinary logistic and constrained.
#
for (ab in list(c(a=0, b=1), c(a=a, b=b))) {
  #
  # MLE.
  #
  g <- function(x) logistic.ab(x, ab[1], ab[2])
  beta.hat <- c(0, 1)
  fit.logistic <- with(X, nlm(Lambda, beta.hat, n=n, k=k, x=x, invlink=g,
                              iterlim=1e3, steptol=1e-9, gradtol=1e-12))
  if (fit.logistic$code > 3) stop("Check the fit.")
  beta.hat <- fit.logistic$estimate

  # Check:
  print(rbind(Reference=coefficients(fit), Constrained=beta.hat))

  # Plot:
  Y$p.hat <- with(Y, predict.ab(beta.hat, x, invlink=g))
  with(X, plot(x, k / n, ylim=0:1,, col=col,
               main=paste0("Fit with a=", signif(ab[1], 2),
                           " and b=", signif(ab[2], 2))))
  with(Y, lines(x, p.hat, col = "Red", lwd=2))
  with(Y, lines(x, p))
  abline(h = c(a, b), lty=3)
}
par(mfrow=c(1,1))

thanks a lot for taking the time, very clear explanation. I did not realize how the scaling of the link would affect the $\beta$ coefficients, and that's why I discarded this as my first option. Also, what confuses me is that the logistic regression by itself, as you also mention, already offers an estimate of $a$ and $b$, which is particularly good with a lot of data. Anyway, I removed my answer so that it does not confuse others as well! — Davide ND, Feb 13 '20 at 10:07
Thank you @Davide: that is gracious of you. And thank you for a stimulating discussion! — whuber, Feb 13 '20 at 13:49
@COOLSerdash It's just a way to guarantee `nlm` won't object if it feeds a parameter resulting in a probability of $0$ or $1,$ for which `Lambda` will (correctly) output an infinite negative log likelihood. (To appreciate the problem, set `tol` to $0$, comment out the `set.seed` statement, and run the code a bunch of times. Pretty soon you'll get an error.) You see such "tolerances" built into numerical procedures all the time. The first place I would look for a general discussion and practical advice would be any of the *Numerical Recipes* books. — whuber, Feb 13 '20 at 18:18

score 6 · Answer 2 · answered Feb 12 '20 at 12:48

6

You could simply instead of using logistic function $1/(1 + e^{-x})$, use the one that forces the bounds, e.g. $a + (b-a)/(1 + e^{-x})$. In such case the optimization algorithm would need to find the probabilities within the bounds.

answered Feb 12 '20 at 12:48

Tim

108,699
20
212
390

I just noticed this deleted answer--and it looks like a good one! – whuber Feb 12 '20 at 16:15
2

@whuber if you say so, I'm undeleting. My main concern & reason for deleting it is that I'm not sure if this would work as intended with the usual logistic loss function. – Tim Feb 12 '20 at 17:53
3

Why not? The change in link doesn't modify the likelihood function; it merely changes how linear combinations of explanatory variables are mapped to probabilities. Because the image of the map is $[a,b],$ by varying $\beta$ the ML procedure will be able to explore only models that have probabilities in that range--but subject to that restriction, it will still maximize the likelihood. Although the *form* of the link might not be perfect--indeed, who is to say that the logit link is ever the right form in logistic regressions?--the procedure works exactly the same as always. – whuber Feb 12 '20 at 18:23

Logistic regression with actual probabilities $\in(a,b)$ where $0

2 Answers2

Linked