How do I train a (logistic?) regression in R using L1 loss function?

Question

I can train a logistic regression in R using

glm(y ~ x, family=binomial(logit)))

but, IIUC, this optimizes for log likelihood.

Is there a way to train the model using the linear ($L_1$) loss function (which in this case is the same as the total variation distance)?

I.e., given a numeric vector $x$ and a bit (logical) vector $y$, I want to construct a monotonic (in fact, increasing) function $f$ such that $\sum |f(x)-y|$ is minimized.

See also

How do I train a logistic regression in R using L1 loss function?

What you want doesn't exist, and to be blunt, it doesn't make much sense. We can discuss alternatives but you need to state more thoroughly what are you trying to do. Why do you want to fit a logistic model with an L1 loss? — user603, May 22 '14 at 19:25
@user603: Because I want to evaluate my model using [TVD](http://en.wikipedia.org/wiki/Total_variation_distance_of_probability_measures) — sds, May 22 '14 at 20:59
You seem to be talking about fitting a logistic *curve* to data, rather than fitting binomially-distributed data - that is, a form of *nonlinear regression*, but using $L_1$ rather than $L_2$ norm. Indeed, the loss function $\sum |f(x)-y|$ suggests that the maximum is not $1$ (if that is the case, it makes reference to the binomial GLM misleading). On the other hand, if it really *is* constrained to 0-1, the loss function doesn't make sense. Can you give details of your actual situation please? — Glen_b, May 22 '14 at 22:52
Please note that the [help](http://stats.stackexchange.com/help/on-topic) ask that you *don't* cross post the same question to multiple sites, but instead choose one single site. If you later change your mind about which site is best, flag it for moderator attention and ask for it to be moved. — Glen_b, May 22 '14 at 22:57
@Glen_b: I think "bit (logical) vector y" does imply 0/1 response. — sds, May 23 '14 at 01:53
Yes, I see it now, sorry. I missed it - it's buried rather late in the post. — Glen_b, May 23 '14 at 03:23

user603 · Accepted Answer · 2020-06-06T07:26:40.720

What you want to do does not exist because it is, for lack of better word, mathematically flawed.

But first, I will stress why I think the premises of your question are sound. I will then try to explain why I think the conclusions you draw from them rest on a misunderstanding of the logistic model and, finally, I will suggest an alternative approach.

I will denote $\{(\pmb x_i,y_i)\}_{i=1}^n$ your $n$ observations (the bolder letters denote vectors) which lie in $p$ dimensional space (the first entry of $\pmb x_i$ is 1) with $p<n$, $y_i\in [0,1]$ and $f(\pmb x_i)= f(\pmb x_i'\pmb\beta)$ is an monotonous function of $\pmb x_i'\pmb\beta$, say like the logistic curve to fix ideas. For expediency, I will just assume that $n$ is sufficiently large compared to $p$.

You are correct that if you intend to use TVD as criterion to evaluate the fitted model, then it is reasonable to expect your fit to optimize that same criterion among all possible candidate, on your data. Hence

$$\pmb\beta^*=\underset{\pmb\beta\in\mathbb{R}^{p}}{\arg\min}\;\;\;\;\;||\pmb y-f(\pmb x_i'\pmb\beta)||_1$$

The problem is the error term: $\epsilon_i=y_i-f(\pmb x_i'\pmb\beta)$ and if we enforce $E(\pmb\epsilon)=0$ (we simply want our model to be asymptotically unbiased), then, $\epsilon_i$ must be heteroskedastic. This is because $y_i$ can take on only two values, 0 and 1. Therefore, given $\pmb x_i$, $\epsilon_i$ can also only take on two values: $1-f(\pmb x_i'\pmb\beta)$ when $y_i=1$, which occurs with probability $f(\pmb x_i'\pmb\beta)$, and $-f(\pmb x_i'\pmb\beta)$ when $y_i=1$, which occurs with probability $1-f(\pmb x_i'\pmb\beta)$.

These consideration together imply that:

$$\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\text{var}(\pmb\epsilon)=E(\pmb\epsilon^2)=(1-f(\pmb x'\pmb\beta))^2f(\pmb x'\pmb\beta)+(-f(\pmb x'\pmb\beta))^2(1-f(\pmb x'\pmb\beta))\\ \;=(1-f(\pmb x'\pmb\beta))f(\pmb x'\pmb\beta)\\ =E(\pmb y|\pmb x)E(1-\pmb y|\pmb x)$$

hence $\text{var}(\pmb\epsilon)$ is not constant but concave parabola shaped and is maximized when $\pmb x$ is such that $E(y|\pmb x)\approx .5$.

This inherent heteroskedasticity of the residuals has consequences. It implies among other things that when minimizing the $l_1$ loss function, you are asymptotically over-weighting part of your sample. That is, the fitted $\pmb\beta^*$ don't fit the data at all but only the portion of it that is clustered around places where $\pmb x$ is such that $E(\pmb y|\pmb x)\approx .5$. To wit, these are the least informative data points in your sample: they correspond to those observations for which the noise component is the largest. Hence, your fit is pulled towards $\pmb\beta^*=\pmb\beta:f(\pmb x'\pmb\beta)\approx .5$, e.g. made irrelevant.

One solution, as is clear from the exposition above is to drop the requirement of unbiased-ness. A popular way to bias the estimator (with some Bayesian interpretation attached) is by including a shrinkage term. If we re-scale the response:

$$y^+_i=2(y_i-.5),1\leq i\leq n$$

and, for computational expediency, replace $f(\pmb x'\pmb\beta)$ by another monotone function $g(\pmb x,[c,\pmb\gamma])=\pmb x'[c,\pmb\gamma]$ --it will be convenient for the sequel to denote the first component of the vector of parameter as $c$ and the remaining $p-1$ ones $\pmb\gamma$-- and include a shrinkage term (for example one of the form $||\pmb\gamma||_2$), the resulting optimization problem becomes:

$$[c^*,\pmb\gamma^{*}]=\underset{\pmb[c,\pmb\gamma]\in\mathbb{R}^{p}}{\arg\min}\;\;\sum_{i=1}^n\max(0,1-y_i^+\pmb x_i'\pmb[c,\pmb\gamma])+\frac{1}{2}||\pmb\gamma||_2$$

Note that in this new (also convex) optimization problem, the penalty for a correctly classified observations is 0 and it grows linearly with $\pmb x'\pmb[c,\gamma]$ for a miss-classified one --as in the $l_1$ loss. The $[c^*,\pmb\gamma^*]$ solution to this second optimization problem are the celebrated linear svm (with perfect separation) coefficients. As opposed to the $\pmb\beta^*$, it makes sense to learn these $[c^*,\pmb\gamma^{*}]$ from the data with an TVD-type penalty ('type' because of the bias term). Consequently, this solution is widely implemented. See for example the R package LiblineaR.

@sds; thanks: it was a great question :) I will come back during the day and fill in the details, correct some typo. — user603, May 23 '14 at 05:18

score 9 · Answer 2 · answered May 22 '14 at 20:24

I'm not sure why you would want to use L1 loss for something constrained between 0 and 1. Depending on what your goal is, you may want to consider something like hinge loss instead, which is similar to L1 loss in one direction and flat in the other.

In any case, the code below should do what you've asked for. Note that the optimal response is basically a step function.

set.seed(1)

# Fake data
x = seq(-1, 1, length = 100)
y = rbinom(100, plogis(x), size = 1) # plogis is the logistic function

# L1 loss
loss = function(y, yhat){
  sum(abs(y - yhat))
}

# Function to estimate loss associated with a given slope & intercept
fn = function(par){
  a = par[1]
  b = par[2]
  loss(y = y, yhat = plogis(a + b * x))
}

# Find the optimal parameters
par = optim(
  par = c(a = 0, b = 0),
  fn = fn
)$par

# Plot the results
plot(y ~ x)
curve(plogis(par[1] + par[2] * x), add = TRUE, n = 1000)

score -1 · Answer 3 · answered Dec 08 '14 at 15:27

You can use the glmnet package for fitting L1, L2 models. It's not limited to logistic regression but includes it.

Here is the vignette: http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html

There is also a webminar: https://www.youtube.com/watch?v=BU2gjoLPfDc

Liblinear is good, but i've found glmnet easier to get started. Glmnet includes a function thats does cross-validation and selects a regularization parameter for you based in different metrics such as the AUC.

Regarding theory, I would read the tibshiarini paper regarding lasso (L1 regularization) and the chapter in elements of statistical learning. http://statweb.stanford.edu/~tibs/lasso/lasso.pdf

About the log loss, it's just to evaluate models. It's not a loss function for model fitting.

They were asking about L1 loss not L1 regularization. – daknowles Oct 16 '21 at 20:56 — daknowles, Oct 16 '21 at 20:56

How do I train a (logistic?) regression in R using L1 loss function?

3 Answers3

Linked