How To Solve Logistic Regression Using Ordinary Least Squares?

Question

I was self-learning machine learning. I came upon this section of the Wikipedia page on Logistic regression, where it claims

Because the model can be expressed as a generalized linear model (see below), for 0

It sounds to me that I can recast a logistic regression setup into a linear regression setup. But I can't see how to do that. I don't understand what $0<p<1$ means too. Maybe that is the trick?

Have you looked at our threads on logistic regression and GLM? They will help you appreciate that $p$ cannot directly be observed, so what you propose is a dead end. — whuber, Sep 20 '16 at 19:14
That's what I thought too, which is why I was puzzled about the Wikipedia's comment. — kirakun, Sep 20 '16 at 19:18
Generalized linear models are applicable to many different types of regression (linear, binomial/logistic, Poisson, etc.) and aren't an exclusive feature of linear regression. All "linear" means here is that the fit function is of the form $y=a+bf_1(x) + cf_2(x) + df_3(x) + \ldots$ — jwimberley, Sep 20 '16 at 19:22
For the question *as phrased in the title*, there is of course a [well known solution](http://stats.stackexchange.com/q/142951/127790), i.e. logistic regression could be solved with only an OLS subroutine (+ some simple matrix algebra). — GeoMatt22, Sep 20 '16 at 20:29
See also https://stats.stackexchange.com/questions/326350/what-is-happening-here-when-i-use-squared-loss-in-logistic-regression-setting and https://stats.stackexchange.com/questions/236676/can-you-give-a-simple-intuitive-explanation-of-irls-method-to-find-the-mle-of-a/237384#237384 — kjetil b halvorsen, Jun 14 '19 at 12:43

score 8 · Answer 1 · edited Apr 13 '17 at 12:44

The sigmoid function in the logistic regression model precludes utilizing the close algebraic parameter estimation as in ordinary least squares (OLS). Instead nonlinear analytical methods, such as gradient descent or Newton's method will be used to minimize the cost function of the form:

$\text{cost}(\sigma(\Theta^\top {\bf x}),{\bf y})=\color{blue}{-{\bf y}\log(\sigma(\Theta^\top {\bf x}))}\color{red}{-(1-{\bf y})\log(1-\sigma(\Theta^\top {\bf x}))}$, where

$\large \sigma(z)=\frac{1}{1+e^{-\Theta^\top{\bf x}}}$, i.e. the sigmoid function. Notice that if $y=1$, we want the predicted probability, $\sigma(\Theta^\top x)$, to be high, and the minus sign in the blue part of the cost function will minimize the cost; contrarily, if $y=0$, only the red part of the equation comes into place, and the smaller $\sigma(\Theta^\top x)$, the closer the cost will be to zero.

Equivalently, we can maximize the likelihood function as:

$p({\bf y \vert x,\theta}) = \left(\sigma(\Theta^\top {\bf x})\right)^{\bf y}\,\left(1 - \sigma(\Theta^\top {\bf x})\right)^{1 -\bf y}$.

The sentence you quote, though, makes reference, I believe, to the relatively linear part of the sigmoid function:

Because the model can be expressed as a generalized linear model (see below), for $0<p<1$, ordinary least squares can suffice, with R-squared as the measure of goodness of fit in the fitting space. When $p=0$ or $1$, more complex methods are required.

The logistic regression model is:

$$\text{odds(Y=1)} = \frac{p\,(Y=1)}{1\,-\,p\,(Y=1)} = e^{\,\theta_0 + \theta_1 x_1 + \cdots + \theta_p x_p} $$

or,

$\log \left(\text{odds(Y=1)}\right) = \log\left(\frac{p\,(Y=1)}{1\,-\,p\,(Y=1)}\right) = \theta_0 + \theta_1 x_1 + \cdots + \theta_p x_p=\Theta^\top{\bf X}\tag{*}$

Hence, this is "close enough" to an OLS model ($\bf y=\Theta^\top \bf X+\epsilon$) to be fit as such, and for the parameters to be estimated in closed form, provided the probability of $\bf y = 1$ (remember the Bernoulli modeling of the response variable in logistic regression) is not close to $0$ or $1$. In other words, while $\log\left(\frac{p\,(Y=1)}{1\,-\,p\,(Y=1)}\right)$ in Eq. * stays away from the asymptotic regions.

See for instance this interesting entry in Statistical Horizons, which I wanted to test with the mtcars dataset in R. The variable for automatic transmission am is binary, so we can regress it over miles-per-gallon mpg. Can we predict that a car model has automatic transmission based on its gas consumption?

If I go ahead, and just plow through the problem with OLS estimates I get a prediction accuracy of $75\%$ just based on this single predictor. And guess what? I get the exact same confusion matrix and accuracy rate if I fit a logistic regression.

The thing is that the output of OLS is not binary, but rather continuous, and trying to estimate the real binary $\bf y$ values, they are typically between $0$ and $1$, much like probability values, although not strictly bounded like in logistic regression (sigmoid function).

Here is the code:

> d = mtcars
> summary(as.factor(d$am))
 0  1 
19 13 
> fit_LR = glm(as.factor(am) ~ mpg, family = binomial, d)
> pr_LR = predict(fit, type="response")
> 
> # all.equal(pr_LR, 1 / (1 + exp( - predict(fit_LR) ) ) ) - predict() is log odds P(Y =1)
> 
> d$predict_LR = ifelse(pr_LR > 0.5, 1, 0)
> t_LR = table(d$am,d$predict_LR)
> (accuracy = (t_LR[1,1] + t_LR[2,2]) / sum(t))
[1] 0.75
> 
> fit_OLS = lm(am ~ mpg, d)
> pr_OLS = predict(fitOLS)
> d$predict_OLS = ifelse(pr_OLS > 0.5, 1, 0)
> (t_OLS = table(d$am, d$predict_OLS))

     0  1
  0 17  2
  1  6  7
> (accuracy = (t[1,1] + t[2,2]) / sum(t_OLS))
[1] 0.75

The frequency of automatic v manual cars is pretty balanced, and the OLS model is good enough as a perceptron:

Glen_b · Answer 2 · 2016-09-21T02:39:44.283

You misinterpret the statement you quote. A generalized linear model (normally estimated by maximum likelihood) is not a least squares problem*.

See wikipedia's page Generalized linear model for example.

However, the likelihood is often solved as a sequence of linear least squares approximations -- iteratively reweighted least squares (similarly to a common approach for nonlinear least squares problems).

So in practice quite often a sequence of weighted least squares problems are solved to obtain the parameter estimates. These are obtained by starting at some approximate estimate (there are some standard ways to obtain these), then constructing working response values and weights for a linear approximation to the model which is fitted by weighted least squares, yielding new estimates which in turn are used to update the working response values and weights; this cycle being repeated multiple times.

It's not the only way to fit these models, but one used by a number of stats packages.

* (NB not to be confused with a general linear model which the estimation of can be cast as a form of least squares, nor with generalized least squares)

How To Solve Logistic Regression Using Ordinary Least Squares?

2 Answers2