why sum of squared errors for logistic regression not used and instead maximum likelihood estimation is used to fit the model?

Question

I have a doubt on why sum of squared errors is not used for Logistic regression and instead maximum likelihood estimation is used and also why not the vice versa.

Edited

Many were asking me to clarify the question, my intention was to know about using a squared error loss function to find β vs MLE

Your observations are of the form $0$ and $1$. If you tried to apply the function $f(x)=\log\left(\frac{x}{1-x}\right)$ to these, you might have difficulties — Henry, Dec 14 '16 at 07:50
If you insist: neither $\log\left(\frac{0}{1}\right)$ nor $\log\left(\frac{1}{0}\right)$ are finite, which makes makes linear regression of the transformed probabilities difficult. You could instead try non-linear regression, but then you lose the arithmetic advantages of linear regression and you should consider carefully your loss function: using maximum likelihood rewards accurate prediction of high and low probability events — Henry, Dec 14 '16 at 12:40
Minimizing squared errors is a maximum likelihood estimation -- of the wrong likelihood function. — jwimberley, Jan 01 '17 at 21:59
@Henry: I don't support this idea, but you *could* keep the response unchanged (i.e. $y = 0, 1$) and have the predictions be $\hat y = \frac{e^\eta}{1 + e^\eta}$ with $\eta = X^T \beta$, and finally minimize $\sum (y_i - \hat y_i)^2$. I *believe* this is what the OP intended in the question. — Cliff AB, Jan 03 '17 at 01:19
@CliffAB my understanding of the OPs question is the same as yours. Can you explain (or refer me to) why fisher scoring/IWLS for MLE would be slower than some numerical approximation for the above loss function? I'm legitimately interested because I thought fisher scoring was a real game changer for GLMs. — ilanman, Jan 03 '17 at 02:33
@ilanman: Fisher scoring is the algorithm of choice in most cases, but I'm not sure I'd say it's a game changer? In fact, in the case of a canonical link function, Fisher's scoring is [*identical*](http://www.stat.purdue.edu/~mlevins/Lecnotes/Lecture_fitting.pdf) to Newton Raphson. In the case when the non-canonical link function is used, my understanding is that it has guarantees that make it a more stable algorithm. But for many vanilla regression problems, NR itself is adequately stable, so the advantage is minimal! — Cliff AB, Jan 03 '17 at 02:46
Right. Game changer is a bit strong :) just trying to think why FS alg would not be beneficial in the case of the OPs question re MLE vs squared loss, per your comment below. — ilanman, Jan 03 '17 at 02:49
It would be very helpful if the OP edited the question to clarify which, if any, of the comments and answers is hitting the target. — mdewey, Jan 03 '17 at 12:32
@kjetilbhalvorsen I don't know how duplicate works but isn't this question older than the one you referred. — Sri Harsha Pinninti, Jul 17 '19 at 15:04
@Sri Harsha Pinninti: It doesn't matter which post is older, what matters is that the refefenced post answers your question. — kjetil b halvorsen, Jul 17 '19 at 15:33
@kjetilbhalvorsen okay, I will check the question you referenced but the answer provided by ilanman answered my question. — Sri Harsha Pinninti, Jul 17 '19 at 15:37

score 7 · Accepted Answer · edited Apr 13 '17 at 12:44

Firstly, least squares (or sum of squared errors) is a possible loss function to use to fit your coefficients. There's nothing technically wrong about it.

However there are number of reasons why MLE is a more attractive option. In addition to those in the comments, here are two more:

Computational efficiency

Because the likelihood function of a logistic regression model is a member of the exponential family, we can use Fisher's Scoring algorithm to efficiently solve for $\beta$. In my experience, this algorithm converges in only a few steps. To solve least squares numerically will likely take longer.

Lest this gets lost, per @vbox's comment:

learning parameters for any machine learning model (such as logistic regression) is much easier if the cost function is convex. And, it's not too difficult to show that, for logistic regression, the cost function for the sum of squared errors is not convex, while the cost function for the log-likelihood is.

MLE has very nice properties

Solutions using MLE have nice properties such:

consistency: meaning that with more data, our estimate of $\beta$ gets closer to the true value.
asymptotic normality: meaning that with more data, our estimate of $\beta$ is approximately normal distributed with variance that decreases with $O(\frac{1}{n})$
functional invariance: nice property to have when dealing with multiple parameters (nuisance parameters) and calculating the profile likelihood.

Among others.

However using Least Squares does have some benefits

Least squares tends to be more robust to outliers because an outlier can be wrong by at most 1 (because $(1-0)^2 = 1$), whereas under a negative log likelihood loss function, the distance can be arbitrarily large.

For more information check this or this out.

Edited

My interpretation of the OPs question is why do we use MLE instead of a square loss function to determine $\beta$ in a logistic regression model of the form:

$$logit(P(Y=1|X)) = x\beta$$

Where $P(Y=1|X) = f(x;\beta) = \frac{e^{x\beta}}{1 + e^{x\beta}} = \frac{1}{1 + e^{-x\beta}}$

So the loss function looks like:

$$\sum_{i} (y_i - f(x_i;\beta))^2 = \sum_{i} (y_i - \frac{1}{1 + e^{-x\beta}})^2$$

where $y_i$'s take values 0/1.

When I talk about computational efficiency, I mean finding the $\beta$ which minimizes the above vs. Fisher Scoring on the likelihood function.

OLS is generally *more* computationally efficient than logistic regression; a single step of Fisher's scoring should have more computational costs than solving the entire OLS solution. — Cliff AB, Jan 03 '17 at 00:04
Is this true for least squares with a logistic function? Can you point me to a reference? — ilanman, Jan 03 '17 at 00:13
Ah, so you mean something like a GLM with a logistic link and Gaussian pdf. In that case, it would not be in closed form like OLS, but I am extremely doubtful it will be faster. — Cliff AB, Jan 03 '17 at 01:09
Not Gaussian PDF. Just a squared loss function with logic link. Right, not closed form. I'm pretty sure fisher scoring is faster than standard numerical methods. Unless you could show me otherwise. I'll include detail in my answer to be clear. — ilanman, Jan 03 '17 at 01:11
Maximizing likelihood based on Gaussian PDF = minimizing squared error loss function. — Cliff AB, Jan 03 '17 at 01:14
I added some clarification above. Nowhere do I make the Gaussian assumption, unless I'm missing something. — ilanman, Jan 03 '17 at 01:17
To add to ilanman's answer, learning parameters for any machine learning model (such as logistic regression) is much easier if the cost function is [convex](http://stats.stackexchange.com/questions/219899/why-do-we-want-an-objective-function-to-be-a-convex-function). And, it's not too difficult to show that, for logistic regression, the cost function for the sum of squared errors is not convex, while the cost function for the log-likelihood is. — vbox, Jan 03 '17 at 02:47
@vbox: ah, I forget about the non-convexity issue. In that case, I could see how using the logistic transform with a squared error loss could be slower. Sorry for the distraction. — Cliff AB, Jan 03 '17 at 17:20

score 2 · Answer 2 · answered Jan 03 '17 at 11:45

2

Maybe I'm not getting the point of ilanman's answer as well as of some of the comments here, but afaiks, the answer is simply that

OLS = log L(Gaussian)

i.e. the OLS corresponds to the log likelihood of a regression with a normal / Gaussian distribution. You can see this by logging the formula for the Gaussian - the $\sigma$ will factor out, and you see that the OLS maximizes the likelihood.

So, OLS estimation IS MLE for a Gaussian error.

Logistic Regression assumes a Bernoulli / Binomial error, that is why you don't do OLS.

answered Jan 03 '17 at 11:45

Florian Hartig

6,499
22
36

1

I *think* the OP is asking about using a squared error loss function to find $\beta$ vs MLE. – ilanman Jan 03 '17 at 12:10
1

Hmm ... maybe it would be useful if the OP could clarify what the question was. In my world, the main use of logistic regression is 0/1 data, and there you would like to fit a glm with logit link / binomial distribution - OLS simply doesn't fit to that assumptions. – Florian Hartig Jan 04 '17 at 11:17
I agree, but technically you could fit a squared loss function. And to be fair, no where does the OP use the term OLS. – ilanman Jan 04 '17 at 11:19

why sum of squared errors for logistic regression not used and instead maximum likelihood estimation is used to fit the model?

2 Answers2

Linked