Difference between logistic regression and linear regression on aggregated dataset

Question

What's the intuitive difference between logistic regression of $P(Y|X)$ on the dataset

and a weighted linear regression on

X1 X2 Y    w
0  0  0    1 
1  0  0.5  2
0  1  0.25 4
1  1  1    2

I know they are defined differently, and I feel like the second approach is probably kind of "wrong" philosophically, however for a very large dataset, it could be very beneficial computationally. I'm more asking for a feel of how big of a tradeoff this would make from a practical point of view, and how different the classifiers would be in practice.

how are you coding the predictors? interaction terms? intercepts? without more information the answer would probably be pretty long — Taylor, Oct 24 '16 at 22:31
@Taylor I don't really have a good answer to this ... i'm not a statistician, so i would probably just try them all and see which one had better cross-validated performance . i would love a long answer if you have the time ! — Max Flander, Oct 24 '16 at 23:02
You write "intuitive/statistical" as if "intuitive" and "statistical" meant the same thing. Are you looking for an intuitive answer or a mathematically precise answer? — Kodiologist, Oct 24 '16 at 23:47
@Kodiologist definately more after a practical/common sense answer about how bad it is to take a binary classification problem and reframe it as a continous regression on rate. i know it's probably the "wrong" thing to do from a theoretical point of view, however it could be very beneficial computationally, i'm interested in how serious the tradeoffs are — Max Flander, Oct 25 '16 at 01:25
One thing about your particular example is that you have two categorical predictor variables, while a logistic regression doesn't require that. If your example had two continuous predictors and it wouldn't work out as neatly. Your aggregated data is essentially a contingency table. How would you deal with the uncertainties in your table? How would you reflect whether 0.25 was 1000 out of 4000 or 1 out of 4? — Wayne, Oct 25 '16 at 23:42
To follow up on @Wayne 's comment -- R's GLM implements a method to work with precisely this scenario: the dependent variable can be a 2-column matrix reflecting the total number of successes and the total number of trials for binomial regression. — Sycorax, Oct 26 '16 at 00:06
@Sycorax OK so if i included weights in a GLM from a cross-validation point of view, would you expect a difference? — Max Flander, Oct 26 '16 at 01:00
@MaxFlander Difference in what? Between a weighted and unweighted GLM? Yes, because the variance is different. Difference between a linear model and a binomial GLM? Yes, because a binomial GLM is predicting probability of a success (a real between 0 and 1) that is multilinear in its predictors, while the OLS model is a response (any real) that is multilinear in its predictors. In other words, a binomial GLM minimizes cross-entropy loss while OLS minimizes squared error loss; these are not the same thing. — Sycorax, Oct 26 '16 at 01:08
@Sycorax thanks for taking the time to explain this, actually I'm wondering about the difference between GLM on the weighted contingency table, and Logistic regression on the underlying dataset — Max Flander, Oct 26 '16 at 01:12
@MaxFlander Ok, but comparing two different GLMs is a different question than what is written in your post, which I understood to be about OLS vs. GLMs. — Sycorax, Oct 26 '16 at 02:04
@Sycorax hmm yes i agree that the question would have been clearer if i phrased in terms of GLM's, but I didn't make this connection when i was asking. if I change "linear regression" to "weighted linear regression" does the meaning become clear? — Max Flander, Oct 26 '16 at 02:28
That would be yet a third question, since WLS and OLS are two different things; so far I had understood this to be a contrast between OLS and logistic regression, or else weighted logistic regression and "vanilla" logistic regression. — Sycorax, Oct 26 '16 at 02:38
@Sycorax sorry for the confusion, i'm not a statistician, so i don't really know the precise way to phrase things, and i appreciate you sticking with me on this. the way i'm understanding it is from a machine-learning point of view where the model is a black box that takes X as an input and spits out P(Y|X) as an output; i could model it on the raw data with a logistic regression or i could model it on the contingency table with a weighted continuous-valued linear model. i want to understand the difference between these approaches. — Max Flander, Oct 26 '16 at 02:59
I realise it may seem very weird/arbitrary to you but it is sort of a natural question from a machine-learning point of view where models can be viewed as black boxes which are either "classifiers" (modelling discrete values) or "regressors" (modelling continuous values) — Max Flander, Oct 26 '16 at 03:00

score 1 · Answer 1 · answered Oct 26 '16 at 00:06

The big difference between the two is that Logit Regression given that it is the regression of a logit is constrained between nominal values ranging between 0 and 1. So, whenever you estimate a dependent variable that has a binomial value (0,1) you should use Logit Regression instead of Linear Regression. The latter can come up with estimates that are < 0 and > 1 which does not make any sense within a binomial (0, 1) context. The Logit Regression can be thought off as the probability of a binomial event happening (0 = it does not occur; 1 = it occurs). And, by definition a probability can not exceed 1 or be negative. Logit Regression will respect those constraints. Linear Regression will not.

Difference between logistic regression and linear regression on aggregated dataset

1 Answers1

Linked