0

What's the intuitive difference between logistic regression of $P(Y|X)$ on the dataset

X1 X2 Y
0  0  0
1  0  0
1  0  1
0  1  0
0  1  0
0  1  0
0  1  1
1  1  1

and a weighted linear regression on

X1 X2 Y    w
0  0  0    1 
1  0  0.5  2
0  1  0.25 4
1  1  1    2

I know they are defined differently, and I feel like the second approach is probably kind of "wrong" philosophically, however for a very large dataset, it could be very beneficial computationally. I'm more asking for a feel of how big of a tradeoff this would make from a practical point of view, and how different the classifiers would be in practice.

Max Flander
  • 101
  • 6
  • how are you coding the predictors? interaction terms? intercepts? without more information the answer would probably be pretty long – Taylor Oct 24 '16 at 22:31
  • @Taylor I don't really have a good answer to this ... i'm not a statistician, so i would probably just try them all and see which one had better cross-validated performance . i would love a long answer if you have the time ! – Max Flander Oct 24 '16 at 23:02
  • You write "intuitive/statistical" as if "intuitive" and "statistical" meant the same thing. Are you looking for an intuitive answer or a mathematically precise answer? – Kodiologist Oct 24 '16 at 23:47
  • @Kodiologist definately more after a practical/common sense answer about how bad it is to take a binary classification problem and reframe it as a continous regression on rate. i know it's probably the "wrong" thing to do from a theoretical point of view, however it could be very beneficial computationally, i'm interested in how serious the tradeoffs are – Max Flander Oct 25 '16 at 01:25
  • 2
    One thing about your particular example is that you have two categorical predictor variables, while a logistic regression doesn't require that. If your example had two continuous predictors and it wouldn't work out as neatly. Your aggregated data is essentially a contingency table. How would you deal with the uncertainties in your table? How would you reflect whether 0.25 was 1000 out of 4000 or 1 out of 4? – Wayne Oct 25 '16 at 23:42
  • To follow up on @Wayne 's comment -- R's GLM implements a method to work with precisely this scenario: the dependent variable can be a 2-column matrix reflecting the total number of successes and the total number of trials for binomial regression. – Sycorax Oct 26 '16 at 00:06
  • @Sycorax OK so if i included weights in a GLM from a cross-validation point of view, would you expect a difference? – Max Flander Oct 26 '16 at 01:00
  • @MaxFlander Difference in what? Between a weighted and unweighted GLM? Yes, because the variance is different. Difference between a linear model and a binomial GLM? Yes, because a binomial GLM is predicting probability of a success (a real between 0 and 1) that is multilinear in its predictors, while the OLS model is a response (any real) that is multilinear in its predictors. In other words, a binomial GLM minimizes cross-entropy loss while OLS minimizes squared error loss; these are not the same thing. – Sycorax Oct 26 '16 at 01:08
  • @Sycorax thanks for taking the time to explain this, actually I'm wondering about the difference between GLM on the weighted contingency table, and Logistic regression on the underlying dataset – Max Flander Oct 26 '16 at 01:12
  • @MaxFlander Ok, but comparing two different GLMs is a different question than what is written in your post, which I understood to be about OLS vs. GLMs. – Sycorax Oct 26 '16 at 02:04
  • @Sycorax hmm yes i agree that the question would have been clearer if i phrased in terms of GLM's, but I didn't make this connection when i was asking. if I change "linear regression" to "weighted linear regression" does the meaning become clear? – Max Flander Oct 26 '16 at 02:28
  • That would be yet a third question, since WLS and OLS are two different things; so far I had understood this to be a contrast between OLS and logistic regression, or else weighted logistic regression and "vanilla" logistic regression. – Sycorax Oct 26 '16 at 02:38
  • @Sycorax sorry for the confusion, i'm not a statistician, so i don't really know the precise way to phrase things, and i appreciate you sticking with me on this. the way i'm understanding it is from a machine-learning point of view where the model is a black box that takes X as an input and spits out P(Y|X) as an output; i could model it on the raw data with a logistic regression or i could model it on the contingency table with a weighted continuous-valued linear model. i want to understand the difference between these approaches. – Max Flander Oct 26 '16 at 02:59
  • I realise it may seem very weird/arbitrary to you but it is sort of a natural question from a machine-learning point of view where models can be viewed as black boxes which are either "classifiers" (modelling discrete values) or "regressors" (modelling continuous values) – Max Flander Oct 26 '16 at 03:00

1 Answers1

1

The big difference between the two is that Logit Regression given that it is the regression of a logit is constrained between nominal values ranging between 0 and 1. So, whenever you estimate a dependent variable that has a binomial value (0,1) you should use Logit Regression instead of Linear Regression. The latter can come up with estimates that are < 0 and > 1 which does not make any sense within a binomial (0, 1) context. The Logit Regression can be thought off as the probability of a binomial event happening (0 = it does not occur; 1 = it occurs). And, by definition a probability can not exceed 1 or be negative. Logit Regression will respect those constraints. Linear Regression will not.

Sympa
  • 6,862
  • 3
  • 30
  • 56