1

My understanding is that logistic regression assumes a linear relationship between the logit of the outcome and each predictor variable.

I'm working on a case study from this MIT course. My model is making really poor predictions and I suspect it is because non-linearity.

idx <- sample(seq(1, 3), size = nrow(Book), replace = TRUE, prob = c(.45, .35, .2))
train <- Book[idx == 1,]
val <- Book[idx == 2,]
test <- Book[idx == 3,]

glm.fit1 <- glm(Florence ~., family = binomial, data = train)
summary(glm.fit1)
glm.probs1 <- predict(glm.fit1, test, type='response')
glm.pred1 <- rep("0",nrow(test))
glm.pred1[glm.probs1 >.5] <- "1"

This is the confusion matrix

> table(glm.pred1,test$Florence)

glm.pred1   0   1
        0 787  73
        1   0   1

How can I confirm that assumption?

What I've tried: I plotted the predictions against the log-transformed probabilities that came out of my model. I was told that doesn't work for poorly performing classifiers. Here is a post with info.

Sebastian
  • 509
  • 5
  • 15
  • 1
    You said "logistic regression assumes a linear relationship between the logit of the outcome and..." -- well, no, not quite. It assumes a linear relationship between the logit of $p$ (which is the *probability that the outcome is 1*) and the predictors, not the logit of the outcome itself. – Glen_b Jan 21 '19 at 03:31
  • @Glen_b is correct but to be even more precise it assumes a linaer relationship between the logit of p and the *continuous* predictors. – StatsStudent Jan 21 '19 at 03:33
  • 1
    I wondered whether to stress that I meant the predictors *in the design matrix* but figured that it wasn't necessary (perhaps wrongly, likely I should have done so). The linear predictor is clearly linear in $X$, since $\eta = X\beta$. – Glen_b Jan 21 '19 at 03:39
  • There are four primary methods that I use to assess the linearity of the relationship between the logit and continuous variables in logic regression. These are also the methods described, if I recall correctly in the Hosmer, Lemmeshow, and Sturdivant text on Applied Logisitc Regression (the diagnostics chapter). Check out the book for detailed examples and explanations. I may provide some additional details if time allows this evening or tomorrow. - Smooth scatterplots - Fractional polynomials - splines - Method of Design Variables – StatsStudent Jan 21 '19 at 03:55
  • 1
    Can you tell us what your variables represent? And try a model with spline in the continuous variables – kjetil b halvorsen Jan 21 '19 at 08:33
  • @kjetilbhalvorsen the link is to the problem description (not course :) – seanv507 Jan 21 '19 at 19:02
  • So you have built a model to predict the probability that a customer would buy "the art history of florence"... (based on a bunch of demographic data). all the model is saying is that the 'predicted' probability is lower than 50% for ~all your customers (based on your variables), that doesn't mean that the model is useless... eg perhaps 40 year olds have 40% probability but 20 year olds have 20% probability – seanv507 Jan 21 '19 at 19:06

0 Answers0