Logistic regression results (coefficients) counterintuitive?

Question

I ran a logistic regression model on SPSS with a dependent variable of yes/no whether you chose bus or not (the other being personal vehicle) and 5 independent variables (Waiting Time, Trip Time, Total Daily Expense, Overall Mode Comfort, and Overall Mode Ease-of-use). While the Omnibus and Hosmer-Lemeshow tests shows the model to be very good, and the significance for the most variables adequate, the result coefficients of some of the variables are somewhat off. This affects the probability estimation in that the predictor variable goes against intuition in real life conditions.

For example, the Comfort variable has a coefficient of -0.102821; this translates to a low probability when the Comfort value is high. Who wouldn't choose the bus when the Comfort value is over the top? I'm thinking that the coefficient should be a positive instead of negative. I should probably also point out that the intercept is negative, I'm not sure how much this effects the model.

So what seems to be the problem with my model?

It might be the correlations between the predictors - what is the coefficient for comfort if you include no other predictors? — Jeremy Miles, Jun 16 '15 at 04:24
What do you mean by the correlations between predictors? The coefficient for Comfort is -0.035395 when I don't include the other predictors (though statistically insignificant). Thanks for the response. — Umar Al Faruq, Jun 16 '15 at 10:08
How exactly is "Comfort" coded? That is, what is the assignment of numerical values to comfort levels? — whuber, Jun 16 '15 at 15:18
With 120 observations and 6 parameters, I would worry about finite sample bias of MLE. — dimitriy, Jun 17 '15 at 22:21
Hosmer-Lemeshow is considered obsolete: https://stats.stackexchange.com/questions/273966/logistic-regression-with-poor-goodness-of-fit-hosmer-lemeshow — kjetil b halvorsen, May 14 '20 at 12:18

Jason Sanchez · Answer 1 · 2015-06-16T15:00:29.653

3

If the beta parameter estimate is statistically significant, then the issue is not correlation among variables in your model.

Instead, the issue is one of omitted variable bias. This means there is a variable your model does not control for that is correlated with the comfort variable and the response variable. The impact of this omitted variable is absorbed by the comfort variable.

As a theoretical example, the buses that are the most comfortable might be located in the areas that are more wealthy. Perhaps people in wealthier areas are more likely to use their personal vehicle.

Because you did not control for the wealth of the driver (the variable was omitted) and this variable could be correlated with the buses that are more comfortable, the comfort variable could be negatively biased (potentially so much so to change signs of the parameter estimate).

Remember that when you interpret a beta parameter as holding everything else constant, you really mean that you are holding all of the variables in your model constant. Any variable not in your model that is correlated with a variable in your model is not considered to be held constant.

Omitted variable bias violates a fundamental assumption of linear models and leads to biased parameter estimates. Because of this negative impact, you should always include all variables you believe are part of the model. Unfortunately, there are no good ways to know what is omitted through the model alone. You must use your own experience and judgment to understand what might be omitted.

As a side note, if your goal is just to estimate the comfort variable, then you don't need to include every possible variable that you might have omitted in your model. Instead, you only need to include all variables that are correlated with the comfort and response variable.

edited Jun 16 '15 at 15:00

answered Jun 16 '15 at 07:37

Jason Sanchez

680
3
11

So you suggest I should put in more independent variables? Also, how can I know which variable is absorbed by the Comfort variable? Thanks for the response. – Umar Al Faruq Jun 16 '15 at 10:11
I updated the answer to directly answer these questions. – Jason Sanchez Jun 16 '15 at 15:00
I ran a correlation analysis for all variables and found that Comfort isn't correlated with my response variable, and when I didn't include it in my model the results were better. Should I omit Comfort instead? – Umar Al Faruq Jun 16 '15 at 23:55
2 questions. First, is the comfort variable correlated with any of the other variables in your model? Second, do you believe the comfort variable should be part of the equation? – Jason Sanchez Jun 17 '15 at 01:04
Yes, Comfort is correlated with Total Daily Expense and Mode Safety (I forgot to mention this variable in my question) out of all the variables. Secondly, also yes, because ideally if an individual feel that a certain mode is more physically comfortable than another they will likely choose that one over the other, along with the other variables in mind. – Umar Al Faruq Jun 17 '15 at 01:59
Because you believe the comfort variable should be part of the model and the variable is correlated with total daily expense and mode safety, if you remove the comfort variable then you guarantee that the total daily expense and mode safety variables will be biased. Compare the beta parameters of those two variables when the comfort variable is in the model versus when it is not in the model. – Jason Sanchez Jun 17 '15 at 02:40
I should probably correct you: when I enter the 6 variables (not 5 as I have said before) that is (1) Waiting Time, (2) Trip Time, (3) Total Daily Expense, (4) Mode Comfort, (5) Mode Safety and (6)Mode Ease-of-use; I get 6 coefficients + 1 intercept. I used 5 of those since the Mode Safety variable is statistically not significant. When I put those coefficients into the probability equation and change one variable while holding the others constant I get an S-shaped graph; so far so good. But the thing is some of the graphs are strange. [cont.] – Umar Al Faruq Jun 17 '15 at 06:42
[cont.] For example: when I change the values for the Comfort variable while holding the others constant, I get a graph showing that as the value of Comfort (in this case the comfort of the bus since I coded Bus as 1) goes higher, the probability of a person choosing the bus will decrease. This goes against intuition. The probability of me choosing the bus would certainly increase if I knew they put in sofas and plasma TVs inside. For the other ones they are either fine or can be interpreted in such a way that it isn't that much of a problem. – Umar Al Faruq Jun 17 '15 at 06:47
I think I do understand what you're asking, but I wanted to warn you that removing the comfort variable might greatly impact the marginal effects associated with your other variables. The likely cause of the beta parameter being negative is that you don't have enough variables in your model and an omitted and important variable is severely biasing the comfort beta parameter. Removing the comfort variable means that your model is even more poorly specified and the remaining parameters are even more biased. – Jason Sanchez Jun 17 '15 at 18:46
Oh I see. But I don't have any other variables. I do, but those aren't considered mode attributes (since I'm calculating the mode utility using their attributes through the model) such as Income and Age of the sample (which are attributes of the sample, not of the mode of transport). I won't be omitting any of the variables, but adding more variables confuse me even more because I don't how I can interpret the model after adding other variables along with the mode attributes (since similar studies - read: previous undergraduate theses by my seniors- have done it like that). That's my dilemma. – Umar Al Faruq Jun 17 '15 at 19:16
Those other variables seem valuable to me. Interpretation is similar to before. Instead of just controlling for the mode of transportation, you also control for differences in passenger characteristics. Try adding the income and age variable to the model and see how it impacts the comfort variable. – Jason Sanchez Jun 17 '15 at 19:31
Okay so when I put in the 6 variables in along with Age and Income I find that all variables have a p value over 0.05 except for Waiting Time, Trip Time, Total Daily Expense, and Mode Ease-of-use/convenience. So do I expect to not input Comfort and other non-significant variables into the probability equation? – Umar Al Faruq Jun 17 '15 at 20:03
Does comfort still have a negative beta value? – Jason Sanchez Jun 17 '15 at 20:05
Yes it still does. – Umar Al Faruq Jun 17 '15 at 20:20
How large is the dataset? – Jason Sanchez Jun 17 '15 at 20:21
120 sample units. – Umar Al Faruq Jun 17 '15 at 20:24

score 0 · Answer 2 · answered Jun 17 '15 at 21:35

0

Excluded variables are likely biasing your beta estimates (potentially severely for the comfort variable) and that due to small sample sizes you will have difficulty precisely estimating the parameters.

answered Jun 17 '15 at 21:35

Jason Sanchez

680
3
11

Is it possible that, aside from the small sample size (this I admit, due to time and manpower restrictions), the respondents filled in incorrect Comfort values, as in misinterpreting the question? I put in more passenger characteristics variables and the beta for Comfort turned positive, but the p value for all variables are either 1 or 0.999. – Umar Al Faruq Jun 17 '15 at 22:15
Bummer, I am using the mobile app and didn't realize I accidentally added a new answer. I will fix that when I get to a computer. Yes it is possible people incorrectly interpreted the question on the survey. If you can confirm that happened, then at least you will have the correct sign on the variable. – Jason Sanchez Jun 17 '15 at 23:15
In my report, can I just write the results down and explain that there is indeed a bias? Or is there a workaround? – Umar Al Faruq Jun 20 '15 at 16:06

Logistic regression results (coefficients) counterintuitive?

2 Answers2