I am conducting a logistic regression I created the following test-data (the two predictors and the criterion are binary variables):
UV1 UV2 AV
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 1 1 1
7 1 1 1
8 0 0 1
9 0 0 1
10 0 0 1
11 1 1 0
12 1 1 0
13 1 0 0
14 1 0 0
15 1 0 0
16 1 0 0
17 1 0 0
18 0 0 0
19 0 0 0
20 0 0 0
AV = $\frac{dependent variable}{criterion}$
$\frac{UV1}{UV2} = \frac{both independant variables}{predictors}$
For measuring the UVs effect on the AV a logistic regression is necessary, as the AV is a binary variable. Hence I used the following code
> lrmodel <- glm(AV ~ UV1 + UV2, data = lrdata, family = "binomial")
including "family = "binomial"". Is this correct?
Regarding my test-data, I was wondering about the whole model, especially the estimators and sigificance:
> summary(lrmodel)
Call:
glm(formula = AV ~ UV1 + UV2, family = "binomial", data = lrdata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7344 -0.2944 0.3544 0.7090 1.1774
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.065e-15 8.165e-01 0.000 1.000
UV1 -1.857e+01 2.917e+03 -0.006 0.995
UV2 1.982e+01 2.917e+03 0.007 0.995
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 27.726 on 19 degrees of freedom
Residual deviance: 17.852 on 17 degrees of freedom
AIC: 23.852
Number of Fisher Scoring iterations: 17
Why is UV2 not significant. See therefore that for group AV = 1 there are 7 cases with UV2 = 1, and for group AV = 0 there are only 3 cases with UV2 = 1. I was expecting that UV2 is a significant discriminator.
Despite the not-significance of the UVs, the estimators are - in my opinion- very high (e.g. for UV2 = 1.982e+01). How is this possible?
Why isn't the intercept 0,5?? We have 5 cases with AV = 1 and 5 cases with AV = 0.
Further: I created UV1 as a predictor I expected not to be significant: for group AV = 1 there are 5 cases withe UV1 = 1, and for group AV = 0 there are 5 cases withe UV1 = 1 as well.
The whole "picture" I gained from the logistic is confusing me...
What was consuming me more: When I run a "NOT-logistic" regression (by omitting "family = "binomial")
> lrmodel <- glm(AV ~ UV1 + UV2, data = lrdata,)
I get the expected results
Call:
glm(formula = AV ~ UV1 + UV2, data = lrdata)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.7778 -0.1250 0.1111 0.2222 0.5000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5000 0.1731 2.889 0.01020 *
UV1 -0.5000 0.2567 -1.948 0.06816 .
UV2 0.7778 0.2365 3.289 0.00433 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.1797386)
Null deviance: 5.0000 on 19 degrees of freedom
Residual deviance: 3.0556 on 17 degrees of freedom
AIC: 27.182
Number of Fisher Scoring iterations: 2
- UV1 is not significant! :-)
- UV2 has an positive effect on AV = 1! :-)
- The intercept is 0.5! :-)
My overall question: Why isn't logistic regression (including "family = "binomial") producing results as expected, but a "NOT-logistic" regression (not including "family = "binomial") does?
Update: are the observations described above because of the correlation of UV1 and UV 2. Corr = 0.56 After manipulating the UV2's data
AV: 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
UV1: 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0
UV2: 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0
(I changed the positions of the three 0s with the three 1s in UV2 to gain a correlation < 0.1 between UV1 and UV2) hence:
UV1 UV2 AV
1 1 0 1
2 1 0 1
3 1 0 1
4 1 1 1
5 1 1 1
6 1 1 1
7 1 1 1
8 0 1 1
9 0 1 1
10 0 1 1
11 1 1 0
12 1 1 0
13 1 0 0
14 1 0 0
15 1 0 0
16 1 0 0
17 1 0 0
18 0 0 0
19 0 0 0
20 0 0 0
to avoid correlation, my results come closer to my expectations:
Call:
glm(formula = AV ~ UV1 + UV2, family = "binomial", data = lrdata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.76465 -0.81583 -0.03095 0.74994 1.58873
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1248 1.0862 -1.036 0.3004
UV1 0.1955 1.1393 0.172 0.8637
UV2 2.2495 1.0566 2.129 0.0333 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 27.726 on 19 degrees of freedom
Residual deviance: 22.396 on 17 degrees of freedom
AIC: 28.396
Number of Fisher Scoring iterations: 4
But why does the correlation influence the results of the logistic regression and not the results of the "not-logistic" regression?