6

I am doing logistic regression in R on a binary dependent variable with only one independent variable. I found the odd ratio as 0.99 for an outcomes. This can be shown in following. Odds ratio is defined as, $ratio_{odds}(H) = \frac{P(X=H)}{1-P(X=H)}$. As given earlier $ratio_{odds} (H) = 0.99$ which implies that $P(X=H) = 0.497$ which is close to 50% probability. This implies that the probability for having a H cases or non H cases 50% under the given condition of independent variable. This does not seem realistic from the data as only ~20% are found as H cases. Please give clarifications and proper explanations of this kind of cases in logistic regression.

I am hereby adding the results of my model output:

M1 <- glm(H~X, data=data, family=binomial())
summary(M1)

Call:
glm(formula = H ~ X, family = binomial(), data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8563   0.6310   0.6790   0.7039   0.7608  

Coefficients:
                Estimate      Std. Error      z value     Pr(>|z|)    
(Intercept)    1.6416666      0.2290133      7.168      7.59e-13 ***
   X          -0.0014039      0.0009466     -1.483      0.138    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1101.1  on 1070  degrees of freedom
Residual deviance: 1098.9  on 1069  degrees of freedom
  (667 observations deleted due to missingness)
AIC: 1102.9

Number of Fisher Scoring iterations: 4


exp(cbind(OR=coef(M1), confint(M1)))
Waiting for profiling to be done...
                                      OR           2.5 %       97.5 %
(Intercept)                    5.1637680       3.3204509     8.155564
     X                         0.9985971       0.9967357     1.000445

I have 1738 total dataset, of which H is a dependent binomial variable. There are 19.95% fall in (H=0) category and remaining are in (H=1) category. Further this binomial dependent variable compare with the covariate X whose minimum value is 82.23, mean value is 223.8 and maximum value is 391.6. The 667 missing values correspond to the covariate X i.e 667 data for X is missing in the dataset out of 1738 data.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Saurabh Sinha
  • 71
  • 1
  • 6
  • What is your interpretation of the intercept estimate? – whuber Jun 20 '16 at 17:24
  • The estimated coefficient for intercept is 1.6416666, therefore the odds of being in H=1 class i.e p(H=1), when the independent variable X exposure is zero is exp(1.6416666)=5.1637680. The odds are high,but if we look at the variable X, the minimum exposure value of X is 82.23 and mean value of X is 223.38. So the intercept in this model corresponds to the log odds of being in H=1 class or group or category when X is at the hypothetical value of zero. Any correction or suggestions are extremely welcome. – Saurabh Sinha Jun 20 '16 at 17:53
  • Doesn't that odds of 1.64 correspond to the "~20%" you mention? – whuber Jun 20 '16 at 18:21
  • No No, that 20% are those data which are corresponds to H=0 cases. But as per Odds of 0.99, the probability of H=0 class and H=1 class become almost equal i.e about 50%, which is not in the dataset as only about 20% data belong to H=0 class. Hope it will help you. – Saurabh Sinha Jun 20 '16 at 18:27
  • It doesn't help me at all, I'm afraid: your comments appear to confuse the intercept and the slope. So that everything can be clear, why not *present* your data? From your description, you can summarize them with a $2\times 2$ contingency table: just four counts is all you need. It would also be interesting to see a description of what those $667$ missing values correspond to. – whuber Jun 20 '16 at 18:30
  • I have 1738 total dataset, of which H is a dependent binomial variable. There are 19.95% fall in (H=0) category and remaining are in (H=1) category. Further this binomial dependent variable compare with the covariate X whose minimum value is 82.23, mean value is 223.8 and maximum value is 391.6. The 667 missing values correspond to the covariate X i.e 667 data for X is missing in the dataset out of 1738 data. – Saurabh Sinha Jun 20 '16 at 18:40
  • 3
    Please include that information in your question: most people will not read through these comments (which themselves could eventually be deleted or migrated elsewhere). In the meantime, you might find it of great interest to note that $$1/(1 + \exp(1.64166666 - 0.0014039\times 223.8)) = 20.96\%$$ is remarkably close to $19.95\%$. *This is not a coincidence.* – whuber Jun 20 '16 at 18:46
  • Sorry Whuber, i am not getting what you want to say by above calculation of 20.96 % close to 19.95%. How can i interpret this? – Saurabh Sinha Jun 20 '16 at 18:56
  • Dear, Whuber in your last comment you did not mention exp(1.64166666−0.0014039×223.8) in the numerator. It should be like exp(1.64166666−0.0014039×223.8)/(1+exp(1.64166666−0.0014039×223.8)), I think so. Kindly explain, what you want to say? – Saurabh Sinha Jun 23 '16 at 10:36

1 Answers1

6

Summary

The question misinterprets the coefficients.

The software output shows that the log odds of the response don't depend appreciably on $X$, because its coefficient is small and not significant ($p=0.138$). Therefore the proportion of positive results in the data, equal to $100 - 19.95\% \approx 80\%$, ought to have a log odds close to the intercept of $1.64$. Indeed,

$$\log\left(\frac{80\%}{20\%}\right) = \log(4) \approx 1.4$$

is only about one standard error ($0.22$) away from the intercept. Everything looks consistent.


Detailed analysis

This generalized linear model supposes that the log odds of the response $H$ being $1$ when the independent variable $X$ has a particular value $x$ is some linear function of $x$,

$$\text{Log odds}(H=1\,|\,X=x) = \beta_0 + \beta_1 x.\tag{1}$$

The glm command in R estimated these unknown coefficients with values $$\hat\beta_0 = 1.641666\pm 0.2290133$$ and $$\hat\beta_1 = -0.0014039\pm 0.0009466.$$

The dataset contains a large number $n$ of observations with various values of $x$, written $x_i$ for $i=1, 2, \ldots, n$, which range from $82.3$ to $391.6$ and average $\bar x = 223.8$. Formula $(1)$ enables us to compute the estimated probabilities of each outcome, $\Pr(H=1\,|\,X=x_i)$. If the model is any good, the average of those probabilities ought to be close to the average of the outcomes.

Since the odds are, by definition, the ratio of a probability to its complement, we can use simple algebra to find the estimated probabilities in terms of the log odds

$$\widehat\Pr(H=1\,|\,X=x) = 1 - \frac{1}{1 + \exp\left(\hat\beta_0 + \hat\beta_1 x\right)}.$$

As a nonlinear function of $x$, that's difficult to average. However, provided $\beta_1 x$ is small (much less than $1$ in size) and $1+\exp(\hat\beta_0)$ is not small (it exceeds $6$ in this case), we can safely use a linear approximation

$$\frac{1}{1 + \exp\left(\hat\beta_0 + \hat\beta_1 x\right)} = \frac{1}{1 + \exp(\hat\beta_0)}\left(1 - \hat\beta_1 x \frac{\exp{\hat\beta_0}}{1 + \exp(\hat\beta_0)}\right) + O\left(\hat\beta_1 x\right)^2.$$

Since the $x_i$ never exceed $391.6$, $|\hat\beta_1 x_i|$ never exceeds $391.6\times 0.0014039 \approx 0.55$, so we're ok. Consequently, the average of the outcomes may be approximated as

$$\eqalign{ \frac{1}{n}\sum_{i=1}^n \widehat\Pr(H=1\,|\,X=x) &\approx \frac{1}{n}\sum_{i=1}^n \left(1 - \frac{1}{1 + \exp(\hat\beta_0)}\left(1 - \hat\beta_1 x_i \frac{\exp{\hat\beta_0}}{1 + \exp(\hat\beta_0)}\right)\right)\\ &= 0.162238 + 0.000190814 \bar{x} \\ &= 20.4943\%. }$$

Although that's not exactly equal to the $19.95\%$ observed in the data, it is more than close enough, because $\hat\beta_1$ has a relatively large standard error. For example, if $\beta_1$ were increased by only $0.3$ of its standard error to $-0.0011271$, then the previous calculation would produce $19.95\%$ exactly.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • 1
    In some posts I have been critical of the use of too many significant figures in publications. In this answer I have displayed many more sig figs than necessary in order to help anyone who might be trying to reproduce the calculations. At most two significant figures of precision are supported by the information presented in the question (because we just don't know the details of how the $x_i$ are distributed). – whuber Jun 23 '16 at 15:08
  • Thanks Whuber, Can you suggest me some literature and text for better understanding on logistic regression analysis, its diagnostic, Goodness of fit and others or may be some hands on practice papers. As i am doing all this in R particularly in mgcv package. It will be very helpful for me. – Saurabh Sinha Jun 24 '16 at 05:57
  • Hosmer & Lemeshow, *Applied Logistic Regression.* (I am familiar with the second edition; it's now in a third edition.) – whuber Jun 24 '16 at 13:36
  • Hi @whuber, are there any assumptions of a linear relationship in logistic regression? Even in the logit link function and $X$? – Pierre L Sep 20 '16 at 19:22
  • @Pierre Definitely! The model must be linear in its parameters. Specifically, with the usual (logit) link, the log odds of the response for a regressor (row) vector $x$ is assumed to be given by $x\beta$ for a parameter vector $\beta$. – whuber Sep 20 '16 at 20:16
  • Okay, thank you. I'm putting together a list of the reasons why logistic regression may be problematic for variable selection. If the true relationship is non-linear, the coefficients will be unreliable. I appreciate it. I've asked this question, hopefully you can help. http://stats.stackexchange.com/questions/236020/model-selection-ultimately-for-variable-selection – Pierre L Sep 20 '16 at 20:19
  • 1
    @Pierre Logistic regression shares most of the benefits and problems of least-squares multiple regression. Many of the solutions to OLS problems also apply to logistic regression. For instance, splining the regressors can both test for and account for nonlinearities. – whuber Sep 20 '16 at 20:25