1

I have looked at the answers at:

In a linear regression, you regress $Y$ on $X$. For each subject, $i$, you have a $X_i$ and $Y_i$.

Assuming I want to do the transformation in a logistic regression by hand, how do I obtain $P(Y_i)$ for each subject? I found the following website that goes through the procedure for categorical predictors (http://vassarstats.net/logreg1.html). Which is essentially computing the log odds for each combination of categorical predictors. How then do you deal with continuous predictors?

For illustration, I am using the dataset from: https://stats.idre.ucla.edu/r/dae/logit-regression/

bindata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

What I would like to achieve is for the results from:

glm.logistic <- glm(admit ~ gpa, bindata, family = "binomial")

to match with

glm.linear <- glm(admit_transform ~ gpa, bindata, family = "gaussian")

where admit_transform is the log odds of admit.

The whole point of this is really to understand how logistic regression works, not as a practical way to do logistic regression.

R J
  • 525
  • 9
  • 28

1 Answers1

1

In a logistic regression you don't know the probability $p(x)$ of $Y=1$ given $X=x$ and hence calculation of the log odds (with the assumption that $p(x)$ is correctly specified by the logistic link) is not readily possible. How did you calculate the odds of admit?

To get identical results you need to calculate $p(x)$:

bindata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
glm.logistic <- glm(admit ~ gpa, bindata, family = "binomial")
pp<-predict(glm.logistic,type="response") #predicting p(x)
y1<-log(pp/(1-pp)) #calculating log odds
lm.out<-lm(y1~gpa,bindata) #regressing log odds on gpa

This will result in the same estimated coefficients.

> coef(glm.logistic)
 (Intercept)         gpa 
  -4.357587    1.051109 
> coef(lm.out)
(Intercept)         gpa 
 -4.357587    1.051109 
chRrr
  • 692
  • 4
  • 9
  • what i understand is that there is an observed `Y` and a fitted `Y`. Observed `Y` needs to be transformed before it can be fitted i.e. for each subject, I will have a `P(Y)`. The observed `Y` would also be where the residulas come from, more specifically, there is an observed and fitted/predicted `lg(P(Y)/(1 - P(Y)))`. My question really is how do I compute this observed `P(Y)` manually in the case of continuous predictors? – R J Mar 23 '18 at 12:56
  • Your observed data is $(X_i,Y_i)$ where $Y_i\in \{0,1\}$. You are interested in the unknown $E(Y|X=x) = P(Y=1|X=x) = p(x)$. For calculating the odds you need knowledge about $p(x)$, which you don't have but aim to estimate. The example on http://vassarstats.net is nothing more than just a toy example. In order to calculate $p(x)$, you run a (in your case simple) logistic regression to obtain coefficients $\alpha$, $\beta$. The estimated linear predictor is then $\eta(x) = \alpha + \beta \cdot x$ and estimated probabilites of $Y=1$ given $X=x$ are $p(x) = \exp(\eta(x))/(1+\exp(\eta(x))$. – chRrr Mar 23 '18 at 13:13
  • Thanks, I am beginning to understand this better. One last question, how does the observed $Y$ come into the computations? – R J Mar 23 '18 at 13:47
  • $Y$ enters the computations via maximum likelihood. The conditional distribution of $Y$ given $X=x$ is Bernoulli with parameter $p(x)$, where, in the case of a logistic regression, you assume $p(x) = \exp(\eta(x))/(1+\exp(\eta(x)))$ with $\eta=\alpha + \beta \cdot x$ which depends on the unknown paramaters $\alpha$ and $\beta$. From this you can construct the (conditional) likelihood and maximize it w.r.t. to the two unknown parameters, using your data $(X_i,Y_i)$. – chRrr Mar 23 '18 at 14:21