20

I am attempting to carry out a discrete time survival analysis using a logistic regression model, and I'm not sure I completely understand the process. I would greatly appreciate assistance with a few basic questions.

Here is the set up:

I'm looking at membership in a group within a five year time window. Each member has a monthly record of membership for each month that member is in the group. I'm considering all the members whose membership began during the five year window (to avoid "left censorship" issues with members who joined earlier). Each record will be indexed by time, with time one being the month the member joined. So, a member who stays for two and a half years will have thirty monthly records, numbered from one up to thirty. Each record will also be given a binary variable, which will have a value of one for the last month of membership, and zero otherwise; a value of one for the binary variable marks the event that the member has left the group. For each member whose membership continues beyond the five year analysis window, all the binary variable values will be zero (these are the right-censored individuals in the survival analysis).

So, the logistic regression model is built to predict the values of the binary event variable. So far, so good. One of the typical ways to evaluate a binary predictive model is to measure the lift on a holdout sample. For the logistic regression model I have built to predict the membership ending event, I have computed the lift on a holdout data set with a five to one ratio of non-events to events. I ranked the predicted values into deciles. The decile with the highest predicted values contains seventy percent ones, a lift of more than four. The first two deciles combined contain sixty-five percent of all the ones in the holdout. In certain contexts this would be considered a fairly decent predictive model, but I wonder whether it's good enough to carry out a survival analysis.

Let $h[j,k]$ be the hazard function for individual $j$ in month $k$, and let $S[j,k]$ be the probability that individual $j$ survives through month $k$.

Here are my fundamental questions:

  1. Is the discrete hazard function, $h[j,k]$, the conditional probability of non-survival (leaving the group) in each month?

  2. Are the predicted values from the logistic regression model estimates of the hazard function? (i.e., is $h[j,k]$ equal to the model predicted value for individual $j$ in month $k$, or does something more need to be done to obtain hazard function estimates?)

  3. Is the probability of survival up to month q for individual $j$ equal to the product of one minus the hazard function from month one up to $q$, that is, does $S[j,q] = (1 - h[j,1]) \cdot (1 - h[j,2]) \cdot \ldots \cdot (1 - h[j,q])$?

  4. Is the mean value of $S[j,k]$ over all individuals $j$ for each time $k$ a reasonable estimate of the overall population mean survival probability?

  5. Should a plot of the overall population mean survival probability by month resemble the monthly Kaplan-Meier graph?

If the answer to any of these questions is no, then I have a serious misunderstanding, and could really use some assistance / explanation. Also, is there any rule of thumb for how good the binary predictive model needs to be in order to produce an accurate survival profile?

Ferdi
  • 4,882
  • 7
  • 42
  • 62
Talbot Katz
  • 361
  • 2
  • 5
  • Maybe the [this](http://stats.stackexchange.com/questions/45738/appropriate-application-of-survival-analysis/258015#258015) can help you with some of your questions – jujae Feb 27 '17 at 08:41

1 Answers1

7

Assume $K$ is the largest value of $k$ (i.e. the largest month/period observed in your data).

  1. Here is the hazard function with a fully discrete parametrization of time, and with a vector of parameters $\mathbf{B}$ a vector of conditioning variables $\mathbf{X}$: $h_{j,k} = \frac{e^{\alpha_{k} + \mathbf{BX}}}{1 + e^{\alpha_{k} + \mathbf{BX}}}$. The hazard function may also be built around alternative parameterizations of time (e.g. include $k$ or functions of it as a variable in the model), or around a hybrid of both.

    The baseline logit hazard function describes the probability of event occurrence in time $k$, conditional upon having survived to time $k$. Adding predictors ($\mathbf{X}$) to the model further constrains this conditionality.

  2. No, logistic regression estimates (e.g. $\hat{\alpha}_{1}$, $\dots$, $\hat{\alpha}_{K}$, $\mathbf{\hat{B}}$) are not the hazard functions themselves. The logistic regression models: logit$(h_{j,k}) = \alpha_{k} + \mathbf{BX}$, and you need to perform the anti-logit transform in (1) above to get the hazard estimates.

  3. Yes. Although I would notate it $\hat{S}_{j,q} = \prod_{i=1}^{q}{(1-h_{j,i})}$. The survival function is the probability of not experiencing the event by time $k$, and of course may also be conditioned on $\mathbf{X}$.

  4. This is a subtle question, not sure I have answers. I do have questions, though. :) The sample size at each time period decreases over time due to right-censoring and due to event occurrence: would you account for this in your calculation of mean survival time? How? What do you mean by "the population?" What population are the individuals recruited to your study generalizing to? Or do you mean some statistical "super-population" concept? Inference is a big challenge in these models, because we estimate $\beta$s and their standard errors, but need to do delta-method back-flips to get standard errors for $\hat{h}_{j,k}$, and (from my own work) deriving valid standard errors for $\hat{S}_{j,k}$ works only on paper (I can't get correct CI coverages for $\hat{S}_{j,k}$ in conditional models).

  5. You can use Kaplan-Meier-like step-function graphs, and you can also use straight up line graphs (i.e. connect the dots between time periods with a line). You should use the latter case only when the concept of "discrete time" itself admits the possibility of subdivided periods. You can also plot/communicate estimates of cumulative incidence (which is $1 - S_{j,k}$... at least epidemiologists will often define "cumulative incidence" this way, the term is used differently in competing risks models. The term uptake may also be used here.).

Alexis
  • 26,219
  • 5
  • 78
  • 131
  • I think in question 2, OP is asking about predicted value from logistical model, not the estimates of the regression coefficients. [This](http://stats.stackexchange.com/questions/16533/properties-of-logistic-regressions) might be relevant – jujae Feb 27 '17 at 08:47
  • @jujae I explicitly gave the logistic function in my answer to #2, and directed OP's attention to the use of the anti-logit to transform logit parameter estimates into $\hat{h}(t)$, so I am not understanding your comment. – Alexis Feb 27 '17 at 19:47
  • Isn't the predicted value of a logistic model the probability of success of the binary rv such that no ant-logit is needed. That is $y_\mathrm{pred}= \exp(\beta^Tx)/(1+\exp(\beta^Tx))$ ? – jujae Feb 28 '17 at 11:47
  • Back to the original question 2, the OP asked: "Are the predicted values from the logistic regression model estimates of the hazard function?" I would say yes (if my understanding of predicted value is correct). And you are saying no and give the argument that the estimated coefficients are not the same as hazard estimation. I agree with your statement, they are correct but it is not what OP asked from my understanding. – jujae Feb 28 '17 at 12:01
  • And for questions 4, I think OP is asking about the survival probability at each interval $k$ and the average of the estimated $\hat{S}_j(k)$ is indeed a reasonable estimator for $S(k)$. In your answer, you are first referring to mean survival time which is confusing to me as a reader. Meanwhile, I also believe that the estimator we are discussing is essentially Kaplan-meier, and (for instance) Greenwood's variance estimator for KM can be directly used and I fail to appreciate the difficulties you stated above about the calculation of the variances. – jujae Feb 28 '17 at 12:35
  • @jujae "Isn't the predicted value of a logistic model the probability of success…" No it is not. The logistic regression is a logit model, so that $\beta_{x}$ describes how much the $\log[\text{odds}(y)]$ changes do to a 1-unit increase in $x$. Because of this the estimated hazard function in a discrete time event history model, $\hat{h}(t) = \frac{e^{\hat{\beta}_{0} + \mathbf{\hat{B}X}}}{1+ e^{\hat{\beta}_{0} + \mathbf{\hat{B}X}} }$, as I have indicated above (oops! minus the hats... gonna edit that in a sec). – Alexis Feb 28 '17 at 21:39
  • @jujae See for example: See, for example, Allison, P. (1982). Discrete-time methods for the analysis of event histories. *Sociological methodology*, 13(1982), 61–98. Allison, P. (1984). *Event history analysis: Regression for longitudinal event data*. Sage Beverly Hills, CA. or Singer, J. D. & Willett, J. B. (2003, March). *Applied longitudinal data analysis: modeling change and event occurrence*. New York, NY: Oxford University Press – Alexis Feb 28 '17 at 21:44
  • @jujae I have edited "mean survival" to "estimated survival" in 4. – Alexis Feb 28 '17 at 21:45
  • I am NOT disagreeing with you about the interpretation of the coefficient $\beta$ or $\hat{\beta}$ or what $\hat{h}(t)$ should be equal to, I completely agree with you in this part. I am arguing about the meaning of the "predicted value", which I believe means "predicted value of the outcome" (see Singer and Willet reference you provided). This means in logit model or logistic regression $y_{\mathrm{pred}} = \hat{h}(t)$, so no anti-log is actually needed. Can you give a definite definition about what your "predicted value" means? So we can clear the confusion here. – jujae Mar 01 '17 at 14:37
  • "estimated survival time" is even worse than "mean survival time" don't you think? survival time is a random variable that has distribution of $F(t)=1-S(t)$, you can estimate the parameter (mean survival time here makes sense), what is that "estimated survival time" trying to estimate then? Can you maybe please define what you mean by "estimated survival time" ? Estimating survival probability is what the OP asked ($\hat{S}(k)$), am I correct here? – jujae Mar 01 '17 at 14:51
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/54574/discussion-between-jujae-and-alexis). – jujae Mar 01 '17 at 14:51
  • @Alexis About your answer to question 4: did you find a solution for the valid estimation of standard errors in conditional models estimated as discrete time logistic regression model? Thanks! – emanuela.struffolino Jun 26 '18 at 15:53