4

I would greatly appreciate if you could let me know how to do discrete time survival analysis with time varying covariates. Some part of my data set is as follows:

ID TIME EVENT   x1   x2   x3   x4   x5 
1    1    0    1.28 0.02 0.87 1.22 0.06 
1    2    0    1.27 0.01 0.82 1.00 -0.01 
1    3    0    1.05 -0.06 0.92 0.73 0.02 
1    4    0    1.11 -0.02 0.86 0.81 0.08 
1    5    1    1.22 -0.06 0.89 0.48 0.01 
2    1    0    1.06 0.11 0.81 0.84 0.20 
2    2    0    1.06 0.08 0.88 0.69 0.14 
2    3    0    0.97 0.08 0.91 0.81 0.17 
2    4    0    1.06 0.13 0.82 0.88 0.23 
2    5    0    1.12 0.15 0.76 1.08 0.28 
2    6    0    1.60 0.26 0.55 1.31 0.37 
2    7    0    1.58 0.26 0.56 1.16 0.35 
2    8    0    1.54 0.24 0.59 1.08 0.33 
2    9    0    1.72 0.22 0.55 0.84 0.29 
2    10   0    1.72 0.21 0.53 0.79 0.29 
2    11   0    1.63 0.19 0.55 0.73 0.27 
2    12   0    2.17 0.32 0.44 0.95 0.43 
3    1    0    0.87 -0.03 0.79 0.61 0.00 
3    2    1   0.83 -0.14 0.95 0.57 -0.02 

My data set is related to companies' bankruptcy. My covariates are some financial ratios which are computed at the end of each year. Besides, the issue that a company is gone bankrupt or not, is also determined at the end of each year after financial statements is prepared.

Which method should be used?: Non-parametric method (logit, cloglog),Semi-parametric method (cox) or Parametric method (exponential, loglogistic, lognormal, weibull and gamma). Should the model be estimated using fixed-effects, random-effects, mixed-effects or pooled regression?

Some R codes are also provided here.

Steffen Moritz
  • 1,564
  • 2
  • 15
  • 22
ebrahimi
  • 227
  • 3
  • 12
  • Can you clarify what the question is? – The Laconic Apr 01 '17 at 18:08
  • @TheLaconic. Thanks. I don't know which method I should use. – ebrahimi Apr 01 '17 at 18:10
  • I suspect you might get better asnwers on a Stata site. You would also improve your chances by telling us what you have tried and why it did not seem to answer your scientific question, whatever that is. – mdewey Apr 02 '17 at 13:36
  • @mdewey Thanks. I asked it on Statalist but there was no answer. Really, it is not important to use Stata. I know R to some extent. In fact, I have more covariates so I want to identify those variables which mostly affect bankruptcy. – ebrahimi Apr 03 '17 at 12:52
  • @mdewey. Sorry, but my question is similar to this one: http://stats.stackexchange.com/questions/141528/comparing-different-methods-of-discrete-time-survival-analysis?rq=1 – ebrahimi Apr 04 '17 at 09:09
  • Without knowing what your various X's might be, it's pretty much sheer speculation, but I also wonder if you need to take into account auto-correlation? If a company has one bad year is it just as likely that there will be a bankruptcy in the immediately following year as when there are three or more bad years back-to-back? What about recessions where all your "subjects" will have cross-correlated decreases in measures of economic health? And what about buyouts at "distressed" pricing? – DWin Apr 05 '17 at 19:36
  • @DWin Thanks. In fact, I have about 120 independent variables, which are divided into 5 categories: financial ratios based on accrual accounting, financial ratios based on cash flow accounting, stock market liquidity variables, corporate governance variables and macroeconomic variables. Therefore, I want to identify the most relevant variables but since I couldn't yet decide about the underlying regression model, I have not yet decided which variable selection method to use. The above data are related to the first category. – ebrahimi Apr 06 '17 at 17:18
  • @DWin Currently, it is organized in a way that the event is determined at time t and X's also belong to time t. Buyouts, mergers and acquisitions are not investigated. I have about 1550 firm-year observation which belongs to 152 firms during 12 years, of which 50 firms went bankrupt. Really, I tried xgboost to classify companies into bankrupt and non-bankrupt based on features belong to 1 and 2 years ago but since my data set is small, the result was not satisfactory. – ebrahimi Apr 06 '17 at 17:23
  • One non-economist's opinion: I would be attempting to set up a smaller simulation with features that I understood and then run a survival analysis with time varying covariates. Then I would add in additional noisy covariate columns and scale the problem up to see what level of discrimination I could achieve as far as identifying noise versus signal with different methods. I would also be searching with the terms "auto-correlation" and "cross-correlation" since I think your problem is even _more_ complex than the prediction efforts that plague survival analysis of patient data. – DWin Apr 06 '17 at 19:26
  • @DWin Thanks a lot. As you suggested, I reduced the number of covariates so I need to know which model should be used for discrete time survival analysis with time-varying covariates. – ebrahimi Apr 06 '17 at 19:41
  • The R package `survival` has a `Surv`-function that supports time varying covariates. It's not discrete time but I don't think that putting in times that are "aligned" will break the logic. Running sos::findFn("discrete time survival analysis") brings up several other candidates, but I have no experience with them. SurvDisc: package labeled: "The Discrete Time Survival and Longitudinal Data Analysis" sounds like it might be a fit, as does `dynamichaz`: "Dynamic Hazard Models using State Space Models". Also look at `dse` , `dlm`, `KFAS`, `INLA` (not in CRAN), and `sspir`. – DWin Apr 06 '17 at 20:17
  • @JiebiaoWang I would appreciate if you could let me know how to use mixed-effect models for discrete time survival analysis? Is it right to do: `require(lme4) model – ebrahimi Apr 08 '17 at 20:30
  • Incidentally the logit, complementrary log-log (and probit) discrete time hazard models are all *fully parametric* specificiations. If you are using Stata, the [**dthaz**](https://alexisdinno.com/stata/dthaz.html) package estimates all these models, and permits the use of time-varying covariates. – Alexis Oct 11 '17 at 21:50

1 Answers1

1

You can do this with static_glm function in the dynamichazard package I have made. The model you get is exactly like the multiperiod logit model used in

Shumway, T. (2001). Forecasting bankruptcy more accurately: A simple hazard model. The Journal of Business, 74(1), 101-124.

This is common method used in the litterature. The R code for your data would be

fit <- dynamichazard::static_glm(
  formula = Surv(tstart, tstop, EVENT) ~ x1 + x2 + x3 + x4 + x5,
  data = the_data_frame_you_used, # you have to change this
  max_T = 12,                     # the last time you observe
  by = 1)                         # bin into period of one year

You will though first have to transform you data into the start-stop setup. This is easily done with the tmerge function from the survival package. You can see the Using Time Dependent Covariates and Time Dependent Coefficients in the Cox Model vignette in the survival package for example on how to use the tmerge function.

Of course, you can use any other survival method which supports time-varying covariates once you have your data.frame in the stat-stop format. There is a long list of option in R. E.g., see the Survival Analysis view.

An issue though is that you companies (likely) do not default at TIME but somewhere between Time -1 and TIME. I.e. you are dealing with interval censoring which you may want to account for if you chose with the survival model you use.

You question is related to this one. Particularly, you can include TIME as random effect kinda like answer here as follows

require(lme4)
ans <- glmer(
  EVENT ~  x1 + x2 + x3 + x4 + x5 + (1|TIME), 
  data = your_intial_data_frame, # data.frame as you posted it 
  family = binomial) 

Update to OP's further questions

Could you please let me know if it is possible to use "cloglog" for your both methods?

You cannot get an interval censored model (i.e., cloglog link function) with the static_glm. However, you can use the get_survival_case_weights_and_data function in the same package as I show in the Comparing methods for time varying logistic models vignette and then use whatever classifier you want like glm with a cloglog link function.

Is it allowed to use your suggestions If some companies enter the study in time 4, some others in time 7 and etc.?

This is called delayed entry. It should not be problem in a discrete time default model if your time scale is the calendar date/year.

Really, I want to predict bankruptcy using survival analysis so my covariates should be lagged for example 1 year lag.

Yes, you need to lag your covariates.

As I tried logistic regression in Python - sklearn, the solver "sag" had a better performance. Is it allowed to use this solver in your suggestions? Thanks a lot.

Seems like "sag" is a penalized logistic model. It should not be problem if you set up your data correctly.

  • You may be interested in [this vignette](https://cran.r-project.org/web/packages/dynamichazard/vignettes/Comparing_methods_for_logistic_models.pdf) in my package. – Benjamin Christoffersen Oct 25 '17 at 21:39
  • @Benjamin.Thanks a lot for sharing your time and knowledge. Could you please let me know if it is possible to use "cloglog" for your both methods? Is it allowed to use your suggestions If some companies enter the study in time 4, some others in time 7 and etc.? Really, I want to predict bankruptcy using survival analysis so my covariates should be lagged for example 1 year lag. As I tried logistic regression in Python - sklearn, the solver "sag" had a better performance. Is it allowed to use this solver in your suggestions? Thanks a lot. – ebrahimi Nov 17 '17 at 18:05
  • @Benjamin.I would appreciate if you could introduce me a good book which describes research designs in finance and accounting. In fact, I should explain what kind of research this is. – ebrahimi Nov 18 '17 at 08:22
  • @Benjamin.Thanks a lot. I thank you very much for answering the questions. – ebrahimi Nov 19 '17 at 13:27
  • Glad to help, if this answer solved your problem please mark it as accepted by clicking the check mark next to the answer. – Benjamin Christoffersen Nov 19 '17 at 17:28
  • @Benjamin.I would appreciate if you could let me know why the performance of my bankruptcy prediction is not good. In fact, unlike most studies I didn't use pair-matching method for sample selection so my data is too imbalanced (%05 bankrupt vs %95 non-bankrupt firm-year observation). The link of some part of my data are provided [here](https://datascience.stackexchange.com/questions/17191/why-the-estimated-lasso-coefficients-of-almost-all-variables-are-equal-to-zero?noredirect=1#comment21766_17191) (comment section). – ebrahimi Nov 21 '17 at 14:33
  • My target variable is retained earnings to capital stock, which according to the commercial law, if it will be less than -0.50 it should be considered as bankrupt and the reverse. My new data sample is doubled but it is not possible to increase it any more. I selected my covariates after reviewing more than 100 bankruptcy papers. Thanks a lot. – ebrahimi Nov 21 '17 at 14:36