Its hard without seeing your data, so I'll try making it generic. First of all, the two main ways that a data frame should look like for the use in the survival
package:
The bare-bones
- ID - a unique variable to identify each unit of analysis (e.g., patient, country, organization)
- Event - a binary variable to indicate the occurrence of the event tested (e.g., death, , revolution, bankruptcy)
- Time - Time until event or until information ends (right-censoring). The Cox model is best used with continuous time, but when the study is over the course of years (especially regarding countries) monthly spells can do.
- (Oftentimes) Some covariates
Lets use a made-up model of trying to find the hazard for countries falling into civil war (the event
) over ten years (in monthly spells). Using a single covariate (previousCivilWar
) which is not time-dependent:
# the first country was censored before an event and the second
# experienced the event after 8 years
id time event prevCivilWar
1 120 0 0
2 96 1 1
Adding time-dependent covariates: Method 1
- Covariate - In this case you need to know the original value, and whether it changed and to what - and if so, when (at what spell).
- Changing the time variable to start and end - when needed to indicate the time of change for (any of the) covariates
Here we will add the binary variable to indicate >40% poverty is40pov
:
id time1 time2 event prevCivilWar is40pov
1 0 80 0 0 0
1 80 120 0 0 1
2 0 24 0 1 0
2 24 60 0 1 1
2 60 96 1 1 1
When using time-dependent covariates we need to specify the exact time frame until any change in any covariate occurs. Note that the times need to overlap. If a certain subject has no changes in any covariates, than one row suffices.
Method 2 - best for continuously changing covariates
This will include $k$ rows per unique ID as there are spells ($k$ rows if censored, or less if the event happened before). So, if you have a database with information on a certain time frame studied, decide on the spell-length that makes sense to you (which makes theoretical sense): If the covariates change on an hourly basis - make it hourly, etc... Once you have decided on the spell-length (e.g., month) and the total time (e.g., ten years) than each ID will have <=$120$ spells.
If you need, create the longitudinal dataset with empty (NA
, 0
, or whatever) data, for the time-dependent covariates, and make two extra utility columns for dates/times of each spell. Then you can access the database and fetch the specific values for your covariates at those dates/times and fill it in. It is OK if certain rows have no changes in covariates. You will end up with something like:
# The variable pov is the poverty percent of population and measured monthly
id time1 time2 event prevCivilWar pov
1 0 1 0 0 0.34
1 1 2 0 0 0.34
...
1 79 80 0 0 0.43
...
1 119 120 1 0 0.41
2 0 1 0 1 0.25
...
2 23 24 0 1 0.42
...
2 95 96 1 1 0.58
For more info on time-dependent covariates and coefficients, see Therneau, Crowson and Atkinson's 2016 Vignette.