21

Let's say I have a "kidney catheter" data set. I'm trying to model a survival curve using a Cox model. If I consider a Cox model: $$h(t,Z) = h_0 \exp(b'Z),$$ I need the estimate of the baseline hazard. By using the built-in survival package R function basehaz(), I can easily do it like this:

library(survival)

data(kidney)
fit <- coxph(Surv(time, status) ~ age , kidney)
basehaz(fit)

But if I want to write a step by step function of the baseline hazard for a given estimate of parameter b how can I proceed? I tried:

bhaz <- function(beta, time, status, x) {

    data <- data.frame(time,status,x)
    data <- data[order(data$time), ]
    dt   <- data$time
    k    <- length(dt)
    risk <- exp(data.matrix(data[,-c(1:2)]) %*% beta)
    h    <- rep(0,k)

    for(i in 1:k) {
        h[i] <- data$status[data$time==dt[i]] / sum(risk[data$time>=dt[i]])          
    }

    return(data.frame(h, dt))
}

h0 <- bhaz(fit$coef, kidney$time, kidney$status, kidney$age)

But this does not give the same result as basehaz(fit). What is the problem?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Dihan
  • 317
  • 1
  • 2
  • 9
  • @gung could you help with [this question](http://stats.stackexchange.com/questions/269217/how-to-manually-calculate-survfit-in-cox-hazard-model)? I struggled for couple of days... – Haitao Du Mar 22 '17 at 20:46

1 Answers1

22

Apparently, basehaz() actually computes a cumulative hazard rate, rather than the hazard rate itself. The formula is as follows: $$ \hat{H}_0(t) = \sum_{y_{(l)} \leq t} \hat{h}_0(y_{(l)}), $$ with $$ \hat{h}_0(y_{(l)}) = \frac{d_{(l)}}{\sum_{j \in R(y_{(l)})} \exp(\mathbf{x}^{\prime}_j \mathbf{\beta})} $$ where $y_{(1)} < y_{(2)} < \cdots$ denote the distinct event times, $d_{(l)}$ is the number of events at $y_{(l)}$, and $R(y_{(l)})$ is the risk set at $y_{(l)}$ containing all individuals still susceptible to the event at $y_{(l)}$.

Let's try this. (The following code is there for illustration only and is not intended to be very well written.)

#------package------
library(survival)

#------some data------
data(kidney)

#------preparation------
tab <- data.frame(table(kidney[kidney$status == 1, "time"])) 
y <- as.numeric(levels(tab[, 1]))[tab[, 1]] #ordered distinct event times
d <- tab[, 2]                               #number of events

#------Cox model------
fit<-coxph(Surv(time, status)~age, data=kidney)

#------cumulative hazard obtained from basehaz()------
H0 <- basehaz(fit, centered=FALSE)
H0 <- H0[H0[, 2] %in% y, ] #only keep rows where events occurred

#------my quick implementation------
betaHat <- fit$coef

h0 <- rep(NA, length(y))
for(l in 1:length(y))
{
  h0[l] <- d[l] / sum(exp(kidney[kidney$time >= y[l], "age"] * betaHat))
}

#------comparison------
cbind(H0, cumsum(h0))

partial output:

       hazard time cumsum(h0)
1  0.01074980    2 0.01074980
5  0.03399089    7 0.03382306
6  0.05790570    8 0.05757756
7  0.07048941    9 0.07016127
8  0.09625105   12 0.09573508
9  0.10941921   13 0.10890324
10 0.13691424   15 0.13616338

I suspect that the slight difference might be due to the approximation of the partial likelihood in coxph() due to ties in the data...

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
ocram
  • 19,898
  • 5
  • 76
  • 77
  • Thanks a lot. Yes, there are slight difference for approximation method. But there are 76 time points with ties, if I want to find the baseline hazard for every time point. What can i do? What type of modification in R code is needed? – Dihan Dec 26 '12 at 11:11
  • 1
    The discretised hazard is zero, except at event times. This indeed gives the largest contribution to the likelihood if a discrete hazard function is supposed. You might want to interpolate between any two estimates assuming, for example, that the hazard stays constant. – ocram Dec 26 '12 at 11:32
  • Method of Breslow (1974) – tomka Feb 27 '17 at 17:49
  • I need to note some problems with this implementation. Using `kidney$time >= y[l]` can run into numerical problems when time is numeric due to the tabulation in creating $y$. Furthermore, the way you define your risk set is inaccurate, because if there is a tie of two observations, one with `status=0` and one with `status=1`, then $d=2$ but your code gives $d=1$ as you exclude all `status=0` observations. The latter problem applies higher numbers of ties likewise. – tomka Feb 28 '17 at 18:10
  • As @tomka mentioned. Replacing the `coxph` call with `fit – mr.bjerre May 16 '18 at 07:16
  • As \@ocram defined the baseline hazard, you can find this formula in the book *Applied Survival Analysis Using R-2016* by Dirk F. Moore (page 64), either. I applied the formula for my dataset and it works, in fact, it estimated the baseline hazrad correctly as I checked it couple of times for some different covariates. And also, the slight difference in the results because of method is used. As @mr.bjerre stated, we will get the same results when we choose the method='breslow' in fitting cox proportional hazard. – mhy Mar 21 '20 at 13:07