Why aren't cox regression models validated against independent test sets in medical literature

Question

It has been the standard in many machine learning journals for very many years that models should be evaluated against a test set that's identically distributed but has independently samples from training data, and authors report averages of many iterations of random train/test partitions of a full dataset.

When looking at epidemiology research papers (e.g. risk of future stoke given lab results), I see that a huge proportion of papers build Cox proportional hazard models, from which they report hazard ratios, coefficients, and confidence intervals directly from a single training of a model, and do not evaluate the accuracy of the model on an independent test set. Is this, in general, reasonable?

score 2 · Answer 1 · answered Jan 05 '17 at 00:09

2

Finding independent survival datasets in the public domain for validation is often quite difficult. In addition to requiring all the same features, you need to find a dataset with time and event information. Many studies don't collect this information, and if they do, they probably already did the survival analysis, and your study is therefore less novel.

answered Jan 05 '17 at 00:09

thc

388
2
16

Sure -- but that's why a lot of validation is done by separating test and training data within a single study. This seems OK -- "IF your data looks looks like the study population this is the performance of the test set is what you should expect". You can make this statement if you separate training and test data. If you don't, I don't see why its reasonable to make statements about a model that's only trained on test data (because: overfitting, underfitting, model poorly match with distributions). If the dataset is very small, you can still do leave-one-out cross validation. – user48956 Jan 05 '17 at 00:18
I agree with you if the goal is to find an accurate model that will predict a patient's survival time. But what if the goal is to determine the effect size of a particular factor (e.g. smoking in lung cancer)? In this case, Cox-regression seems to be more akin to t-test than a machine learning model, and you trust the associated p-value. Also, survival data is probably much more noisy than other data types. Survival is stochastic, and people die for all sorts of reasons. An individual patient's death may not have anything to do with the particular disease of study. – thc Jan 05 '17 at 00:25
Is this even valid? Don't the factors have to be independent. In epidemiology, they very rarely are. If you duplicate a factor, isn't its coefficient reduced --- I don't think you can assume low hazards follow from a low coefficient. – user48956 Jan 05 '17 at 00:29
1

By definition, hazard ratio is directly determined by the coefficient. In cox regression, HR = exp(beta * x), where beta are the model coefficients. If you want to test each variable individually unadjusted for other variables, you can regress a single-variate model, which is more akin to a statistical test. Of course this doesn't tell you what factors are causative, only which are correlated. – thc Jan 05 '17 at 01:20

score 2 · Answer 2 · answered Jan 05 '17 at 01:51

Although I would have to admit that not all published survival models follow the best statistical practices, there are some important differences between what is typically considered modern machine learning and most epidemiological/clinical studies.

One is scale. Machine learning often involves tens of thousands or millions of observations, providing enough cases to afford setting aside separate training and test sets. (Frank Harrell has estimated that you require thousands of cases to do so without losing power over techniques like bootstrapping that use all the available data.)

The second is the intended use of the model. Machine learning is typically interested in prediction, survival modeling much less frequently. Although survival models can be used for prediction, they are more typically used to determine whether one or a few variables of particular interest are related to outcome when other variables are taken into account. Cox proportional hazards models provide a way to approach this problem, even with the correlated predictors that are common on clinical/epidemiological studies. Multicollinearity will tend to increase the standard errors for individual predictors, but if the proportional hazards assumption is met then the point estimates of the coefficients will be useful for thinking about the underlying biology even if their magnitudes are optimistic in magnitude.

There certainly are ways to use clinical/epidemiological data sets more efficiently and intelligently. Harrell's text on Regression Modeling Strategies and the associated rms package in R illustrate those better approaches to validating models and assessing their optimism.

I think, however, that those who read studies reporting model results without attempts at further validation will tend to discount the magnitude of the result. As an experienced physician told me decades ago, only somewhat in jest: "When a report about a great new drug comes out, you'd better use it while it still works." Eventual regression to the mean is a reasonable expectation when striking, poorly validated results are published.

The argument that datasize prohibits internal validation is a false one. In low data cases we can perform leave-one-out cross validation. If the data size is n, we can have up to n independent test sample (n trials, with training set sizes if n-1). — user48956, Jan 13 '17 at 17:10
@user48956 : there are, however, limits on what you can learn from cross validation with small data sets. See [this thread](http://stats.stackexchange.com/q/64147/28500) for example. With leave-one-out in particular, variance is likely to be high. I didn't intend to imply that a small data set prohibits internal validation, just that you lack many of the advantages that large data sets provide in many applications of machine learning. — EdM, Jan 13 '17 at 20:50

Why aren't cox regression models validated against independent test sets in medical literature

2 Answers2