1

My question is about external validation of gene model.

After analyzing my HTA 2.0 microarrays, I obtained 350 genes expressed significantly in patients before treatment and after treatment. I adapted the code from this post (https://www.biostars.org/p/344233/#448480) to do a survival analysis with genes expression with COX regression, obtaining 35 genes that affect patient survival.

Now that I have the genes identified, I want to validate them with a validation set samples. What method would you use? I have taken my genes that affect patient survival and used them using the clinical data from the validation set patients, and I get a 0.9 AUC in ROC. But I think this method is not optimal, right? I read that this is not correct, as I am redoing the coefficients, not validating them.

To do a validation, I found this package that allows you to do internal and external validation, but I haven't found any articles that use this package. https://cran.r-project.org/web/packages/hdnom/vignettes/hdnom.html#2_build_survival_models but I haven't found any articles that use this package.

Enrique
  • 111
  • 2
  • Nobody????????? – Enrique Jul 11 '20 at 18:38
  • I read the question a few times and i am not really sure about a lot of things. First of all, instead of pointing to a post, maybe you can summarize what was done to obtain the 35 genes. Now I assume this was done using a cox regression model on each gene. You used this onto a validation set and got a AUC of 0.9 – StupidWolf Jul 13 '20 at 23:15
  • what do you mean by "this method is not correct?" or "this method is not optimal".. I see nothing to do with coefficients unless you mean you refit the cox regression model onto the validation set, instead of using the model constructed from the training set to predict – StupidWolf Jul 13 '20 at 23:17
  • so I suggest you clarify a few things 1) what exactly was run to train the model 2) what was done to get you a AUC of 0.9 3) what you think is sub-optimal. Also, it's not a matter of running another package to make your issues go away – StupidWolf Jul 13 '20 at 23:19
  • @StupidWolf Sorry for the delay in answering. I haven't trained any models. Basically, I've done a univariate analysis by COX regression, using the Death variable (Event) and the average survival of patients. With this, I've come up with 35 significant genes. What I need now is how I can compare those genes with a validation set to see if they are truly optimal. – Enrique Aug 10 '20 at 16:49

1 Answers1

1

I think that you need to move your validation scheme up to an earlier step in your modeling.

You examined 350 genes whose expression was significantly different before and after treatment, and then tested each of them individually to see whether their expression (or perhaps their change in expression) was associated with outcome.

If you chose a significance level of p < 0.05, then without any true association with survival you would find a "significant" association just by chance in about 5% of comparisons. When you start with 350 genes, that means 17 genes of your set of 35 that "affect patient survival" could easily be false positives. That's an example of the multiple comparisons problem that becomes very large in studies on gene expression.

Also, the one-at-a-time evaluation removes any ability to see if accounting for some of those genes might make it easier to see the association of other genes with outcome. As with omitted-variable bias in logistic regression, if you omit any predictor associated with outcome from a survival model then you might underestimate the true magnitude of the coefficient for a predictor that you examined.

Furthermore, unless you have many thousands of cases you probably shouldn't be using separate training and test sets for your modeling. Otherwise you lose power in the training set and you have too few cases in your test set to provide a sensitive test of model performance. You should use the types of internal validation provided, for example, by the hdnom package. The only exception might be if you have a completely independent set of gene-expression and outcome data from an external source (like another hospital) to use as a test set.

If you want to develop a model for survival that is somehow based on the 350 genes that were differentially expressed, you should use an approach that starts broadly and considers multiple genes together. Ridge regression, elastic net, and LASSO (also evidently provided by hdnom) are so-called penalized methods often used for this purpose. These span a range from using all of the genes while differentially penalizing their coefficients to avoid overfitting (ridge), down to selecting a combination of just a few that are most closely--but together--associated with outcome (LASSO).

I haven't used the hdnom package, but I suspect that it's just a convenient interface to other R packages like glmnet for modeling, and cross-validation and bootstrapping for validation and calibration. It seems to have a reasonable workflow, although I can't say that it necessarily is the "best" package of all. So go back a couple of steps to your 350-gene list, use a penalized approach to identify the genes for your model, and then do internal validation.

EdM
  • 57,766
  • 7
  • 66
  • 187