I think that you need to move your validation scheme up to an earlier step in your modeling.
You examined 350 genes whose expression was significantly different before and after treatment, and then tested each of them individually to see whether their expression (or perhaps their change in expression) was associated with outcome.
If you chose a significance level of p < 0.05, then without any true association with survival you would find a "significant" association just by chance in about 5% of comparisons. When you start with 350 genes, that means 17 genes of your set of 35 that "affect patient survival" could easily be false positives. That's an example of the multiple comparisons problem that becomes very large in studies on gene expression.
Also, the one-at-a-time evaluation removes any ability to see if accounting for some of those genes might make it easier to see the association of other genes with outcome. As with omitted-variable bias in logistic regression, if you omit any predictor associated with outcome from a survival model then you might underestimate the true magnitude of the coefficient for a predictor that you examined.
Furthermore, unless you have many thousands of cases you probably shouldn't be using separate training and test sets for your modeling. Otherwise you lose power in the training set and you have too few cases in your test set to provide a sensitive test of model performance. You should use the types of internal validation provided, for example, by the hdnom
package. The only exception might be if you have a completely independent set of gene-expression and outcome data from an external source (like another hospital) to use as a test set.
If you want to develop a model for survival that is somehow based on the 350 genes that were differentially expressed, you should use an approach that starts broadly and considers multiple genes together. Ridge regression, elastic net, and LASSO (also evidently provided by hdnom
) are so-called penalized methods often used for this purpose. These span a range from using all of the genes while differentially penalizing their coefficients to avoid overfitting (ridge), down to selecting a combination of just a few that are most closely--but together--associated with outcome (LASSO).
I haven't used the hdnom
package, but I suspect that it's just a convenient interface to other R packages like glmnet
for modeling, and cross-validation and bootstrapping for validation and calibration. It seems to have a reasonable workflow, although I can't say that it necessarily is the "best" package of all. So go back a couple of steps to your 350-gene list, use a penalized approach to identify the genes for your model, and then do internal validation.