How can I use synthetic data to validate my classification model?

Question

Through R and based on a microarray gene expression dataset (60 samples in total-30 cancer and 30 control samples) and R package caret, i have performed a feature selection regarding a binary categorical outcome (Disease status). My final selected subset, is comprised of both gene features as also clinical continuous variables (the initial dataset, was produced by merging and batch effect corrected 2 affymetrix microarray datasets with similar phenotype condition, and also paired-each patient has both cancer and control samples-so in total 30 patients: 30 cancer and 30 adjucent control samples). Regarding the feature selection methodology, briefly the initial pool of features utilized was the DE gene list from my comparison of cancer versus control samples. Afterwards, an initial feature selection took place with a methodology we have implemented in our lab, resulting in 338 genes. Then, i performed a recursive feature elimination with these genes and the 9 continuous clinical variables i mentioned above, which gave 41 final "composite" features.

Moreover, except from a simple initial inspection of my combined composite feature set with cross-validation, i would like also somehow to perform an initial validation in an independent dataset, in the way of testing the classifier trained in my initial dataset with these features. The major problem of simply selecting a microarray dataset from GEO and/or other repositories, is that these PET features, have been only measured in the same patients that also the microarrays have been produced (an important novelty that i would like somehow to test).So, i could not have any external samples or datasets with these clinical features.

Thus, there any package or methodology that i could implement in R, in order to perform a possible simulation of my above dataset with only these 41 features, and then utilize this "synthetic" dataset for external validation with the classifier constructed in my initial training/analyzed dataset ? If not, what are my alternatives for validating this model?

(*Just to mention, the Disease status in the same phenotype information in all my dataset: that is colorectal adenocarcinoma samples or normal samples)

Questions about software implementations (like "how to do X in R") are off-topic here. Your question, however, raises some important statistical issues that I will try to address over the weekend. In the meantime, could you please clarify the relations among samples and patients and "phenotype condtions" and disease status? For example, are there 30 patients each with 1 cancer and 1 control sample, or 30 cancer patients and 30 normals? Does "Disease status" represent different status within cancer patients? Also, please describe how you performed the feature selection. — EdM, Dec 02 '16 at 21:28
Also, what do you mean by PET features being "an important novelty that i would like somehow to test"? For safety, since comments are easily lost, please edit the original question to include the clarifications I requested in this and my previous comment. — EdM, Dec 02 '16 at 21:28
How many cases are there in the least-frequent categorical outcome, how many predictors/features are in your model, and did you do any type of penalization (LASSO, ridge, etc.)? — EdM, Dec 02 '16 at 21:34
Dear EdM, thank you for your comments--firstly i understand the nature of my post, but because this putative implementation is complicated and a challenging task (as I'm not a bioinformatician)i will try to include as much information possible in my original post-also i mentioned R. because it is the main language i use for data analysis — Jason, Dec 02 '16 at 23:03
Regarding the PET features, i cant mention any other further information except that these features-variables are novel, as for first time they are evaluated as continuous quantitative features, except the original technique (PET refers to Positron emission tomography). Hope that makes it more clear. — Jason, Dec 02 '16 at 23:14
So am I correct that the idea is to distinguish normal from cancer tissue by a combination of microarray expression values and some 9 continuous values determined from PET? Also, please indicate the approach you used for feature selection. It seems from other posts that you used the `caret` package in R, but that provides many different ways to do feature selection. — EdM, Dec 03 '16 at 01:07
Yes my goal in to further test this "composite" signature and its ability, to discriminate primary colorectal tumors from adjucent controls. Regarding the feature approach, as i mentioned initially we implemented for a first dimension reduction-concerning the DE gene list-a methodology wle have created in our lab based on the Gene Ontology tree structure. Then, i created a union of the "selected DE genes" with the PET variables, and by applying the rfe function through caret with a number of different iterations, i ended up with the common constantly appeared subset-that is, the 41 features. — Jason, Dec 03 '16 at 02:47
I changed the title to be more general (and thus more generally interesting, an important consideration on this site) and added a sentence to your question to open the possibility for alternatives to your proposed use of synthetic data. — EdM, Dec 03 '16 at 17:58

score 4 · Accepted Answer · edited Apr 13 '17 at 12:44

There are 3 statistical (rather than coding) questions here. First is whether or how to use synthetic data as external validation for your classification scheme. As my answer is "you shouldn't," the second is how you can properly validate the scheme based on the data that you do have. The third is whether, in a data set with 30 positive cases, you will be able to validate a model with 41 features.

First, generating a synthetic data set of positron emission tomography (PET) values based on your microarray and other data might be possible, but unless there is a strong theoretical basis for your mapping of microarray and other data to PET values (which seems unlikely) then you could at best use the empirical relations within your present data to generate the synthetic PET data. That does not seem to be a true external validation of your classification scheme as the empirical mapping might depend heavily on the peculiarities of your present data.

Second, it is still possible to do substantial validation of a model, or at least of your model-generation process, based on your data set. Depending on exactly how you used the functions of the caret package, you might already have the necessary information, and you certainly could use the interfaces provided by caret to perform internal validation. This, however, requires careful attention to which parts of your feature selection were based simply on the predictor features and which depended on their relations to known cancer/normal status. That is the topic of this page among many other pages on this site.

You are free to select features for model building based on the features themselves without reference to the known classifications. For example, microarray data analyses typically remove genes with little variance in expression among all samples, as such genes are unlikely to contain any useful information. Unsupervised clustering to find groups of samples with similar expression profiles, without reference to the classification, is a powerful tool. This can provide a single score based on the set of expression values; you can then explore the relation of this score to your classes of interest. You can look for predictors highly correlated to each other (again, without reference to the known "outcome" classes) and choose to remove some or combine them in your set of candidate features. These types of choices do not affect the statistical validity of your later model building.

Once you use the known cancer/normal classes to choose features, however, the assumptions of independence, etc, needed to perform standard statistical tests have been violated. That use of the outcome data to choose features has to be incorporated into model validation. That's why the author of the caret package warns:

One potential issue [is] over-fitting to the predictor set such that the wrapper procedure could focus on nuances of the training data that are not found in future samples (i.e. over-fitting to predictors and samples).

Cross-validation (CV; ideally, with K-fold CV done many times on independent data splits, a good deal more than the "simple initial inspection ... with cross-validation" you have done so far) or bootstrap resampling can be used for evaluating the quality of your model, but the validation must include all the steps, including the feature selection process, that were used in your model building. Again, from the web page on recursive feature elimination in caret:

Since feature selection is part of the model building process, resampling methods (e.g. cross-validation, the bootstrap) should factor in the variability caused by feature selection when calculating performance...improper use of resampling to measure performance will result in models that perform poorly on new samples.

In your case, as your gene set was chosen based on known cancer/normal classes, you would at least have to include that part of the feature selection within each CV fold or bootstrap resample. If PET features were selected based on known classes, that selection would also need to be repeated for each fold or resample. As noted in this answer, the choice of features will probably vary from fold to fold in CV or among bootstrap resamples. That's OK, as what you are validating is not the final model per se but the process you used to build it. You still present the features chosen from the full data set; the cross-validation or bootstrapping provides information about the validity of your approach and the potential optimism of your model when applied to new data.

Third is the issue of having 41 "features" with only 30 cancer cases. Without further explanation, that raises fears of substantial over-fitting. The usual rule of thumb in binary classification is to limit the number of features to 1 per 10-20 cases in the least frequent class: maybe 2 to 3 features for your data set. Although you didn't specify the particular classification method you used during your recursive feature elimination and for your final model, you evidently did not use standard logistic regression. Perhaps you used logistic ridge regression (with its penalization of regression coefficients to reduce the effective number of parameters), or a random forest (which can handle large numbers of correlated predictors). Please be careful in how you present your number of "features"; it seems likely that you have a large number of highly correlated "features." If your model passes appropriate internal validation that's OK, but you need to make sure that you are ready to respond to a reviewer who claims that you have substantially over-fit your data.

Added in response to comments:

Please step back for a moment and focus on your strategy rather than the tactics that you have enquired about in this and other questions on this site.

Your primary goal seems to be to determine whether your PET measures help distinguish cancer from normal tissue. PET data come from imaging on living individuals, while microarray data require biopsies or surgically resected tumor specimens. If you already have a biopsy or tumor specimen, a pathologist would probably be preferred to microarray or PET for the cancer/normal determination. (A pathologist probably provided the ground truth of cancer/normal for your samples). Also there is over a decade of previous work on microarray distinctions between colon cancer and normal tissue. So your novel PET measures, obtained with essentially non-invasive techniques and not requiring intervention by a surgeon, seem to be what's crucial here.

On that basis, your strategy should be to test statistically how well your PET measures specifically work. Data-mining with tools like those accessed via caret may be an appropriate strategy in some applications (like determining useful gene-expression sets), but that strategy (lumping in your PET measures with a large number of microarray values) may be inappropriate for documenting that your PET measures are useful.

Happily, dealing with 9 continuous PET measures on 30 regions each of cancerous and normal tissue is a fairly simple problem. For example, you could perform PCA on those 9 measures over the entire data set (including both normal and cancer) and perform logistic regression, restricting your predictors to the 2 principal components that provide the largest contributions to the variance. That gives you 2 orthogonal predictors for 30 cancer cases, for an acceptable 15 cases per predictor. Or you can use logistic regression methods that penalize individual regression coefficients, like LASSO (which provides variable selection) or ridge regression (essentially a weighted rather than all-or-none principal-components regression, which I have found better behaved in situations with modest numbers of correlated predictors). Internal validation would be done as I noted above, repeating all steps of your modeling process on multiple re-samples (bootstrap or CV) of your data.

Focusing on the novel PET measures would seem to meet both your interests (and those of other cancer researchers, clinicians and patients) better than folding them in with microarray results into a large data-mining operation. That doesn't mean that you should abandon your microarray work, but you should not let the abundance of microarray measurements get in the way of evaluating the value of your PET data, which should be of primary importance. As you are evidently at an academic medical center, there should be local statisticians available to provide close guidance on how to accomplish your goals reliably.

Dear EdM, thank you for your detailed and comprehensive answers thus far. Because i have not reported all the details, i will try to add any important information i did not mentioned above, in order to also discuss further this matter in some crusial and specific points: firstly, regarding the feature selection procedure: i used as the pool of features, the combination of the 338 DE genes i mentioned along with the 8 PET variables, with 5 repeats of 10-fold cross validation. But i used 20 different random iterations (with 50 the size of the features that should be returned). — Jason, Dec 04 '16 at 00:49
To continue, my notion was to select from all the 20 different iterations only the features that appeared in all of them, to result thus to the 41 final features. Also, i wanted to inspect if along with the genes any PET variables would be selected, in order to have an initial measure of their possible importance. Then, i just used extreme gradient boosting and trained my dataset with the above selected features. Thus firstly, regarding the size of the selected features, you suggest that i could use various sizes in the rfe function, repeat again the same process and see the final outcome ? — Jason, Dec 04 '16 at 01:00
Overall, regarding your crusial point of over-fitting and bias resulting when using feature selection with the known labels (as recursive feature elimination and other models do) and then train the model with these features but with the same data: thus, as my main goal is to evaluate and promote the possible significance of PET features along any other important gene features, my above procedure would not be an initial "appropriate" approach ? Or your above indication: "you would at least have to include that part of the feature selection within each CV fold or bootstrap resample" ? — Jason, Dec 04 '16 at 01:14
Finally, perhaps an alternative way of further feature selection instead of rfe, would be to perform a PCA-maybe an MFA--as i have different "groups" of features representing the same observations-patients. And then, select for instance the top K features that are most correlated with the first dimensions/participate to the construction of the dimensions separating the two classes. But then, if this approach is more unbiased, how i could then evaluate based on the same dataset, these selected features based on some classification/error metrics ? — Jason, Dec 04 '16 at 01:23
Dear EdM, for a weird reason i just saw your updated comments without having some notification, even though i checked every day. Nevertheless, i would like to mention two crusial points on this matter, except you prefer to create a different post-question for this reason: Firstly, regarding your comments about the importance and novelty of PET variables: yes i have completely in mind your suggestions, but the main reason of mixing together the PET variables with our selected genes, is also to find any interesting correlation between any of them-that is the notion "composite signature" ! — Jason, Dec 12 '16 at 21:08
In parallel, in this point i also believe that Multiple Factor Analysis could handle a larger number of genes and also PET variables, as it balances the influence of each group--already an initial analysis of this approach again ranks as in the top contributors of the dimensions specific PET variables along with specific genes. Regarding the other crusial point of selection-bias with feature selection methodology: because it is an important point, in the tutorial of caret-section 18.2--the rfe function does not perform the external validation-resampling you have mentioned above? — Jason, Dec 12 '16 at 21:14
http://topepo.github.io/caret/recursive-feature-elimination.html — Jason, Dec 12 '16 at 21:16
I don't have experience with rfe; you can check the documentation and code about whether it automatically does bootstrapping or repeated CV on the _entire model building process_ as is recommended. It's unlikely that MFA will avoid the overfitting with only 30 cancer cases. You should consider no more than 2 or 3 predictors unless you use penalization like LASSO or ridge regression. You can look for correlations of PET to genes without including genes in the logistic regression. What practical clinical situation would use both PET and gene expression in a composite signature? — EdM, Dec 12 '16 at 21:58
Dear Edm, actually in my opinion (as i have understood from the paper and R package FactoMineR), MFA is rather a variant of PCA and not some classification method-it is suitable for my analysis, as you can desribe the same observations (i.e. patient samples) with multiple groups of "different variables", so i naively believe that the meaning of "overfitting" is not related in this case. Now,regarding your second part of answer, an interesting correlation of a "composite signature", could perhaps indicate some gene which could be detected along with the PET variable via some experimental tests — Jason, Dec 12 '16 at 22:15
Although MFA and PCA incorporate all the original predictor variables, to avoid over-fitting your data you should only use the top 2 or 3 dimensions returned by FMA/PCA as predictors (without reference to the cancer/normal label), unless you impose some penalty as in ridge regression. Repeat the process on multiple bootstrap samples to validate the model-building process. [An Introduction to Statistical Learning with Applications in R](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf) is one of many good references to consult on the dangers of over-fitting. — EdM, Dec 12 '16 at 22:40
Thank you again I have checked the book, but in order to verify your above approach: you mention again over-fitting when using PCA or MFA as an initial feature selection in order to reduce more the features right ? Because i mentioned MFA in order to just analyze simultaneously the 60 samples based on various groups of variables (i.e. genes and PET variables), and identify the top K variables which contribute most to the construction of the first 2 or 3 dimensions (and other subsequent capabilities)-http://factominer.free.fr/advanced-methods/multiple-factor-analysis.html — Jason, Dec 12 '16 at 23:38
It depends on what you do with the "top K variables." If you simply identify them to explain to your audience what they are, that's OK. Do not, however, fall into the trap of taking those K variables and using them as the predictors in a new standard logistic regression or similar analysis. That would almost certainly be over-fit if K>3, and might not repeat well on new cases even if K= 1 or 2 or 3. Your final model should have no more than 3 unpenalized predictors, whether original variables or dimensions in PCA/MFA, or you should penalize predictors with LASSO or ridge regression. — EdM, Dec 13 '16 at 01:20
Yes exactly that is my initial goal !! to show with an initial approach like MFA, which genes and/or PET variables are the top contributors and their possible biological relevance to my cancer under study. And thank you again for pinpointing the second crusial part of analysis. Perhaps, utlizing further external data would be an unbiased way of further validating these topk features, as also other methodologies. — Jason, Dec 13 '16 at 01:30

How can I use synthetic data to validate my classification model?

1 Answers1

Linked