6

I've been working with WEKA to build class predictors using this (rather old..) breast cancer dataset. The dataset is divided into a training and a test set. I've been testing different learning schemes (mostly focused on feature selection) using 10-Fold cross-validation experiments on the training set. Unfortunately, when I try the trained models out on the test set there seems to be no correlation whatsoever between scores in cross-validation and actual test set performance.

Is this a consistent problem for microarray or other high dimensional / low sample count data? Is there another approach that would be more suitable than cross-validation for evaluating models on the training data?

Ben
  • 81
  • 4
  • I think the real problem is that gene expression profiling does not, in fact, actually predict the clinical outcome in breast cancer. – Alexander May 31 '12 at 17:36
  • Are you sure you have included FS/MS in your CV loop? Making FS once and testing its output with CV is not enough and is a straight way to overfitting. –  May 31 '12 at 18:56
  • The feature selection routines are included in the CV loop. (In WEKA I use the AttributeSelectedClassifier and then test different selection approaches.) – Ben May 31 '12 at 19:10
  • Another non-statistical comment. It is now widely understood that older studies like this one that measured gene products from single tumor biopsies provide data that do not accurately represent the tumor mutational landscape. A [paper](http://www.nejm.org/doi/full/10.1056/NEJMoa1113205) published this year in NEJM and its accompanying [editorial](http://www.nejm.org/doi/full/10.1056/NEJMe1200656) provide a good overview of this fact. It's also worth noting that [another group](http://www.pnas.org/content/100/14/8418.long) tried to replicate the study you cite above and were unsuccessful. – Alexander May 31 '12 at 20:22
  • Anyway, I know these comments don't directly answer your question. And perhaps indeed you can coerce the data into letting you train models that appear to be predictive in this particular cohort of patients (*i.e.*, your data source). But the reality is that predictive models based on gene expression data do not yet work in real life (although such data are frequently being used to *classify* subtypes of tumors, like in this major *Nature* [paper](http://www.nature.com/nature/journal/vaop/ncurrent/full/nature10983.html) published in April). – Alexander May 31 '12 at 20:27
  • 2
    Also, you might be interested in the paper [Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data](http://dx.doi.org/10.1093/bib/bbr001). [Gene Expression–Based Prognostic Signatures in Lung Cancer: Ready for Clinical Use?](http://jnci.oxfordjournals.org/content/102/7/464.short) would also be of interest to you. Both provide really wonderful discussions of this topic, although if I recall they don't provide full discussions of intra-tumor heterogeneity. – Alexander May 31 '12 at 20:58
  • 2
    May be this tutorial could help you http://cbio.mskcc.org/~lianos/tips/svms-and-gene-signatures – friveroll Jun 01 '12 at 01:31

2 Answers2

2

The answer really seems to be that cross-validation is not great because its results are extremely variable but it remains the best option available. The only other competitive approach seems to be the 0.632 bootstrap estimator which has slightly lower variance but also under-estimates the true performance. See Is cross-validation valid for small-sample microarray classification?. Also of relevance - (perhaps obvious) - the more features that are included, the higher the variance of the cv-estimates.

Ben
  • 81
  • 4
1

I think the problem may be that your training set is too small and therefore not representative of the entire population and if you test it on even smaller tests sets these data can be very different. This is more of a general large p small n problem and pertains to that type of problemn whether it is genetics or not. It has nothing to do with how well genes predict outcomes in breast cancer. In fact I think there are several biomarkers that are useful for estimating the probability of recurrence for patients who had the tumor completely removed.

Michael R. Chernick
  • 39,640
  • 28
  • 74
  • 143
  • Michael - this is my feeling as well. Could you suggest another approach that handles the large p small n problem more effectively? – Ben May 31 '12 at 19:12
  • That is a difficult problem and is not settled statistically. But you should take a look at Efron's empirical Bayes approach which is given in his recent monograph "Large Scale Inference." – Michael R. Chernick May 31 '12 at 19:52
  • @Michael, Why do you think it has nothing to do with how well DNA microarray expression data predict breast cancer recurrence? There are no clinically useful microarray-based predictive tests currently used in breast cancer treatment. Currently used markers including the estrogen receptor, progesterone receptor, and the human epidermal growth factor receptor 2 protein are not microarray-based tests. Other markers including Ki-67 and p53 analysis are also not based on the measurement of gene expression levels. – Alexander May 31 '12 at 20:06
  • @Alexander I have seen ER and HER2 be effective predictive biomarkers in cancer studies I have worked on at LIMR. I have no idea how they were identified. I assumed it might have been through microarray aalysis. But you probably know better than I do. I have also since a biomarker TIMP-4 that we published a paper on seem to be a good predictor as to whether or not therre would be rapid growth of small area breast cancer tumors. – Michael R. Chernick May 31 '12 at 20:51
  • @Alexander Regardless as to whether or not any genetically based biomarkers used in the OPs study are effective is beside the point. The issue is why the performance is different on the test set and I think that my statistical explanation makes more sense and is based on what I knwo about the data. You base your remarks based on your past experience but know nothing about the OPs data. – Michael R. Chernick May 31 '12 at 20:56
  • @Michael, I won't argue with your statistical explanation, as I am not a statistician. However, my point is that gene microarray data like these are not predictive of recurrence, period. This data set contains two groups of patients who total 78 individuals. They were divided into two groups based on whether or not they had relapsed at five years. The expression levels of over 24,000 genes were measured in each sample. These measurements do not even accurately reflect the totality of the mutations in the tumors. – Alexander May 31 '12 at 21:11
  • The question here is really about the methods. If I was consistently seeing 55% accuracy in cross-validation and then also seeing 55% accuracy on the test set then the biological explanation would be great. The challenge here is that I sometimes see 90% correct in CV and then 42% on the test set (and vice versa). In fact if I run a parameter sweep and generate many different models I see absolutely no correlation between CV performance and test set performance. (The scatterplot is a shotgun blast.) What would you say, as a statistician, if you observed this in data you had to process? – Ben May 31 '12 at 22:51
  • I have no basis to argue that point Alexander. You may be completely right and you may have a valid critique of the analysis. I am not an MD. But the point here is to understand the statistical principles that explain what the OP is puzzled about. Analysis of large scale data and the large p small n problem make it very difficult to identify real predictive biomarkers and too easy to find spurious ones. That may explain the lack of success with gene expression. But we are starting to get better at it and things could change. – Michael R. Chernick May 31 '12 at 22:57
  • 1
    @Ben I am not sure whether you are addressing your question to me or Alexander. But the point you are making now is exactly the point I thought you made with your initial question. I think that although you may feel that the results are crazy, this is new territory for statistics an area we used to say should not be touched. Statistical dogma always said to model n must be a lot bigger than p. Now we are accepting that something rational can be done when p is a lot larger than n. 20th Century statiticians would be rolling over in their graves! – Michael R. Chernick May 31 '12 at 23:03
  • I agree with you about the point of the question, sorry if I seemed a bit pedantic about the point I was trying to make. Anyway that's why I left it as a comment rather than an answer as I realize that it wasn't really an "answer" to the question itself. – Alexander May 31 '12 at 23:06
  • @Alexander I find your comments here to be knowledgeable, well thought out and interesting. – Michael R. Chernick Jun 01 '12 at 01:14
  • @Michael thanks for your thoughts on this. Looks like there is no easy answer, but thats reality for you! Looking through the reviews on Efron's book [link](http://www.amazon.com/Large-Scale-Inference-Estimation-Prediction-Mathematical/dp/0521192498) it seems like his course Stats 329 at Stanford [link] (http://www-stat.stanford.edu/~omkar/329/) is a good place to start building background knowledge in this area. – Ben Jun 01 '12 at 05:36
  • I agree. I studied math stat from him when I was a graduate student (many years ago) and I have heard him lecture on this topic. He is a great teacher along with being famous for his work in geometry of exponential families, empirical Bayes methods and the bootstrap. I think I should take the course! – Michael R. Chernick Jun 01 '12 at 09:25