Is my high dimensional data logistic regression workflow correct?

Question

I have a cancer classification problem (type A vs type B) on radiological images from which i have generated 756 texture-based predictive features (wavelet transform followed by texture analysis, i.e., features described by Haralick, Amasadun etc) and 8 semantic features based on subjective assessment by expert radiologist. This is entirely for research and publication to show that these predictive features may be useful in this particular problem. I do not intend to deploy the model for practitioners.

I have 107 cases. 60% cases are type A and 40% type B (in keeping with their natural proportions in population). I have done several iterations of model development with varying results. One particular method is giving me an 80% 80% classification accuracy but I am suspicious that my method is not going to stand critical analysis. I am going to outline my method and a few alternatives. I will be grateful if someone can pick if it is flawed. I have used R for this:

Step 1: Split into 71 training and 36 test cases.
Step 2: remove correlated features from training dataset (766 -> 240) using findcorrelation function in R (caret package)
Step 3: rank training data features using Gini index (Corelearn package)
Step 4: Train multivariate logistic regression models on top 10 ranked features using subsets of sizes 3 , 4, 5 ,and 6 in all possible combination (¹⁰C₃=252, ¹⁰C₄=504, ¹⁰C₅=630). So total 1386 multivariate logistic regression models were trained using 10-fold cross-validation and tested on test dataset.
Step 5: Of these I selected a model which gave the best combination of training and test dataset accuracy, i.e., 3 feature model with 80% 80% accuracy.

Somehow running 1300 permutations seems quite dodgy to me and seems to have introduced some false discovery. Just want to confirm if this is a valid ML technique or whether I should skip step 4 and only train on top 5 ranked features without running and permutations.

Thanks.

PS I experiemented a bit with naive bayes and random forests but get rubbish test set accuracy so dropped them

====================

UPDATE

Following discussion with SO members, i have changed the model drastically and thus moved more recent questions regarding model optimisation into a new post is my LASSO regularised classification method correct?

Hi, 65 type A and 42 type B. that is also their natural frequency. Also forgot to mention clearly that the models are multivariate logistic regression. Naive Bayes' and random forests were giving about 60% test set accuracy with massive overfitting (95% training acc). — Maelstorm, Aug 15 '16 at 14:12
To be clear, your main goal is classification accuracy and not explanation. Correct? — shadowtalker, Aug 15 '16 at 14:44
And the features are all constructed from the radiological images? What kind of algorithm are you using to construct the features? — shadowtalker, Aug 15 '16 at 14:44
Hi, my goal is only accuracy and not explanation. The features constructed from radiological images are based on texture analysis. From each image, I extract matrices (e.g., gray-level cooccurence matrix, neighbourhood gray-tone difference matrix)and from those matrices a number of scalar quantities are derived such as auto-correlation, variance, mean, standard deviation. I have also used single level wavelet transform on each image to get 8 further images from it. On each of those images the same process of texture analysis is repeated, thus the 766 total features. — Maelstorm, Aug 15 '16 at 19:01
There's a lot to deal with in your update, beyond what's in your original question, so you might want to pose the update as a separate question (with a link to this one), keeping this thread more readable. Note that the odds-ratios for the PCs would be per unit change in the PC value. I'm a little worried that your LASSO is keeping so many variables. It would help to have more specific information about how you standardized variables before LASSO, how you implemented LASSO and CV, maybe plots of accuracy versus penalty in CV, etc, rather than just the summarized results. — EdM, Aug 31 '16 at 18:05
Thanks. I have uploaded another question here http://stats.stackexchange.com/questions/232829/lasso-regularised-classification-highly-variable-choice-of-lambda-min-on-repeate — Maelstorm, Sep 01 '16 at 10:49
As you suspect, there is a huge possibility for data dredging given your small data set and high dimensionality. I would run a few runs of your procedure with scrambled labels (ie A vs B) and see whether you still get good results. — seanv507, Sep 01 '16 at 11:11
thanks. I have moved away from this train / test split due to small sample size and implemented a regularized model on the entire dataset. Please have a look at http://stats.stackexchange.com/questions/232829/lasso-regularised-classification-highly-variable-choice-of-lambda-min-on-repeate — Maelstorm, Sep 01 '16 at 11:30

score 4 · Accepted Answer · answered Aug 15 '16 at 17:22

4

I see 3 potential problems with this approach. First, if you intend to use your model for classifying new cases, your variable-selection procedure might lead to a choice of variables too closely linked to peculiarities of this initial data set. Second, the training/test set approach might not be making the most efficient use of the data you have. Third, you might want to reconsider your metric for evaluating models.

First, variable selection tends to find variables that work well for a particular data set but don't generalize well. It's fascinating and frightening to take a variable selection scheme (best subset as you have done, or even LASSO) and see how much the set of selected variables differs just among bootstrap re-samples from the same data set, particularly when many predictors are inter-correlated.

For this application, where many of your predictors seem to be correlated, you might be better off taking an approach like ridge regression that treats correlated predictors together. Some initial pruning of your 766 features might still be wise (maybe better based on subject-matter knowledge than on automated selection), or you could consider an elastic net hybrid of LASSO with ridge regression to get down to a reasonable number of predictors. But when you restrict yourself to a handful of predictors you risk throwing out useful information from other potential predictors in future applications.

Second, you may be better off using the entire data set to build the model and then using bootstrapping to estimate its generalizability. For example, you could use cross-validation on the entire data set to find the best choice of penalty for ridge regression, then apply that choice to the entire data set. You would then test the quality of your model on bootstrap samples of the data set. That approach tends to maximize the information that you extract from the data, while still documenting its potential future usefulness.

Third, your focus on classification accuracy makes the hidden assumption that both types of classification errors have the same cost and that both types of classification successes have the same benefit. If you have thought hard about this issue and that is your expert opinion, OK. Otherwise, you might consider a different metric for, say, choosing the ridge-regression penalty during cross-validation. Deviance might be a more generally useful metric, so that you get the best estimates of predicted probabilities and then can later consider the cost-benefit tradeoffs in the ultimate classification scheme.

In terms of avoiding overfitting, the penalty in ridge regression means that the effective number of variables in the model can be many fewer than the number of variables nominally included. With only 42 of the least-common case you were correct to end up with only 3 features (about 15 of the least-common case per selected feature). The penalization provided by ridge regression, if chosen well by cross validation, will allow you to combine information from more features in a way that is less dependent on the peculiarities of your present data set while avoiding overfitting and being generalizable to new cases.

answered Aug 15 '16 at 17:22

EdM

57,766
7
66
187

1

Thanks for the elaborate answer. Not sure if you noticed but i have mentioned in step 2 that I weeded out highly correlated features from the 766 starting set to obtain 240 less correlated one. I suppose this wouldnt affect your opinion since you have covered it where you recommend. using regularised regression. Could you confirm that despite it being train / test scenario, my results are highly data-driven, i.e., overfitting? I am not a professional as you might have guessed but will start working on these ideas. thanks – Maelstorm Aug 15 '16 at 19:06
1

I'm not sure your bootstrap approach makes sense, since you'd be testing the same samples that were used to train the model. – Firebug Aug 15 '16 at 19:24
1

@Firebug bootstrapping from the original data is a well established way to evaluate the performance of a modeling process. See [this answer](http://stats.stackexchange.com/a/14550/28500) and the link in it. Resampling the original data _with replacement_ mimics taking a new sample from the full population; doing it many times gives reasonable estimates for many purposes. It allows for calibration of models, correcting for optimism in the original model, and provides useful estimates of variance in future applications to the same underlying population. See the `rms` package in R – EdM Aug 16 '16 at 17:47
1

@Maelstorm what I fear is that the particular 3 variables you end up with based on this data sample are unlikely to be the same 3 that you would have selected based on some other sample. Whether this is technically "overfitting" doesn't really matter. Test that by repeating your model building on multiple bootstrap samples from your data. Your cross validation seems to have tested the model, not the model-building process itself, which is important. Your initial pruning of features is one way to start, providing features for ridge regression or elastic net modeling. – EdM Aug 16 '16 at 17:59
@EdM Perhaps I'm misunderstanding your answer. Predictions should be done on out-of-bootstrap samples, the rest is used to build the model iteratively. Building a model on whole data and then throwing the same data at it will potentially lead to an optimistic performance estimate. – Firebug Aug 16 '16 at 18:02
@Firebug done properly, bootstrapping can correct for optimism of a model built on the entire data sample. The `rms` package in R provides such facilities. Frank Harrell, author of that package, argues that models built on only a portion of the available data throw away too much information and thus diminish the power of modeling. See his course notes and Regression Modeling Strategies book, available through [this page](http://biostat.mc.vanderbilt.edu/wiki/Main/RmS), for extensive details. – EdM Aug 16 '16 at 18:12
How would that work on Random Forests, that usually achieve perfect performance on training data? – Firebug Aug 17 '16 at 00:03
@Firebug I don't have direct experience with random forests. The question and my answer were in the context of (logistic) regression. My hunch is that your cautions might be much more warranted with random forests. – EdM Aug 17 '16 at 03:50
@EdM I see. You are treating the logistic regression just like any other GLM in regression analysis, while I'm treating it as a regression learner in machine learning (there's a machine learning tag on this question). Usually, in machine-learning, it's necessary to compare different algorithms and I guarantee this kind of bootstrap won't work most of the time with Random Forests (there's another way to use it though that works, akin to Cross Validation). That was my concern. I do agree on your other points. – Firebug Aug 17 '16 at 12:40
Dear @EdM, i have updated my question based on your feedback. I will be very grateful if you can look at the update and suggest if i'm almost there. thanks a lot! – Maelstorm Aug 30 '16 at 18:20
@EdM thanks alot for help in this model. I am finishing my thesis (this study was a part of it). I would like to acknowledge you in it. And also, when i submit the paper for publication, may i invite you to become a co-author? If you're happy for it, please send me an email : drusmanbashir@gmail.com – Maelstorm May 28 '17 at 14:22

Is my high dimensional data logistic regression workflow correct?

UPDATE

1 Answers1

Linked