Thousands of features and only 70 samples

Question

I am working on a regression problem where I have around 5-10 thousand features and have only 65 samples. I am training my algorithm on 55 samples and testing on 10 samples. I am using both Pearson correlation and feature importance using the random forest to remove irrelevant features. In testing data, the random forest algorithm is predicting the average of the training samples. I have a few questions:

(1) Which ML algorithm should I use when I have very few samples for training (55 samples in my case)?

(2) The Best way of feature selection when we have much more features than samples?

I fear you lack enough observations to do any kind of serious predictive modeling unless you have a ton of domain knowledge to inform the model and a very high signal-to-noise ratio. — Dave, Jul 25 '21 at 22:04
You would probably be better off using regularisation rather than feature selection. Is feature selection a goal of the analysis or do you just want the best predictor? — Dikran Marsupial, Jul 26 '21 at 09:03
@ Dikran, thanks for the comment. Yes, finding the relevant features is one of the goals of the analysis. @ Dave, thanks for the comment — Kamal, Jul 28 '21 at 02:58

Tim · Answer 1 · 2021-07-26T08:59:37.873

You're seeking a needle in a haystack. As noticed in the comment and another answer, you do not have enough data. Even if you had so many features, 65 is already a very small sample size for any machine learning model, so adding feature selection to it makes it a pretty doomed problem.

You say that you have between 5 and 10 thousand features, so I'd assume 7500 features. With 55 train samples, your model would easily overfit. Below you can see the model trained on completely random data that "achieves" nearly perfect $R^2$.

from sklearn.ensemble import AdaBoostRegressor
import numpy as np

np.random.seed(42)
y_train = np.random.rand(55)
X_train = np.random.rand(55, 7500)

model = AdaBoostRegressor(random_state=i)
model.fit(X_train, y_train)
model.score(X_train, y_train)
## 0.9895214625949762

You would probably say "Hey, wait a minute! This is a train score. What about test score?". You'd be right, the test score is bad. So let's say that you would train the model, validate the results on the test set and repeat until finding the acceptable results. Notice that your test set is only ten samples. So you would only need to get ten numbers right. Let me give another example. Your "model" returns a completely random result now, how many iterations are needed to obtain a result with a high $R^2$ (it's equal to $r^2$ for linear regression)? Apparently, you need just a few thousand iterations.

np.random.seed(42)

best_r2 = 0
y_test = np.random.rand(10)

for i in range(10000):
    y_pred = np.random.rand(10)
    r, _ = sp.pearsonr(y_pred, y_test)
    r2 = r**2
    if r2 > best_r2:
        best_r2 = r2
        print(f"iter={i}, r={r2}")

## iter=0, r=0.49601681572673695
## iter=6, r=0.6467516405878888
## iter=92, r=0.6910478084107202
## iter=458, r=0.6971821688682832
## iter=580, r=0.6988719722383485
## iter=1257, r=0.721148489188462
## iter=2015, r=0.7437673627048644
## iter=2253, r=0.7842495052355497
## iter=4579, r=0.8189207386492211
## iter=5465, r=0.8749525244481782

How does this apply to the machine learning scenario? Imagine that instead of a random "model" you have some other machine learning model, trained on the train dataset and validated on the test set. Say that you "tune" the random seed of the model for many iterations. If you wait long enough, you would find a completely random solution, on completely random data, that matches your test data well. The same applies to data-based feature selection.

You can find similar arguments in How to choose the training, cross-validation, and test set sizes for small sample-size data?

If your sample size is already small I recommend avoiding any data driven optimization. Instead, restrict yourself to models where you can fix hyperparameters by your knowledge about model and application/data. This makes one of the validation/test levels unnecessary, leaving more of your few cases for training of the surrogate models in the remaining cross validation.

also, rather than using a single held-out test set, better to use cross-validation. Keep in mind that with a small sample cross-validation is still not very reliable (see Varoquaux, 2017) and does not offer good out-of-sample performance estimate.

TL;DR:

The more iterations you make, the more likely you are to overfit. Avoid data-based optimization as much as possible. So rather than using data-based feature selection, pick (< 10) meaningful features by hand.
Same applies to model choice, with such small data, you don't want to tune the hyperparameters. Use domain knowledge to pick a model that is likely to work for the data. You want a simple model that is less likely to overfit.
Use cross-validation rather than held-out set. The test set consisting of ten samples is too small and unreliable. You could easily overfit to it.

score 4 · Answer 2 · answered Jul 26 '21 at 11:04

Others have described your impossible challenge very well. To state in other ways:

Your sample size is too small by a factor of 100 for split-sample validation to have any hope of working, i.e., of being precise and stable
Statistical testing to discard features has no hope of discarding the right features
Penalization (regularization; shrinkage) has no hope here because the sample size is too small for you to be able to choose the shrinkage factor
Your only hope is to use unsupervised learning (data reduction; does not use the outcome variable in any way) methods such as principal components, variable clustering, sparse principal components to reduce the dimensionality of the feature space down to 1 or 2 numbers per observation, then using those 1 or 2 numbers to predict the outcome. In other words the only hope of meeting your goal is for you to be lucky enough that the ways the features are redundant with each other can be collapsed into some kind of signal summary that relates to Y.

thanks for the valuable comment. Could you please explain a bit more point 2 for this case where we have much more features than samples" Statistical testing to discard features has no hope of discarding the right features"? — Kamal, Jul 28 '21 at 03:50
That is correct. The false negative rate is also staggeringly high, i.e., actually useful features are likely to be removed. It helps to think Bayesian: for each feature you can compute a posterior probability that the feature is associated with Y. You'll see a large number of features with probabilities between 0.05 and 0.95 and you can compute the expected number of false inclusions or false exclusions. You'll be shocked. Decisions are only reliable if the probabilities of the unknown truths are very small or very large. — Frank Harrell, Jul 28 '21 at 12:38

score 2 · Answer 3 · answered Jul 26 '21 at 10:35

I think the answer to this question is highly data and problem dependent. Some people have reasonably suggested fitting a simple model with regularization, like ridge regression or Lasso, using all features. I also think this is a good idea, and you should try it. But see here: Is ridge regression useless in high dimensions ($n \ll p$)? How can OLS fail to overfit?

In this question, the original poster had a real world data set with approximately as many data as you and the same order of magnitude of features. They found that regularization didn't help, and another user pointed out that for some data sets with many more features than data, regularization is moving in the wrong direction from the min-norm solution.

So the point is, even very general suggestions like "use regularization" might end up being wrong in your specific case, depending on the data.

If I were you, I would start by checking if there is some domain dependent reason to include some variables. If not, check for variables that seem different than the others, different ranges, missing values, etc. Try to figure out what about the reason is for those differences and see if that gives any insight into the data.

Univariate screening like you've tried with Pearson correlation is probably only going to be useful if you find some variables with very high correlation. Use your training set, and just check if you have a situation where most of your variables have correlation between -0.5 and 0.5 (consistent with random noise in your case) but a few of your variables have correlation like 0.65 (unlikely even if you have 100 000 random features). Don't take the top 5% or 10% of features, or auto-exclude features below some correlation. Just check if there is some obviously important feature that you should already know about for domain reasons, but either you forgot something important or someone forgot to tell you something.

Failing something obvious like that, I think you're back to data/problem dependent answers. It matters how you ended up in a situation where 10000 features were created from 65 data. If all or most of your features are pairwise correlated with similar scales, maybe they are all somehow noisy measurements of the same thing. In that case, averaging all the predictors and using the result as a single feature regression might do very well.

score 1 · Answer 4 · answered Jul 25 '21 at 22:15

1

The chances of any predictive model doing well here are very very slim. You need to either get more data, or use domain knowledge to narrow down the number of predictors you consider.
That said, Lasso regression is the usual choice in situations like this.

answered Jul 25 '21 at 22:15

Eoin

4,543
15
32

In this situation the probability that _lasso_ will find the right features is zero. See https://www.fharrell.com/talk/stratos19/ – Frank Harrell Jul 28 '21 at 12:35
I don't disagree! I guess I meant *situations like this, but not quite so data poor and feature rich*. – Eoin Jul 29 '21 at 21:53
For virtually all feature selection algorithms the reliability only increases a bit as the sample size increases. You're asking the data to do things that the information content in the data does not support, especially if features are collinear. Feature selection also wastes some of the information on selection that could have been used for prediction. – Frank Harrell Jul 30 '21 at 11:37

Thousands of features and only 70 samples

4 Answers4