Application of machine learning techniques in small sample clinical studies

Question

What do you think about applying machine learning techniques, like Random Forests or penalized regression (with L1 or L2 penalty, or a combination thereof) in small sample clinical studies when the objective is to isolate interesting predictors in a classification context? It is not a question about model selection, nor am I asking about how to find optimal estimates of variable effect/importance. I don't plan to do strong inference but just to use multivariate modeling, hence avoiding testing each predictor against the outcome of interest one at a time, and taking their interrelationships into account.

I was just wondering if such an approach was already applied in this particular extreme case, say 20-30 subjects with data on 10-15 categorical or continuous variables. It is not exactly the $n\ll p$ case and I think the problem here is related to the number of classes we try to explain (which are often not well balanced), and the (very) small n. I am aware of the huge literature on this topic in the context of bioinformatics, but I didn't find any reference related to biomedical studies with psychometrically measured phenotypes (e.g. throughout neuropsychological questionnaires).

Any hint or pointers to relevant papers?

Update

I am open to any other solutions for analyzing this kind of data, e.g. C4.5 algorithm or its derivatives, association rules methods, and any data mining techniques for supervised or semi-supervised classification.

Just to be clear: your question is about the size of the data, not about the setting, correct? — Shane, Aug 18 '10 at 20:40
Exactly, I wonder if there are any references about the "smallest" n (wrt. to a high number of variables), or more precisely if any cross-validation techniques (or resampling strategy like in RFs) remain valid in such an extreme case. — chl, Aug 18 '10 at 20:45

Yaroslav Bulatov · Accepted Answer · 2010-10-01T15:56:38.640

8

I haven't seen this used in outside of bioinformatics/machine learning either, but maybe you can be the first one :)

As a good representative of small sample method method from bioinformatics, logistic regression with L1 regularization can give a good fit when number of parameters is exponential in the number of observations, non-asymptotic confidence intervals can be crafted using Chernoff-type inequalities (ie, Dudik, (2004) for example). Trevor Hastie has done some work applying these methods to identifying gene interactions. In the paper below, he uses it to identify significant effects from a model with 310,637 adjustable parameters fit to a sample of 2200 observations

"Genome-wide association analysis by lasso penalized logistic regression." Authors: Hastie, T; Sobel, E; Wu, T. T; Chen, Y. F; Lange, K Bioinformatics Vol: 25 Issue: 6 ISSN: 1367-4803 Date: 03/2009 Pages: 714 - 721

Related presentation by Victoria Stodden (Model Selection with Many More Variables than Observations )

edited Oct 01 '10 at 15:56

answered Sep 20 '10 at 07:29

Yaroslav Bulatov

5,167
2
24
38

Yes, the Wu et al. 2009 is a nice paper. Incidentally, I've been working on GWAS and ML during the last two years; now I'm trying to go back to clinical studies where most of the time we have to deal with imperfect measurements, missing data, and of course... a lot of interesting variables from the point of view of the physicist! – chl Sep 20 '10 at 08:37
BTW, I just came across a paper that made me think of this question...it's very rare for Machine Learning papers to talk about confidence intervals, but here's a notable exception http://www.ncbi.nlm.nih.gov/pubmed/19519325 – Yaroslav Bulatov Sep 21 '10 at 22:50
Thanks for the additional link. Still for me the problem is with the small $n$ and the heterogenous predictors. It seems to me that the $n\ll p$ case is now increasingly well-studied in genetics, neuroimaging studies, or when we can assume an exponential relationship between $n$ and $p$, but at the moment I never found any evidence of the relevance or predictive power of boosting in the particular study I presented. I am currently running MC simulations to see how RFs and sparse regression perform in this case. I'll let you know all of any progress in this direction. – chl Oct 01 '10 at 19:08
This is a very interesting question. I've collected some of these and some other articles I have in a [blog post](http://gossetsstudent.wordpress.com/2011/12/16/machine-learning-techniques-in-the-biomedical-literature/) (hope you don't mind). I'm sure that there are some others out there. – Andrew Dec 16 '11 at 09:54

Jeromy Anglim · Answer 2 · 2010-08-19T05:53:41.183

I would have very little confidence in the generalisability of results of an exploratory analysis with 15 predictors and a sample size of 20.

The confidence intervals of parameter estimates would be large. E.g., the 95% confidence interval on r = .30 with n = 20 is -0.17 to 0.66 .
Issues tend to be compounded when you have multiple predictors used in an exploratory and data driven way.

In such circumstances, my advice would generally be to limit analyses to bivariate relationships. If you take a bayesian perspective, then I'd say that your prior expectations are equally if not more important than the data.

Shane · Answer 3 · 2010-08-18T21:23:54.503

4

One common rule of thumb is to have at least 10 times the number of training data instances (not to speak of any test/validation data, etc.) as there are adjustable parameters in the classifier. Keep in mind that you have a problem wherein you need to not only have adequate data but also representative data. In the end, there is no systematic rule because there are so many variables when making this decision. As Hastie, Tibshirani, and Friedman say in The Elements of Statistical Learning (see Chapter 7):

it is too difficult to give a general rule on how much training data is enough; among other things, this depends on the signal-to-noise ratio of the underlying function, and the complexity of the models being fit to the data.

If you are new to this field, I recommend reading this short "Pattern Recognition" paper from the Encyclopedia of Biomedical Engineering which gives a brief summary of some of the data issues.

edited Aug 18 '10 at 21:23

answered Aug 18 '10 at 20:51

Shane

11,961
17
71
89

Thanks! I have Hastie's book and that of C. Bishop (Pattern Recognition and Machine Learning). I know that such a small n would lead to spurious or unreliable (see Jeromy Anglim's comment) association. However, the RF algorithm as implemented by Breiman allows to cope with a limited number of features each time a tree is grown (in my case, 3 or 4) and although OOB error rate is rather high (but this should be expected), analyzing variable importance lead me to conclude that I would reach similar conclusion using bivariate tests (with permutation test). – chl Aug 19 '10 at 07:35
1

That rule of thumb mainly applies to classical methods like l2 regularized maximum likelihood, L1 regularized methods can learn effectively when number of adjustable parameters is exponential in the number of observations (ie, Miroslav Dudik, 2004 COLT paper) – Yaroslav Bulatov Sep 20 '10 at 07:03

score 3 · Answer 4 · answered Aug 18 '10 at 21:28

3

I can assure you that RF would work in that case and its importance measure would be pretty insightful (because there will be no large tail of misleading unimportant attributes like in standard (n << p)s). I can't recall now any paper dealing with similar problem, but I'll look for it.

answered Aug 18 '10 at 21:28

1

Thanks! I was attending the IVth EAM-SMABS conference last month, and one of the speaker presented an application of ML in a biomedical study; unfortunately, this was a somewhat "standard" study with N~300 subjects and p=10 predictors. He is about to submit a paper to *Statistics in Medicine*. What I am looking for is merely articles/references wrt. standard clinical study with, e.g. outpatients, where generalizability of the results is not so much an issue. – chl Aug 19 '10 at 07:39
Did you find any paper finally? – chl Sep 21 '10 at 22:00
@chl Not yet; but thanks for reminder. – Sep 21 '10 at 22:11
There's no hurry :) Didn't find anything interesting myself; maybe Pubmed isn't the right search engine for this particular case... – chl Sep 21 '10 at 22:34
@chl That's also my problem here. It really seems n<
– Sep 21 '10 at 22:46

score 0 · Answer 5 · edited Jun 20 '15 at 20:44

If you have discrete inputs, I'm writing a program to predict missing values of a binary input, given previous inputs. Any categories, e.g. "1 of 6", can be converted into binary bits, and it will work just fine; it won't effect it.

The purpose of the algorithm I'm writing is to learn as fast as mathematically possible. Consequently it has very poor time and space complexity (space complexity about O(4^N)!.

But for that you get essentially 1-off learning, for any system whose state can be expressed as a bit vector. For instance, a full-adder has 8 distinct input states. The algorithm will learn a full adder perfectly after only 8 distinct training samples. Not only that, but you can then give it the answer and have it predict the question, or give it part of the answer and part of the question and have it fill in the remaining.

If the input data has a lot of bits, it'll be pretty computation and memory intensive. But if you've got very few samples, - or so the design goal is - it will give you near the best predictions possible.

You just train it with bit vectors, including a bit vector of which bits are unknown. To get a prediction, you likewise just feed it a bit vector, which bits are unknown, and which bits you want it to predict.

Source code available here: https://sourceforge.net/p/aithroughlogiccompression/code/HEAD/tree/BayesianInferenceEngine/src/_version2/

Application of machine learning techniques in small sample clinical studies

5 Answers5

Linked