Are there any contemporary uses of jackknifing?

Question

The question: Bootstrapping is superior to jackknifing; however, I am wondering if there are instances where jackknifing is the only or at least a viable option for characterizing uncertainty from parameter estimates. Also, in practical situations how biased/inaccurate is jackknifing relative to bootstrapping, and can jackknife results provide preliminary insight before a more complicated bootstrap is developed?

Some context: A friend is using a black-box machine learning algorithm (MaxEnt) to classify geographic data that is "presence only" or "positives only." General model assessment is usually done using cross-validation and ROC curves. However, she is using the output of the model to derive a single numeric description of the model output and would like a confidence interval around that number; Jackknifing appears to be a reasonable way to characterize uncertainty around this value. Bootstrapping does not appear relevant because each data point is a unique location on a map that cannot be re-sampled with replacement. The modeling program itself might be able to ultimately provide what she needs; however, I am interested in general if/when jackknifing can be useful.

Such mapping applications--making estimates from discrete sampled locations--are precisely the ones where I have noted extensive use of jackknifing, for the reason you give. It is a standard procedure undertaken preliminary to performing kriging, for instance. — whuber, Jan 21 '15 at 16:42
In some low sample settings, because bootstrapping sample with replacement, the whole data matrix can become singular, so many models are impossible to fit. — rep_ho, Jul 05 '18 at 14:07

score 5 · Accepted Answer · edited Apr 13 '17 at 12:44

If you take jackknifing not only to include leave-one-out but any kind of resampling-without-replacement such as $k$-fold procedures, I consider it a viable option and use it regularly, e.g. in Beleites et al.: Raman spectroscopic grading of astrocytoma tissues: using soft reference information. Anal Bioanal Chem, 2011, 400, 2801-2816

see also: Confidence interval for cross-validated classification accuracy

I avoid LOO for several reasons and instead use an iterated/repeated $k$-fold scheme. In my field (chemistry/spectroscopy/chemometrics), cross validation is far more common than out-of-bootstrap validation. For our data/typcial applications we found that $i$ times iterated $k$-fold cross validation and $i \cdot k$ iterations of out-of-bootstrap performance estimates have very similar total error [Beleites et al.: Variance reduction in estimating classification error using sparse datasets. Chem.Intell.Lab.Syst., 2005, 79, 91 - 100.].

The particular advantage I see for looking at iterated cross validation schemes over bootstrapping is that I can very easily derive stability/model uncertainty measures that can be intuitively explained, and it separated two differnt causes of variance uncertainty in the performance measurement which are more intertwined in out-of-bootstrap measurements.
One line of reasoning that gets me to cross validation/jackknifing is looking at the robustness of the model: cross validation corresponds rather directly to questions of the type "What happens to my model if I exchange $x$ cases for $x$ new cases?" or "How robust is my model against perturbing the training data by exchanging $x$ cases?" This is kind of applicable to bootstrapping as well, but less directly.

Note that I do not try to derive confidence intervals, because my data is inherently clustered ($n_s$ spectra of $n_p \ll n_s$ patients), so I prefer to report

a (conservative) binomial confidence interval using the average observed performance and $n_p$ as sample size and
the variance I observe between the $i$ iterations of the cross validation. After $k$ folds, each case is tested exactly once, though by different surrogate models. Thus any kind of variation observed between the $i$ runs must be caused by model instability.

Typically, i.e. if the model is well set up, 2. is needed only to show that it is much smaller than the variance in 1., and that the model is therefore reasonably stable. If 2. turns out to be non-negligible, it is time to consider aggregated models: model aggregation helps only for variance caused by model instability, it cannot reduce the variance uncertainty in the performance measurement that is due to the finite number of test cases.

Note that in order to construct performance confidence intervals for such data, I'd at least consider that the variance observed between the $i$ runs of the cross validation is of the average of $k$ models of that instability, i.e. I'd say model instability variance is $k \cdot $ observed variance between cross validation runs; plus variance due to finite case number - for classifiation (hit/error) performance measures this is binomial. For continuous measures, I'd try to derive the variance from within cross-validation run variance, $k$, and the estimate of the instability-type variance for the $k$ models derived from the

The advantage of crossvalidation here is that you get a clear separation between uncertainty caused by model instability and uncertainty caused by finite number of test cases. The corresponding disadvantage is of course that if you forget to take the finite number of actual cases into account, you'll severely underestimate the true uncertainty. However, this would happen for bootstrapping as well (though to a lesser extent).

So far, the reasoning concentrates on measuring performance for the model you derive for a given data set. If you consider a data set for the given application and of the given sample size, there is a third contribution to variance that fundamentally cannot be measured by resampling validation, see e.g. Bengio & Grandvalet: No Unbiased Estimator of the Variance of K-Fold Cross-Validation, Journal of Machine Learning Research, 5, 1089-1105 (2004). , we also have figures showing these three contributions in Beleites et al.: Sample size planning for classification models., Anal Chim Acta, 760, 25-33 (2013). DOI: 10.1016/j.aca.2012.11.007)
I think what happens here is the result of the the assumption that resampling is similar to drawing a complete new sample breaking down.

This is important if model building algorithms/strategies/heuristics are to be compared rather than constructing a particular model for the application and validating this model.

Are there any contemporary uses of jackknifing?

1 Answers1