Background:
Here is the background for the question, both the question itself and the answer given by Dikran Marsupial. Training with the full dataset after cross-validation?
It asks about after k-fold cross validation is there a way to squeeze more information out of the data without getting any noisy chunky bits.
My thoughts
If I were to set one of the alphas, and repeat the cross-validation informed learning, then I am essentially working with a different model. My search is operating in a membrane/subset of the original search space. The error reduction is projected onto that smaller space. Using the same cost-functional minimization, and the same original data, will result in different parameters.
If I were to bootstrap-resample the fit I could get a distribution of estimates for each of the alphas. There could be a least-variation parameter. I could measure variance, range, or iqr, but there could be one parameter that has less variation there.
If I were to "freeze" it at its central tendency, then repeat the fitting process I would be attempting to funnel the information into reducing the variation of the remaining parameters.
Questions:
- Does a parameter specific restriction like this improve the "squeeze more information out of the data without getting any noisy chunky bits"?
- What are the cases where an approach like this would fail, or be less effective?
- What are the textbook criticisms or alternatives to this approach?
Clarification:
Lets say that I have 3 parameters, $\alpha_1, \alpha_2, \alpha_3$ . They comprise a volume, the coordinate axes of a 3d space. At every point in that space is a set of parameters that comprise a single model. If I had enough data, then at every point in the space I could compute a vector pointing toward the location of lowest error (aka minimized cost functional). The vector would have direction and magnitude. All training methods are trying to go there.
When I set one parameter constant - I lock myself onto a plane in the space that may or may not contain the space-minimum. The region that I search in this case is only 2d, not 3d. I am searching within a plane, not within the volume.
At the minimum point the vector pointing to the minimum is a zero vector. I'm never going to get infinitely perfectly to that point. Implicit in non-infinite sample size, and non-zero noise is that there is a zone of "good enough" around the point. Adjacent to the point the vectors have very small magnitudes. The delta-rule, a truncated taylor series, relates the derivative (aka vector size), with the variance. When variance is small the gradient is small. The idea then is iterated searches in smaller and smaller pieces of the space, trying to minimize variance to approach the global minimum. I have the advantage of having started with cross-validation so that I can have some nonzero confidence that my initial state is near the global minimum and not a local "pocket".
They hypothesis is that all else being equal, when the variance of one of the parameters is set to its minimum we are constraining our search to be closer to the global minimum. Compared to classical cross-validation would we get better parameter estimation by performing this sequential locking?