Follow-on to "Training with the full dataset after cross-validation" - sequential parameter estimation

Question

Background:

Here is the background for the question, both the question itself and the answer given by Dikran Marsupial. Training with the full dataset after cross-validation?

It asks about after k-fold cross validation is there a way to squeeze more information out of the data without getting any noisy chunky bits.

My thoughts

If I were to set one of the alphas, and repeat the cross-validation informed learning, then I am essentially working with a different model. My search is operating in a membrane/subset of the original search space. The error reduction is projected onto that smaller space. Using the same cost-functional minimization, and the same original data, will result in different parameters.

If I were to bootstrap-resample the fit I could get a distribution of estimates for each of the alphas. There could be a least-variation parameter. I could measure variance, range, or iqr, but there could be one parameter that has less variation there.

If I were to "freeze" it at its central tendency, then repeat the fitting process I would be attempting to funnel the information into reducing the variation of the remaining parameters.

Questions:

Does a parameter specific restriction like this improve the "squeeze more information out of the data without getting any noisy chunky bits"?
What are the cases where an approach like this would fail, or be less effective?
What are the textbook criticisms or alternatives to this approach?

Clarification:

Lets say that I have 3 parameters, $\alpha_1, \alpha_2, \alpha_3$ . They comprise a volume, the coordinate axes of a 3d space. At every point in that space is a set of parameters that comprise a single model. If I had enough data, then at every point in the space I could compute a vector pointing toward the location of lowest error (aka minimized cost functional). The vector would have direction and magnitude. All training methods are trying to go there.

When I set one parameter constant - I lock myself onto a plane in the space that may or may not contain the space-minimum. The region that I search in this case is only 2d, not 3d. I am searching within a plane, not within the volume.

At the minimum point the vector pointing to the minimum is a zero vector. I'm never going to get infinitely perfectly to that point. Implicit in non-infinite sample size, and non-zero noise is that there is a zone of "good enough" around the point. Adjacent to the point the vectors have very small magnitudes. The delta-rule, a truncated taylor series, relates the derivative (aka vector size), with the variance. When variance is small the gradient is small. The idea then is iterated searches in smaller and smaller pieces of the space, trying to minimize variance to approach the global minimum. I have the advantage of having started with cross-validation so that I can have some nonzero confidence that my initial state is near the global minimum and not a local "pocket".

They hypothesis is that all else being equal, when the variance of one of the parameters is set to its minimum we are constraining our search to be closer to the global minimum. Compared to classical cross-validation would we get better parameter estimation by performing this sequential locking?

Hyperparameter surfaces are typically highly irregular with lots of local optima, so any gradient-based approach only works when you are already in the vicinity of the global optimum. — Marc Claesen, Sep 14 '14 at 15:08
See updated question. I had two comments and was working on my third. I felt that indicated it should go in the question itself. The initial conditions are the cross-validation result. That should give some confidence that the global optimum is nearby. — EngrStudent, Sep 14 '14 at 15:17
In some sense you only have gradient based approaches and brute force. Particle swarm (PSO) is a re-contrived cost functional. EM is a gradient descent. GA is a local semi-discrete gradient descent. Either your previous model-results give you a hint where to go next, or you have to go through many points. You can reformulate your error, but its still GD. Many approaches live in the continuum between pure GD and pure brute force. Cross-validation itself could be argued to live there. Bootstrap too. — EngrStudent, Sep 14 '14 at 15:22
Those are some sweeping generalizations, I wouldn't call every direct search method gradient-based. Specific to this question: you may want to check out sequential model-based optimization approaches (like [SMAC](http://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf)), which typically use Gaussian processes to model the response surface including the variance. Seems related to what you are considering. — Marc Claesen, Sep 14 '14 at 15:28
@MarcClaesen - I agree that those are sweeping generalizations. I am not satisfied with them or I would not have posted the question. They entirely ignore the difference between model and truth, and assume that the model is appropriate and that a global minimum exists. I will czech out SMAC. — EngrStudent, Sep 14 '14 at 18:21

Follow-on to "Training with the full dataset after cross-validation" - sequential parameter estimation

0 Answers0