Proof that more is better in optimization

Question

Today I overheard somebody say something along the lines of: "there is a proof that shows that in optimization, more options are better or at least neutral." I searched but could not find such a proof. Does it sound familiar to anyone? Sorry for the vague question but I figured others might be interested in seeing it. The idea makes intuitive sense but it would cool to see it formalized.

Edit: by options I mean the parameter space.

with options being what exactly? sounds more like cross-validation for hyperparameters — Sam, Aug 31 '17 at 22:55
There are formal proofs of this, but the basic idea is really simple. Consider a system $f(\theta, \pi)$ which you are optimizing over. If you don't optimize over $\pi$, it hasn't disappeared - it's just been fixed at some value $\hat{\pi}$ that you can't affect (zero is not an uncommon one.) If the optimal solution for the original system is $\theta^*, \pi^*$, it must be that $f(\theta^*, \pi^*) \leq f(\theta^{**}, \hat{\pi})$, otherwise the first one wouldn't be optimal, for all $\theta^{**}$. And, sometimes, that inequality will be strict, because $\hat{\pi}$ is not optimal. — jbowman, Aug 31 '17 at 22:57
@jbowman, why not develop that into an answer? I find the question vague & unclear, but if you understand it, go for it. In addition, you might see if you could help the OP formulate the question in a sufficiently clear manner. — gung - Reinstate Monica, Sep 01 '17 at 00:08
@gung Congrats on your election. [Omitted variable bias](https://en.wikipedia.org/wiki/Omitted-variable_bias) and the like do not really need development into a formal answer here, do they? — Carl, Sep 01 '17 at 01:24
Thanks, @Carl. Is this question about omitted variable bias? That's a long way from what I thought this was about, but I find it unclear, so maybe I missed something obvious. — gung - Reinstate Monica, Sep 01 '17 at 01:38
@Carl - I think this is really about optimization, not estimation, but there's a lot of overlap with omitted variable bias in a way. This result underlies a general approach relating to relaxation of constraints, where it shows that the relaxed optimal solution is at least as good as the real optimal solution, and there are ways of bounding the "optimality gap" between the two which help determine whether an iterative algorithm has converged sufficiently. It appears in other contexts too. — jbowman, Sep 01 '17 at 03:39
@jbowman Gee, now its my turn to say, I must have missed something. In general, regression has an optimization target which is some estimator of error of some parameter; not necessarily OLS of anything or MLE in any sense. How does that relate to what you are saying? — Carl, Sep 01 '17 at 03:54
@Carl Now I'm getting confused! When I read the question, I saw it from an operations research perspective, not a stats perspective, for whatever reason. I can see, though, that had I looked at it from a stats perspective I would have likely been thinking "omitted variable bias". I'll post an answer later today and hopefully figure out how to clarify all this, probably with some help from you in comments. — jbowman, Sep 01 '17 at 14:02
OP clearly stated his context is optimization, and that's the one in which people say what OP overheard. there's no question about the context in my opinion, so there's no need to make it more confusing by dragging the question into other contexts — Aksakal, Sep 05 '17 at 12:08

Aksakal · Answer 1 · 2017-09-01T17:44:51.307

It's a general comment that's made in any kind of optimization. Here's the intuition.

Let us say we are looking for a minimum (maximum) of a function that has many parameters (variables). All possible combinations of parameters are to be considered given the constraints on the parameters.

If you add one more constraint, your parameter space is bound to either decrease or stay the same. It can't become larger. So, now you have a smaller space of parameters where the optimum is searched. This means that the solution cannot be better than the previous solution.

Silly example. I have a function $f(x^2-x)$ and want to find a minimum.

First, I have no constraints, then the minimum is at $x_0=1/2$.

Second, let's put a constraint $x>0$, the minimum doesn't change it's the same $x_1=1/2$

Third, let's put a different constraint: $x<0$, the minimum is different $x_2=0$ and the its not as small as in the case with no constraints $f(x_0)<f(x_2)$, it's less optimal.

You could also consider the global vs. local optimum. The latter cannot be better than the former.

Carlos Cinelli · Answer 2 · 2017-09-01T18:30:26.553

In the pure optimization sense, this is true. The intuition for why this is true is trivial. Suppose that $x^*$ is a solution to the optimization problem $\max ~~ f(x)~~s.t. ~~ g(x) = b $. If you relax the problem, say, by dropping the condition $g(x) = b$, this means,you have more candidate options, including your original one. In other words, you can always pick $x^*$ again in your relaxed optimization problem. Hence, if you pick a different solution $x' \ne x^*$, by definition this means $f(x')\ge(x^*)$, otherwise you would not have picked $x'$. In this sense, having more options never hurts.

However, in a statistical sense, this is not true, because in statistics usually our goal is not to optimize the fit in sample. That is, considering the appropriate metric (out of sample prediction, causal inference, etc), having more options can hurt. I'm bringing this up here because this is a statistics Q&A site, not a math Q&A site. If your question is purely about optimization, then it's better suited here.

So, for statistical problems, having more options in your optimization for sure guarantees you will get a better (or the same) fit in sample. But it could hurt your fit out of sample. That is, allowing your search space to be every possible complex model might lead to overfitting. This is the very reason why we do regularization. So we have constrained regression methods like the ridge, or the lasso, and that's also why people regularize their estimates with prior information in Bayesian inference.

To sum up: the claim that "more is better" is true when considering optimizing in sample fit. It's not necessarily true for optimizing out of sample fit or when you have other inferential goals.

" That is, considering the appropriate metric (out of sample prediction, causal inference, etc), having more options can hurt" - can you elaborate? — Aksakal, Sep 04 '17 at 18:44
@Aksakal for instance, fitting a 10th order polynomial with only 10 samples. You you will get a perfect fit in sample, but a terrible fit out of sample. — Carlos Cinelli, Sep 04 '17 at 21:17
@Aksakal if you want to optimize out of sample prediction, more is not necessarily better. — Carlos Cinelli, Sep 05 '17 at 02:48

score 1 · Answer 3 · answered Sep 01 '17 at 17:03

Let's consider a problem where you want to find $\arg \min_{\theta, \pi} f(\theta, \pi)$. Define the set of values that minimize $f$ as $S = \{(\theta^*, \pi^*)\} = \arg \min_{\theta, \pi} f(\theta, \pi)$

This is the "more options" problem. The "fewer options" problem is one where we don't get to choose $\pi$; it's fixed, often with values of 0 or 1, which can cause various sub-expressions to disappear from $f$ altogether. This problem can be written as finding the $\arg \min_{\theta(\pi)} f(\theta(\pi))$, where we write $\theta(\pi)$ to make clear the dependency of the optimal $\theta$ on the fixed value of $\pi$.

Define $\Pi^* = \{\pi^*\}$, the set of $\pi$ for which there exists a $\theta$ such that $f(\theta, \pi)$ achieves its minimum, and similarly define $\Theta^*$. Now, if $\pi \in \Pi^*$, it is clear that there exist $\theta(\pi) \in \Theta^*$ such that $f(\theta(\pi), \pi)= \min_{\theta, \pi} f(\theta, \pi)$. You can find them by extracting all the elements from $S$ for which $\pi = \pi^*$. However, there aren't any $\theta(\pi)$ for which $f(\theta(\pi), \pi) < \min f(\theta, \pi)$, which can be shown with a short proof by contradiction. So, in this case, giving ourselves more options by allowing ourselves to choose $\pi$ is neutral; the optimal value is the same for the more-option problem as for the fewer-option problem.

On the other hand, if $\pi \notin \Pi^*$, then there does not exist any $\theta(\pi)$ such that $f(\theta(\pi))= \min_{\theta, \pi} f(\theta, \pi)$, because the minimum is only achieved for values of $\pi \in \Pi^*$. This implies that $\min_{\theta(\pi)} f(\theta(\pi)) > \min_{\theta, \pi} f(\theta, \pi)$ (you can get the "$>$" with a short proof by contradiction.) In this case, giving ourselves more options by allowing ourselves to optimize over $(\theta, \pi)$ instead of just over $\theta$ has allowed us to improve the value of the objective function by the difference $\min_{\theta(\pi)} f(\theta(\pi)) - \min_{\theta, \pi} f(\theta, \pi)$.

Therefore, more options in terms of parameters to optimize over never hurts, it's either neutral or helpful. The question of whether adding those parameters makes the problem so complex that it becomes impossible to actually solve is quite a different one!

Proof that more is better in optimization

3 Answers3