In Machine Learning, how does getting more training examples fix high variance $(Var(\hat f(x_{0})))$?

Question

I don't believe that (Why does increasing the sample size lower the variance?) appropriately handles my question!

The linked questions explains why any addition of random variables (all iid) produces a new RV whose variance is less than the variance of any one RV. In this case, the effect of adding a "training example" is to add a random variable to the summation representing the average: $$ Y = \frac{\sum_{i = 1}^{n} X_{i}}{n}$$

The effect of adding a training example in an ML context has a much different effect (What is meant by the variance of *functions* in *Introduction to Statistical Learning*?)

$$ \mathcal{A} : \mathcal{T} \rightarrow \{f \mid f: X \rightarrow \mathbb{R} \} $$

A learning algorithm is a mapping from any subset, $\tau$, of the set of all training examples to some predictor (which was trained using the algorithm $A$). Each subset of the training set has a certain probability of being selected. Thus, to add another training example to the set of training examples is to, perhaps, fundamentally change the probability of some subsets being selected. This, in turn, alters $E[A_{x_{0}}]$ (the expectation of random variable: $A_{x_{0}}: T \rightarrow f_{t}(x_{0})$ where $f_{t}$ is the predictor given by a particular choice of subset $t \in T$). As a result, this also changes the variance of $A_{x_{0}}$.

Andrew Ng's ML course makes the claim that increasing the training examples in some training set (some subset of the set of all training examples) decreases the variance of the learning algorithm. Can this either be rigorously shown using the above definition of variance or given some intuitive grounding? I can't find any explanation in his course as to why this might be true, only a learning curve showing the phenomenon.

I can't imagine that adding examples *must* decrease the variance for all learning algorithms. As a counterexample, consider the following pathological learning algorithms. 1) An algorithm that ignores all data and outputs a constant function. Here, the variance is always zero. 2) An algorithm that works as normal for datasets with $n=1000$ points or fewer. If $n>1000$, it uses only the first 5 data points. Here, the variance can even increase as we go from 1000 to 1001 data points. — user20160, Sep 07 '19 at 16:19

score 0 · Answer 1 · edited Apr 13 '17 at 12:44

0

I will try to give a intuitive explanation by assuming an extreme case, although it is not a general proof.

Let us use the definitions in the question that you mentioned: What is meant by the variance of *functions* in *Introduction to Statistical Learning*?

We know the random variable $T$ takes a value from the space of possible training sets. Now, assume this ideal condition: We have a large training set which is so large that contains all of the possible training examples. More formally, for each training set $T$, we have:

$size \ of \ T=number \ of \ all \ possible \ training \ examples$

It is easy to show that in this situation, the probability distribution function of $T$ has the value one at one point, and zero at all of the other points (i.e. it is an impulse function). That is because there is exactly one training set that satisfies the above equation.

Clearly, in this case, $T$ has zero variance, which in turn results in a zero variance for the learned model ($A$). Therefore, a very large training set caused the variance of the learning algorithm to become zero.

edited Apr 13 '17 at 12:44

Community

1

answered Oct 22 '16 at 16:48

A.Ahmadian

58
8

If I've understood you correctly: variance drops to zero, since if enough training examples exist in the data set, then any one training set necessarily forces the predictor function to take on ONE function. This means that the a point evaluated using the function will always give the same output, not matter what subset of training data we choose. So the all the probability of output is concentrated at that one point -- hence variance decreases. – Muno Oct 22 '16 at 18:25
Is this accurate? – Muno Oct 22 '16 at 18:25
@Muno , yes that is what I mean. As I said, it is not a formal proof, I just wanted to give an illustration. – A.Ahmadian Oct 22 '16 at 18:43
Is this a hard rule? Will variance always drop if we more data? Or only if we add enough data? – Muno Oct 22 '16 at 19:14

In Machine Learning, how does getting more training examples fix high variance $(Var(\hat f(x_{0})))$?

1 Answers1