What is an 'input space' and why does the fraction of the input space covered by the training set matter?

Question

In this paper giving an overview of machine learning, the author writes:

Generalizing correctly becomes exponentially harder as the dimensionality (number of features) of the examples grows, because a fixed-size training set covers a dwindling fraction of the input space. Even with a moderate dimension of $100$ and a huge training set of a trillion examples, the latter covers only a fraction of about $10^{-18}$ of the input space. This is what makes machine learning both necessary and hard.

I also don't understand what he means by the input space in this context. I know he's referring to a vector space. I think he might be referring to a vector space which somehow represents all of the parameters. But I don't understand how this relates to the training set examples, and where he's getting the number $10^{18}$ from $100$ features and 1 trillion training examples - is there some way of calculating this, or is it some kind of estimate somehow?

I have the impression that the fraction of the input space covered by the training set relates to the idea that one cannot solve a system of equation with more columns than rows, and I can vaguely see how in machine learning there could generally be issues if one just didn't have many columns compared to rows. But I'm not sure, and I wish I understood why this mattered but I don't even know where to look to find this information.

References

Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78-87.

Btw, it is not $10^{18}$. It is $10^{-18}$. Maybe that is the source of the confusion? Also, I think you need change the quote. The last sentence, "I also think...", is not part of the quote. — roundsquare, Jul 29 '16 at 21:58

roundsquare · Accepted Answer · 2016-07-29T22:04:56.993

1

The "input space" is just all the possible inputs.

In this example, he is assuming that each dimension is binary so that means there are $2^{100}$ possible inputs. A trillion examples would cover only $1/10^{18}$ (i.e. $10^{-18}$) of that input space.

As for why its important to cover a large part of the input space, the short answer is that you need to see the behavior of what you are trying to learn in enough of the input space to build a good learner.

As a toy example, if you have two points: $(0, 5)$ and $(10, 15)$ the best you can do to fit them is the line $y = x + 5$. But you can't be confident in that because you have no idea how the function acts between $x=0$ and $x=10$ (much less outside this space). This example might seem silly and obvious, but what is talking about is essentially a generalization of this. But the problem is even worse than that because the amount of data you need grows exponentially with the number of dimensions. This is the curse of dimensionality.

edited Jul 29 '16 at 22:04

answered Jul 29 '16 at 21:57

roundsquare

700
3
13

Oh that makes sense. So he's taking about all of the combinations of each of the input features, not just the number features. But why does the fraction of the input space matter? – 12kate34 Jul 29 '16 at 22:04
Haha, I think I was typing my answer to that question as you were commenting on my answer :) Let me know if its not clear. – roundsquare Jul 29 '16 at 22:05
Yes that makes a lot more sense! Thank you so much. I appreciate the really clear and easy to understand example that you provided. – 12kate34 Jul 29 '16 at 23:14
@roundsquare I didn't find anything in that paper indicates that each dimension is binary, which part imply this conclusion? – fu DL Sep 16 '19 at 11:37
@fuDL it wasn't in that section directly. It was earlier at the beginning of page 4 ("Generalization being the goal has another major consequence: Data alone is not enough, no matter how much of it you have. Consider learning a Boolean function of (say) 100 variables from a million examples."). Also, the math approximately worked out. But, in reality, the point was basically just illustrative. – roundsquare Sep 16 '19 at 19:13

What is an 'input space' and why does the fraction of the input space covered by the training set matter?

1 Answers1

Linked