1

Is common to say that Machine Learning techniques represent are purely data driven methods, and them are effective only if we have a large amount of data. I focused here on supervised/predictive learning. If we intend “large” as large number of eligible predictors I agree. However some people says that also a large number of observations needed. However I’m dubious about that because one key result behind the opportunity of predictive learning is the bias-variance tradeoff (see here can help: Minimizing bias in explanatory modeling, why? (Galit Shmueli's "To Explain or to Predict") Bias/variance tradeoff tutorial Question about bias-variance tradeoff Endogeneity in forecasting ). Then, is possible to show that if the amount of observations go to infinity the tradeoff disappear and only bias become relevant. Now, In order to address the bias the theoretical knowledge about the phenomena under investigation are much more important than the computational aspects. Therefore it seems me that more observation we have more important the theory become. If what I said is correct the opposite of the underscored phrase is true: few data are good situation in predictive learning, even if less observations we have and more simple the (predictive) model should be.

My mistakes? Maybe the truth is in the middle?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
markowitz
  • 3,964
  • 1
  • 13
  • 28
  • 2
    When people say *a large amount of data*, they mean a large number of observations. A large amount of variables would be expressed as *high-dimensional data*. For machine learning, you need a large amount of data, not high-dimensional data. If I find the time, I will post a more complete answer; for now this is just a comment. – Richard Hardy Oct 09 '20 at 13:28
  • I’m not an expert in machine learning and I can make big mistakes. However, staying at the nomenclature that you suggest, it sound me strange that “For machine learning, you need a large amount of data, not high-dimensional data”. Now, if you intend that in ML $p>N$ is not needed … that’s ok. But if you intend that small value for $p$ are usual in ML … it seems me wrong. At least in linear regression examples, the tool in which i’m more confident, my main argument is that in ML something like “automated predictors selection rule” is the core. – markowitz Oct 09 '20 at 19:13
  • 1
    At the other side, in classical linear regression framework, tools as stepwise selection are considered very dangerous. Infact prof Studenmund call it as “right tool for bad econometrics”. He suggest theory driven selection. This suggestion work much better if $p$ is small … independent variables come from theory at first round … endogeneity is the core. In linear regression the distinction between classical stat and ML are addressed here (https://stats.stackexchange.com/questions/268755/when-should-linear-regression-be-called-machine-learning) – markowitz Oct 09 '20 at 19:14
  • reply of david25272 is what I appreciate more, and in it the many predictors are typical for ML. Moreover, if only large $N$ is what we need for ML … it seems that it boil down in usual asymptotic theory. No? Surely I will be able to understand more when/if you find time for more exhaustive answer. – markowitz Oct 09 '20 at 19:14
  • Hi Richard, even your suggestions here would be appreciated: https://stats.stackexchange.com/questions/497271/arma-models-and-predictive-learning – markowitz Nov 22 '20 at 13:50
  • @markowitz - think about fitting a line. If you have one point, how do you fit it? You need at least as many samples as you have parameters. Number of equations >= number of unknowns. The real world has noise, and some of it is terribly clever. This means you need more or many more samples than parameters. There are many papers on things like "the unreasonable effectiveness of big data". – EngrStudent Jan 12 '21 at 23:56
  • @markowitz - Imagine your line is in 3d space instead of 2d space. If your information were perfect then 2 samples still works, but given the nature of noise, your actual fit is going to be farther off in higher dimensions for the same number. Each point is bumped by a bit on each axis, so the distance between truth and the estimate is farther. If the nature of the noise is different on each axis it gets harder. You need to account for that. If you are in a 20k dimensional space and have 2 samples that have additive normal noise, what is the mean L2 distance to the truth? – EngrStudent Jan 12 '21 at 23:57
  • @EngrStudent - I know that for OLS estimate the number of observation should be at least equal the number of parameters, and that in general a minimum amount of observations needed. I know also that more complex the model is and more obs I need. However this not solve the, at least apparent, contradiction above. I’m worry about what happen if the amount of obs approach infinity. This seems intended, in general, as a good fact in ML. However most ML techniques and tools are justified in base of BV trade off, that in the same case tend to disappear. – markowitz Jan 13 '21 at 10:50
  • See figures 1 and 2 in this http://arxiv.org/abs/1903.07571v2 . The paper is about by trade off with imperfect and prescient models varying by parm count. If I were looking at this, I would find the economic value of adding one more sample and the economic cost in terms of compute and memory of processing one more sample and look for when it becomes a wash, When the net benefit of additional samples goes to zero. – EngrStudent Jan 13 '21 at 11:58

0 Answers0