1

I am using a multilinear regression model to calibrate 2 sensors set, my current regression model is based in a set of 2048 samples per sensor. Based on those samples, I get my multilinear model.

I am wondering whether or not for these kind of regressions there is a theoretical expression which tells you the maximun number of required samples to achieve a concrete error value or viceversa. For instance, in linear prediction you have the stalling condition where beyond a certain number of samples the prediction accuracy is not increased anymore.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
ndarkness
  • 111
  • 2

1 Answers1

2

Unless you go down to quantum randomness, there is usually no theoretical limit to your error. After all, theoretically, you can always collect more information (= predictors), plus more samples as you increase the number of predictors, and what you can't collect, you may be able to constrain. (Q: How can I optimally forecast the amount of milk sold? - A: Ration it far below demand, then you'll be sure to always sell the entire amount you rationed, and you can forecast with 100% accuracy.)

The problems are usually more of a practical nature. It may be prohibitively expensive to collect data fast enough. Or you may not even know what kind of data you need to collect, or where to get it (I discuss "unknown unknowns" here).

Here is what I would do in your situation: pick 100 data points per sensor, fit your model, calculate the error. Do the same for a different sample of 100 data points. Repeat a few times. Then you'll have an idea of the distribution of the error from a model based on 100 data points. Then do the same for 200, 500, 1000 data points. Finally, look at how the error goes down as you increase your $N$.

Alternatively: start with a sample of 100 data points, fit your model, then add another 100 data points, update your model and so forth (and do this repeatedly, to get an idea of the variability). This simulates actually collecting more data, and you may get an idea of where the error trajectory flattens out.

And: better to look at the error on a holdout sample. In-sample, the error will always go down as you increase your $N$, but this is simply overfitting.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357