3

Suppose we have a DIY market and for all wrenches we are selling, we collect a various number of more or less important attributes (size, weight, price, hardness of material, etc., all of them quantitative). Also we know how many items we sold last year, so we can get the revenue. I'd like to model the revenue in a linear way depending on the attributes.

Choosing the right model is obviously an important question. I've thought of generating all possible linear models (in R revenue ~ size, revenue ~ size + weight, weight, revenue ~ size + weight + price, ...) and compare the models. The method of comparison would be leave-one-out cross validation, comparing the average squared error of each model.

  1. Is this approach of model comparison a good idea? If not, why? I'm aware of the possible memory issues if our data has too many different attributes (ie rows in the data frame). In a stackoverflow answer on how to generate all possible models in R I've found the following warning:

    Make very sure that is what you want, in general this kind of model comparison is strongly advised against. Forget about any kind of inference as well when you do this.

  2. Is the squared error from leave-one-out cross validation a good measure to pick the best model, especially when comparing models with a different number of independent variables? Are there other or better ways to compare the models?

Roland
  • 991
  • 9
  • 19
  • 6
    Indeed, yes, full subset selection (aka best subset selection, all subsets regression, etc) along with approximations like stepwise regression lead to a long list of problems (biased estimators, the inference procedures not having the anticipated properties, that kind of thing). Many questions here discuss the issue. I will try to find some. For starters see [here](http://stats.stackexchange.com/questions/74956/does-full-subset-selection-regression-model-building-suffer-from-the-same-handic) and [here](http://stats.stackexchange.com/a/11276/805). – Glen_b Feb 02 '14 at 10:28
  • 2
    Using information from all possible models is at the heart of Bayesian Model Averaging. I wouldn't necessarily say that it's a bad thing. Essentially, what you're doing with BMA is taking into account model uncertainty and attaching probabilities to each possible model rather than choosing one single model. A good tutorial can be found here: [Bayesian Model Averaging: A Tutorial By Hoeting, Madigan, Raftery and Volinsky. _Statistical Science Vol. 14, No. 4 (Nov., 1999), pp. 382-401_](http://www.jstor.org/discover/10.2307/2676803?uid=3738744&uid=2134&uid=2&uid=70&uid=4&sid=21103374765067) – Graeme Walsh Feb 02 '14 at 16:54
  • 1
    Note that "all linear models" would include all possible high order interactions as well as all possible transformations of the input variables... an impossible task. – Michael M Feb 28 '14 at 08:25

0 Answers0