3

I would like to create a model which predicts the amount of energy used in an area, dependent on the number of properties in 5 categories (detached, semi, flats, bungalow and terrace). I have daily data, giving the total daily energy consumption, with the number of properties in each of the 5 categories.

My question, would it be better to build a multiple linear regression model using total energy consumption as the dependent variable, with property type as the explanatory variables (a coefficient for each of the 5 groups). Or, would it be better to create 5 simple regression models, using energy as the dependent, and each property type as different explanatory variables.

To be clear, I have some daily data which only contains readings from areas with one property type (daily energy readings for an area with only detached properties, for example), and some data which is from areas with a combination of properties (for example, daily energy readings for an area with semi-detached and flats).

What would be the difference between the methods, and are there any benefits/caveats to doing either way?

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
sym246
  • 417
  • 4
  • 13
  • 1
    What is your aim? You want good predictability or good interpretability or both? This [CV thread](https://stats.stackexchange.com/questions/127370/multiple-regression-or-separate-simple-regressions) looks related. – Krrr Aug 29 '17 at 08:52
  • Both, I guess. One problem i am facing with MLR modelling is some negative coefficients, which in the context of the problem doesn't make sense (negative energy usage for flats - but they don't produce energy!). So, was thinking that a model for each category may solve this. – sym246 Aug 29 '17 at 09:19
  • Aha! Perhaps add that to your question so we get a better handle of the question. Also [this document](http://www.stat.columbia.edu/~gelman/stuff_for_blog/oh_no_I_got_the_wrong_sign.pdf) provides several reasons and justifications. – Krrr Aug 29 '17 at 09:23
  • Based on your comment, its possible that your best bet is a separate model for each property type -- the logic being that they will behave differently. That said -- as DataD'oh says, cross-validatin, to perform a comparison of the tools on whatever metric you're interested in, is a good general purpose solution to this style question – user5957401 Aug 29 '17 at 09:32
  • 1
    see https://stats.stackexchange.com/questions/17336/how-exactly-does-one-control-for-other-variables/ https://stats.stackexchange.com/questions/94807/what-are-the-major-differences-between-the-parameter-estimation-of-a-simple-line https://stats.stackexchange.com/questions/78828/is-there-a-difference-between-controlling-for-and-ignoring-other-variables-i https://stats.stackexchange.com/questions/17336/how-exactly-does-one-control-for-other-variables/113207#113207 https://stats.stackexchange.com/questions/297373/differences-between-a-sequence-of-simple-linear-regressions-vs-a-single-multiple – Glen_b Aug 29 '17 at 09:57
  • 1
    See also https://en.wikipedia.org/wiki/Simpson%27s_paradox (particularly the illustration, which succinctly demonstrates why leaving out predictors that are related to the response can be a problem (above and beyond the impact on standard errors) – Glen_b Aug 29 '17 at 10:00

0 Answers0