2

Given the common problem of predicting response variable $Y$ from predictor variables $X$ and $Z$, is there any way to determine the "theoretical best" prediction possible for a response variable?

When I am asked to find a model to do such a prediction, I might try different techniques: e.g. linear regression, KNN, etc. However, if $X$ and $Z$ are simply not predictive at all of $Y$, then no matter how good of a model I have, it is a waste of time. For example, if I am trying to predict a student's grade in a class, then using the temperature in Hawaii and the GDP of France will be a complete waste of time. How can I determine that without trying it (or knowing a priori)?

In other words, how do I find out if I should even be using $X$ and $Z$ to predict $Y$ in the first place? Is there some way to calculate an upper bound for the "best" a model I can possibly hope to generate?

user310374
  • 301
  • 3
  • 10

1 Answers1

5

On the one hand, we will typically not know the true data-generating process, unless we have simulated the data ourselves.

Even in your example of predicting a student's grade, the temperature of Hawaii and the GDP of France may have an impact: if the weather was not nice during the student's holiday in Hawaii, he may have studied more and gotten a better grade. Or better weather may have made for a more relaxing holiday. A higher GDP in France may contribute to him doing an internship there, which could again take time away from his studies - or motivate him to do really well.

I am a firm believer in "tapering effect sizes": everything could conceivably have an impact on everything else - but the effects get weaker and weaker.

And even when we do know that a particular predictor $X$ has an impact on $Y$, sampling inaccuracy and the bias-variance tradeoff may imply that including $X$ in the model may be counterproductive. A misspecified wrong model may yield better predictions than a correct one. This is part of the reason why shrinkage works.

Bottom line: we will usually not know which variables to include for best predictive performance, and even when we know whether a variable has an impact, including it may be counterproductive. This is why modeling will stay at least partly an art.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • 2
    (+1) And there's always another way of specifying the model - constructing the features - that might be more appropriate than what you've tried so far. What if the true data-generating process were $\operatorname{E} Y = \beta_1 + \beta_2 \sin(x) + \beta_3 \cos(x) + \beta_4 I(Z) +\varepsilon$, where $I=0$ when $Z$ is odd, & $I=1$ when $Z$ is even? – Scortchi - Reinstate Monica Feb 04 '16 at 13:29
  • That is fascinating. So, the only way I can show that the GDP of France is an appropriate or inappropriate predictor for the student's grade, given training data, is to try it on various models? I suppose that makes sense--I thought there would be some way to quantify the relationship between predictor and and the response variable, but every way would involve some kind of assumptions (i.e. a model). – user310374 Feb 04 '16 at 14:12
  • 1
    Whether a predictor $X$ is useful will *always* depend on the model - after all, your model might already include some $Z$ that is highly correlated with $X$, so that having *both* $X$ and $Z$ will be worse than having $Z$ alone, even if $X$ may be the actual driver. And of course, if you look long enough, you will [always find something that looks useful](http://www.tylervigen.com/spurious-correlations), and even something that will improve predictions on the holdout sample - but that may be useless in "true" forecasting ("overfitting on the holdout sample"). – Stephan Kolassa Feb 04 '16 at 14:22