3

The formula I usually see for MSE is:

$$\mathrm{MSE} = \frac{\sum\limits_{t=1}^T e_i^2}{n-k-1},$$

Whereas for MSPE it is usually:

$$\mathrm{MSPE} = \frac{\sum\limits_{t=T}^{T+P} e_i^2}{P}.$$

So here is my question: is the formula that I showed for MSPE correct when we have $k$ variables in our regression model? And why should we not correct for $k$ in this case while we do correct for it in the MSE?

Thanks in advance.

rbm
  • 753
  • 2
  • 12
  • 34
  • Where did you find the MSE formula with $N-k-1$ rater than $N$ in the denominator? For example, [Wikipedia](http://en.wikipedia.org/wiki/Mean_squared_error#Definition_and_basic_properties) or Diebold's [forecasting textbook](http://www.ssc.upenn.edu/~fdiebold/Teaching221/Forecasting.pdf) p.79 has only $N$ with no adjustments. – Richard Hardy Mar 24 '15 at 15:14
  • @RichardHardy See for instance here: http://www.stat.yale.edu/Courses/1997-98/101/linmult.htm – rbm Mar 24 '15 at 15:39
  • Still, as far as I know the version with no adjustment is more popular. Your source has a correct formula for the estimated error variance but I doubt its equality to MSE is correct. – Richard Hardy Mar 24 '15 at 15:53
  • The links do not work for me, but I believe you. It is interesting to find out that there is no unique definition of such a simple measure. – Richard Hardy Mar 24 '15 at 16:04
  • This answer may be of interest to you, http://stats.stackexchange.com/questions/115011/in-simple-linear-regression-where-does-the-formula-for-the-variance-of-the-resi/115040#115040 – Alecos Papadopoulos Mar 24 '15 at 20:40
  • @alecospapadopoulos Thank you. I undertand where the result comes from for the MSE, but my question is more: why is there not a similar criterion for the MSPE? – rbm Mar 24 '15 at 21:14
  • The mean square error is on the same data to which you fitted the model; the "-k" adjusts for the model d.f. because the model will be closer to the data than the population values are ... but when predicting data which was not used in any way, the d.f. is irrelevant. – Glen_b Mar 25 '15 at 01:53
  • @glen_b That makes sense, but what if I look at a model like this: $y= beta*x(-1) + \epsilon$? That is, a predictie regression. Then I am using my x-variables to obtain my out of sample forecasts. – rbm Mar 25 '15 at 06:36
  • I'm not sure what you're asking there. Do you mean a model with no constant term? – Glen_b Mar 25 '15 at 08:26
  • @Glen_b No, I mean a model with a lagged $x$-variable in it. That will always use x to predict. Therefore, what I mean is, wouldn't it make sense to correct for the number of such $x$ variables in that case? – rbm Mar 25 '15 at 08:30
  • No, it still works the same. – Glen_b Mar 25 '15 at 09:42
  • @Glen_b I don't think you understood my question. I know what the correct formula's are. My question is wouldn't it *make sense* to have an out of sample criterion that also corrects for $k$, just like the MSE does in sample? If it doesn't, then why not? – rbm Mar 25 '15 at 17:59
  • I believe I understood the question just fine. It's you that's making the assertion that there's something inherently different about the posed situation ... why does it 'make sense' that it's different? Answering that will probably reveal a hidden assumption -- one that may not be true. – Glen_b Mar 25 '15 at 22:56

1 Answers1

3

It appears we have a regression set-up. "Adjusting for degrees of freedom", i.e. using $T-K-1$ instead of $T-1$ (I guess in $K$ the constant term is not counted), is performed in order to make the estimator unbiased.

I will use the subscript $s$ to denote values related to the sample we used to obtain the estimates, and $p$ to denote the prediction area. In matrix notation the model we estimate, having $K$ regressors plus a constant term, and having an i.i.d. sample of $T$ available observations, is

$$\mathbf y_s = \mathbf X_{s}\beta + \mathbf u_s,\;\; E[\mathbf u_s \mid X_s]= 0,\;\; E[\mathbf u_s\mathbf u_s' \mid X_s]= \sigma^2I_T$$

where we have assumed strict exogeneity of regressors with respect to the error term. If we do not assume strict exogeneity, the property of unbiasedness is already lost, so there would be no point in discussing "correction for degrees of freedom" (except to the degree that simulated evidence has been published which shows that, nevertheless, such a correction makes the estimator perform better in some other sense).

By standard least-squares algebra we have the residual maker or "annihilator" matrix $\mathbf M_x \equiv I_T - \mathbf X_{s} \left(\mathbf X_{s}'\mathbf X_{s}\right)^{-1}\mathbf X_{s}' = I_T -\mathbf P_x$ which is symmetric and idempotent. We have (consult Hayashi ch.1 freely downloadable from here)

$$\mathbf {\hat u} = \mathbf M_x \mathbf y = \mathbf M_x \mathbf u \implies \mathbf {\hat u}'\mathbf {\hat u} = \mathbf {u}'\mathbf M_x \mathbf {u}$$

Then, we can write

$$E\left(\sum_{t=1}^T \hat u_t^2\right)= E\Big[E\left(\sum_{t=1}^T \hat u_t^2 \mid \mathbf X_s\right)\Big]=E[E(\mathbf {\hat u}'\mathbf {\hat u}\mid \mathbf X_s)] = E\left[E(\mathbf {u}'\mathbf M_x \mathbf {u}\mid X_s)\right]$$

$$=\sigma^2\cdot {\rm tr}(\mathbf M_x) = \sigma^2\cdot {\rm tr}(I_T -\mathbf P_x) = \sigma^2(T-K-1)$$

where ${\rm tr}()$ is the Trace operator. So the estimator

$$\mathrm{MSE} = \frac{\sum\limits_{t=1}^T \hat u_t^2}{T-K-1}$$

is unbiased as regards the estimation of the unknown $\sigma^2$.

PREDICTION
Let's now form predictions for $P$ periods ahead, i.e. for $T+1$ to $T+P$. For this interval too, we assume that the underlying model is the same In matrix notation the predicted values will be

$$\mathbf {\hat y_p} = \mathbf X_{p}\hat \beta$$

To avoid misunderstandings we note that:
a) $\hat \beta$ is not recursively re-estimated. We estimated once using the initial sample, and we use this initial vector of estimates for all subsequent predictions.
b) Given the set-up, each prediction is independent of the previous ones. So the usual result that "prediction error increases as we go further into the future" does not apply in this regression setting (it would apply to autoregressive schemes where the next prediction depends on the value of the previous one).
c) Under the i.i.d. set up, the prediction errors have identical moments (mean and variance).

The mean of the prediction error is

$$E(e_j) = E(y_i) - E(\mathbf x_j'\hat \beta) = E(\mathbf x_j')\cdot\beta +E(u_j) - E(\mathbf x_j')\cdot E(\hat \beta) = 0, \\\;j=T+1,..., T+P$$

since a) $\hat \beta$ is independent of $\mathbf x_j'$, b) $\hat \beta$ is unbiased for $\beta$ and c) the error term has expected value zero, per the model assumptions. Then the expected value of its prediction error squared will be equal to the common variance. Then

$$E(\mathrm{MSPE}) = \frac{\sum_{j={T+1}}^{T+P} E[e_j^2]}{P} = \frac PP {\rm Var(e)} ={\rm Var(e)}$$

and we see that MSPE is an unbiased estimator of the common prediction-error variance if it uses $P$, not something else.

PS-1: If the postulated model links $y_t$ with $x_{t-1}$, then the $T$-th observation on $X$ does not enter the estimation of $\beta$.

PS-2: If the "strict exogeneity" assumption regarding the regressors and the error term is in any way violated, then unbiasedness is gone and no tweaking with the denominator of the sample moment can correct that.

Alecos Papadopoulos
  • 52,923
  • 5
  • 131
  • 241
  • Wow, this is really a great answer! I will take some more time to read it later in the next couple of days, but the 'prediction' part of your explanation seems to be completely what I was after! Thank you very much for your help, I appreciate it greatly! – rbm Mar 25 '15 at 19:48