2

I'm trying to wrap my head around why you want a Cp score to be close to the number of regressors (rather than just minimizing it and getting close to zero)

I am using Cp as a scoring function in a stepwise regression algorithm. Currently, my function seeks to minimize Cp in order to evaluate which term to add or take away from the model. However, everything I have heard says a good Cp score is around the value p.

Looking at the equation for Mallows' Cp

$C_p = \frac{SSE_p}{S^2}-N+2P$

It looks like it already adds $2P$ to the equation, so a good scoring function would be equal to

$C_p = \frac{SSE_p}{S^2}-N+2P-P$, where the extra $-P$ looks extraneous.

Could somebody give an intuitive explanation as to why a good score has $C_p = P$?

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Joel Sinofsky
  • 741
  • 7
  • 18
  • Explanation of edit (pedantic): Mallow's is a common typo in statistical science. But the person concerned is Colin L. Mallows, so the apostrophe belongs elsewhere. – Nick Cox Aug 10 '17 at 18:00
  • I've thought that Cp should be minimized. Do you perhaps have a reference for wanting Cp = P? – user795305 Aug 10 '17 at 18:08

1 Answers1

1

If your model with p parameters is correct it holds that: $SSE_p\approx(n-p)\sigma^2$ If your other model is already correct as well, it holds: $SSE_q\approx (n-q)\sigma^2 $

Therefore: $C_p=\frac{SSE_p}{S^2}-n+2p=\frac{SSE_p}{\frac{SSE_q}{n-q}}-n+2p =\approx \frac{(n-p)\sigma^2}{\sigma^2}-n+2p = p $

Sebastian
  • 2,733
  • 8
  • 24