Spuriously high R-squared is one of the pitfalls of regression through the origin (i.e. zero-intercept models). If the predictors do not contain zeroes, then is it an extrapolation? What are the uses and other pitfalls of regression through the origin? Are there any peer-reviewed articles?
-
There is a related question http://stats.stackexchange.com/questions/12888/what-kind-of-results-are-there-about-prior-knowledge. – cbeleites unhappy with SX Aug 02 '11 at 12:35
-
Spuriously *high* $R^2$? Are you sure? – cardinal Aug 02 '11 at 17:42
-
2One pitfall is that such a regression often makes no substantive sense. – Peter Flom Aug 02 '11 at 11:06
-
Please say why. – rolando2 Aug 02 '11 at 23:18
-
A very good discussion of this subject is available at http://www.duke.edu/~rnau/regnotes.htm#constant – IrishStat Aug 02 '11 at 13:13
-
When we discuss the merits of regression through the origin I think it is worth specifying whether we are hunting for the best regression model (model building) or fitting a known (accepted) model to the data to estimate the parameters of the model. – Thomas Mar 18 '16 at 19:32
3 Answers
To me the main issue boils down to imposing a strong constraint on an unknown process.
Consider a specification $y_t=f(x_t)+\varepsilon_t$. If you don't know the exact form of a function $f(.)$, you could try a linear approximation: $$f(z)\approx a+b x_t$$
Notice, how this linear approximation is actually the first order Maclaurin (Taylor) series of the function $f(.)$ around $x_t=0$: $$f(0)=a$$ $$\frac{\partial f(z)}{\partial z}=b$$
Hence, when you regress through origin, from Maclaurin series view, you're saying that $f(0)=0$. This is a very strong constraint on a model.
There are situations where imposing such a constraint makes a sense, and these are driven by theory or outside knowledge. I would argue that unless you have a reason to believe that $f(0)=0$ it's not a good idea to regress through origin. As with any constraint, this will lead to suboptimal parameter estimation.
EXAMPLE: CAPM in finance. Here we state that the excess return $r-r_f$ on a stock is defined by its beta on the excess market return $r_m-r_f$: $$r-r_f=\beta (r_m-r_f)$$
The theory tells us that the regression should be through origin. Now, some practitioners believe that they can get an additional return, alpha, on top of CAPM relationship: $$r-r_f=\alpha+\beta (r_m-r_f)$$
Both regressions are used in academic research and practice for different reasons. This example shows you when the imposition of a strong constraint, such as regression through origin, can make a sense in some situations.

- 55,939
- 5
- 90
- 176
If the r.h.s variables & response have not been centered? Then (by definition) the estimated coefficients are biased.

- 21,225
- 3
- 71
- 135
-
2I don't quite understand your answer, maybe due to terseness. For example, consider the case where the *true* intercept is zero! :) – cardinal Aug 02 '11 at 15:31
-
@cardinal: ...which corresponds to the situation where the rhs variable and the response are centred! – user603 Dec 19 '15 at 11:19
The least-squares solution to the set of equations
0 = c1*x1_1 + c2*x1_2 + ... cn*x1_n
0 = c1*x2_1 + c2*x2_2 + ... cn*x2_n
0 = c1*x3_1 + c2*x3_2 + ... cn*x3_n
...
0 = c1*xn_1 + c2*xn_2 + ... cn*xn_n
is always c1=0, c2=0, ..., with zero error, so using standard tools, eg. the Perl module Statistics::Regression, to do regression through the origin, will come up with standard deviation = 0 and crash when dividing by standard deviation.

- 105
- 3
-
The downvoter should explain why he/she thinks this is wrong. It is true; try and see it yourself. – Phil Goetz Mar 18 '16 at 16:49
-
-
@Cliff AB, you want to come up with the constants c1, c2, etc., which you can use to compute different response values for different inputs. But coming up with those constants requires computing the intercept, which, in many implementations, will cause a divide by zero. – Phil Goetz Mar 18 '16 at 17:02
-
Because this is such a special circumstance, is so specific to a particular software platform (for instance, `R` seems to have no trouble with this case), and because OLS potentially suffers from the same problem (just use one less regressor), I cannot understand why it is relevant to the question. – whuber Mar 18 '16 at 17:10