What are the uses and pitfalls of regression through the origin?

Question

Spuriously high R-squared is one of the pitfalls of regression through the origin (i.e. zero-intercept models). If the predictors do not contain zeroes, then is it an extrapolation? What are the uses and other pitfalls of regression through the origin? Are there any peer-reviewed articles?

There is a related question http://stats.stackexchange.com/questions/12888/what-kind-of-results-are-there-about-prior-knowledge. — cbeleites unhappy with SX, Aug 02 '11 at 12:35
One pitfall is that such a regression often makes no substantive sense. — Peter Flom, Aug 02 '11 at 11:06
A very good discussion of this subject is available at http://www.duke.edu/~rnau/regnotes.htm#constant — IrishStat, Aug 02 '11 at 13:13
When we discuss the merits of regression through the origin I think it is worth specifying whether we are hunting for the best regression model (model building) or fitting a known (accepted) model to the data to estimate the parameters of the model. — Thomas, Mar 18 '16 at 19:32

score 3 · Answer 1 · answered Mar 18 '16 at 17:34

To me the main issue boils down to imposing a strong constraint on an unknown process.

Consider a specification $y_t=f(x_t)+\varepsilon_t$. If you don't know the exact form of a function $f(.)$, you could try a linear approximation: $$f(z)\approx a+b x_t$$

Notice, how this linear approximation is actually the first order Maclaurin (Taylor) series of the function $f(.)$ around $x_t=0$: $$f(0)=a$$ $$\frac{\partial f(z)}{\partial z}=b$$

Hence, when you regress through origin, from Maclaurin series view, you're saying that $f(0)=0$. This is a very strong constraint on a model.

There are situations where imposing such a constraint makes a sense, and these are driven by theory or outside knowledge. I would argue that unless you have a reason to believe that $f(0)=0$ it's not a good idea to regress through origin. As with any constraint, this will lead to suboptimal parameter estimation.

EXAMPLE: CAPM in finance. Here we state that the excess return $r-r_f$ on a stock is defined by its beta on the excess market return $r_m-r_f$: $$r-r_f=\beta (r_m-r_f)$$

The theory tells us that the regression should be through origin. Now, some practitioners believe that they can get an additional return, alpha, on top of CAPM relationship: $$r-r_f=\alpha+\beta (r_m-r_f)$$

Both regressions are used in academic research and practice for different reasons. This example shows you when the imposition of a strong constraint, such as regression through origin, can make a sense in some situations.

score 2 · Answer 2 · answered Aug 02 '11 at 08:47

2

If the r.h.s variables & response have not been centered? Then (by definition) the estimated coefficients are biased.

answered Aug 02 '11 at 08:47

user603

21,225
3
71
135

2

I don't quite understand your answer, maybe due to terseness. For example, consider the case where the *true* intercept is zero! :) – cardinal Aug 02 '11 at 15:31
@cardinal: ...which corresponds to the situation where the rhs variable and the response are centred! – user603 Dec 19 '15 at 11:19

Phil Goetz · Answer 3 · 2016-03-18T16:55:58.963

-3

The least-squares solution to the set of equations

 0 = c1*x1_1 + c2*x1_2 + ... cn*x1_n
 0 = c1*x2_1 + c2*x2_2 + ... cn*x2_n
 0 = c1*x3_1 + c2*x3_2 + ... cn*x3_n
 ...
 0 = c1*xn_1 + c2*xn_2 + ... cn*xn_n

is always c1=0, c2=0, ..., with zero error, so using standard tools, eg. the Perl module Statistics::Regression, to do regression through the origin, will come up with standard deviation = 0 and crash when dividing by standard deviation.

edited Mar 18 '16 at 16:55

answered May 31 '13 at 06:12

Phil Goetz

105
3

The downvoter should explain why he/she thinks this is wrong. It is true; try and see it yourself. – Phil Goetz Mar 18 '16 at 16:49
but the *response* is not set to 0. Just the intercept. – Cliff AB Mar 18 '16 at 16:51
@Cliff AB, you want to come up with the constants c1, c2, etc., which you can use to compute different response values for different inputs. But coming up with those constants requires computing the intercept, which, in many implementations, will cause a divide by zero. – Phil Goetz Mar 18 '16 at 17:02
Because this is such a special circumstance, is so specific to a particular software platform (for instance, `R` seems to have no trouble with this case), and because OLS potentially suffers from the same problem (just use one less regressor), I cannot understand why it is relevant to the question. – whuber Mar 18 '16 at 17:10

What are the uses and pitfalls of regression through the origin?

3 Answers3

Linked