9

The least squares formula, $\beta = (X'X)^{-1}X'Y$ can be recursively formulated as \begin{align} \beta_t &= \beta_{t-1} +\frac{1}{t}R_t^{-1}x_t'(y_t-x_t\beta_{t-1}),\\ R_t &= R_{t-1}+\frac{1}{t}(x_t'x_t-R_{t-1}), \end{align} where $\beta_{t}$ denotes the least squares estimate using the observations $1,\ldots, t$, and $R_t$ denotes the matrix $\frac{1}{t}X_t'X_t$.

This can be shown by induction, and is neat since all we need to know about previous observations $(x_i, y_i)_{i=1,\ldots, t-1}$ is captured by $\beta_{t-1}$ and $R_{t-1}$.

The book I am reading suggests a proof by induction, which is not too hard and settles correctness. However, it seems to me that a clever projection argument and/or a clever Bayesian interpretation could help with getting a feel for the recursive formulation. After a quick googling session, I have not found either.

So in short: What is your favorite intuition behind recursive least squares?

Gordon McDonald
  • 168
  • 1
  • 11
Har
  • 1,494
  • 11
  • 15
  • (+1) The related (but not the same) question at [Efficient Online Linear Regresion](http://stats.stackexchange.com/questions/6920) may offer some inspiration. BTW, what is "$c_{t-1}$"? – whuber Aug 09 '13 at 16:18
  • "$c_{t-1}$ is a typo and has now disappeared... – Har Aug 10 '13 at 09:33
  • And thanks for the link to the other answer, I will look it over soon. – Har Aug 10 '13 at 09:34

2 Answers2

7

It is roughly reminiscent of a Kalman Filter (where the "state variable" is the LS-estimator), and in any case is a weighted average (and possibly a convex combination) of past estimation and current data (and in that it is an adaptive estimator). I will use a hat to denote the estimator. Re-write the basic equation

$$\hat \beta_t = \hat \beta_{t-1} +\frac{1}{t}R_t^{-1}x_t'(y_t-x_t\hat \beta_{t-1}) $$ as

$$\hat \beta_t = \left(1- \frac{1}{t}R_t^{-1}x_t'x_t\right) \hat \beta_{t-1} +\frac{1}{t}R_t^{-1}x_t'y_t $$

To use standard Kalman filter notation, define $$F_t = \left(1- \frac{1}{t}R_t^{-1}x_t'x_t\right)$$

Then you arrive at

$$\hat \beta_t = F_t \hat \beta_{t-1} +(1-F_t)y_t $$

If $F_t$ lies in $(0,1)$ then this weighted average becomes a convex combination, and hence exhibits exponential smoothing with variable smoothing factor.

Whatever you call it, this is highly intuitive: I give some weight to my previous result, and some weight to new data.

And the intuition doesn't stop there. Re-write the second equation as

$$R_t = \left(1-\frac{1}{t}\right)R_{t-1}+\frac{1}{t}x_t'x_t$$

This is always a convex combination of past data and current data, with more weight given to past data as it accumulates (i.e. as $t$ increases).

REFERENCES
RLS is a stochastic approximation algorithm, the seminal paper about which is Ljung, L. (1977). Analysis of recursive stochastic algorithms. Automatic Control, IEEE Transactions on, 22(4), 551-575.
Recursive Least Squares has seen extensive use in the context of Adaptive Learning literature in the Economics discipline. A clear exposition on the mechanics of the matter and the relation with recursive stochastic algortihms can be found in ch. 6 of Evans, G. W., Honkapohja, S. (2001). Learning and Expectations in Macroeconomics. Princeton University Press.

Alecos Papadopoulos
  • 52,923
  • 5
  • 131
  • 241
2

I thought I'd add my intuition for it a few months later. First of all, Alecos' intuition is great in qualitative terms! My post is more of a mathematical intuition (as in, how one would rediscover the formulation of recursive least squares, using only linear algebra).

Denote by $\hat{Y}_t=\beta_t X_t$ the vector (in $\mathbb{R}^t$) of fitted values of $y$. Since the least squares procedure is a projection, we have that

$\langle \hat{Y}_t, X^i_t \rangle = \langle Y_t, X^i_t \rangle$ for $i\in\{1, \ldots, k\}$ and $\hat{Y}_t\in span\{X^1_t, \ldots, X^k_t\}$ ($\langle \cdot, \cdot \rangle$ denotes the inner product of $\mathbb{R}^t$).

Similarly, we have

$\langle \hat{Y}_{t-1}, X^i_{t-1} \rangle = \langle Y_{t-1}, X^i_{t-1} \rangle$ for $i\in\{1, \ldots, k\}$ and $\hat{Y}_{t-1}\in span\{X^1_{t-1}, \ldots, X^k_{t-1}\}$ ($\langle \cdot, \cdot \rangle$ denotes the inner product of $\mathbb{R}^{t-1}$).

We want to relate the two inner products somehow. From the definition of the inner product, it is clear that $\langle Y_t, X^i_t\rangle = \langle Y_{t-1}, X^i_{t-1}\rangle + x^i_t y_t$. Therefore, we get

$\langle \hat{Y}_t, X_t^i\rangle = \langle \hat{Y}_{t-1}, X^i_{t-1}\rangle + x^i_t y_t$. This, I think, is the essential formula.

To express everything in terms of the betas, we substitute in $\beta_t X_t$ for $\hat{Y}_t$ and bear in mind that $X_t = (X_t^1|\ldots| X_t^k)$ and get

$X_t'X_t\beta_t = X_{t-1}'X_{t-1}\beta_{t-1} + x_t'y_t$.

Now we are led to want to relate $X_t'X_t$ to $X_{t-1}'X_{t-1}$. This is easy, $X_t'X_t = X_{t-1}'X_{t-1} + x_t'x_t$. Therefore, we get $\beta_t = \beta_{t-1} + (X_t'X_t)^{-1}x_t'(y_t-x_t\beta_{t-1})$ (with several intuitions, see Alecos' answer).

We conclude that we have a recursive formulation:

$\beta_t = \beta_{t-1} + (X_t'X_t)^{-1}x_t'(y_t-x_t\beta_{t-1})$.

$X_t'X_t = X_{t-1}'X_{t-1} + x_t'x_t$

The super crisp geometric intuition (why the formula is obvious without computation) is still evading me. Maybe I will return in another six months.

Har
  • 1,494
  • 11
  • 15