64

I'm analysing some data where I would like to perform ordinary linear regression, however this is not possible as I am dealing with an on-line setting with a continuous stream of input data (which will quickly get too large for memory) and need to update parameter estimates while this is being consumed. i.e. I cannot just load it all into memory and perform linear regression on the entire data set.

I'm assuming a simple linear multivariate regression model, i.e.

$$\mathbf y = \mathbf A\mathbf x + \mathbf b + \mathbf e$$

What's the best algorithm for creating a continuously updating estimate of the linear regression parameters $\mathbf A$ and $\mathbf b$?

Ideally:

  • I'd like an algorithm that is most $\mathcal O(N\cdot M)$ space and time complexity per update, where $N$ is the dimensionality of the independent variable ($\mathbf x$) and $M$ is the dimensionality of the dependent variable ($\mathbf y$).
  • I'd like to be able to specify some parameter to determine how much the parameters are updated by each new sample, e.g. 0.000001 would mean that the next sample would provide one millionth of the parameter estimate. This would give some kind of exponential decay for the effect of samples in the distant past.
Gilles
  • 1,022
  • 1
  • 10
  • 21
mikera
  • 975
  • 1
  • 8
  • 12

6 Answers6

35

Maindonald describes a sequential method based on Givens rotations. (A Givens rotation is an orthogonal transformation of two vectors that zeros out a given entry in one of the vectors.) At the previous step you have decomposed the design matrix $\mathbf{X}$ into a triangular matrix $\mathbf{T}$ via an orthogonal transformation $\mathbf{Q}$ so that $\mathbf{Q}\mathbf{X} = (\mathbf{T}, \mathbf{0})'$. (It's fast and easy to get the regression results from a triangular matrix.) Upon adjoining a new row $v$ below $\mathbf{X}$, you effectively extend $(\mathbf{T}, \mathbf{0})'$ by a nonzero row, too, say $t$. The task is to zero out this row while keeping the entries in the position of $\mathbf{T}$ diagonal. A sequence of Givens rotations does this: the rotation with the first row of $\mathbf{T}$ zeros the first element of $t$; then the rotation with the second row of $\mathbf{T}$ zeros the second element, and so on. The effect is to premultiply $\mathbf{Q}$ by a series of rotations, which does not change its orthogonality.

When the design matrix has $p+1$ columns (which is the case when regressing on $p$ variables plus a constant), the number of rotations needed does not exceed $p+1$ and each rotation changes two $p+1$-vectors. The storage needed for $\mathbf{T}$ is $O((p+1)^2)$. Thus this algorithm has a computational cost of $O((p+1)^2)$ in both time and space.

A similar approach lets you determine the effect on regression of deleting a row. Maindonald gives formulas; so do Belsley, Kuh, & Welsh. Thus, if you are looking for a moving window for regression, you can retain data for the window within a circular buffer, adjoining the new datum and dropping the old one with each update. This doubles the update time and requires additional $O(k (p+1))$ storage for a window of width $k$. It appears that $1/k$ would be the analog of the influence parameter.

For exponential decay, I think (speculatively) that you could adapt this approach to weighted least squares, giving each new value a weight greater than 1. There shouldn't be any need to maintain a buffer of previous values or delete any old data.

References

J. H. Maindonald, Statistical Computation. J. Wiley & Sons, 1984. Chapter 4.

D. A. Belsley, E. Kuh, R. E. Welsch, Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. J. Wiley & Sons, 1980.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • 1
    Is the method Maindonald describes related to or the same as Gentleman's algorithm? http://www.jstor.org/stable/2347147 – onestop Feb 05 '11 at 21:46
  • @onestop It must be related. Maindonald uses Gentleman & Wilk (*Biometrics* **31**, 1975) as a reference. – whuber Feb 05 '11 at 21:49
  • 6
    In that case see also the extensions by Alan Miller http://www.jstor.org/stable/2347583. An archive of his Fortran software site is now at http://jblevins.org/mirror/amiller/ – onestop Feb 05 '11 at 22:44
  • @onestop Your first reference obviously solves the entire problem: at the bottom of the first page he even states it can be used for weighted least squares. And Alan Miller's very first Fortran link (lsq.f90) contains the code. Why don't you combine your comments into a reply so they are more prominently available? – whuber Feb 05 '11 at 22:51
  • 5
    An explicit algorithm appears at the bottom of p. 4 of http://saba.kntu.ac.ir/eecd/people/aliyari/NN%20%20files/rls.pdf . This can be found by Googling "recursive least squares." It does not look like an improvement on the Gentleman/Maindonald approach, but at least it is clearly and explicitly described. – whuber Feb 05 '11 at 23:20
  • 2
    The last link looks like the method I was going to suggest. The matrix identity they use is known in other places as the Sherman--Morrison--Woodbury identity. It is also quite numerically efficient to implement, but may not be as stable as a Givens rotation. – cardinal Feb 06 '11 at 00:28
  • 1
    @cardinal Thank you for the reference and for the advice. I was mainly concerned about stability. In a long sequence of updates, there may be circumstances where the data are nearly collinear and there could also be a lengthy accumulation of numerical errors, so stability seems to be of paramount concern. That is the justification for the Givens rotations (instead of Householder reflections, which take less computation). – whuber Feb 06 '11 at 19:16
  • @whuber. Please don't take it otherwise - but the reference(s) that you cite are classic, like - Freedman, Maindonald, etc.. I haven't seen a lot of people cite these books. (It's my personal observation, I may be wrong. In my graduate studies, I have seen a lot of references to Freedman (and Diaconis), but never his books.) – suncoolsu Feb 06 '11 at 22:18
  • @whuber, I agree that stability is the main concern. That's why it was very nice to see the cute Givens rotation approach. I can't imagine that a full update wouldn't be necessary every so often, too. @mikera, for an exponential-decay-like approach, you might look at the LMS algorithm from signal processing. It's normally geared toward time-series, but I suspect can be adapted to a more general setting. – cardinal Feb 07 '11 at 03:49
  • @suncoolsu Is there a problem with classic references? (BTW, it's likely that all the stats software people respect--SAS, SPSS, S-Plus/R, Systat, Stata, etc.--still uses code based on the algorithmic references from the 60's through the 80's, because the core parts of all these programs were written by the early 90's.) – whuber Feb 07 '11 at 13:45
  • @whuber. Oh no! I am impressed actually. That's why I said don't take it otherwise. It was in praise that you can actually site literature that old. I agree with you on the algorithm part as well -- we still use Prof. Maindonald's code for optimization in R (if I am not mistaken). – suncoolsu Feb 07 '11 at 22:04
  • 2
    @suncoolsu Hmm... Maindonald's book was newly published when I started using it :-). – whuber Feb 08 '11 at 06:02
  • @whuber: Wow! Thanks for sharing years of experiences with us! I don't know, but I think, citing Maindonald's book may not be citing `Statistical Methods for Research Workers` - by `Fisher` but it comes pretty close it. – suncoolsu Feb 08 '11 at 18:56
  • 1
    @cardinal Gene Golub told me that the SMW identity was so numerically unstable that it shouldn't be used in practice - I wasn't going to argue with him! ;o) I had a look at the Given's rotation method (Gentleman's paper); it looks interesting, must get round to implementing it at some point! – Dikran Marsupial Mar 23 '11 at 18:15
  • @Dikran, perhaps my comment that a full update would be necessary on occasion was a little too indirect. Yes, I generally agree with this statement. Of course, I think you *should* have argued with Gene, if for no other reason than I think he would have enjoyed it. He was rarely shy about his opinions. RIP. – cardinal Mar 24 '11 at 03:26
9

I think recasting your linear regression model into a state-space model will give you what you are after. If you use R, you may want to use package dlm and have a look at the companion book by Petris et al.

F. Tusell
  • 7,733
  • 19
  • 34
  • maybe I'm confused but this appears to refer to a time series model? my model is actually simpler in that the samples are not a time series (effectively they are independent (x->y) samples, they are just accumulated in large volumes over time) – mikera Feb 05 '11 at 19:26
  • 1
    Yes, in the general case this is used for time series with non independent observations; but you can always assume incorrelation between successive observations, which gives the special case of interest to you. – F. Tusell Feb 05 '11 at 19:46
6

Surprised no one else touched on this so far. Linear regression has a quadratic objective function. So, a newton Raphson step from any starting point leads you straight to the optima. Now, let's say you already did your linear regression. The objective function is:

$$ L(\beta) = (y - X \beta)^t (y - X \beta) $$ The gradient becomes $$ \nabla L (\beta) = -2 X^t (y - X \beta)$$ And the hessian: $$ \nabla^2 L (\beta) = X^t X$$

Now, you got some past data and did a linear regression and are sitting with your parameters ($\beta$). The gradient at this point is zero by definition. The hessian is as given above. A new data point ($x_{new}, y_{new}$) arrives. You just calculate the gradient for the new point via:

$$\nabla L_{new}(\beta) = -2 x_{new} (y_{new}-x_{new}^T \beta)$$ and that will become your overall gradient (since the gradient from the existing data was zero). The hessian for the new data point is:

$$\nabla^2 L_{new} = x_{new}x_{new}^T $$.

Add this to the old hessian given above. Then, just take a Newton Raphson step.

$$\beta_{new} = \beta_{old} + (\nabla^2L)^{-1} \nabla L_{new}$$

And you're done.

ryu576
  • 2,220
  • 1
  • 16
  • 25
  • 2
    I like the idea for its simplicity but (a) for the sake of not confusing readers, would prefer to see an explicit definition of "$\nabla L_{new}$" and (b) believe you need to compute the gradient correctly (or show why being off by a factor of 2 doesn't matter). This would be more convincing if you could give a small example demonstrating it's correct. For larger $p,$ it would be worthwhile to estimate the computational effort. Doesn't inverting the Hessian take $O(p^3)$ time? – whuber Sep 27 '18 at 22:11
  • 1
    Thanks, will add more details a little later today. Yes, inverting the hessian takes $O(p^3)$ for large $p$. You could also try and maintain the hessian inverse and update it directly using power series ($(I-A)^{-1}=I+A+A^2+ \dots $). If you have a million parameters, gradient descent is more or less your only option anyway. – ryu576 Sep 27 '18 at 22:55
6

You can always just perform gradient descent on the sum of squares cost $E$ wrt the parameters of your model $W$. Just take the gradient of it but don't go for the closed form solution but only for the search direction instead.

Let $E(i; W)$ be the cost of the i'th training sample given the parameters $W$. Your update for the j'th parameter is then

$$W_{j} \leftarrow W_j + \alpha \frac{\partial{E(i; W)}}{\partial{W_j}}$$

where $\alpha$ is a step rate, which you should pick via cross validation or good measure.

This is very efficient and the way neural networks are typically trained. You can process even lots of samples in parallel (say, a 100 or so) efficiently.

Of course more sophisticated optimization algorithms (momentum, conjugate gradient, ...) can be applied.

bayerj
  • 12,735
  • 3
  • 35
  • 56
  • Seems very similar to this paper http://eprints.pascal-network.org/archive/00002147/01/CrammerDeKeShSi06.pdf . It's been implemented in an open source project called jubatus. – saccharine Mar 07 '13 at 00:24
4

The standard least-square fit gives regression coefficients

$ \beta = ( X^T X )^{-1} X^T Y $

where X is a matrix of M values for each of N data points, and is NXM in size. Y is a NX1 matrix of outputs. $\beta$ of course is a MX1 matrix of coefficients. (If you want an intercept just make one set of x's equal always to 1.)

To make this online presumably you just need to keep track of $X^T X$ and $X^T Y$, so one MXM matrix and one MX1 matrix. Every time you get a new data point you update those $M^2+M$ elements, and then calculate $\beta$ again, which costs you an MXM matrix inversion and the multiplication of the MXM matrix and the MX1 matrix.

For example, if M=1, then the one coefficient is

$ \beta = \frac{\sum_{i=1}^N{x_i y_i}}{\sum_{i=1}^N{x_i^2}} $

so every time you get a new data point you update both sums and calculate the ratio and you get the updated coefficient.

If you want to damp out the earlier estimates geometrically I suppose you could weight $X^T X$ and $X^T Y$ by $(1-\lambda)$ each time before adding the new term, where $\lambda$ is some small number.

  • 2
    It's nice to see this simple case explained. Did you notice, though, that the question specifically asks about *multivariate* regression? It's not so easy to update the denominator of $\beta$ in that case! – whuber Apr 19 '13 at 21:07
  • 1
    I think my answer works still: ie you need to keep track of the MxM matrix $X^T X$ and the Mx1 matrix $X^T Y$. Each element of those matrices is a sum like in the M=1 example. Or am I missing something? – Mark Higgins Apr 24 '13 at 13:17
  • 6
    Yes: in addition to computing a matrix product and applying a matrix to a vector, you now need to *invert* $X'X$ at each step. That's expensive. The whole point to online algorithms is to replace wholesale expensive steps by cheaper updating procedures. – whuber Apr 24 '13 at 13:24
  • 1
    @whuber By the way, a fast, online way to estimate $C^{-1}x$ for a changing matrix $C$ and vector $x$ is given by Schraudolph, N. N. (2002). Fast curvature matrix-vector products for second-order gradient descent. Essentially, you let $z_{t+1}=z_t + x - Cz_t$, and $z\to C^{-1}x$ as $t\to\infty$. – Neil G Dec 20 '17 at 22:27
  • $X'X^{-1}$ can be efficiently updated using [Sherman-Morrison](https://en.wikipedia.org/wiki/Sherman%E2%80%93Morrison_formula) – Nicholas Mancuso Apr 30 '21 at 03:39
1

The problem is more easily solved when you rewrite things a little bit:

Y = y

X = [x, 1 ]

then

Y = A*X

A one time-solution is found by calculating

V = X' * X

and

C = X' * Y

note the V should have size N-by-N and C a size of N-by-M. The parameters you're looking for are then given by:

A = inv(V) * C

Since both V and C are calculated by summing over your data, you can calculate A at every new sample. This has a time complexity of O(N^3), however.

Since V is square and semi-definite positive, a LU decomposition does exists, that makes inverting V numerically more stable. There are algorithms to perform rank-1 updates to the inverse of a matrix. Find those and you'll have the efficient implementation you're looking for.

The rank-1 update algorithms can be found in "Matrix computations" by Golub and van Loan. It's tough material, but it does have a comprehensive overview of such algorithms.

Note: The method above gives a least-square estimate at each step. You can easily adding weights to the updates to X and Y. When the values of X and Y grow too large, they can be scaled by a single scalar, without affecting the result.

Mr. White
  • 341
  • 2
  • 6