Least Squares Regression Step-By-Step Linear Algebra Computation

Question

As a prequel to a question about linear-mixed models in R, and to share as a reference for beginner/intermediate statistics aficionados, I decided to post as an independent "Q&A-style" the steps involved in the "manual" computation of the coefficients and predicted values of a simple linear regression.

The example is with the R in-built dataset, mtcars, and would be set up as miles per gallon consumed by a vehicle acting as the independent variable, regressed over the weight of the car (continuous variable), and the number of cylinders as a factor with three levels (4, 6 or 8) without interactions.

EDIT: If you are interested in this question, you will definitely find a detailed and satisfactory answer in this post by Matthew Drury outside CV.

When you say "manual computation", what is it you seek? It's relatively straightforward to show a series of relatively simple steps to get parameter estimates and so on (via Gram-Schmidt orthogonalization, for example, or by SWEEP operators), but that's not how R does the calculations internally; it (and most other stats packages) uses QR decomposition (discussed in a number of posts on site - a search on *QR decomposition* turns up a number of posts, a few of which you might get direct value from) — Glen_b, May 29 '15 at 02:02
Yes. I believe that this was very nicely addresses in the answer by M.D. I should probably edit my post, perhaps emphasizing the geometric approach behind my answer - column space, projection matrix... — Antoni Parellada, May 29 '15 at 02:48
Yep! @Matthew Drury Do you want me to erase that line in the OP, or update the link? — Antoni Parellada, Jan 05 '18 at 17:39
Not sure if you have this link, but this is closely related, and I really love J.M's answer. https://stats.stackexchange.com/questions/1829/what-algorithm-is-used-in-linear-regression — Haitao Du, Jan 05 '18 at 17:49

Matthew Drury · Accepted Answer · 2018-01-05T17:30:53.073

Note: I've posted an expanded version of this answer on my website.

Would you kindly consider posting a similar answer with the actual R engine exposed?

Sure! Down the rabbit hole we go.

The first layer is lm, the interface exposed to the R programmer. You can look at the source for this by just typing lm at the R console. The majority of it (like the majority of most production level code) is busy checking of inputs, setting of object attributes, and throwing of errors; but this line sticks out

lm.fit(x, y, offset = offset, singular.ok = singular.ok, 
                ...)

lm.fit is another R function, you can call it yourself. While lm conveniently works with formulas and data frame, lm.fit wants matrices, so that's one level of abstraction removed. Checking the source for lm.fit, more busywork, and the following really interesting line

z <- .Call(C_Cdqrls, x, y, tol, FALSE)

Now we are getting somewhere. .Call is R's way of calling into C code. There is a C function, C_Cdqrls in the R source somewhere, and we need to find it. Here it is.

Looking at the C function, again, we find mostly bounds checking, error cleanup, and busy work. But this line is different

F77_CALL(dqrls)(REAL(qr), &n, &p, REAL(y), &ny, &rtol,
        REAL(coefficients), REAL(residuals), REAL(effects),
        &rank, INTEGER(pivot), REAL(qraux), work);

So now we are on our third language, R has called C which is calling into fortran. Here's the fortran code.

The first comment tells it all

c     dqrfit is a subroutine to compute least squares solutions
c     to the system
c
c     (1)               x * b = y

(interestingly, looks like the name of this routine was changed at some point, but someone forgot to update the comment). So we're finally at the point where we can do some linear algebra, and actually solve the system of equations. This is the sort of thing that fortran is really good at, which explains why we passed through so many layers to get here.

The comment also explains what the code is going to do

c     on return
c
c        x      contains the output array from dqrdc2.
c               namely the qr decomposition of x stored in
c               compact form.

So fortran is going to solve the system by finding the $QR$ decomposition.

The first thing that happens, and by far the most important, is

call dqrdc2(x,n,n,p,tol,k,qraux,jpvt,work)

This calls the fortran function dqrdc2 on our input matrix x. Whats this?

 c     dqrfit uses the linpack routines dqrdc and dqrsl.

So we've finally made it to linpack. Linpack is a fortran linear algebra library that has been around since the 70s. Most serious linear algebra eventualy finds its way to linpack. In our case, we are using the function dqrdc2

c     dqrdc2 uses householder transformations to compute the qr
c     factorization of an n by p matrix x.

This is where the actual work is done. It would take a good full day for me to figure out what this code is doing, it is as low level as they come. But generically, we have a matrix $X$ and we want to factor it into a product $X = QR$ where $Q$ is an orthogonal matrix and $R$ is an upper triangular matrix. This is a smart thing to do, because once you have $Q$ and $R$ you can solve the linear equations for regression

$$ X^t X \beta = X^t Y $$

very easily. Indeed

$$ X^t X = R^t Q^t Q R = R^t R $$

so the whole system becomes

$$ R^t R \beta = R^t Q^t y $$

but $R$ is upper triangular and has the same rank as $X^t X$, so as long as our problem is well posed, it is full rank, and we may as well just solve the reduced system

$$ R \beta = Q^t y $$

But here's the awesome thing. $R$ is upper triangular, so the last linear equation here is just constant * beta_n = constant, so solving for $\beta_n$ is trivial. You can then go up the rows, one by one, and substitute in the $\beta$s you already know, each time getting a simple one variable linear equation to solve. So, once you have $Q$ and $R$, the whole thing collapses to what is called backwards substitution, which is easy. You can read about this in more detail here, where an explicit small example is fully worked out.

This was the most fun mathematical / coding short essay one can imagine. I know next to nothing about coding, but your "tour" through the guts of a seemingly innocuous R function was truly eye-opening. Excellent writing! Since "kindly" did the trick... Could you _kindly_ consider [this one](http://stats.stackexchange.com/q/154355/67822) as a related challenge? :-) — Antoni Parellada, May 28 '15 at 22:14
+1 I hadn't seen this before, nice summary. Just to add a little bit of information in case @Antoni isn't familiar with Householder transformations; it's essentially a linear transformation that allows you to zero out one part of the R matrix you're trying to achieve without mucking up parts you've already dealt with (as long as you go through it in the right order), making it ideal for transforming matrices to upper triangular form (Givens rotations do a similar job and are perhaps easier to visualize, but are a little slower). As you build R, you must at the same time construct Q — Glen_b, Jan 12 '16 at 22:35
@Glen_b Thanks, Glen! I will certainly look into it in detail. — Antoni Parellada, Jan 12 '16 at 22:39
@Antoni I learned a number of useful things about computation with matrices (and indeed about linear algebra) by looking at the linpack manuals/code documentation long ago. I don't know that it's the most efficient way to gain that knowledge, but it was what I had at the time and it was surprisingly helpful. There's good clear explanations in a number of the routines. I was at one point able to help a colleague greatly speed up some MCMC algorithms that did variable selection in regression-related problems using ideas I got from it. — Glen_b, Jan 12 '16 at 22:44
Matthew (+1), I suggest you start or end your post with a link to your much more detailed write-up http://madrury.github.io/jekyll/update/2016/07/20/lm-in-R.html. — amoeba, Jul 22 '16 at 20:08
Matthew, There is a question tangentially related to your answer that I have been meaning to look up for a long time. Perhaps you can save me a lot of searching time. You say, **"R has called C which is calling into fortran"**. The question is: Do all commercial computers come loaded with C and Fortran, or does R systematically downloads and install them? — Antoni Parellada, Jul 17 '17 at 01:45
@AntoniParellada Realizing I am very late on this, so sorry! C and FORTRAN are pretty much exclusively *compiled* languages, i.e. they get translated to machine code directly interpretable by CPUs. So, while almost all computers contain machine code that was compiled from C or FORTRAN source, there may not actually be a *compiler* installed on all machines. — Matthew Drury, Jan 05 '18 at 17:29

Antoni Parellada · Answer 2 · 2015-05-29T12:44:01.077

The actual step-by-step calculations in R are beautifully described in the answer by Matthew Drury in this same thread. In this answer I want to walk through the process of proving to oneself that the results in R with a simple example can be reached following the linear algebra of projections onto the column space, and perpendicular (dot product) errors concept, illustrated in different posts, and nicely explained by Dr. Strang in Linear Algebra and Its Applications, and readily accessible here.

In order to estimate the coefficients $\small \beta$ in the regression,

$$\small mpg = intercept\,(cyl=4) + \beta_1\,*\,weight + D1\,* intercept\,(cyl=6) + D2\, * intercept\,(cyl=8)\,\,\,\,[*]$$

with $\small D1$ and $\small D2$ representing dummy variables with values [0,1], we first would need to include in the design matrix ($\small X$) the dummy coding for the number of cylinders, as follows:

attach(mtcars)    
x1 <- wt

    x2 <- cyl; x2[x2==4] <- 1; x2[!x2==1] <-0

    x3 <- cyl; x3[x3==6] <- 1; x3[!x3==1] <-0

    x4 <- cyl; x4[x4==8] <- 1; x4[!x4==1] <-0

    X <- cbind(x1, x2, x3, x4)
    colnames(X) <-c('wt','4cyl', '6cyl', '8cyl')

head(X)
        wt 4cyl 6cyl 8cyl
[1,] 2.620    0    1    0
[2,] 2.875    0    1    0
[3,] 2.320    1    0    0
[4,] 3.215    0    1    0
[5,] 3.440    0    0    1
[6,] 3.460    0    1    0

If the design matrix had to parallel strictly equation $\small [*]$ (above), where the first intercept corresponds to cars of four cylinders, as in the lm without a `-1', it would require a first column of just ones, but we'll derive the same results without this intercept column.

Continuing then, to calculate the coefficients ($\small\beta$) we need the projection matrix - we project the vector of the independent variable values on to the column space of the vectors constituting the design matrix. The linear algebra is $\small ProjMatrix = \small (X^{T}X)^{-1}X^{T}$, which multiplied by the vector of the independent variable: $\small [ProjMatrix] \,[y]\, =\, [RegrCoef's]$, or $\small (X^{T}X)^{-1}X^{T}\,y = \beta$:

X_tr_X_inv <- solve(t(X) %*% X)    
Proj_M <- X_tr_X_inv %*% t(X)
Proj_M %*% mpg

          [,1]
wt   -3.205613
4cyl 33.990794
6cyl 29.735212
8cyl 27.919934

Identical to: coef(lm(mpg ~ wt + as.factor(cyl)-1)).

Finally, to calculate the predicted values, we will need the hat matrix, which is defined as, $\small Hat Matrix = \small X(X^{T}X)^{-1}X^{T}$. This is readily calculated as:

HAT <- X %*% X_tr_X_inv %*% t(X)

And the estimated ($\hat{y}$) values as $\small X(X^{T}X)^{-1}X^{T}\,y$, in this case: y_hat <- HAT %*% mpg, which gives identical values to:

cyl <- as.factor(cyl); OLS <- lm(mpg ~ wt + cyl); predict(OLS):

y_hat <- as.numeric(y_hat)
predicted <- as.numeric(predict(OLS))
all.equal(y_hat,predicted)
[1] TRUE

In general, in numerical computing, I believe it is best to solve the linear equation instead of compute the inverse matrix. So, I think `beta = solve(t(X) %*% X, t(X) %*% y)` is in practice more accurate than `solve(t(X) %*% X) %*% t(X) %*% y`. — Matthew Drury, May 28 '15 at 18:42
R doesn't do it that way - it uses a QR decomposition. If you are going to describe the *algorithm* used, on a computer I doubt *anyone* uses the one you show. — Gavin Simpson, May 28 '15 at 18:44
Not after the algorithm, just trying to understand the linear algebra underpinnings. — Antoni Parellada, May 28 '15 at 18:46
@AntoniParellada Even in that case, I still find thinking in terms of linear equations more illuminating in many situations. — Matthew Drury, May 28 '15 at 18:51
@MatthewDrury I based my calculations on the linear algebra explanations you can find in Linear Algebra and Its Applications, by Gilbert Strang, but my field is far from statistics. So I am not surprised that even though the results completely replicate what I get using lm in R, the actual computational algorithm is different. Would you kindly consider posting a similar answer with the actual R engine exposed? — Antoni Parellada, May 28 '15 at 18:56
This answer makes both the question and answer duplicates of material appearing in many threads on this site, including http://stats.stackexchange.com/questions/108591, http://stats.stackexchange.com/questions/68151, http://stats.stackexchange.com/questions/140848, and http://stats.stackexchange.com/questions/124887. That leaves me wondering what the point of this thread is. Could you indicate what is new or different about it that is on-topic on this site? (A thread dedicated to `R` coding--and this one gives the strong impression of such a thing--would *not* be on topic, for instance.) — whuber, May 28 '15 at 22:10
@whuber Even if it were just to preserve the response by Matthew Drury, I would like to ask you to keep it. Some concepts need multiple perspectives. I was planning on modifying my answer with its more geometrical interpretation of column spaces and errors. — Antoni Parellada, May 29 '15 at 12:02
@whuber I see a lot of good material in the other posts, but I strongly believe there is complementary value in this post for a lot of beginner students of statistics using R and without good fundamentals in linear algebra. — Antoni Parellada, May 29 '15 at 12:14
Given the peripheral relationship of this thread to our site's objectives, while seeing value in illustrating the use of `R` for important calculations, I would like to suggest that you consider turning it into a [contribution to our blog.](http://stats.blogoverflow.com/) — whuber, May 29 '15 at 13:13

Least Squares Regression Step-By-Step Linear Algebra Computation

2 Answers2

Linked

Related