How does R's lm algorithm handle factors

Question

I thought they use discriminant analysis as discribed e.g. in chapter 4.4. in James et. al. "An Introduction to Statistical Learning with Applications in R". But after input from this article and this SO-question including reproducible examples it seems that

a matrix model is calculated in ans <- .External2(C_modelmatrix, t, data) (in model.matrix.default)
the matrix model goes into z <- .Call(C_Cdqrls, x, y, tol, FALSE) and I did not expect, that linear regression and discriminant analysis are the same on the maths level.

From here it seems, R can always use the standard algorithm based on qr-factorization.
Is that true?

Comment: R's help ?lm references Chambers, J. M. (1992) Linear models. Chapter 4 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole. But I do not have access to the book.

I don't really know what you are asking. R transforms factors into dummy variables before processing if that is what you are asking? — user2974951, Jul 10 '19 at 12:50
@user2974951 This step seems to be clear now. The question is what happens afterwards? It seems, the coefficients $\widehat{\beta}$ are calculated from $\widehat{\beta}=(X^TX)X^TY$. Thus, R has implemented a different algorithm than I thought... (And I cannot have a look at the cited reference to check :-( ) — Christoph, Jul 10 '19 at 13:03
@FransRodenburg not really, `lm` calls `lm.fit` (line 68), then `lm.fit` calls `.Call(C_Cdqrls, ...)` in line 114. Thus "the work" is done in a C function called `Cdqrls`. There I could not check, as I could not find the code. However, as stated in the links above, this function seems to use qr-decomposition, which I did not expect in the case of factors present. — Christoph, Jul 10 '19 at 15:50
Thanks, I've added an answer. I think he just doesn't want to include more math than he deems necessary for ISLR. In ESL, QR-decomposition *is* covered (3.2.3). — Frans Rodenburg, Jul 12 '19 at 00:56
An useful reference is the detailed explanation on [coding matrices](https://cran.r-project.org/web/packages/codingMatrices/vignettes/codingMatrices.pdf) provided by Bill Venables. — Yves, Jul 12 '19 at 05:31
@Yves Thanks for your hint. Really well written. If you like, you can add something in that direction as additional answer. It just explains how the $X$ matrices are build and the picture in my comment is quite oversimplified in case of factor predictors. Furthermore it explains some fallacies: "In R the familiar stats package functions contr.treatment, contr.poly, contr.sum and contr.helmert all generate coding matrices, and not necessarily contrast matrices as the names might suggest." — Christoph, Jul 12 '19 at 06:45

Frans Rodenburg · Accepted Answer · 2019-07-12T01:05:32.377

2

When you have factors, a call is first made to model.matrix, to turn your formula into a design matrix $\mathbf{X}$ and a response $\mathbf{y}$. For factors that just means that the design matrix gets an extra $k−1$ columns, where $k$ is the number of categories in the factor. The reference category coincides with the intercept to prevent perfect colinearity and hence the $−1$. From there QR decomposition works no different than if it were only continuous variables.

For a factor, the first category in levels(your_factor) is the reference level, and you could change that if needed with relevel(your_factor, "reference category").

As to your comment on why ISLR skips over QR-decomposition, I searched through the book for 'decomposition' and found this quote in chapter 10:

Problem (10.3) can be solved via an eigen decomposition, a standard technique in linear algebra, but details are outside of the scope of this book.

So I guess the answer is as simple as: Because they don't want to linger on details that aren't crucial to beginners. Remember that ISLR has a more advanced, also free version: The Elements of Statistical Learning (different authors, but very similar style). In ESL, QR-decomposition is covered in 3.2.3 (specifically, page 55). :)

edited Jul 12 '19 at 01:05

answered Jul 12 '19 at 00:54

Frans Rodenburg

10,376
2
25
58

1

Now i have a good start in the day :-) – Christoph Jul 12 '19 at 05:51
If you like, you can post a summary of this answer [here](https://stackoverflow.com/q/56953563/5784831) and I will accept. I think this really answers the original question about coding... – Christoph Jul 12 '19 at 06:48
I just realized, that my question might not be clear enough: When talking about a linear model $y \sim x$, factor levels in $x$ are handled as you discribe. If in contrast the response $y$ is a factor, `lm` throws an error and discriminant analysis is necessary using e.g. `glm`. – Christoph Jul 12 '19 at 07:49
1

When the outcome has two categories, you can use a binomial GLM, yes. If it has even more, you should look into multinomial regression, or ordinal regression if there is a logical order to the categories. Note that GLMs are usually fitted by iteratively reweighted least squares. – Frans Rodenburg Jul 12 '19 at 10:33

score 2 · Answer 2 · answered Jul 12 '19 at 09:50

A very useful resource is the document Coding Matrices, Contrast Matrices and Linear Models by Bill Venables, available on CRAN. This explains in details how the model matrix of a lm or glm is built, including the case where some of the predictors are factors. The key functions contr.poly, contr.sum, contr.treatment, contr.helmert are illustrated.

Once the model matrix is built, the estimation uses the QR decomposition as it can be seen from the source code of lm.

How does R's lm algorithm handle factors

2 Answers2