11

I'm trying to convert my factor column to dummy variables:

str(cards$pointsBin)
# Factor w/ 5 levels ".lte100",".lte150",..: 3 2 3 1 4 4 2 2 4 4 ...

labels <- model.matrix(~ pointsBin, data=cards)

head(labels)

#     (Intercept) pointsBin.lte150 pointsBin.lte200 pointsBin.lte250 pointsBin.lte300
# 741           1                0                0                0                0
# 407           1                1                0                0                0
# 676           1                0                0                1                0
# 697           1                1                0                0                0
# 422           1                0                1                0                0
# 300           1                0                1                0                0

There is no column for the first value of my factor (".lte100"), which is what the first row should be categorized as. How do I get this data back? And what does the Intercept column that seems to be all 1's mean?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
digitgopher
  • 227
  • 1
  • 2
  • 9
  • 4
    When you have "K" dummy variables then your resulting model will have a.) the intercept term (which is a column of ones) and b.) "K-1" additional columns. The reason is because otherwise the columns of the resulting matrix would not be linearly independent (and, as a result, you wouldn't be able to do **OLS**). – Steve S Oct 01 '15 at 05:34
  • 1
    Basically, the idea is that you have one outcome which acts as the baseline with which the others are to be compared. Then the coefficients can be interpreted with respect to this baseline. You *could* create a matrix with only your "K" factors and no intercept by explicitly removing the intercept term in your formula (so your formula would look something like: **y ~ . - 1**) but it won't lead to a particularly meaningful model. – Steve S Oct 01 '15 at 05:40
  • 2
    Why 'not meaningful'? It's the same model with the same goodness of fit, just parameterized in a different way. – Wolfgang Oct 01 '15 at 05:44
  • @SteveS Thank you, the -1 trick was what I was looking for. However, I think I am missing something elementary - why is the intercept all 1's? – digitgopher Oct 01 '15 at 06:10
  • 2
    @digitgopher: When you run a regression and end up with a model like this: $\hat{y} = \beta_{0} + \beta_{1}*x_{1}$, you're technically ending up with a model like this: $\hat{y} = \beta_{0}*x_{0} + \beta_{1}*x_{1}$, where this new term $x_{0}$ is always equal to "1" (hence the column of ones). If you were to eliminate this column of ones when running a regular regression, you'd end up with a *biased* model since you'd, in effect, be forcing every single model through the origin. – Steve S Oct 01 '15 at 07:33
  • 1
    Basically, **R** is really Statistician-friendly in that goes out of its way to fix the user's errors (like by automatically adding an intercept term). In fact, you could even pass in a matrix with three *identical* columns and instead of throwing back an error (which you would normally be getting if you tried doing linear regression manually via linear algebra, or whatever) **R** will automatically drop the two redundant columns and continue on with its calculations. – Steve S Oct 01 '15 at 07:37
  • @Wolfgang: Do the two models share the same coefficients? Nope. Do they return the same standard errors? Nuh uh. But oh, they share the same goodness of fit so they *must* be the same, right?? No, they are not. – Steve S Oct 01 '15 at 07:56
  • 2
    @SteveS: In fact R's so friendly that if you try remove the intercept `- 1` when you have a single categorical predictor represented as a factor (as in this question), it'll assume you don't really mean that & switch to using sum-to-zero coding; which is of course just a different parametrization. Too friendly, if you ask me. – Scortchi - Reinstate Monica Oct 01 '15 at 08:56
  • 1
    @SteveS Of course the coefficients are not the same since it is a different parameterization, but the column space of the model matrices is the same and hence fitted values and residuals are also the same. – Wolfgang Oct 01 '15 at 09:19
  • @Wolfgang: You are absolutely missing the point--you're *going out of your way* to make a point which--in the context of the given question--is entirely irrelevant. If you tell a student that both approaches are the same do you know what you'll get?? **Bad Models**. Period. – Steve S Oct 01 '15 at 09:37
  • @Scortchi, I'm not exactly sure what you mean--my **R** doesn't switch to sum to zero coding. Have an example in mind? – Steve S Oct 01 '15 at 11:03
  • 2
    @SteveS: Thanks. I should have checked: it switches to cell-means coding. It doesn't do what you might expect, which is fit the forced-through-the-origin model you quite rightly warn against (it will do that though, when the column's of numeric type). – Scortchi - Reinstate Monica Oct 01 '15 at 11:33
  • 1
    Check: http://stats.stackexchange.com/questions/16921/how-to-understand-degrees-of-freedom – Tim Oct 01 '15 at 11:33
  • 1
    @SteveS I wasn't trying to make some "point" - I just tried to point out that removing the intercept and keeping all dummy variables (as illustrated my RUser4512's answer) leads to an equivalent model that is just as good as the model with an intercept and all but one of the dummy variables. As I mentioned, the column span of the model matrices is identical. And a model with all the dummy variables (and no intercept term) is quite meaningful: the coefficients are the estimated means for each factor level. That's not a "bad model". – Wolfgang Oct 01 '15 at 11:55

2 Answers2

11

Consider the following:

require(mlbench)

data(HouseVotes84, package = "mlbench")
head(HouseVotes84)

labels <- model.matrix(~ V1, data=HouseVotes84)
head(labels)

labels1 <- model.matrix(~ V1+1, data=HouseVotes84)
head(labels1)

labels0 <- model.matrix(~ V1+0, data=HouseVotes84)
head(labels0)

labels_1 <- model.matrix(~ V1-1, data=HouseVotes84)
head(labels_1)

The first two commands are identical. The last two commands specifies not to produce the intercept and keeps the two dummy variables produced.

RUser4512
  • 9,226
  • 5
  • 29
  • 59
3

In statistics, when we have a factor variable with $k$ levels, we need to convert it to $k - 1$ indicator variables. We choose one level as the baseline, and then have an indicator variable for each of the remaining levels.

First let me explain why this isn't throwing away any information. Say there are levels A, B, C, and we have $I_B$ and $I_C$, the indicators for being in level B and C. An individual is in level A if and only if $I_B = 0$ (not in B) and $I_C = 0$ (not in C). So we still have kept track of the individuals in level A. This works for any number of levels.

Now as you note we could code the same information with three indicator variables $I_A$, $I_B$ and $I_C$. The reason why we don't is known as multicollinearity. In short, the columns of the regression matrix won't be linearly independent. This means the matrix $X^T X$ is not invertible, so we can't perform linear regression, as we need to calculate this inverse to calculate the regression estimates $\hat{\beta} = \left(X^T X\right)^{-1}X^T y$.

To take a really simple example, say we have two individuals, one in each of two levels A and B, and an intercept term in our regression.

The regression matrix $X$ will be \begin{equation} X = \begin{bmatrix} 1 & 1 & 0 \\ 1 & 0 & 1 \\ \end{bmatrix} \end{equation} Then \begin{align} X^T X & = \begin{bmatrix} 1 & 1\\ 1 & 0\\ 0 & 1\\ \end{bmatrix} \begin{bmatrix} 1 & 1 & 0 \\ 1 & 0 & 1 \\ \end{bmatrix}\\ & = \begin{bmatrix} 2 & 1 & 1 \\ 1 & 1 & 0 \\ 1 & 0 & 1\\ \end{bmatrix} \end{align}

You can then see that adding the second and third columns of $X^T X$ gives the first column, so the inverse can't be computed.

This applies in general the Wikipedia page on multicollinearity gives more of an explanation.

If there is an exact linear relationship (perfect multicollinearity) among the independent variables, at least one of the columns of $X$ is a linear combination of the others, and so the rank of $X$ (and therefore of $X^T X$) is less than $k+1$, and the matrix $X^T X$ will not be invertible.

It is ok to include an indicator for all levels of the factor if you don't include the intercept term, and this encodes the exact same information. (The intercept term minus all other indicators gives the indicator for the baseline factor). But if you have more than one factor in the model this runs into the multicollinearity problem, so you can only include all levels if there is only one factor variable and no intercept term in the model.

Dan Phillips
  • 387
  • 9