4

I would like some clarification regarding a regression specified using Wilkinson-Rogers notation to produce coefficients for all levels in a categorical variable.

Consider the regression specifed by Income ~ Sex, where Sex is a categorical variable with two levels, Male and Female. This regression will produce two coefficients, one for the intercept, and one for either the level Male or Female, as the effect of one level of Sex is automatically incorporated into the intercept.

My question is what coefficients should the notation Income ~ Sex - 1 produce? This is used to exclude the intercept term. In R, which uses a 'modified form' of Wilkinson-Rogers this fits coefficients to both levels of Sex. However, in Matlab, one level is still dropped, thus only one coefficient is produced.

usεr11852
  • 33,608
  • 2
  • 75
  • 117
Alex
  • 3,728
  • 3
  • 25
  • 46
  • I guess in some way it has to do with how you code the dummy variables: http://stats.stackexchange.com/a/115052/22199. It is not clear to me whether you should always use $k$ dummies for $k$ levels if there is no intercept, or whether it is acceptable to always use $k-1$. – Alex Jul 26 '16 at 00:24
  • (I added the tag `mixed-model` because I think this relates very often with mixed models too. I think it will get your answer better visibility in the future. – usεr11852 Jul 26 '16 at 00:50

1 Answers1

5

The unsatisfying answer is that there is no definite reference on what to do in that case.

R actually tries to be somewhat smart and allow only the first dummy value to be fully encoded. Any additional nominal variables will still be "truncated". Even more annoyingly though is not definite way to control which value is fully represented in $k$ levels instead of $k-1$. Unofficially changing the order for which you write your model (ie y ~ -1 + x1 + x2 or y ~ -1 + x2 + x1) does that within lm but you are still in the whim of a developer's interpretation of W-R notation for other functions. MATLAB apparently said: "Forget it, if you want full variables just encode them yourself using dummyVar" and stopped bothering on the issue.

usεr11852
  • 33,608
  • 2
  • 75
  • 117
  • thanks, this is a helpful insight into the workings of matlab's regression functionality. So, I am guessing from your answer that the W-R paper doesn't really specify what `Y ~ X - 1` means if `X` is a categorical variable... I had a read but couldn't see anything. – Alex Jul 26 '16 at 01:00
  • Not it does not. Further more it does not go ahead and define the difference between `y ~ -1 + x` and `y ~ 0 + x` which is a bit of a bummer... In fairness this was supposed to be used in a particular package GENSTAT and maybe it was fine for that purpose. It was not designed with the idea of a universal statistical modelling language. I do not have access to the paper at the moment but I think of the top of my head that the use of `I()` not defined either and this why MATLAB did not support. – usεr11852 Jul 26 '16 at 01:06
  • @TomLane might want to weight in on this too. – usεr11852 Jul 26 '16 at 01:18