0

I have several independent variables. One of the independent variables (I'm calling Var1) has names of different sites (named A,B,C,D,E). This same variable has been coded as 1,2,3,4 and 5 (which I will call Var2, but is exactly the same as Var1 just in a different format). R (using str(data)) recognizes the independent variable Var1 as "Factor" but identifies Var2 as "int". When I use these variables in two separate models (of course), I get two different results. Why is that? Also, do I need to specify that Var2 is a factor before running the GLM?

EleMan
  • 3
  • 2
  • I am voting to leave this open because a) It already has a useful answer and b) The same issue could arise in other software packages. E.g. in SAS, if you have a variable that is coded with integers you will get different results if you put it on a CLASS statement. – Peter Flom Aug 06 '19 at 15:51
  • 2
    pretty sure it's a duplicate (aside possibly from the GLM part which is of no impact). Can't check now; will try later – Glen_b Aug 06 '19 at 16:17
  • Thank you. I notice that SPSS gives me slightly different results from R using the same data @Peter Flom – EleMan Aug 06 '19 at 16:51
  • 1
    Yes, I'm sure it's a duplicate too. So far searches have only turned up the closely-related https://stats.stackexchange.com/questions/120711 and https://stats.stackexchange.com/questions/260073. – whuber Aug 06 '19 at 20:35

3 Answers3

3

R doesn't know that your variable is nominal (which I assume it should be, if it means sites). More precisely, it knows for Var1 because using letters will make it to have a type of character which will be forced to be factor when put to a regression model. This is the correct approach, However, for Var2 it will consider it as a usual a quantitative variable, measured on scale (just as you correctly observed that it has a type of int).

Both can be estimated, but it'll be meaningless in the second case: the coefficient will pertain to "site being 1 unit higher all else equal". Var2 can indeed be one unit higher, but the site can't. In other words, Var2 implies that the sites are ordered and their difference is the same (none of which is true, at least not because of the variable's nature - as R assumes).

Solution: you can use the format of Var2, but in that case, you have to tell R that it is a factor! E.g. Var2 <- as.factor( Var2 ) run before the regression will solve your problem. However, you might wish to change the reference level (you can use relevel to do this, after the variable is declared to be a factor).

Tamas Ferenci
  • 3,143
  • 16
  • 26
  • Thank you @Tamas Ferenci for this very clear explanation that makes a lot more sense to me. I do have two more questions: 1 - Can I still use Var1 as R recognizes Var1 as a factor or not? 2 - Could you please clarify for me your last sentence about changing the reference level, as in why I would need to do that and how do I decide which one it needs to be changed to? Many thanks - very much appreciate your help! – EleMan Aug 06 '19 at 14:41
  • @EleMan 1) Yes, you can use `Var1`. Although it is not a factor, but a character, but character type automatically gets converted to factor when put into a regression model. 2) Let's say you have 10 patients from site 1, 100 from site 2 and 10 from site 3. Statistically it is the best to use that level as a reference, which has the highest frequency (the modal outcome), in this case, 2. Or, 2 is - for some subject area reason - the one, to which you want to compare everything else. Bottomline is, you want 2 to be the reference category. ... – Tamas Ferenci Aug 06 '19 at 15:02
  • 1
    ...however, simply using `as.factor(Var2)` won't give you this, because it'll take the lowest value as the reference level, 1 in this case. So you'll need `relevel(Var2,ref=2)` in this case to set the reference category correctly. – Tamas Ferenci Aug 06 '19 at 15:05
  • Great, thank you again for your explanation. I will try that. – EleMan Aug 06 '19 at 15:10
  • Tamas Ferenci and @Peter so I just noticed that by using (names of sites), R still automatically takes the site that begins with an 'A' as the reference category. I did try to use as I want site C to be the reference category, but that did not work. It still took the site with the name beginning with A. Is there a way around this as opposed to using Var2 and specifying a ref=3 category? Thanks. – EleMan Aug 07 '19 at 13:47
  • `relevel(Var1,ref=C)` should be `relevel(Var1,ref="C")` (note the quotation mark). Did you actually replace the variable (i.e. have you run `Var 1 – Tamas Ferenci Aug 07 '19 at 13:50
  • perfect!!!! That worked exactly the way I think it should have worked. Thank you so much. That really helped. One last question here: How do I know whether the model overall is significant or not, like SPSS gives a separate overall model significance? Thank you! – EleMan Aug 07 '19 at 14:04
  • You're welcome. `anova(fit)` gives you the global test (but `summary(fit)` also displays it). – Tamas Ferenci Aug 07 '19 at 14:21
  • :( sorry, I just tried that, and googled it, found somewhere that I needed a 'fit' package and installed it, but I get error messages saying" > summary(fit) Error in summary(fit) : object 'fit' not found > annova(fit) Error in annova(fit) : could not find function "annova" – EleMan Aug 07 '19 at 14:29
  • `fit` is simply the name of the regression object, sorry, I forget to write it down. `fit – Tamas Ferenci Aug 07 '19 at 14:30
  • Cheers!!! Thank you so much! You're brilliant! :) – EleMan Aug 07 '19 at 14:32
  • Sorry, I'm just reopening this but looking at this answer: it seems like the model overall significance is given by 1-pchisq(Null deviance -Residual deviance, Null deviance degrees of freedom - Residual deviance degrees of freedom). Does that seem right, as I don't get model significance overall using anova() or summary(). – EleMan Aug 10 '19 at 05:00
  • @EleMan Sorry. You're right, I mixed `glm` and `lm`. You can find the answer for `glm` [here](https://stats.stackexchange.com/questions/129958/glm-in-r-which-pvalue-represents-the-goodness-of-fit-of-entire-model). – Tamas Ferenci Aug 13 '19 at 00:21
  • No worries, thank you for this! – EleMan Aug 14 '19 at 14:35
1

R will treated Var2 as numbers and do a regression on the values. Since they are not numbers but categories, you will need to tell it that Var2 is a factor before fitting your model.

mkt
  • 11,770
  • 9
  • 51
  • 125
0

The question is about R, but the same thing can happen with other packages. This is one reason to try not to code categorical variables with numeric codes - it eliminates one very easy way to carelessly mess up. (I am always eager to prevent my own carelessness or that of other people).

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • 1
    I see...yeah, so you're suggesting that I just leave var1 as site "names" rather than code the sites and then treat those values as factors? Thank you for enlightening me! – EleMan Aug 06 '19 at 16:42
  • Yes. Less opportunity to do something silly. – Peter Flom Aug 07 '19 at 11:53