Why, or why not, factor categorical variables in regression modeling?

Question

I'm currently in the midst of running several logistic regression models to test for effect modification (i.e., testing interaction terms) between two categorical variables (sex and age as a categorical variable).

I realized that I'm not quite sure if I should factor all categorical variables or not? It seems reasonable that a categorical variable should be made into a factor rather than left as an integer but I don't fully understand what the potential implications of factoring vs not factoring are? And I assume that factoring is a common term across all languages but I'm referencing R programming.

If anyone could add some mathematical clarity it would be greatly appreciated.

Notably, I referenced logistic regression but I assume the implications would be similar across other distributions/links. Also, I played around with the model before posting and it didn't make much of a difference (save for interpretation if I left age category numeric) but I'm sure this is not always the case.

How many levels do your categorical variables have? When it is only two levels then the difference between categorical/scalar does not matter. — Sextus Empiricus, Jun 02 '20 at 16:47

Ryan Volpi · Answer 1 · 2020-06-02T03:38:59.357

2

Assuming your categorical features are stored as numbers, R will treat the values as interval data, which means that 3>2>1 and 1+2=3. If 1 represents "male", 2 represents "female", and 3 represents "not specified", then you can see that thinking of the variable as numeric makes no sense. If R identifies a coefficient to represent the effect of gender, then the difference in the effect between "not specified" and "male" will be twice the size of the effect between "female" and "male". That is not what you want in that case. When you make gender a factor, R creates dummy variables that represent each of the possible states, "male", "female", and "not specified" and individually estimates a coefficient for the effect of each. This is what you want.

Some notes:

If you only have two levels to your variable (eg. you only have male and female) then turning the variable into a factor will actually not make any difference in performance or predictions versus representing the variable as a number. However, if you aren't using 0 and 1 to represent the two categorical levels, then the interpretation of the model coefficients will be more difficult. Thank you for the comment below pointing this out.
Making a variable into a factor treats it as nominal feature which means it does not consider the options as being ordered in any way. Age group is ordinal, which means the order matters, but the differences between options are somewhat arbitrary. For an ordinal variable, it is occasionally better to represent the different values as integers which preserve the original order. I imagine there are other ways to deal with ordinal features as well. Converting them to factors may very well be the best option, however, especially if you have a lot of data and not many distinct values for age range.

edited Jun 02 '20 at 03:38

answered Jun 02 '20 at 03:13

Ryan Volpi

1,638
8
17

1

1) is not true. It will make a difference if you use integers rather than factors. If the features are categorical, they should be converted to factors no matter how many levels there are. If you leave them as integers, then their numerical values take on meaning which you don't want. – mlofton Jun 02 '20 at 03:22
My mistake, @mlofton is right in the sense that it will affect the coefficient estimates, and would affect interpretability. However, I maintain that using other numbers will not change the predictions and therefore will not have a performance impact. – Ryan Volpi Jun 02 '20 at 03:27
Hi Both, I think that was my concer, @mlofton. If not turned to factors, I was worried that numeric data would cause problems. So, rule of thumb is always factor, regardless? (and order if necessary). – Brennan Beal Jun 02 '20 at 15:48
@Brennan Beal: Unfortunately, I don't have time to look at below at the moment so maybe Ryan can help you with that. But, generally speaking, if you want to think of you're variables are categories in the sense that the numbers don't have any other meaning except to differentiate between the categories, then definitely code them as factors. OTOH, if you want to use the numbers in the sense that you think the numbers mean something, ( for example, say you had temperature, or rainfall levels ) then don't make them factors. I hope that helps a little. – mlofton Jun 03 '20 at 05:26
1

@Ryan Volpi: Thanks for confirming. Note that I think you are correct that if you just have one variable and two levels and no interactions with other variables, then you can get away with coding the zero and one as numerical. But that confuses the issue IMHO so, better, atleast for a beginner, to even code the 0 and 1 in that case as factors. Also, I'm not clear on what won't change predictions and not have a performance impact but, if you don't think it's that important, then don't worry about it. Thanks again. – mlofton Jun 03 '20 at 05:30
@Brennan Beal: I was mistaken in that below is Sextus Empiricus comments-answer rather than your follow up question so hopefully that is satisfactory. It definitely looks nice and thorough at a glance. Sextus Empiricus: Thanks for nice answer. – mlofton Jun 03 '20 at 05:33
@RyanVolpi and all: Thanks for your responses! Sincerely appreciate the time. – Brennan Beal Jun 03 '20 at 16:29

score 1 · Accepted Answer · edited Jun 11 '20 at 14:32

I realized that I'm not quite sure if I should factor all categorical variables or not?

Categorical variables and factor variables are basically the same thing. By definition a categorical variable is a factor variable.

But your questions seems to relate to the question like 'Is my numeric variable a categorical variable?'

Contrast with scalar variables

A categorical variable relates to a measurement that is not on any scale, which contrasts to measurements that have a scale. E.g. measurements like temperature, height, weight, relate to a number and different numbers can be compared to each other in terms of distance and order.

Models with such scalar variables will make use of that scale. See for instance the below graph of the mtcars dataset. It can model the relationship between fuel efficiency (mpg) and displacement (disp) in terms of a formula with only two parameters

$$\text{mgp} = 29.6 - 0.041 \cdot \text{disp}$$

For every unit $\text{disp}$ the $\text{mgp}$ is 0.041 units lower.

From https://stats.stackexchange.com/a/429867/164061

Categorical/factor variables

A categorical variable does not relate to any scale. There is no order, for instance green is not bigger or larger than yellow. There is no distance, for instance there is no definition for the distance between a policeman and a nurse. (although you might use variables like 'wavelength'/'salary' to make those categories 'color'/'job' relate in some way to some scale)

Models with categorical variables determine a parameter for each single category/factor*. So unlike the $\text{mgp} = 29.6 - 0.041 \cdot \text{disp}$ relationship where a single parameter 0.041 describes the entire relationship between $\text{mgp}$ and $\text{disp}$ for all possible values of $\text{disp}$ (which is because it can make use of the scalar property of the value), in the case of a categorical parameter more parameters must be determined (one for each category).

For instance in the case of the iris dataset we have the following relationship between sepal length (a scalar variable) and the species type (a categorical variable)

$$\begin{array}\\ \text{sepal width} = 5.01 + \begin{bmatrix} 0 \\ 0.93 \\ 1.58 \end{bmatrix}_j \cdot \text{species type} \text{} \\ \end{array}$$

Where you get a different parameter estimated for each species type. You often see those type of relations expressed as:

$y_i = \hat{\beta}_0 + \hat{\beta}_j x_j + \epsilon_i$

or in R we formulate a formula like

y ~ parameter1 + parameter2 etc.

This might be sometimes confusing. The model is not like a linear function of parameters with scalar variables. Instead it is determining a different parameter for each category (you also see this come back in the degrees of freedom which is different for scalar vs categorical variables, because a different number of parameters are estimated)

*There is actually one less parameter then the total number of categories in a variable, because one parameter can be absorbed into the intercept

Categorical/factor variable encoded as a set of scalar variables

In a certain way you might rewrite the categorical variable as a scalar variables (but more specifically, dummy variables that only have two possible values). This way is dummy encoding.

The data table like

Petal Length     Species
5.1              Iris setosa
4.9              Iris setosa
4.7              Iris setosa
4.6              Iris setosa
 .                    .
 .                    .
 .                    .
7.0              Iris versicolor
6.4              Iris versicolor
6.9              Iris versicolor
5.5              Iris versicolor
 .                    .
 .                    .
 .                    .
6.3              Iris virginica
5.8              Iris virginica
7.1              Iris virginica
6.3              Iris virginica

turns into

Petal Length     Iris setosa      Iris versicolor     Iris virginica
5.1              1                0                   0
4.9              1                0                   0
4.7              1                0                   0
4.6              1                0                   0
 .               .                .                   .
 .               .                .                   .
 .               .                .                   .
7.0              0                1                   0
6.4              0                1                   0
6.9              0                1                   0
5.5              0                1                   0
 .               .                .                   .
 .               .                .                   .
 .               .                .                   .
6.3              0                0                   1
5.8              0                0                   1
7.1              0                0                   1
6.3              0                0                   1

And those dummy variables with values 0 or 1 could be seen as scalar variables (although with restrictions: A flower can only be value 1 in one factor and, either a flower is setosa, versicolor or virginica. The value is only 0 or 1, either a flower is setosa or it is not setosa, it can not be 0.5 setosa. But note, the class is a dichotomy either the one value or the other value, but mathematically we can use different values than 0 and 1).

Then the relationship becomes like:

$$ \Tiny{ \text{sepal width} = 5.01 + 0 \cdot \text{species setosa} + 0.93 \cdot \text{species versicolor} + 1.58 \cdot \text{species verginica} \\} $$

Categorical variables that are a number

You might sometimes have a numeric variable and wonder whether it is a categorical variable or not.

Often this is clear.

For instance if you use a number to encode some categories like 'category 1', 'category 2', ..., and those category numbers have no meaning as a scalar variable (there is not distance and order defined and you can just as well change the numbers with other labels) then the number is a categorical variable

(This might be tricky when reading tables/files like in R's function read.csv, if a program encounters a number, which is ambiguous, then it is guessing whether it should be scalar vs factor and uses some default which might not be what you expect. See also in this question where an error arose because scalar/numeric variables where treated as a factor, which is because the use of cbind on variables of different types while this can be only done with variables of the same type).

Sometimes it might be more tricky.

For instance people might be giving a score between 0 and 5. That could almost be seen as 6 categories 'one', 'two', 'three', 'four' and 'five'. Very often such values/numbers are treated as categorical variables when there is not a clear and meaningful order and distance.

The same is true for binned variables, like age groups. It is not always so good to consider them as scalar (continuous) variables because the coarseness of the binning might destroy the functional relationship with the scalar variable (in a certain sense all scalar variables are discrete because measurements are limited but with binning this may become more extreme and less negligible)

Occasionally one might on purpose treat a scalar/number as a categorical variable.

It may occur that you have some measurement where a particular variable is a scalar measured at a few levels. But, you do not know what sort of relationship there is. Instead of imposing some linear relationship like the above mgp vs. disp you could remain undecided and treat each level on it's own as a category (and then use plots of the means as function of the variable to observe potential relationships that you may wish to explore further in new experiments).

Ordinal variables

It might be that you have a categorical variable that is not a scalar number but does have an order. For instance a Likert-type scale with different levels like 'Strongly disagree, Disagree, Neither agree nor disagree, Agree, Strongly agree'. Or age categories '0-4 yrs, 4-18 yrs, 18-50 yrs, 50+ yrs'. For such cases you can do an ordinary model that treats them as categories, but you can impose some limitations to the parameters such that you take into account the order of the variables. For example, one may not be defining a linear relationship like $\text{mgp} = 29.6 - 0.041 \cdot \text{disp}$ where the step in $\text{mgp}$ is the same for each step in $\text{disp}$, but one could still require that the parameters for the different (ordered) categories are increasing or decreasing as function of the order of the category.

Hey thanks for the lengthy, and thorough response! This is great information for readers but the question I was trying to ask was should I convert numeric variables, which represent categorical variables, to factors every time or are there exceptions. My numeric variable is ordinal, for the record, as it represents binned age. — Brennan Beal, Jun 02 '20 at 15:53
@BrennanBeal So this is more like a programming question (how to define variables) and not about statistics? If you want your number to be a categorical variable, then yes you better make sure that it is used as a categorical variable. In `R` you can convert a numeric variable to a factor by using `as.factor(x)`. But yes there are exceptions; Some functions do this automatically and also when your variable has the character type then often it is interpreted as a factor. — Sextus Empiricus, Jun 02 '20 at 16:02
Right, so the question (which I did not make super clear - sorry!) was what are the mathematical implications of not doing so. I played around and found none but I'm sure that is a coincidence. — Brennan Beal, Jun 02 '20 at 16:04
The implication is that your variable is wrongly interpreted as a scalar and that you will be doing an ANCOVA instead of ANOVA (in other words you fit it as a scalar variable, which it isn't). — Sextus Empiricus, Jun 02 '20 at 16:07
So instead of estimating a *different/separate* coefficient per each category value you only estimate a *single* coefficient (and the different estimates for the different categories are created by the multiplication of the scalar value with that coefficient). If the estimates of coefficients for the different categories happen to be on a single line then the error with the scalar will not have much effect. — Sextus Empiricus, Jun 02 '20 at 16:13
That, above, is what I was looking for. That makes sense, and was what I had intuitively had in mind. The separation of ANCOVA vs ANOVA makes sense! Thank you for your time. Was much appreciated! — Brennan Beal, Jun 03 '20 at 16:28