I realized that I'm not quite sure if I should factor all categorical variables or not?
Categorical variables and factor variables are basically the same thing. By definition a categorical variable is a factor variable.
But your questions seems to relate to the question like 'Is my numeric variable a categorical variable?'
Contrast with scalar variables
A categorical variable relates to a measurement that is not on any scale, which contrasts to measurements that have a scale. E.g. measurements like temperature, height, weight, relate to a number and different numbers can be compared to each other in terms of distance and order.
Models with such scalar variables will make use of that scale. See for instance the below graph of the mtcars dataset. It can model the relationship between fuel efficiency (mpg) and displacement (disp) in terms of a formula with only two parameters
$$\text{mgp} = 29.6 - 0.041 \cdot \text{disp}$$
For every unit $\text{disp}$ the $\text{mgp}$ is 0.041 units lower.
From https://stats.stackexchange.com/a/429867/164061

Categorical/factor variables
A categorical variable does not relate to any scale. There is no order, for instance green is not bigger or larger than yellow. There is no distance, for instance there is no definition for the distance between a policeman and a nurse. (although you might use variables like 'wavelength'/'salary' to make those categories 'color'/'job' relate in some way to some scale)
Models with categorical variables determine a parameter for each single category/factor*. So unlike the $\text{mgp} = 29.6 - 0.041 \cdot \text{disp}$ relationship where a single parameter 0.041 describes the entire relationship between $\text{mgp}$ and $\text{disp}$ for all possible values of $\text{disp}$ (which is because it can make use of the scalar property of the value), in the case of a categorical parameter more parameters must be determined (one for each category).
For instance in the case of the iris dataset we have the following relationship between sepal length (a scalar variable) and the species type (a categorical variable)
$$\begin{array}\\
\text{sepal width} = 5.01 + \begin{bmatrix} 0 \\ 0.93 \\ 1.58 \end{bmatrix}_j \cdot \text{species type} \text{} \\
\end{array}$$
Where you get a different parameter estimated for each species type. You often see those type of relations expressed as:
- $y_i = \hat{\beta}_0 + \hat{\beta}_j x_j + \epsilon_i$
or in R we formulate a formula like
y ~ parameter1 + parameter2
etc.
This might be sometimes confusing. The model is not like a linear function of parameters with scalar variables. Instead it is determining a different parameter for each category (you also see this come back in the degrees of freedom which is different for scalar vs categorical variables, because a different number of parameters are estimated)
*There is actually one less parameter then the total number of categories in a variable, because one parameter can be absorbed into the intercept
Categorical/factor variable encoded as a set of scalar variables
In a certain way you might rewrite the categorical variable as a scalar variables (but more specifically, dummy variables that only have two possible values). This way is dummy encoding.
The data table like
Petal Length Species
5.1 Iris setosa
4.9 Iris setosa
4.7 Iris setosa
4.6 Iris setosa
. .
. .
. .
7.0 Iris versicolor
6.4 Iris versicolor
6.9 Iris versicolor
5.5 Iris versicolor
. .
. .
. .
6.3 Iris virginica
5.8 Iris virginica
7.1 Iris virginica
6.3 Iris virginica
turns into
Petal Length Iris setosa Iris versicolor Iris virginica
5.1 1 0 0
4.9 1 0 0
4.7 1 0 0
4.6 1 0 0
. . . .
. . . .
. . . .
7.0 0 1 0
6.4 0 1 0
6.9 0 1 0
5.5 0 1 0
. . . .
. . . .
. . . .
6.3 0 0 1
5.8 0 0 1
7.1 0 0 1
6.3 0 0 1
And those dummy variables with values 0 or 1 could be seen as scalar variables (although with restrictions: A flower can only be value 1 in one factor and, either a flower is setosa, versicolor or virginica. The value is only 0 or 1, either a flower is setosa or it is not setosa, it can not be 0.5 setosa. But note, the class is a dichotomy either the one value or the other value, but mathematically we can use different values than 0 and 1).
Then the relationship becomes like:
$$ \Tiny{
\text{sepal width} = 5.01 + 0 \cdot \text{species setosa} + 0.93 \cdot \text{species versicolor} + 1.58 \cdot \text{species verginica} \\}
$$
Categorical variables that are a number
You might sometimes have a numeric variable and wonder whether it is a categorical variable or not.
Often this is clear.
- For instance if you use a number to encode some categories like 'category 1', 'category 2', ..., and those category numbers have no meaning as a scalar variable (there is not distance and order defined and you can just as well change the numbers with other labels) then the number is a categorical variable
(This might be tricky when reading tables/files like in R's function read.csv
, if a program encounters a number, which is ambiguous, then it is guessing whether it should be scalar vs factor and uses some default which might not be what you expect. See also in this question where an error arose because scalar/numeric variables where treated as a factor, which is because the use of cbind
on variables of different types while this can be only done with variables of the same type).
Sometimes it might be more tricky.
- For instance people might be giving a score between 0 and 5. That could almost be seen as 6 categories 'one', 'two', 'three', 'four' and 'five'. Very often such values/numbers are treated as categorical variables when there is not a clear and meaningful order and distance.
The same is true for binned variables, like age groups. It is not always so good to consider them as scalar (continuous) variables because the coarseness of the binning might destroy the functional relationship with the scalar variable (in a certain sense all scalar variables are discrete because measurements are limited but with binning this may become more extreme and less negligible)
Occasionally one might on purpose treat a scalar/number as a categorical variable.
- It may occur that you have some measurement where a particular variable is a scalar measured at a few levels. But, you do not know what sort of relationship there is. Instead of imposing some linear relationship like the above mgp vs. disp you could remain undecided and treat each level on it's own as a category (and then use plots of the means as function of the variable to observe potential relationships that you may wish to explore further in new experiments).
Ordinal variables
It might be that you have a categorical variable that is not a scalar number but does have an order. For instance a Likert-type scale with different levels like 'Strongly disagree, Disagree, Neither agree nor disagree, Agree, Strongly agree'. Or age categories '0-4 yrs, 4-18 yrs, 18-50 yrs, 50+ yrs'. For such cases you can do an ordinary model that treats them as categories, but you can impose some limitations to the parameters such that you take into account the order of the variables. For example, one may not be defining a linear relationship like $\text{mgp} = 29.6 - 0.041 \cdot \text{disp}$ where the step in $\text{mgp}$ is the same for each step in $\text{disp}$, but one could still require that the parameters for the different (ordered) categories are increasing or decreasing as function of the order of the category.