1

What is the correct way to handle ordinal variables in multiple regression? (This might seem very basic for some of you here). I'm using an ordinal variable in a multiple OLS regression model (1-4, with 1 being best and 4 being worst), the data is already numeric, in that it has the values of 1 to 4. I'm using R to estimate the model - Should I specify to R that this is an ordinal variable or can I run the regression with just the variable as it is?

Any help would be much appreciated!

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Anders
  • 13
  • 3
  • Welcome to Cross Validated! Programming questions are off-topic here. Do you have a statistics question about what kind of model to use, though? – Dave Nov 22 '20 at 18:05
  • 1
    The answer depends on what you mean by "handle" and whether this ordinal variable is an explanatory variable or response variable in the regression. – whuber Nov 22 '20 at 18:19
  • Hi! Alright, sorry about that! In that case, I still have a statistics question: Can I use an ordinal variable that has the values 1, 2, 3, and 4 in a multiple regression, just with the numeric values? Or is there some other method I should use instead? – Anders Nov 22 '20 at 18:19
  • @whuber, I use the variable as an explanatory variable in the OLS model. – Anders Nov 22 '20 at 18:20
  • How to use ordinary data as explanatory variables in OLS is a statistics question that is on topic. I voted to reopen but you should change your question to make clear, this is not about how to use R but how to do OLS. No, you can most certainly not use the numeric values. You should search your books and/or the web for 'dummy coding' and 'dummy variables'. R will be a great help in building these but first you need to understand them (which IMO is on topic here) – Bernhard Nov 23 '20 at 08:25
  • @Bernhard re your "certainly not:" see Lord (1953) [On the statistical treatment of football numbers](https://www.google.com/search?client=firefox-b-1-d&q=Lord+football+numbers) for a famous and stimulating counter opinion. – whuber Nov 23 '20 at 14:53
  • 1
    You might like to look at https://stats.stackexchange.com/questions/101511/logistic-regression-and-ordinal-independent-variables which has some useful tips even though it is about logistic regression rather than linear. – mdewey Nov 23 '20 at 14:59
  • 1
    @whuber "certainly" was a poor choice of words. However, given the context of a community member whose first thought is to disregard scale niveaus because they know no alternative to this rather basic problem in regression I continue to feel certainty, that they should not make that decision until they have improved their knowledge about conventional ways to address things. Given the circumstance of a closed question where nobody could write a proper answer I was quite certain that no potential answer would start with "Yes, you can." Still you are right and I should have commented more humble. – Bernhard Nov 23 '20 at 15:12

1 Answers1

2

Saying that an approach is correct is pretty hard. In different contexts, different answers. So I'll compare two approaches here.


Ordinal coding

Consider the ordinal coding:

$$x_i = \left[\matrix{\mathcal I_{z>1} & \mathcal I_{z>2} & \mathcal I_{z>3}}\right]$$

$\mathcal I_{c}$ is an indicator variable that assumes:

$$\mathcal I_{c}=\cases{0, \quad\text{if $c$ is false} \\ 1, \quad\text{if $c$ is true}}$$

So for the four possible values in the scale $z$ we get the following encoding:

$$\left(\matrix{1 \\ 2 \\ 3 \\ 4}\right)\rightarrow \left(\matrix{ 0 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 1}\right)$$

Then, our model of the expectation assumes the following form:

$$\hat y_i = \beta_0 + \beta_1 \mathcal I_{x_i>1}+ \beta_2 \mathcal I_{x_i>2}+ \beta_3 \mathcal I_{x_i>3}$$

Consider what this means for different values of $x_i$:

If $x_i = 1$, $\hat y_i = \beta_0$

If $x_i = 2$, $\hat y_i = \beta_0 + \beta_1$

If $x_i = 3$, $\hat y_i = \beta_0 + \beta_1 + \beta_2$

If $x_i = 4$, $\hat y_i = \beta_0 + \beta_1 + \beta_2 + \beta_3$

So each of $\beta_j$ represents the expected change in the response, with respect to the previous level.


Dummy coding

You'll notice a dummy encoding like:

$$\left(\matrix{1 \\ 2 \\ 3 \\ 4}\right)\rightarrow \left(\matrix{ 0 & 0 & 0 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1}\right)$$

would work just the same, but now you'd be comparing against the base value, represented as $\beta_0$. Using this encoding:

If $x_i = 1$, $\hat y_i = \beta_0$

If $x_i = 2$, $\hat y_i = \beta_0 + \beta_1$

If $x_i = 3$, $\hat y_i = \beta_0 + \beta_2$

If $x_i = 4$, $\hat y_i = \beta_0 + \beta_3$

Coefficients can be converted from one encoding to the other, defining $\beta_i^*$ as the ordinal encoding from before, we have that $\beta_1 = \beta_1^*$, $\beta_2= \beta_1^*+\beta_2^*$ and $\beta_3 = \beta_1^*+\beta_2^*+\beta_3^*$.


No coding

Consider what happens if you keep the variable as it was.

$$\hat y_i = \beta_0 + \beta_1 x$$

If $x_i = 1$, $\hat y_i = \beta_0 + \beta_1$

If $x_i = 2$, $\hat y_i = \beta_0 + 2\beta_1$

If $x_i = 3$, $\hat y_i = \beta_0 + 3\beta_1$

If $x_i = 4$, $\hat y_i = \beta_0 + 4\beta_1$

By analogy, we can see that this model corresponds to a model with encoding, where the change in the response from a change in level in the explanatory variable is always the same.

In ordinal coding, this would mean $\beta_1 = \beta_2 = \beta_3$. In dummy coding, this implies $3\beta_1 = \beta_1+\beta_2 = \beta_3$.

Also, notice here that the "base" level $1$ corresponds to the effect of $\beta_0 + \beta_1$ in the response. If you remove 1 from every level, then it would drop $\beta_1$ and be more promptly comparable to the other approaches. I chose to keep it starting at one, though, since that's the direct approach.


This is a more restrictive hypothesis. It requires that the difference in level in the explanatory variable be directly proportional to a change in the response.

Encoding is more flexible, in that the change between levels can, and probably will, be different. It also presents a complication, that can be beneficial: the change in response can change sign. If it does not conform to the meaning in the model, coefficients can be constrained as necessary.

Firebug
  • 15,262
  • 5
  • 60
  • 127
  • 2
    This was really helpful! Thank you for the time you spend explaining this. I should most likely have re-formulated my question - I know that saying something is always "the right approach" is often wrong to say. Again, thanks so much for the help. – Anders Nov 23 '20 at 19:21
  • @Anders glad to be of help! There's a small error in one of my formulas, I'll reformulate it soon enough. But the main point stands – Firebug Nov 23 '20 at 19:58