How to use ordinal data as explanatory variables in OLS

Question

What is the correct way to handle ordinal variables in multiple regression? (This might seem very basic for some of you here). I'm using an ordinal variable in a multiple OLS regression model (1-4, with 1 being best and 4 being worst), the data is already numeric, in that it has the values of 1 to 4. I'm using R to estimate the model - Should I specify to R that this is an ordinal variable or can I run the regression with just the variable as it is?

Any help would be much appreciated!

Welcome to Cross Validated! Programming questions are off-topic here. Do you have a statistics question about what kind of model to use, though? — Dave, Nov 22 '20 at 18:05
The answer depends on what you mean by "handle" and whether this ordinal variable is an explanatory variable or response variable in the regression. — whuber, Nov 22 '20 at 18:19
Hi! Alright, sorry about that! In that case, I still have a statistics question: Can I use an ordinal variable that has the values 1, 2, 3, and 4 in a multiple regression, just with the numeric values? Or is there some other method I should use instead? — Anders, Nov 22 '20 at 18:19
@whuber, I use the variable as an explanatory variable in the OLS model. — Anders, Nov 22 '20 at 18:20
How to use ordinary data as explanatory variables in OLS is a statistics question that is on topic. I voted to reopen but you should change your question to make clear, this is not about how to use R but how to do OLS. No, you can most certainly not use the numeric values. You should search your books and/or the web for 'dummy coding' and 'dummy variables'. R will be a great help in building these but first you need to understand them (which IMO is on topic here) — Bernhard, Nov 23 '20 at 08:25
@Bernhard re your "certainly not:" see Lord (1953) [On the statistical treatment of football numbers](https://www.google.com/search?client=firefox-b-1-d&q=Lord+football+numbers) for a famous and stimulating counter opinion. — whuber, Nov 23 '20 at 14:53
You might like to look at https://stats.stackexchange.com/questions/101511/logistic-regression-and-ordinal-independent-variables which has some useful tips even though it is about logistic regression rather than linear. — mdewey, Nov 23 '20 at 14:59
@whuber "certainly" was a poor choice of words. However, given the context of a community member whose first thought is to disregard scale niveaus because they know no alternative to this rather basic problem in regression I continue to feel certainty, that they should not make that decision until they have improved their knowledge about conventional ways to address things. Given the circumstance of a closed question where nobody could write a proper answer I was quite certain that no potential answer would start with "Yes, you can." Still you are right and I should have commented more humble. — Bernhard, Nov 23 '20 at 15:12

Firebug · Accepted Answer · 2020-11-23T20:00:33.770

Saying that an approach is correct is pretty hard. In different contexts, different answers. So I'll compare two approaches here.

Ordinal coding

Consider the ordinal coding:

$$x_i = \left[\matrix{\mathcal I_{z>1} & \mathcal I_{z>2} & \mathcal I_{z>3}}\right]$$

$\mathcal I_{c}$ is an indicator variable that assumes:

$$\mathcal I_{c}=\cases{0, \quad\text{if $c$ is false} \\ 1, \quad\text{if $c$ is true}}$$

So for the four possible values in the scale $z$ we get the following encoding:

$$\left(\matrix{1 \\ 2 \\ 3 \\ 4}\right)\rightarrow \left(\matrix{ 0 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 1}\right)$$

Then, our model of the expectation assumes the following form:

$$\hat y_i = \beta_0 + \beta_1 \mathcal I_{x_i>1}+ \beta_2 \mathcal I_{x_i>2}+ \beta_3 \mathcal I_{x_i>3}$$

Consider what this means for different values of $x_i$:

If $x_i = 1$, $\hat y_i = \beta_0$

If $x_i = 2$, $\hat y_i = \beta_0 + \beta_1$

If $x_i = 3$, $\hat y_i = \beta_0 + \beta_1 + \beta_2$

If $x_i = 4$, $\hat y_i = \beta_0 + \beta_1 + \beta_2 + \beta_3$

So each of $\beta_j$ represents the expected change in the response, with respect to the previous level.

Dummy coding

You'll notice a dummy encoding like:

$$\left(\matrix{1 \\ 2 \\ 3 \\ 4}\right)\rightarrow \left(\matrix{ 0 & 0 & 0 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1}\right)$$

would work just the same, but now you'd be comparing against the base value, represented as $\beta_0$. Using this encoding:

If $x_i = 1$, $\hat y_i = \beta_0$

If $x_i = 2$, $\hat y_i = \beta_0 + \beta_1$

If $x_i = 3$, $\hat y_i = \beta_0 + \beta_2$

If $x_i = 4$, $\hat y_i = \beta_0 + \beta_3$

Coefficients can be converted from one encoding to the other, defining $\beta_i^*$ as the ordinal encoding from before, we have that $\beta_1 = \beta_1^*$, $\beta_2= \beta_1^*+\beta_2^*$ and $\beta_3 = \beta_1^*+\beta_2^*+\beta_3^*$.

No coding

Consider what happens if you keep the variable as it was.

$$\hat y_i = \beta_0 + \beta_1 x$$

If $x_i = 1$, $\hat y_i = \beta_0 + \beta_1$

If $x_i = 2$, $\hat y_i = \beta_0 + 2\beta_1$

If $x_i = 3$, $\hat y_i = \beta_0 + 3\beta_1$

If $x_i = 4$, $\hat y_i = \beta_0 + 4\beta_1$

By analogy, we can see that this model corresponds to a model with encoding, where the change in the response from a change in level in the explanatory variable is always the same.

In ordinal coding, this would mean $\beta_1 = \beta_2 = \beta_3$. In dummy coding, this implies $3\beta_1 = \beta_1+\beta_2 = \beta_3$.

Also, notice here that the "base" level $1$ corresponds to the effect of $\beta_0 + \beta_1$ in the response. If you remove 1 from every level, then it would drop $\beta_1$ and be more promptly comparable to the other approaches. I chose to keep it starting at one, though, since that's the direct approach.

This is a more restrictive hypothesis. It requires that the difference in level in the explanatory variable be directly proportional to a change in the response.

Encoding is more flexible, in that the change between levels can, and probably will, be different. It also presents a complication, that can be beneficial: the change in response can change sign. If it does not conform to the meaning in the model, coefficients can be constrained as necessary.

This was really helpful! Thank you for the time you spend explaining this. I should most likely have re-formulated my question - I know that saying something is always "the right approach" is often wrong to say. Again, thanks so much for the help. — Anders, Nov 23 '20 at 19:21
@Anders glad to be of help! There's a small error in one of my formulas, I'll reformulate it soon enough. But the main point stands — Firebug, Nov 23 '20 at 19:58

How to use ordinal data as explanatory variables in OLS

1 Answers1

Ordinal coding

Dummy coding

No coding