How to deal with categorical dependent variable with more than 2 categories?

Question

I've been struggling to understand how to approach this problem.

Problem Description

I have $n$ features that describe a dog race such as:

Final time
First bend time
Track
Grade

My dependent variable is the FINAL POSITION that could be from 1st place to 6th place.

What I need to predict

Given my training data and features and $m$ past dog races, I need to predict the dependent variable $y$, that is, a dog's final position (1st to 6th).

WHAT I DON'T UNDERSTAND

How should I approach the dependent variable $y$?

The first model I created: instead of using 6 possible outcomes for a race (1st to 6th), I divided my dependent variable into WINNER and LOSER. That is:

if dog has finished 1st: $y = 1$
if dog has finished 2nd to 6th, $y = 0$

In this simple case I have a dependent variable like: $y = [ 1, 1, 0, ... , 0, 1]$

BUT: What if I want to predict every position?

If I use 1: 1st, 2: 2nd, 3: 3rd ... 6:6th

but as I read from this topic: How to deal with categorical features in machine learning models? assigning "6" and "1" to a variable is not recommended. Because that encoding indicates that the 6th place is 6 times greater than 1st place, which is not true.

How should I handle my dependent variable?? Is it possible to have a multidimensional dependent variable $y$? Something like $n \times 6$?

score 2 · Accepted Answer · answered May 28 '19 at 11:39

2

This is not a categorical dependent variable but rather an ordinal one. The proportional odds ordinal logistic model is one of many ways to efficiently analyze such data. But the dog's characteristics are more likely to relate to his absolute run time than to the order in which he finished. Consider using run time as Y.

answered May 28 '19 at 11:39

Frank Harrell

74,029
5
148
322

1

I absolutelly agree! Use regression on absolute time run, build confidence intervals and extract probabilities from there. Otherwise you would entertain a model that, in a race among the 6 fastests dogs on Earth, would declare everyone as a 1st place finisher – David May 28 '19 at 11:52
Amazing. If you guys don't mind, I'd like to ask other questions! :) I agree that the dogs caracteristics are more likely to relate to the absolute run time. So Frank Harrell, you think that I should do a regressor to predict Y ? – Piero Costa May 28 '19 at 14:25
Second question: even if it is a ordinal one. This problem can also be seen as a categorical, isn't it true ? what I mean by that is that the model doesn't know what's the difference between First and Sixth. I could easy say that "first == oranges" and "sixth == monkeys". Isn't that correct ? – Piero Costa May 28 '19 at 14:30
No, the model should take into account that there's a natural/inherent ordering of the categories (thus "ordinal"), since "first > second > ... > sixth". You will lose a lot if you treat the final position as categorical (and it doesn't make sense, too, since it's more likely that a dog ends up "second or third" then that it ends up "second or sixth"). – Edgar May 28 '19 at 14:53
Edgar, thank you VERY much for the enlightenment. Thank you also David for the ideia. So, my main algorithm should be Proportional odds ordinal logistic model with abs time as Y ? – Piero Costa May 28 '19 at 15:54
That would work. Then you don't care about the shape of the time distribution. You can get effect estimates and also convert estimated values to means and quantiles as shown in a case study in my [RMS course notes](http://fharrell.com/links). – Frank Harrell May 28 '19 at 17:55
Amazing course notes. A+++ – Piero Costa May 29 '19 at 07:34

How to deal with categorical dependent variable with more than 2 categories?

1 Answers1