How do I run Ordinal Logistic Regression analysis in R with both numerical / categorical values?

Question

Base Data: I have ~1,000 people marked with assessments: '1,' [good] '2,' [middle] or '3' [bad] -- these are the values I'm trying to predict for people in the future. In addition to that, I have some demographic information: gender (categorical: M / F), age (numerical: 17-80), and race (categorical: black / caucasian / latino).

I mainly have four questions:

I was initially trying to run the dataset described above as a multiple regression analysis. But I recently learned that since my dependent is an ordered factor and not a continuous variable, I should use ordinal logistic regression for something like this. I was initially using something like mod <- lm(assessment ~ age + gender + race, data = dataset), can anybody point me in the right direction?
From there, assuming I get coefficients I feel comfortable with, I understand how to plug solely numerical values in for x1, x2, etc. -- but how would I deal with race, for example, where there are multiple responses: black / caucasian / latino? So if it tells me the caucasian coefficient is 0.289 and somebody I'm trying to predict is caucasian, how do I plug that back in since the value's not numerical?
I also have random values that are missing -- some for race, some for gender, etc. Do I have to do anything additional to make sure this isn't skewing anything? (I noticed when my dataset gets loaded into R-Studio, when the missing data gets loaded as NA, R says something like (162 observations deleted due to missingness) -- but if they get loaded as blanks, it does nothing.)
Assuming all of this works out and I have new data with gender, age, and race that I want to predict on -- is there an easier way in R to run all of that through whatever my formula with new coefficients turns out to be, rather than doing it manually? (If this question isn't appropriate here, I can take it back to the R forum.)

score 16 · Accepted Answer · edited Apr 11 '14 at 20:03

Here's a little info that might point you in the right direction.

Regarding your data, what you have is a response with multiple categories, and anytime you are trying to model a response which is categorical you are right to try and use some type of generalized linear model (GLM). In your case you have additional information which you must take into account regarding your response and that is that your response levels have a natural ordering good > middle > bad, notice how this is different from trying to model a response such as what color balloon someone is likely to buy (red/blue/green), these values have no natural ordering. When doing this type of model with an ordered response you may want to consider using a proportional odds model.

http://en.wikipedia.org/wiki/Ordered_logit

I haven't used it myself, but the polr() function in the MASS package is likely to be of some use, alternatively I have used the lrm() function in the rms package to do similar types of analysis, and have found it quite useful. If you load these packages just use ?polr or ?lrm for the function information.

Alright enough background, on to your questions:

This should be covered above, check out these packages/functions and read up on ordinal logistic regression and proportional odds models
Any time you have a covariate which is categorical (Race/Sex/Hair color) you want to treat these as 'factors' in your R coding in order to model them appropriately. It's important to know what a factor is and how they are treated, but essentially you treat each category as a separate level and then model them in an appropriate way. Just read up on factors in models and you should be able to tease out whats going on. Keep in mind that treating categorical variables as factors is not unique to glm models or proportional odds models, but is typically how all models deal with categorical variables. http://www.stat.berkeley.edu/classes/s133/factors.html
Missing values can sometimes be a hassle to deal with but if you're doing a fairly basic analysis its probably safe to just remove data rows which contain missing values (this isn't always true, but based on your current experience level I'm guessing you need not be concerned with the specifics of when and how to deal with missing values). In fact this is pretty much what R does. If you have a data which you are using to model, if you are missing information in a row for your response or any covariate in the model R is just going to exclude this data (this is the warning your seeing). Obviously if you're excluding a large proportion of your data due to missingness, your results could be biased and its probably good to try and get some more info on why there are so many missing values, but if you're missing 162 observations in 10,000 rows of data I wouldn't sweat it too much. You can google up on methods for handling missing data if you're interested in some more specifics.
Almost all R model objects (lm, glm, lrm,...) will have an associated predict() function which will allow you to calculate the predicted values for your current modeling dataset and additionally for another dataset which you wish to predict an outcome for. Just search ?predict.glm or ?predict.lm to try and get some more info for whatever model type you want to work with. This is a very typical thing people wish to do with models so rest assured that there are some built in functions and methods that should make doing this relatively straightforward.

Best of luck!

score 2 · Answer 2 · answered Apr 11 '14 at 19:05

2

Yes, ordered logit or probit would be where to start. Here's a tutorial on ordered logit that uses R. Other CV questions can probably help you with any snags you run into—try the tags 'logit,' 'probit,' and 'ordinal.'
A standard approach to dealing with a categorical independent variable with $k$ values is to dummy code it as $k-1$ binary values. This is more fully explained here, but in short: The effect of one category is subsumed into the intercept, and coefficients are fitted to the remaining categories. In your example, there would be a dummy variable caucasian that would be coded to 1 for a Caucasian respondent, 0 otherwise.
Dealing with missing data very much depends on the problem at hand, and yes, how you deal with missing data may introduce bias. This book excerpt nicely describes four mechanisms that can produce missing data, which should help you consider potential bias in your own problem at hand. (In particular, section 25.1, p. 530.)
Many modeling packages have a predict function of some sort, and indeed the first tutorial linked above includes a demonstration.

answered Apr 11 '14 at 19:05

Sean Easter

8,359
2
29
58

Thanks so much! Quick follow-up on #2: That was my basic assumption -- but what's the code is there's more than two variables? For example, caucasian, black, latino. – Ryan Apr 11 '14 at 22:52
Quite welcome! In that example, you would choose one category to subsume into the intercept, say `latino`, and dummies for the other two. A 1 value for the `caucasian` dummy indicates a Caucasian respondent, similar for the `black` dummy variable. A 0 value for both indicates a Latino respondent. Make sense? – Sean Easter Apr 12 '14 at 14:31
So should I just change the dataset from one column with multiple responses ('black,' 'caucasian,' and 'latino') to one 'black' column with 1s and 0s, one 'caucasian' column with 1s and 0s, and one 'latino' column with 1s and 0s? – Ryan Apr 13 '14 at 13:21
That's one approach that'll work fine. The only difference from using two columns is how you interpret the intercept. You can do this manually, but I believe factors in R can handle it for you. Try [this](http://www.ats.ucla.edu/stat/r/modules/dummy_vars.htm)—it walks through using factors with a similar example. Cheers! – Sean Easter Apr 14 '14 at 02:23
1

The link to the tutorial is broken. If someone can fix it, that would be great! – Dan Hicks Oct 25 '17 at 22:16

How do I run Ordinal Logistic Regression analysis in R with both numerical / categorical values?

2 Answers2

Linked