Base Data: I have ~1,000 people marked with assessments: '1,' [good] '2,' [middle] or '3' [bad] -- these are the values I'm trying to predict for people in the future. In addition to that, I have some demographic information: gender (categorical: M / F), age (numerical: 17-80), and race (categorical: black / caucasian / latino).
I mainly have four questions:
I was initially trying to run the dataset described above as a multiple regression analysis. But I recently learned that since my dependent is an ordered factor and not a continuous variable, I should use ordinal logistic regression for something like this. I was initially using something like
mod <- lm(assessment ~ age + gender + race, data = dataset)
, can anybody point me in the right direction?From there, assuming I get coefficients I feel comfortable with, I understand how to plug solely numerical values in for x1, x2, etc. -- but how would I deal with race, for example, where there are multiple responses: black / caucasian / latino? So if it tells me the caucasian coefficient is 0.289 and somebody I'm trying to predict is caucasian, how do I plug that back in since the value's not numerical?
I also have random values that are missing -- some for race, some for gender, etc. Do I have to do anything additional to make sure this isn't skewing anything? (I noticed when my dataset gets loaded into R-Studio, when the missing data gets loaded as
NA
, R says something like(162 observations deleted due to missingness)
-- but if they get loaded as blanks, it does nothing.)Assuming all of this works out and I have new data with gender, age, and race that I want to predict on -- is there an easier way in R to run all of that through whatever my formula with new coefficients turns out to be, rather than doing it manually? (If this question isn't appropriate here, I can take it back to the R forum.)