1

I have a pure math background with knowledge of basic statistics (random variables, inference, etc.) but am new to predictive modeling. Here is my situation:

I have a bunch of independent variables that I am trying to use to predict a single dependent variable. The independent variables are categorical (ordinal), and the dependent variable is a metric.

I'm not sure what areas I should be looking into/learning about. Multiple correspondence analysis, multiple regression, and factor analysis all seem to pop up in my searches. Which of these kinds of analysis would best serve my situation? I'd welcome any alternatives, but there's no need to cover all available options. Thanks!

Nick Stauner
  • 11,558
  • 5
  • 47
  • 105
pwerth
  • 113
  • 4

2 Answers2

3

I'm glad I asked – ordinal data can be treated rather differently than nominal data. is a subtype of , and differs from in that levels are ordered (hover over these tags for excerpts of the tag wikis, or click on the "info" links for the full spiel). Preserving this information often improves predictions because it better reflects the nature of the predictor variable.

The binary predictors don't need to be treated differently than nominal data, but the polytomous ordinal predictors would be handled well by a penalized regression model. This is essentially a smoothing method that prevents adjacent levels of the ordinal variable from having drastically different dummy coefficients. This is a non-issue with binary variables, because they only require one dummy coefficient apiece. See "Continuous dependent variable with ordinal independent variable" for more. I think regression splines and LASSO / elastic net have some application in linear models based on ordinal predictors as well, but I know less about those myself.

For a little appeal to intuition, compare ordinal variables to continuous variables in your modeling mind. With continuous variables in a linear model, straight lines or smooth curves often make the best regression models; even local regression applies smoothing to prevent bumpy models, as these would likely be overfitted. Many ordinal variables really represent grouped, latent, continuous variables, and as such shouldn't produce really spiky models either. It generally makes more sense for nominal data to be heterogeneously related to metric response variables. To prevent overestimating the difference in relationships between the response variable and various levels of an ordinal variable, some penalization of starkly different coefficients for adjacent levels of ordinal predictors often helps.

Nick Stauner
  • 11,558
  • 5
  • 47
  • 105
  • Sounds interesting, but complicated - will definitely check into it. Is it correct to assume that since the categorical variables can be treated as ordinals, my model would be more accurate if I did treat them as such (as opposed to nominals, which I'm assuming would be easier - again, I am new to all this stuff)? – pwerth Jun 02 '14 at 20:54
  • @pwerth: edited to respond. – Nick Stauner Jun 02 '14 at 21:08
2

Typically you would use Analysis of Variance with categorical predictors (independent) and a continuous response (dependent) variable. If your main goal is predicting then multiple regression may make more sense (the same underlying code is usually used for ANOVA and multiple regression, learn about them together).

Multiple correspondence analysis and factor analysis answer different questions from what you appear to be asking, so I would suggest holding off on those.

Greg Snow
  • 46,563
  • 2
  • 90
  • 159