7

Is there a canonical regression approach for predicting the ranks of a response?

I'd like to fit a regression to a dataset where the response is highly non-normal with very large outliers. There are about 10 predictors. I haven't had much success with transformations (the best has been adding a constant and then logging the response twice, but this isn't very interpretable).

However, I only care about the ranks of the response. The response is really only a score that is used as an instrument for ranking observations. What I really want to know is which predictors explain the most variation in the ranks.

My approach has been the following:

  1. Calculate the ranks of the response. I.e. for each observation $i$, calculate $R(Y_i)$
  2. Suppose $N$ is the number of observations. Then, approximately, $U_i =\frac{R(Y_i)}{N} \sim Unif(0, 1)$
  3. By the Probability Integral Transform, $Z_i = \Phi^{-1}(U_i) \sim N(0,1)$
  4. Use $Z$ as my response in a regression of $Z$ on the predictors

Since these rank and inverse CDF transformation are monotone and thus preserve rank, I reason that this regression approach will help me identify which covariates are most predictive of rank.

Does this approach work? Is there a better or more standard approach to predicting rank with a set of covariates? Googling around, I found this paper but I don't know how accepted or well known the approach is: https://journal.r-project.org/archive/2012-2/RJournal_2012-2_Kloke+McKean.pdf

Thanks!

ttnphns
  • 51,648
  • 40
  • 253
  • 462
frelk
  • 1,117
  • 1
  • 8
  • 19
  • Why is $U_i \sim \text{U}(0,1)$? –  Dec 22 '16 at 19:49
  • $U_i$ will look uniformly distributed because $\frac{R(Y)}{N}$ takes values $\frac{1}{N}, \frac{2}{N}, ..., \frac{N - 1}{N}, \frac{N}{N}$ are perfectly evenly spaced on the unit interval. If you plot it, it will look exactly like the pdf of a standard uniform. If the $Y_i$ are independent then the $U_i$ should also be independent. – frelk Dec 22 '16 at 20:30

1 Answers1

3

From what I can tell, the rank-based estimation this paper is referring to is slightly different than what you're interested in. Note that least-squares estimation is based on the idea that $\boldsymbol \beta$ should be chosen to minimize $||\boldsymbol y - \boldsymbol X \boldsymbol \beta||^2$. This isn't suitable in your case because the distribution of $y$ isn't very nice and it's also not really of interest. However, the focus of the paper is still to predict $y$ as a linear function of $X$. The only difference is the way in which it estimates $\boldsymbol \beta$: In their case, they choose $\boldsymbol \beta$ to minimize a rank-based norm which is still applied to $\boldsymbol y - \boldsymbol X \boldsymbol \beta$. Hence, this method is still largely dependent on the distribution of $y$.

You mentioned that you only care about the ranks of the response variable. In other words, you'd be just as well off using $X$ to model $R(Y)$ rather than $Y$ itself. The fact that $R(Y)$ is limited to $[0, 1]$ means that the usual linear regression approach may not work. You could end up with predictions outside the unit interval or you might not even have a linear relationship between $X$ and $R(Y)$. But this really isn't a problem. The usual modeling approach in this situation is to employ a Generalized Linear Model. The only additional step in fitting this model is to choose an appropriate link function.

For example, suppose $X \sim Normal(0, 1)$ and $Y|X \sim Normal(\beta_0 + \beta_1 X, \sigma^2)$. It would then be appropriate to use $X$ to model $R(Y)$ with a GLM and a logit or probit link.

jjet
  • 1,187
  • 7
  • 12
  • After the normal inverse CDF transformation my response should be standard normal, which I think should avoid using a GLM, right? – frelk Dec 22 '16 at 20:31
  • That could work in some situations - like if $X$ were normally distributed. But it's certainly no guarantee. Here's an illustrative counterexample: Suppose $X \sim Exponential(1)$, $Y = .3 + .1*X + \epsilon$, where $\epsilon \sim N(0, (0.05)^2)$. If you then set $Z = \phi^{-1}(R(Y)/N)$, you'll see that $Z$ is not even remotely linear in $X$. I really think the GLM route is the best option in your case. – jjet Dec 22 '16 at 21:31