4

Is there a general rule of thumb about when robust regression or quantile regression is preferred in the presence of outliers?

For example, I have a dataset where the DV exhibits extreme positive skewness. However, the large cases are actually some of the most interesting observations. When I run OLS, I find a positive relation between the DV and the IV of interest. When I estimate quantile regressions, I find that the positive relation between the DV and IV of interest is strongest in the 85th, 90th, and 95th percentiles (which is where one might expect). It is insignificant and sometimes negative for the rest of the percentiles. However, when I run rreg in Stata, it basically gives no weight to the large positive outliers, leading to no relation between the DV and the IV of interest. Which approach (OLS, Quantile, rreg) should be reported? Which is most appropriate?

smci
  • 1,456
  • 1
  • 13
  • 20
KSL
  • 61
  • 7
  • This can't be answered for your data because (1) your verbal description isn't enough to convey what they look like (2) we can't know your scientific context implying what kind of model matches your goals. It can't be answered generally: any survey of robust regression shows numerous competing methods; people can't agree that one is best and there are many good reasons why that is inevitable. It could be that quite another answer makes as much or more sense, namely use a transformation or non-identity link function rather than try to squeeze your data into hyperplane + long-tailed errors. – Nick Cox Nov 28 '13 at 09:59
  • 1
    Note that `rreg` in Stata should be expected to mean precisely nothing to non-Stata users. It's an implementation of the method of Li, G. 1985. Robust regression. In _Exploring Data Tables, Trends, and Shapes_, ed. D. C. Hoaglin, F. Mosteller, and J. W. Tukey, 281-340. New York: Wiley. I'd be surprised if anyone regards that as the method of choice in 2013. – Nick Cox Nov 28 '13 at 10:04
  • 1
    I note that further that whether robust and quantile regression are alternatives appears moot. I guess many people would want a clearer idea of your precise goals before recommending one or the other. – Nick Cox Nov 28 '13 at 10:07
  • DV is bounded between 0 and 1, with a lot of cases close to zero, approx 30% of the cases > 0.001, and approximately 1% of the cases > 0.01. So extreme positive skew. My precise goals are to evaluate the relation between DV and IV of interest. However, OLS seems inappropriate given skewness. Two methods of dealing with skewness - Quantile and Robust Reg, give different results. One shows pos & sig relation between IV and DV in right tail (which economics would suggest), robust reg discards DVs in right tail, leading to insigifnicant relation. Which should I choose, given different results? – KSL Nov 28 '13 at 14:51
  • 1
    From the first sentence alone I would not expect any method that fits a line (plane, hyperplane) to work well with your data: differences between robust and quantile are a separate issue. None of the methods so far discussed pays **any** attention to boundedness of the response. If you don't have exact zeros you might, just possibly, benefit from transformation; otherwise some generalised linear model might help. You are seeking oracular judgements on what is right or what is better that can't be given responsibly when you are only gradually revealing vital characteristics of your data. – Nick Cox Nov 28 '13 at 14:57
  • Thanks for your patience, Nick. I am confused - why can a quantile regression not work in the given my DV? My understanding of quantile regression is that it makes no distributional assumptions of the error term or the distribution of the DV. Give my DV (bounded, a very small percentage of zeroes, but a long thin right tail, no 1s), the assumptions of OLS are clearly violated. I thought quantile regression could allow me to draw inferences about the effect of my IV on the DV around certain percentiles. Am I mistaken? – KSL Nov 28 '13 at 16:30
  • Distribution here is secondary. The more important point is that a linear functional form is wrong in principle for a bounded variable, as it won't respect the bounds and will predict values outside [0,1] (here) even if it doesn't do so within the observed range of your data. Moreover, it is usually implausible that the response varies linearly even within the bounds, except as a very crude approximation. This is essentially the same reasoning as that behind logit or probit models. (You are ignoring my earlier point that OLS does not in itself imply assumptions.) – Nick Cox Nov 28 '13 at 18:39
  • Thanks Nick for clarifying. I guess in my field, OLS is often used for bounded dependent variables, assuming (as you note, likely incorrectly) that the effect is locally linear. Prediction is not usually a major point of emphasis, although your main point is that inferences are likely affected as well. I appreciate your patience. – KSL Nov 28 '13 at 19:21

1 Answers1

4

Is there a general rule of thumb about when robust regression or quantile regression is preferred in the presence of outliers?

Yes. So long as we're comparing regression equivariant approaches, it is clearly possible to rank the various robust estimates of regression in terms of their capacity to find outliers.

The algorithm behind rreg is described here:

rreg first performs an initial screening based on Cook’s distance $>1$ to eliminate gross outliers before calculating starting values and then performs Huber iterations followed by biweight iterations, as suggested by Li

The Li estimate of regression is in a sense similar to an S-estimator but with a single starting point. This estimator is not used a lot and has not been studied much. I would advise you to use the FastS algorithm of Saliban-Barrera&Yohai, about which much more is known.

For more background on why the S-estimator, a robust estimator with re-descending $\rho$ function, is more reliable than quantile regression check this answer. The S-estimates of regression are implemented in Stata, check the Verardi and Croux (2008) stata package and companion paper.

For the second part of your question: the breakdown point of quantile regression is proportional to the quantile you estimate with it. So the $\tau=0.9$ quantile of the quantile regression is much less able to withstand outliers than the $\tau=0.5$ quantile (and is generally not considered robust).

By the way, the fact that an observation is flagged as an outlier does not imply anything about the quality, validity or reliability of the corresponding measurement. It simply means that the flagged observation is inconsistent with the multivariate pattern fitting the bulk of the data. Indeed, in many fields (micro-array analysis, fraud identification) revealing such data points is often the primary objective of the study.

[1]Verardi, Croux (2008). Robust regression in Stata. The Stata Journal 9(3): 439-453.
[2]Salibian-Barrera M., Yohai, V.J. (2006). A Fast Algorithm for S-Regression Estimates. Journal of Computational and Graphical Statistics, Vol. 15, 414--427.

user603
  • 21,225
  • 3
  • 71
  • 135
  • 2
    http://www.stata.com/manuals13/rrreg.pdf gives a much better description of `rreg` in Stata. The account at the UCLA website omits many important details, even for a verbal sketch. – Nick Cox Nov 28 '13 at 10:29
  • I guess I should have rephrased my initial question. I want to find an appropriate method to study my DV, including the outliers. In my case, the outlying DV are the most interesting observations. I do know that fitting an OLS line between all of my observations is likely to suffer from problems, since the residual isn't normally distributed, because the relation between my DV and IV of interest is not likely linear throughout the distribution of the DV, and because the outliers in the DV will dominate the OLS coefficient, leading to incorrect inferences regarding the entire DV distribution. – KSL Nov 28 '13 at 13:57
  • Given that my DV is skewed but that I want to study the effect of my IV on all DV cases, including those in the far right tail, is quantile regression appropriate? Or will the coefficient on the IV in the 90th percentile still be "biased" in some way. Robust regression appears to simply ditch those observations. – KSL Nov 28 '13 at 13:58
  • 1
    @KSL: I'm even more confused now on what you are trying to do. If you are interested in how the IVs relates to the DV for the 10% of the observations with the largest values of the DV, then why not simply take the 10% data with the largest values of the DV and run OLS on *them*? – user603 Nov 28 '13 at 14:14
  • Sorry for not being clear. I'm interested in how the IV of interest relates to the DV of interest both overall and at certain points in the distribution of the DV. Specifically, I expect that my IV should be positively related to DV and that the effect should be strongest within the top 20% of the DV distribution, where there is a long thin tail. I was under the impression that you could not sort the distribution by the DV to get the differing effect of the IV on the DV on that part of the DV distribution. Am I wrong? – KSL Nov 28 '13 at 14:30
  • As I mentioned before, ultimately, I'd like to study the relation between the DV and the IV for all cases of the DV. However, the DV exhibits extreme positive skew, resulting in a few outliers that are actually interesting cases. OLS seems inappropriate, as it fits a line through all observations, which will be primarily determined by the top 10% of observations. Also, the skewness violates OLS assumptions. One option is robust reg, but it simply discards the interesting observations and fits a line through the rest. Is quantile regression acceptable to use in conjunction with OLS in my case? – KSL Nov 28 '13 at 14:40
  • 2
    if you want to estimate a given *conditional* quantile, then, yes, the quantile regression will do that for you. But this is neither an outlier detection tool nor a robust fitting procedure (I think the tags you placed on your question are misleading). – user603 Nov 28 '13 at 14:45
  • 2
    NB: OLS is an estimation procedure, not a model. Please don't conflate the two. – Nick Cox Nov 28 '13 at 15:00
  • Clearly I am missing lots of relevant information and knowledge, and I appreciate your patience and diligence in responding. When you note that quantile regression is a not a robust fitting procedure, what do you mean? What I have read about quantile regression is that can provide more accurate inference regarding causal effects for skewed distributions than OLS since it can provide the causal effect of the IV on the DV at different values of the DV. Whereas using OLS basically allows certain values of the DV to dictate the line fit (that is, OLS coefficients are sensitive to outliers). – KSL Nov 28 '13 at 16:22
  • Also, sorry for the misleading outlier tag - I removed it and happy to remove/add others. – KSL Nov 28 '13 at 16:32
  • 2
    How can quantile regression far into the tails be robust? It is designed to return whatever quantile you ask for. That is why I emphasised earlier that robust and quantile regression are not to be seen as alternatives. – Nick Cox Nov 28 '13 at 18:42
  • Thanks for your help Nick. I see now that robust regression is trying to minimize outliers and fit a straight line throughout the entire dataset, whereas quantile is finding the effect of the IV at different spots in the DV distribution. As you note, these are not "alternatives". Let me ask one more general question about quantile regression. Using the data I discuss above, I find a significant relation between my DV and IV for the 80th, 85th, 90th, and 95th percentile, but no percentiles below those. Is it appropriate to state that my IV only affects the DV at high values of the DV? – KSL Nov 28 '13 at 19:11
  • 2
    Again, sorry, but just having some experience in data analysis doesn't impart to me the ability to act as your oracle. What you describe could be what you summarize it as, or it could be a side-effect of fitting on the wrong scale. If your data are piled up near zero on your response scale, then necessarily it is hard to distinguish the lower quantiles from close parallels to the x axis. – Nick Cox Nov 28 '13 at 19:23
  • So you are basically saying that inference based on quantile regression can be inappropriate if the distribution is bounded with many observations close to zero? Again, I realize I am not as knowledgeable as you, so your help and patience is appreciated. this response is surprising to me since my understanding of quantile regression is that it makes no assumptions of the distribution of the DV or error term. It just assumes that the effect of the IV on the DV is locally linear at the percentile being investigated. – KSL Nov 28 '13 at 19:46
  • 2
    You need to think more about your findings and what they mean. We know too little about the data (no graph, e.g.). So, I can only give different versions of the same advice (guesses, really). Low quantile regressions are essentially reported as flat (not significant). Perhaps that's real; perhaps it's a side-effect of the skewness of the response. Consider a transformation or link function that stretches the values near zero and see if you get similar results. Your data are possibly too squeezed together near zero to get useful results. Can you make the data accessible? Just one y, one x? – Nick Cox Nov 28 '13 at 23:41
  • Thanks Nick, it finally clicked for me in this last reply. I appreciate your persistence. I will re-examine my results using a log and rank transformation of the dependent variable. I am guessing that if quantile regressions reveals different results with these specifications, then it was a function of the data being squeezed, and not that it is a real effect, as you noted. Thanks so much, again, for your persistence and patience. – KSL Nov 29 '13 at 04:46