3

I am currently making a short literature study of robust and efficient estimators. Some very well known are the median absolute deviation (MAD) and the interquartile range (IQR). However they both have a weakness for non-normal distributions, with the five number summary having at least the advantage of showing skewness through the box and whiskers plot.

To me, it seems unreasonable to use the IQR, because it assumes same spacing both to the right and to the left of the corresponding quartile. In a distribution like the one shown below, this would cause a lot of "useless" information still to be included. (See Fig below, first box and whiskers plot)

What I would like to propose instead is to use the distance from the lower quartile to the median as an estimator for the standard deviation for values lower than the median, and the opposite for values higher than the median: $$ \text{if } x<\text{Med:} \ \text{reject if }|x−3(\text{Med}−Q(0.25))|>0 \\ \text{if } x>\text{Med:} \ \text{reject if } |x−3(Q(0.75)−\text{Med})|>0 $$ Of course, this method is quite robust (25%) and incredibly easy/fast to calculate. It would also follow the nature of the data when rejecting outliers.

What I would like to ask is if anybody else has seen this method applied somewhere else? It would also be great if somebody can help me find a way to calculate the efficiency of this method and/or mention its advantages/disadvantages.

Here a picture of what I am trying to say (only a spread of 2 s.d. are used in this case, to compare with the IQR it would be 3 s.d.): enter image description here

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
user3604362
  • 131
  • 4
  • Check this answer about shorth: https://stats.stackexchange.com/questions/76848/retrieving-minimum-width-that-contains-specified-fraction-of-all-values/76860#76860 – Tim Mar 01 '16 at 11:09
  • Hm, that is interesting! I was never shown this. However, it only considers the shorter side to make a range around the median, whereas in my case, I use the shorth to make a range on that side, and the other side I use a longer distance. That way it follows better the nature of the data. – user3604362 Mar 01 '16 at 11:18
  • 3
    "the IQR having at least the advantage of showing the skewedness through the box and whiskers plot" - I'm not sure I follow this. On its own, the IQR is only a measure of dispersion, not of skewness. – Silverfish Mar 01 '16 at 11:19
  • Silverfish, through the box and whiskers plot, you can see how the median is closer to one quartile than the other, thus showing that the data is skewed. – user3604362 Mar 01 '16 at 11:28
  • 7
    The IQR is just what it is defined to be; there is no assumption of symmetry; equally nothing means that it can be interpreted easily without looking at median and quartiles. You might as well say that the SD assumes symmetry around the mean; not so, and nothing stops that being useful for distributions that are right-skewed, e.g. exponentials and Poissons. – Nick Cox Mar 01 '16 at 11:50
  • 4
    It is quite fallacious to suppose that symmetry of the quartiles around the median requires normality. That would be true e.g. of logistic and t distributions. You could even construct distributions with quartiles equally distant from the median but asymmetric overall. – Nick Cox Mar 01 '16 at 11:51
  • Right, well, assuming I get a distribution similar to what is shown above, is this method recommendable? And what other method might be better instead? – user3604362 Mar 01 '16 at 13:22
  • I really have a hard time understanding your second paragraph. In particular, can you try to reformulate the second sentence in that paragraph? – user603 Mar 01 '16 at 13:41
  • 1
    I have added an image to what I meant. Maybe that helps? – user3604362 Mar 01 '16 at 14:14
  • 2
    All summary statistics by their very nature discard some information from the sample, so you will always find such limitations. If you want to know something about the skewness then do not look at the IQR, look at the skewness. The IQR is intended as a robust measure of dispersion and is a fairly concise way of summarizing this information. – dsaxton Mar 01 '16 at 14:15
  • 1
    @user3604362: Thank you for the additional clarification. Assuming I understood what you want (an univariate outlier detection method with 25% breakdown that takes the assymetry of the center of the data into account?) Maybe have a look at the adjusted [boxplot](http://stats.stackexchange.com/a/13429/603). – user603 Mar 01 '16 at 14:35
  • That is exactly what I was looking for. However it would be nice to still have somebody mention whether my proposed method is useful and at what point I should start using other methods such as M-estimators. – user3604362 Mar 01 '16 at 16:13
  • 1
    As I understand it you are proposing this as an estimator of the SD and the rule is calculate the SD as usual, but to ignore points outside your limits. So, it seems that this estimator is necessarily biased and the question is then how much and does it matter. I'd expect simulation to more illuminating here than any other mode of argument. My own preference, FW little IW, is to use something other than the SD whenever I doubt the SD. I think you'd have a hard job convincing people that this was easier to think about and less arbitrary than IQR or MAD. – Nick Cox Mar 01 '16 at 19:54

0 Answers0