1

I have a small-sample dataset representing observations from a longitudinal study. My principal interest is in 'change scores' across three parameters (A, B, C). This requires simple paired t-tests. However, applying the median absolute deviation rule, I've found that the change scores for each parameter contains a large number of outliers (30-45%).

This represents a substantial amount of data relative to the full sample, and thus my concern. I have several questions I'd appreciate any comment on:

  • Is there a rule for when removal of data from outliers is too much (i.e., where outliers represent too great a proportion of the full dataset)?
  • How should I proceed with analysis? A regular test with all data? A test with data removed? Or a robust t-test using trimmed means and winsorized variance?

Example figures:

Histogram of Parameter A Histogram of Parameter A

Parameter A vs Parameter B Parameter A vs Parameter B

pomodoro
  • 723
  • 5
  • 15
  • this 30%-45% number, do you obtain it by flagging an observation as outlier whenever it is flagged as outlier on *either one* of the the three parameters? – user603 Aug 23 '18 at 08:50
  • Yes. For example, the change score for parameter A has 30% outliers, parameter B has 40%, and parameter C 35%. – pomodoro Aug 23 '18 at 10:47
  • so if you had only parameter A to apply the mad rule on, you would have 30% outliers? Is this what you mean? Also, can you post the result of doing `length(unique(x))/length(x))` when, again, `x` is just the first parameter ('parameter A')? [`length(unique())` counts the number of different values, so `length(unique({1,1,1,2}))` is 2] – user603 Aug 23 '18 at 12:17
  • 1
    Yes. If only looking at the change scores for parameter A, I would obtain 30% outliers following the MAD rule. The result of your function is 1: all change score observations are different. – pomodoro Aug 24 '18 at 02:41
  • ok. These 30% outliers, when you plot them (histogram of parameters individually or --even better-- considering plots of A vs B and A vs C and B vs C) do they form a cohesive group (are all or a large proportion of the outliers bundled together?) – user603 Aug 24 '18 at 05:50
  • If I plot parameter A, as an example, the non-outliers bundle together (values within +/- 200) and the outliers cluster at the edges (greater than 200, less than -200). – pomodoro Aug 26 '18 at 02:48
  • Plotting A against B, as an example, there is a near perfect correlation (r = .84). Visually, there do not appear to be outliers. – pomodoro Aug 26 '18 at 02:52
  • I think there is a simple explanation specific to your problem. It would help if you could add these plots to your question. – user603 Aug 26 '18 at 07:17
  • I've updated to add the above two plots to my question. – pomodoro Aug 27 '18 at 08:51
  • I am on vacation til the end of the week. But to give a brief answer, you have a MV outlier problem. Use a [MV outlier detection tool](https://stats.stackexchange.com/a/996/603) such as FMCD on the 3 variate problem. Visually, from the A vs B plot, I think FMCD would flag 2 (and possible a third one, it's hard to gauge by the eye) observations there as outliers. – user603 Aug 27 '18 at 09:51
  • Either your "median absolute deviation rule, " whatever that might refer to, is incorrect or it is being incorrectly applied, because *any* procedure that identifies 30% of all data as "outliers" is effectively useless. In the scatterplot you provide there is one outlying (and high-leverage) point near (-500,-1400), but that's all. – whuber Aug 27 '18 at 11:35
  • @pomodoro: Looking at the histogram, it is unlikely that the MAD rule would find 30% outliers in that data set. On the plot, I think (-500,-1400) and possibly (-75, 400) {it's hard to gauge by the eye) could be flagged as outliers by FMCD but nowhere near 30% of the data. I think you are computing/using the rule wrong. – user603 Aug 27 '18 at 21:09

1 Answers1

3

Removal of outliers based on simplistic rules without any assessment is generally inappropriate. Any inappropriate removal of outliers is too much and may raise suspicions of scientific misconduct - i.e. that this was done to get a desired result. Of course, it may be very appropriate to remove outliers that are due to failure of measurement equipment.

If you have too many apparent outliers compared to what you would expect under the distribution you assume, then perhaps you are assuming a distribution that does not suit your problem. What was used for other publications in your field may give you an idea of whether other people have identified some well- suited analysis approach already.

Björn
  • 21,227
  • 2
  • 26
  • 65