Can I delete excessive number of multivariate outliers, like over 10% in sample?

Question

I have a dataset with around 9000 cases, I'm running a factor analysis and I have found that 1100 cases are identified as being a multivariate outlier. Is it alright for me to got ahead and delete it?

As you tagged this question with `factor-analysis`, could you expand a little bit more on the purpose/context of your study? Especially, your statistical units are considered 'outliers' with respect to what? — chl, Sep 04 '11 at 11:35

Karl · Answer 1 · 2011-09-04T10:10:14.247

10

It's hard to see how 10% of the data could be called outlying.

There's nothing that says you can't omit them, as long as you say clearly exactly what you did. But, this particular instance seems a bit extreme.

When it comes to outliers, I first ask, are they errors? If they're errors, I'd want to fix them; if I couldn't fix them, I'd be reasonably comfortable omitting them (though I'd worry about bias).

If they seem not to be errors (or there's no way to tell), I'd ask: do they affect the results? If omitting them gives the same answer as not, I'd be happy and move on. If it does matter, I would look for more a robust method of analysis.

I would look more closely at your method for identifying outliers: is it making some sort of assumption that is clearly wrong?

Most importantly, I'd look at lots and lots of different plots of the data, to see what it is that is leading those 10% of points to be called outliers, and whether it seems at all reasonable (though I can't see how it could be).

edited Sep 04 '11 at 10:10

answered Sep 04 '11 at 04:38

Karl

5,957
18
34

I used mahalanobis to detect the multivariate outliers and it is actually over 10% of the data that are outliers. I checked to see if there were entered incorrectly but it was fine. I looked at univariate outliers and that was fine but when I ran multivariate analysis, it resulted in over 10% of the data being outliers. So I can go ahead with the deletion as long as I state it? – Emily Jones Sep 04 '11 at 04:45
@Emily, data transcription errors aren't the only form of measurement error. More generally, Karl was probably also referring to errors in the measurement apparatus itself. If your measurement tool randomly had a very large mean zero value added to it, then you might want to delete extreme cases from the dataset. Where you get into trouble is when, for example, your measurement tool only reports high values inaccurately (in which case you would have bias). – Macro Sep 04 '11 at 06:19
I'm not sure how the outlier cutoff on the Mahalanobis distance was defined, but my guess is that it was derived under the assumption of multivariate normality. And likely it's the MVN assumption that is suspicious, not the points. Take a look at a histogram of the Mahalanobis distances. – Karl Sep 04 '11 at 10:22
I'm going to look at the histogram of the mahalanobis. I hadn't thought about that. Thank you! – Emily Jones Sep 05 '11 at 00:10

score 4 · Answer 2 · answered Sep 04 '11 at 11:29

4

In addition to @karl broman's excellent point, I'm curious as to how many variables there are. You could be running into the "curse of dimensionality".

Also, I would NOT delete outliers just because of some arbitrary threshold. You haven't said what it is you are studying, but, often, the outliers are where the interest is.

And I strongly agree with @Karl 's point about looking at graphs first - LOTS of graphs.

answered Sep 04 '11 at 11:29

Peter Flom

94,055
35
143
276

I have 171 variables. I'm trying to find out the factor structure of the survey. It is so long, we are trying to shorten it so I'm running it through factor analysis and that is when I'm encountering this huge number of outliers. When you mean graphs, you mean plot each two variables? – Emily Jones Sep 05 '11 at 00:08
I agree with Peter - I dislike throwing away data because it doesn't conform to expectation. – Fomite Sep 05 '11 at 02:48
1

OK, so we gradually learn about the data. Is the survey new or an existing one? What was it intended to do? Why are there 171 questions on the survey? How will factor analysis help shorten it? Factor analysis finds latent variables - but latent variables are *latent*. – Peter Flom Sep 05 '11 at 10:26
its an existing one but we have also included some newer 56 items. The goal is to make a stronger measure since we have one factor that has 14 items. We are also exploring as we have added in those 56 items. I have ran the histograms for the mahalanobis and its almost flat u curve. I don't know what this means!!!! – Emily Jones Sep 05 '11 at 15:26
@emily I don't understand why you need a stronger measure because "one factor has 14 items". You would need a stronger measure if you had evidence of lack of reliability or lack of validity. What were the psychometric properties of the original scale? – Peter Flom Sep 06 '11 at 10:08

score 4 · Answer 3 · answered Sep 05 '11 at 15:53

While the above topics are interesting, with 171 items I think validity is going to be a concern that overrides statistical ones. There's a real risk that people are going to answer mechanically, resulting in straightlining or in a very large initial factor that represents a halo or horn effect. I think your team should be able to use non-statistical criteria to trim down the survey to a more manageable level that will make it more worthy of the statistical analyses you want to do.

Can I delete excessive number of multivariate outliers, like over 10% in sample?

3 Answers3