4

I have a biological data set on which I would like to do both univariate and multivariate analysis, and try to find correlation of features to a response. Should I remove univariate outliers and do univariate analysis, and remove multivariate outliers and then do multivariate analysis separately? Or should I remove both univariate and multivariate outliers and then do the analysis on the remaining data set.

  • 5
    Many threads here on outliers. Their main burden is that attempts to identify and remove outliers are often unnecessary and in any case would be highly problematic. Look for threads on outliers with high numbers of votes. – Nick Cox Mar 22 '16 at 19:49
  • 5
    Sometimes the outliers in a biological study may be the most informative cases. – EdM Mar 22 '16 at 19:59
  • 3
    (+1) Taking to heart the wisdom in the comments by @NickCox and EdM, I still like the nature of this question, which might be rephrased to read "When I plan to perform both univariate and multivariate analyses of data, is there some preferred order in which I would *examine* the dataset for univariate and multivariate outliers? Should I be concerned about multivariate outliers when I am performing the univariate analyses?" *Etc, etc.* – whuber Mar 22 '16 at 20:04
  • Yeah, thats actually what I wanted to ask. Should I be concerned about multivariate outliers when I am performing univariate analyses and vice versa? And if there is a preferred order? – Parul Verma Mar 22 '16 at 20:33
  • A good answer might depend on details of your data. What's the general nature of your data? Is your response variable continuous, true/false, or something else? How many and what types of predictors do you have? How many cases? When you say "outliers" do you mean in terms of the responses themselves, of the predictors themselves, or of the relations of the predictors to the response? And how far away, in whichever of these meanings, are the outliers from the other cases? – EdM Mar 23 '16 at 01:15
  • Hi, response variable is continuous in one case and categorical in another. I am trying to find correlation of predictors to both the response variables separately. So, I want to find outliers in terms of the predictors themselves. I have around 250 predictors and 120 responses. Predictors are basically concentration of various molecules. – Parul Verma Mar 23 '16 at 15:32
  • So, I have tried multivariate analysis to find important predictors out of these 250 predictors. In doing so, many predictors were getting selected just because of few points which were clearly outliers. The value of those predictors remains constant w.r.t. the response except for one of two points which deviate a lot from the rest of the points. – Parul Verma Mar 23 '16 at 15:35
  • It's best to edit pertinent information into your question rather than just post as comments. Not only does this make it easier to read what your problem is, but editing the question brings it to the attention of readers by "bumping" it on the front page. (Unless people are mentioned by the "@username" convention, they will not be notified by the presence of a new comment.) – Silverfish Mar 24 '16 at 12:31
  • Huw do you defilne a univariate outlier? A multivariate one? –  Sep 07 '17 at 04:31

1 Answers1

1

As a first approach, I usually follow the steps described in Zuur et al (2010) A protocol for data exploration to avoid common statistical problems. This will help you identify outliers for univariate and multivariate analyses.

To answer your question, I would say that from my experience, an outlier for a univariate analysis is also usually an outlier for a multivariate analysis. However, multivariate analysis assumptions are more "relax" than in univariate analysis. For example, if you do a redundancy analysis (RDA) you basically have to make sure your explanatory variables are not highly correlated before your RDA, and look for multi-collinearity and make sure you meet the homogeneity of dispersion assumption on your RDA model. So at the end, the effect of an outlier might not be as pronounced in a multivariate analysis.

In any analysis, decision to remove data should be taken after you run your analysis on the full data and you see that you don't meet the assumptions because of the outlier(s).

Mud Warrior
  • 505
  • 1
  • 6
  • 20