1

I want to perform a simple linear regression in R. However, the plot of fitted and residual values has outliers. Transformations (ie log, square root) did not solve this problem. Removing these outliers created new outliers. So, what can I do? Is it wrong to adjust a Poisson distribution for this case of a simple linear regression and perform a generalized model? Are there other options?

Armando
  • 13
  • 2
  • 3
    agree that this is better for CrossValidated, but (1) we might need more context (*why* are you running the regression? to explore data, predict future responses, or confirm hypotheses? your answer will affect the appropriate answer to the question); (2) the question is almost unanswerable without a reproducible example – Ben Bolker Jun 08 '15 at 22:49
  • 1
    Your vicious circle is nicely characterized in @MikeHunter's answer [here](http://stats.stackexchange.com/a/155773/17230). See also [Rigorous definition of an outlier?](http://stats.stackexchange.com/q/7155/17230). It *may* be the case that a change in model specification gets rid of outliers without recourse to robust methods, but you'd have to tell us more about your situation. – Scortchi - Reinstate Monica Jun 09 '15 at 09:03
  • I greatly appreciate the comments of all of you (`@Scortchi` `@MikeHunter` `@BenBolker`). My question comes from biology field. Specifically, I want to investigate if a given behaviour (i.e. cannibalism in insects) depends on population density. So, I have 12 replicates: during 1 year, I recorded the frequency I overserved the behaviour per month and density of population (in the respective month). My problem is surely because I have very few replicates. In addition to outliers R highlights, the plot of fitted and residual values resembles a cone, indicating heteroscedasticity. – Armando Jun 10 '15 at 03:53
  • @Armando: (1) Putting usernames between back-ticks allows you to bypass the restriction to one per comment, but those users won't get a notification. (2) In any case, editing your question's better - the info's less likely to be overlooked - & has the side-effect of putting it to the top of the active questions list. (3) The context you've provided is certainly useful, but I think we'd also need detail of the fitted model & diagnostics - perhaps to see those plots - to give any specific advice. – Scortchi - Reinstate Monica Jun 10 '15 at 21:04
  • @Scortchi: Thanks for your advices. I'll display my raw data here (density followed by number of times behavior was observed (per month)): [1.125; 5], [1.175; 3], [1.425; 3], [1.3; 1], [1.275; 6], [0.675; 1], [0.675; 1], [0.4; 0], [0.5; 0], [0.55; 0], [0.85; 0], [1.225;1]. I did not perform quantile regression once I have many zeroes. A GLM with Poisson distribution displayed an acceptable plot of resids. But as they were overdispersed, I corrected for quasipoisson. Does this 'solution' work? – Armando Jun 11 '15 at 13:44

1 Answers1

2

Probably the key question is: what kind of outliers? Outliers in the dependent variable, Y? Or in the predictor(s), X? I'm going to assume that the outliers are in the DV. One inheritance of 20th c assumptions rooted in Gaussian statistics about regression is that the errors be normally distributed and a plethora of techniques have been developed to constrain those extreme values. The problem with this is that outliers, unless they are "bad," errorneous or fraudulent, contain potentially valuable information. This includes the possibility of classes of behaviors and distributions not widely studied (yet) such as power law or heavy-tail distributions and extreme value models. The idea behind power laws can be motivated when thinking of so-called "long-tail" behaviors like Internet sales on online sites like Amazon. Heavy-tails are related to this -- consider stock returns on days with huge losses such as started the Downturn in 2008. Extreme value models (aka generalized extreme value models or GEVs) are a class of models that could use greater exposure and adoption by statisticians as much real-life behavior is extreme valued. They are widely used in insurance and financial risk management but have ready applications, e.g., in creative industries such as film and music where "blockbuster" hits rule the box offices and topline revenues in these businesses. For instance, William Golding, a Hollywood scriptwriter, has been widely quoted as saying that, in Hollywood, "Nobody knows nothing." Golding is referring to the fact that the huge hits that generate huge returns are inherently unpredictable.

So, what does all of this have to do with your problem and question? It's a very long-winded answer to say that my recommendation is that, unless they can be proven to be somehow wrong, that you treat your outliers as real data and valuable information. One reliable workaround to textbook regression is that you find a robust, nonparametric R regression module known as quantile regression. Quantiles can be any percentile as specified by the analyst. The key thing about QR is that, unlike traditional, OLS regression which predicts mean or average behaviors, QR predicts a quantile, which is usually the median value by default, but can be any quantile of the analyst's choosing. QR is much more robust to outliers than traditional, OLS regression. One approach to answering which quantile is to use a grid-like search across a range of values above and below the median.

Let me know if you have any questions.

Mike Hunter
  • 9,682
  • 2
  • 20
  • 43