2

I have a set of data of locations and associated rent prices. Now there seem to be several outliers which I would like to get rid of so that a plot of my original data gains more meaning. In the "world" of statistics, would it be acceptable if I did this by eliminating any prices that who deviate from the mean by more than twice the standard deviation?

The aim of what I'm working on is to test out several machine learning techniques. I don't have to be extremely accurate but I wouldn't like to do something that is totally unacceptable.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Lunat1c
  • 467
  • 1
  • 6
  • 5
  • 3
    Please investigate some of the previous threads on the topic of removing outliers: you can find them by linking through the [tag:outlier] tag. – whuber Nov 16 '13 at 15:40
  • Outliers with respect to which model? "*...would it be acceptable if...*" -- acceptable to whom? What properties do you require? – Glen_b Nov 16 '13 at 17:11
  • you are dealing with a regression task. The hypothesis of your model pertain to the residuals (not to the $y$'s themselves). You should be wary of observations whose associated residuals are too far from the fit. The only reliable way to reveal such observations is in terms of their distances to a robust fit of your data. See [this](http://stats.stackexchange.com/questions/15426/whether-to-delete-cases-that-are-flagged-as-outliers-by-statistical-software-whe/50780#50780) answer for more info – user603 Nov 17 '13 at 11:58

1 Answers1

1

Surely, if your machine learning techniques are to be realistic, then they should include outliers! The only thing to add is that your training dataset should be of a reasonable size, such that your "sample" for each location is probably at least 100. In my view the only outliers to exclude are those arising from data errors.

Robert Jones
  • 598
  • 1
  • 3
  • 6