0

I am working with a small dataset (15 points), and have built a liner regression model (two explanatory variables). I then performed LOOCV. One of the points has an error that is noticeably higher than the rest. When I try to build a new regression model without this point, my model looks better (both in terms of statistics and getting the rank ordering right). Can I say that this point is an outlier and possibly outside of the domain of applicability? Can I use LOOCV like this, or should I use something else?

Any help is appreciated.

enter image description here

iva_lu
  • 1
  • Thank you for your reply. It could be, I don't know that. I was thinking that if the point 'looks wrong' (like 13 in the set), it is likely that the measurement is wrong and I could use LOOCV to point the ones that need re-doing. To give a bit more context, one of the variables I use in regression model is a calculation that can have several starting points. – iva_lu Nov 24 '20 at 11:06
  • https://stats.stackexchange.com/questions/121071/can-we-use-leave-one-out-mean-and-standard-deviation-to-reveal-the-outliers/121075#121075 – user603 Nov 24 '20 at 11:39
  • Other than occasionally finding a data error (not the best way to do that), this is cheating. The result will be an overstatement of predictive performance. – Frank Harrell Nov 24 '20 at 12:46
  • Thank you very much for your help so far. So just to see if I got this right: if my model is of form ax+by+c, where y can have several values (I don't know a priori which one is the 'correct' one), I should not be using this approach to identify if one of them has a 'wrong' y value? In this case in particular, when I refit the model with the alternative y value for 13, the predictions I get are more in line with what I would expect based on the experimental values, but that could be just luck, and is not a good practice, or statistically sound. – iva_lu Nov 24 '20 at 12:59
  • I'd say LOOCV (see also https://stats.stackexchange.com/questions/164223/proof-of-loocv-formula/164277#164277) is mainly used to identify how the coefficient estimator changes when leaving out one observation. If it changes "much" we'd call it a high-leverage observation, see e.g. https://stats.stackexchange.com/questions/208242/hat-matrix-and-leverages-in-classical-multiple-regression – Christoph Hanck Nov 24 '20 at 14:50
  • Given the values of error you present I would not have said observation 13 was particularly different from the others. As others have suggested leaving out observations without a good scientific justification is not the best way forward. – mdewey Nov 24 '20 at 15:04

0 Answers0