10

I would like compute prediction intervals for predictions made by kNN regression. I can't find any explicit reference to confirm, so my question is - is this approach to computing prediction intervals correct?

I have a reference dataset where each row is one location (e.g. city). I have two features (say, x1 and x2), describing a sample from the population of that location (e.g. x1 could be the average income of the residents). Sample size is different for each location. I predict a target variable (say, y, e.g. the total number of cars in that city) based on x1 and x2.

A prediction for a new location Z is made by finding k nearest neighbors of Z in terms of x1 and x2 (the Euclidean distance), and averaging over the target variable of those k neighbors.

I compute prediction intervals as y* +- t*s, where s is the standard deviation of the target among k nearest neighbors, and t comes from the standard normal distribution (e.g. for 95% prediction interval t=1.96). I ignore x1 and x2, and I ignore the fact that x1 and x2 are estimated over different samples. Does the approach make sense?

Pål GD
  • 87
  • 12
inzl
  • 1,183
  • 8
  • 18

1 Answers1

4

You've got two options, I think.

  1. Bootstrap

Generate 100 synthetic data-sets by sampling with replacement from the original data-set. Run the knn regression over each new data-set and sort the point predictions. The confidence interval is just the distance between the 5th and 95th point prediction.

  1. Pseudo-Residuals

Basically you either use a pooled variance estimator (if you have multiple observations at the same $x$) or pseudo-residuals to get an estimate of the variance. Assuming homoskedastic and normal error you can use the t-distribution such that:
$ \bar y_i \pm t(h,\alpha) \frac{\sigma}{\sqrt{n_i}}$
Where $\bar y$ is the average predicted, $h = \frac{n-2}{n}$ is the degrees of freedome of the t-distribution and $n_i$ is the number of points in the neighborhood.

You can read more about it here

CarrKnight
  • 1,218
  • 9
  • 18
  • 2
    At least the first option (bootstrap) is not providing a *prediction* interval but a *confidence* interval for the true average prediction. – Michael M Jul 29 '16 at 16:13
  • 1
    This is a common misconception. Prediction interval is just as possible through bootstrap, see for example section 6.3.3 of "Bootstrap methods and their applications" by Davison – CarrKnight Jul 31 '16 at 14:49
  • 2
    I'd be very interested in learning more about that. To not hijack this question, I've opened a new thread (http://stats.stackexchange.com/questions/226565/bootstrap-prediction-interval) – Michael M Jul 31 '16 at 16:10
  • @CarrKnight Is it possible / correct to use these methods when your data are time-series? – arroba Sep 23 '16 at 12:58