2

I am using a KNN model to predict quantity sold for a highly seasonal business. I chose KNN because I thought that using nearest neighbors would inform my model about said seasonality better than a standard regression. For reference, I need this prediction to be fairly close to reality, not a smoothed function that doesn't account for the fact that in December, volume is millions of units greater than in September. Which brings me to my question:

As I have been searching for the optimal K value, I have found that while increasing my K is reducing my out of sample error, it is providing a worse prediction of the test data than a smaller K (by worse I mean that when tracking the predictions next to the actual values, the predicted values are significantly off in certain periods, more so than with a smaller K that has a higher RMSE). My assumption here is that using a higher value of K is effectively moving my model towards a standard regression model, smoothing the curve if you will. Is this a valid intuition, is there something more going on here?

chrislee
  • 21
  • 1
  • It sounds like RMSE might not be such a good performance metric; you might be more interested in *percentage* error. [The Wikipedia article on mean absolute percent error](https://en.wikipedia.org/wiki/Mean_absolute_percentage_error#Issues) gets into that metric and some alternatives that aim to remedy its issues. – Dave Feb 11 '22 at 15:21
  • 1
    If you have seasonal data, then consider informing your regression about that by including predictors that model seasonality, ideally by including a number of Fourier-type harmonics [like this](https://stats.stackexchange.com/a/478175/1352). Alternatively, take a look at classical forecasting methods like seasonal exponential smoothing or SARIMA. As a textbook, I recommend [*Forecasting: Principles and Practice* (3rd ed.) by Athanasopoulos & Hyndman](https://otexts.org/fpp3/). – Stephan Kolassa Feb 11 '22 at 19:10
  • 1
    In contrast to @Dave, I don't think switching your error measure to the MAPE will be helpful. But if you want to consider the MAPE, you may want to take a look at [What are the shortcomings of the Mean Absolute Percentage Error (MAPE)?](https://stats.stackexchange.com/q/299712/1352) In a highly seasonal situation, optimizing the MAPE will incentivize a closer fit in low season at the expense of a worse fit in high season - which is presumably the opposite of what you want. – Stephan Kolassa Feb 11 '22 at 19:12
  • @StephanKolassa that is correct. Closer in the higher seasons is more important as those periods significantly drive yearly revenues. I tried using a SARIMA model, but found that a first order seasonal AR component was the only component that was non-0 (even integration was 0 because the data is stationary as is), which resulted in effectively holding volumes constant with a slight decay. I will try the seasonal exponential smoothing method and the using other error metrics. – chrislee Feb 11 '22 at 20:19

0 Answers0