1

If I were trying to predict the next number in a list of numbers based on past performance would it make sense to find the number that has occurred most frequently? For example, if I were to say based on some parameters an event has produced that numbers [1,2,3,4,5,6,6,7,8,9,10,11,12,13,14,15] and I want to predict what number will occur next could I do

import numpy as np

data = [1,2,3,4,5,6,6,7,8,9,10,11,12,13,14,15]


res = max(set(data), key = data.count)

print(res) # 6

is this statistically valid? side note these numbers come from a high variance sample and won't be as uniform as the example. All of these numbers are average as well.

1 Answers1

2

It depends.

First, the numbers in your example are linearly increasing. Assuming this trend holds, it looks rather obvious that 16 is a better prediction than 6.

More generally, if your sequence is ordered, you can use tools from analysis to tease out dynamics. I recommend the excellent free online book Forecasting: Principles and Practice (2nd ed.) by Athanasopoulos & Hyndman.

If you have any useful predictors, of course you should use these in predicting, whether in a time series context or not.

Whatever tool you use, you will obtain some kind of predictive distribution. If your data is unordered and you have no drivers, you can probably assume that they are iid (independent and identically distributed). In this case, your predictive distribution is the distribution of the data you have already observed itself. In all other cases, you will have some kind of conditional distribution.

A single number prediction is then a one number summary of such a distribution. The optimal point prediction will depend on this distribution - but also on your loss function or error measure (e.g., Kolassa, 2020, IJF). In other words, you will need to think about what constitutes a "good" forecast. So, if you have no dynamics or structure in your data and use its observed distribution as the predictive distribution:

  • If your loss function is the mean squared error, use the mean of the data.
  • If it is the mean absolute error, then use the median.
  • If your loss function is 0 if your prediction hits exactly and 1 otherwise, use the most frequent value (i.e., the mode of the distribution).

More information can be found at Mean absolute error OR root mean squared error? and at Why does minimizing the MAE lead to forecasting the median and not the mean?

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357