1

I have become interested statistical analysis of sports and came across a horse racing paper: "Computer Based Horse Race Handicapping and Wagering Systems: A Report" (found at: https://www.gwern.net/docs/statistics/decision/1994-benter.pdf)

One of the features the author uses in the model is a horse's preference to the distance being ran in the race. The author experiments with many different ways to calculate this feature but settled on the following specification:

enter image description here

The specification of this feature seems strange to me, and I can seem to gather any intuition as to why the value of this feature would show a preference to a race distance.

My understanding of the feature is as follows:

  1. For each of the horses past races, use a model (that uses no features relating to race distance) to predict the finishing position, $p$
  2. calculate the residual $r = a - p$, where $a$ is the horses actual finishing position
  3. For each of the races calculate the similarity $s$ in distance to the current race's distance (this could be through subtraction or using a Euclidian distance)
  4. For each of these races consider the points $(s, p)$, fit a line to these points
  5. The value of the feature is the slope of the fitted line (my interpretation of "the final magnitude of the estimate") is divided by the standard error of the regression (I interpret this to be the MSE between the points and the final line)

I don't understand how the strength of the relationship between the residual and the similarity to the current race's distance could indicate a horse's preference to the current races distance.

Maybe my understanding of the specification of the feature is wrong or there is something else i am missing?

  • See, *inter alia*, https://stats.stackexchange.com/a/46508/919, https://stats.stackexchange.com/questions/17336, https://stats.stackexchange.com/questions/28474, and https://stats.stackexchange.com/questions/21022/. – whuber Jul 01 '20 at 15:06
  • @whuber I am unsure how these other questions relate to my question of how residuals can be used to show an entities preference? – user6817585 Jul 01 '20 at 16:05

1 Answers1

1

I think that the problem is with point 4. The regression is not between $s$ and $p$, but $a$ and $r$, possibly weighted by $s$. In other words, imagine you have short and long races and a horse that likes short races. If today's race is long, it will have a negative residual and a positive residual if today's race is short. The term "residual" of course refers to regression 1. ($p$ predicting $a$), but when comes to 4. $r$ is used as a predictor.

  • So in step 4, we are using the residuals $r$ to predict the predicted value $p$ from the regression in step 1. When you say 'possibly weighted by $s$' do you mean that $s$ may have an influence or, indeed, may be the coefficient of the refression in step 4? – user6817585 Jul 02 '20 at 21:15
  • Sorry, I meant $a$ not $p$, so you predict actual performance again. I believe $s$ is used to weigh past races according to how similar they are to today's: ie use data from short races to predict short races. – JohnnieThick Jul 02 '20 at 21:29
  • Just seen your most recent edit, so when we use the residual to predict the actual finishing position, if the distance of the race had no effect _i.e. the horse had no preference_ the plot should be random and give a small regression coefficient, but if distance did have an effect then there would be some correlation and the regression coefficient would be large? – user6817585 Jul 02 '20 at 21:31
  • Exactly, effectively you have a variable telling you how much better/worse the horse is expected to perform for the given distance. If horses are performing randomly with respect to the distance, this should get a coefficient insignificantly different to 0. – JohnnieThick Jul 02 '20 at 21:38
  • The only thing im still not understanding is the significance of $s$ in this, could you elaborate on what you mean by weighing the past races? I dont understand what you mean by using data from short races to predict short races, surley when performing the regression instep 4, you would still use all the data? – user6817585 Jul 02 '20 at 21:59
  • Oh, do you mean that $s$ could be used to weight each datapoint in the least squares function, so that datapoints that come from a track with a closer distance are more important when fitting the regression in step 4 _(so, could weight terms by $\frac{1}{s}$)_, or just dont include data points in step 4 if their corrseponding $s$ value is above some threshold. – user6817585 Jul 02 '20 at 22:52
  • Yes the former is what I meant. – JohnnieThick Jul 02 '20 at 23:03