I have become interested statistical analysis of sports and came across a horse racing paper: "Computer Based Horse Race Handicapping and Wagering Systems: A Report" (found at: https://www.gwern.net/docs/statistics/decision/1994-benter.pdf)
One of the features the author uses in the model is a horse's preference to the distance being ran in the race. The author experiments with many different ways to calculate this feature but settled on the following specification:
The specification of this feature seems strange to me, and I can seem to gather any intuition as to why the value of this feature would show a preference to a race distance.
My understanding of the feature is as follows:
- For each of the horses past races, use a model (that uses no features relating to race distance) to predict the finishing position, $p$
- calculate the residual $r = a - p$, where $a$ is the horses actual finishing position
- For each of the races calculate the similarity $s$ in distance to the current race's distance (this could be through subtraction or using a Euclidian distance)
- For each of these races consider the points $(s, p)$, fit a line to these points
- The value of the feature is the slope of the fitted line (my interpretation of "the final magnitude of the estimate") is divided by the standard error of the regression (I interpret this to be the MSE between the points and the final line)
I don't understand how the strength of the relationship between the residual and the similarity to the current race's distance could indicate a horse's preference to the current races distance.
Maybe my understanding of the specification of the feature is wrong or there is something else i am missing?