Estimating latent performance potential based on a sequence of observations

Question

Context

you have 200 observations of an individual's running time for the 100 metres measured once a day for 200 days.
Assume the individual was not a runner before commencement of practice
Based on the observed data and the 199 other observations, you want to estimate the latent time it would take the individual to run if they (a) applied maximal effort; and (b) had a reasonably good run for them (i.e., no major problems with the run; but still a typical run). Let's call this latent potential.

Of course, the actual data would not measure latent potential directly. The data would be noisy:

Times would vary from run to run
On some days the individual would be particularly slow because of one or more possible problems (e.g., tripping at the start, getting a cramp half way through, not putting in much effort). Such problems would result in massive outliers
On some days the individual would be slower than you'd expect, perhaps because of more minor issues.
In general, with practice the runner would be expected to get faster in latent potential.
In rare cases, it is possible for the runner to get slower in latent potential (e.g., injury)

The implications of this:

The occasional slow time might provide minimal information on what the individual is capable of.
A fast time for the individual suggests that the individual is capable of such a fast time, but a small amount of this fast time might be good fortune on the day (e.g., the right wind, a little luck on the start).

The question: Thus, how could one estimate latent potential at each of the 200 time points based on the available data and a few assumptions about the nature of running times?

Initial Thoughts: I imagine there would be some form of Bayesian approach that combined the available information and assumptions to form an estimate, but I'm not sure where to look for such models. I'm also not quite clear how the effectiveness of such a model would be evaluated.

user603 · Accepted Answer · 2021-03-21T17:39:32.187

7

You need to perform an isotonic (i.e. monotonic non decreasing) nonparametric regression (see page 6 of this document for an example), then use $\hat{E}(y|x)+ \delta \hat{\sigma}(y|x)$ with $\delta>0$ as the upper potential. There are many packages that will do that in R. I like this one for its simplicity.

Isotonic nonparametric is simply your regular scatterplot smoother with the added prior that more $x$ cannot decrease smoothed $y$ (i.e. drug dose vs effects).

From the first comment below your design includes a $k$-vector dummy variable $z$ (controling for injury,running style) and a continous variable $x$ (days), assuming that $y$ (latent performance) is given by: $E(y|x,z)=m(x)+\alpha z+\delta$ where $m(x)$ is a monotone scatterplot smoother, $\delta>0$ is known and $\alpha\in \mathbb{R}^k$. This types of model can be estimated by isotonic GAM (see this paper implemented here).

Edit: i changed the link to the paper, (the old link was pointing to a derivative of the method by the same author).

edited Mar 21 '21 at 17:39

answered Sep 22 '10 at 11:01

user603

21,225
3
71
135

Thanks for the great suggestion. I did a quick analysis using isoreg in R. It was certainly robust to the outliers (i.e., the very slow times). It was a little bit more stepped than would seem appropriate if it was updating its estimate based on every observation. I also assume that it also wouldn't deal well with the situation where the runner experiences an injury or perhaps changes their running style and thus the runner's latent performance actually declines. – Jeromy Anglim Sep 22 '10 at 13:22
Jeromy:> first point/ have you tried fudging with the bandwith selector ? Second point/ Injury is a valid example (running style is not since it does not impact latent [potential] speed): see edited answer. – user603 Sep 22 '10 at 13:30
Thanks. I'll play with the bandwidth selector. As for running style, I guess your argument would be that the runner could go back to their original running style and return to their previous speed. Sometimes I'd be interested in seeing the person's new potential in terms of perhaps a new running style that they have committed to. – Jeromy Anglim Sep 22 '10 at 13:35
@Jeromy:> see updated answer. – user603 Sep 22 '10 at 13:40
1

@user603 thanks for your answer. Some of your links seem to no longer work though. Would be great if you could update them :) – pir Mar 20 '21 at 06:36
@pir: updated all links. Hopefully, they will live another 10 year. – user603 Mar 21 '21 at 17:41

Mike Dunlavey · Answer 2 · 2010-09-22T13:28:08.330

Just a guess.

First I would explore transformations of the data, such as converting time to speed or acceleration. Then I would consider the log of that, since it obviously won't be negative.

Then, since you are interested in the asymptote, I would try fitting (by least squares) a simple exponential to the transformed data, with time t being the x axis, and log-transformed speed (or acceleration) being the y axis. See how that works in predicting new measurements as time increases.

A possible alternative to an an exponential function would be a Michaelis-Menten type of hyperbola.

Actually, I would strongly consider a mixed-effect population approach first (as with NONMEM), because each individual may not show enough information to evaluate different models.

If you want to go Bayesian, you could use WinBugs, and provide any prior distribution you want to the parameters of the exponential function. The book I found useful is Gilks, Richardson, Spiegelhalter, "Markov Chain Monte Carlo in Practice", Chapman & Hall, 1996.

Thanks. I have a concern with the nonlinear regression approach (or the linear regression of transformed x and y). 1. It assumes the functional form is known; 2. least squares is influenced by outliers, which I would argue are less relevant to the quantity of interest; 3. The quantity of interest is more like median of a positively skewed running time distribution with outliers, whereas least squares would estimate the mean. — Jeromy Anglim, Sep 22 '10 at 13:28

score 0 · Answer 3 · answered Sep 25 '10 at 03:51

One could view the recorded times as biased estimates of the runner's latent ability. Many factors would cause the time to be worse than the latent best time, such as a bad start, headwind, a stumble, mis-judgement of pace, etc, while very few would cause the recorded times to be better than the latent best, such as a strong tailwind or running downhill. I am not terribly familiar with regression with biased errors, but apparently one can use the gamma family when performing GLM regression; one would use time as the dependent variable and observed times as the dependent variable.

Estimating latent performance potential based on a sequence of observations

3 Answers3

Linked