0

Context

My question is related to the binmedian algorithm which is suggested in this post and its implementation originally in C and its adaptation in python.

My issue with these algorithms is that they are not truly online in the way they are implemented: the only thing they do is compute the median in one pass given the full array $X = \left[ x_1\; x_2\;\dots\;x_{n-1}\;x_n\right]$. They claim the memory footprint is $O(1)$ but they still require the full array $X$ as input.

Question

Assume at step $k$ you only know two things: the step number $k$, the value $x_k$ and the estimator at the previous step $\theta_{k-1}$. This happens when $X$ has very large memory footprint (say $n>10^7$) and one only wants to load $X$ in batches of small sizes (here assume batch of size $1$).

What would be a good recurrence relation to get $\theta_k$ from $k$, $x_k$ and $\theta_{k-1}$ ? I do not mind storing auxiliary variables as long as they have a (small) fixed size that does not depend on $n$.

Related posts

The following posts answer the question quite well:

The livestats python package is very close to what I want, but when experimenting with gaussian mixtures I've had strange behaviors.

Experiments

When plotting the relative error percentage of the estimated median $\theta_k$ to the target $t_k$ $$ \mathrm{perr}(\theta_k, t_k) = 100\cdot\dfrac{\lvert t_k - \theta_k\rvert}{\lvert t_k\rvert}$$ where $t_k$ is computed using the full array $\texttt{np.median}(X)$, I find important variability in the first $10^3$ samples,

first1kgmm

But the relative error stays below $5\%$ after $10^3$ samples

next10kgmm

I'll open another question if I encounter problems in my experiments.

ArnoV
  • 1,170
  • 3
  • 8
  • 1
    Also related: [Online estimation of quartiles without storing observations](https://stats.stackexchange.com/q/103258/1352) – Stephan Kolassa Feb 05 '21 at 16:17
  • This is helpful, but it's quite disappointing that 10 years later there isn't a clear solution to this problem. I'd be interested in a comparison of different methods. – ArnoV Feb 05 '21 at 16:47
  • The median is not as well behaved as the mean, so I don't think there will *ever* be a "clear" solution... – Stephan Kolassa Feb 05 '21 at 16:48
  • 1
    This problem has been discussed at length in at least three threads: I suggest that the duplicate *does* contain a clear and truly online solution. – whuber Feb 05 '21 at 17:03

0 Answers0