2

I need to compress the outputs of a Bernoulli process and already decided to use Golomb coding. However, there is something that still feels confusing to me and that I hope to understand better. Maybe I'm confused about the notion of an "arrival time difference distribution" for a finite-length process.

For the rest of this question, let $p < 0.5$. (I have $p \approx 10^{-4}$ or $10^{-5}$)

First the infinite case. Let $B_i$ for $i \in \mathbb{N}$ be i.i.d. $\text{Bernoulli}(p)$ random variables. I think of each $i$ as a time interval, during which something either arrives or does not. The Shannon entropy of each $B_i$ is $h_2(p)$ (binary entropy function). The waiting times between arrivals (random variable $D$) follow a geometric distribution with parameter $p$. This geometric distribution has Shannon entropy $\frac{h_2(p)}{p}$. This means that a large enough number $n$ of such differences can be encoded (compressed) using arbitrarily close to $n \frac{h_2(p)}{p}$ bits. (Golomb coding is a good method to compress samples from a geometric distribution one-by-one.)

The finite case: The random vector $X_n = (B_1, B_2, \dots, B_n)$ has Shannon entropy $H(X) = nh_2(p)$. Thus, for large enough $n$, a single sample from (maybe more precisely: "realization of") $X_n$ can be compressed to (arbitrarily close to) $nh_2(p)$ bits, by Shannon's source coding theorem.

One can map each outcome of $X_n$ to a list of inter-arrival times. Define an bijection

$$\delta: \{0, 1\}^n \to \text{TupleSumLeq}(n),$$

where $\text{TupleSumLeq}(n)$ is "the set of positive-integer-tuples of length at most $n$ where all entries sum to at most $n$" (at least I think that's the right co-domain). The bijection could be defined such that, e.g. $\delta(0\dots0) = ()$ and the first tuple entry (if present) denotes the "arrival time" of the very first arrival, e.g., $\delta(10100\dots0) = (1, 2)$. One can obtain a bijection by restricting the co-domain to the image. I will call tuples in the image of $\delta$ "valid lists of inter-arrival times". (I think that the image of $\delta$ is the "set of all tuples")

Since we have the bijection, a valid list of inter-arrival times contains exactly the same information as (a realization of) $X_n$. Therefore the list can also be compressed to exactly the same limit. This means that, for large enough $n$, one can compress a list of inter-arrival times to (arbitrarily close to) $nh_2(p)$ bits. This gives a heuristic intuitive argument for the entropy of the Geometric distribution: For large enough $n$, the number of "arrivals" (and thus also the number of inter-arrival times) is almost always very close to $np$. This means that one can compress a single inter-arrival time to $\frac{nh_2(p)}{np} = H(\text{Geometric}(p))$.

What confuses me:

In the finite case, I cannot quite wrap my head around "(the random variable of inter-arrival times inside $X_n$", call it $D_n$, and its distribution. (not the "distribution of sets of arrival times"). I assume one can probably write it down?

While $D$ can take arbitrarily large values, $D_n$ can only take on values $1,\dots,n$ and does not follow a Geometric distribution. I think that $H(D_n) < nh_2(p)$ (although probably very close). Is this correct and can one calculate it exactly?

The question that I care the most about is this: when compressing a single realization of $X_n$ (i.e., cannot look at entropy. Need loss-less compression with finite block size), how advantageous is it to have a large $n$? How far is Golomb-coding the inter-arrival times really away from optimal?

I'm not sure how easy it is to answer such "finite size" questions. In general, it's probably very hard, but I'm dealing with quite simple concrete distributions. Maybe it this case it's possible?

Adomas Baliuka
  • 688
  • 3
  • 16

0 Answers0