1

I don't quite know how to ask this question or what to search for but I'm certain this method has a name..

I have a uni variate binned distribution, something like x = { 0, 10 , 20 , 30,....} with corresponding frequency measures f = { 100, 123, 275, 35,...}.

The total number of bins is small, like 10 bins or so.

I'd like to generate, in a parametric or non-parametric (preferred) way a large number of observations, that, when binned, have the same moments as this binned distribution.

What is this called and can someone direct me to any resources or tell me how to do it. I am certain this is not a difficult problem but seems to be a gap in my knowledge.

user20160
  • 29,014
  • 3
  • 60
  • 99
Joseph
  • 25
  • 3

1 Answers1

1

If I understand correctly, you have a continuous variable whose possible values are partitioned into bins. And, you have a histogram, giving a count value for each bin. This defines a continuous probability distribution, whose density function is constant in each bin. Here's how this distribution is defined, and how to sample from it.

The PDF

Suppose there are $k$ bins. Let $n_i$ be the count assigned to the $i$th bin, and let $a_i$ and $b_i$ be its left and right edges (so its width is $b_i - a_i$). The overall probability mass assigned to each bin is given by its count, divided by the total count:

$$p_i = \frac{n_i}{\sum_{j=1}^k n_j}$$

The probability mass in each bin is spread uniformly over its width. So, the probability density at any point $x$ is given by the probability mass of the bin containing it, divided by the bin width (or $0$ if $x$ lies outside all bins):

$$p(x) = \left\{ \begin{array}{cl} \frac{p_1}{b_1-a_1} && x \in \text{bin } 1 \\ & \vdots & \\ \frac{p_k}{b_k-a_k} && x \in \text{bin } k \\ 0 && \text{Otherwise} \end{array} \right.$$

Sampling

To sample a point $x$ from this distribution:

  1. Randomly select a bin according to its probability mass. That is, sample an integer $j \in \{1, \dots, k\}$ from a categorical distribution with probabilities $[p_1, \dots, p_k]$:

$$j \sim Cat(p_1, \dots, p_k)$$

  1. Sample $x$ from a uniform distribution over the chosen bin:

$$x \sim U(a_j, b_j)$$

Note

Points generated as above are samples from the given histogram. As such, they will recapitulate the moments (and other properties) of the histogram itself. But, an important distinction arises if the histogram was obtained by quantizing some underlying distribution, or fit to data that was generated by some underlying distribution. In this case, the histogram only approximates the underlying distribution. And, as whuber has pointed out, moments of the histogram (and samples from it) may systematically differ from those of the underlying distribution. See here for more information.

user20160
  • 29,014
  • 3
  • 60
  • 99
  • This approach tends to be biased--that it is, it does not (in the long run) reproduce the moments of the distribution that might have generated the data. This issue comes to the fore whenever the bins are few or lengthy or populated at the endpoints. See https://stats.stackexchange.com/questions/60256 for an improved solution (which needs to adopt some quasi-parametric assumptions, however). – whuber May 19 '20 at 17:18
  • @whuber I certainly agree that it won't faithfully reproduce the underlying distribution that originally generated the data. But, the OP asked for something that will "have the same moments as this *binned* distribution" (not the underlying distribution). Since this is a way to sample from the OP's 'binned distribution', wouldn't it satisfy they're asking for? – user20160 May 19 '20 at 17:24
  • I'm talking only about the moments of the underlying distribution--clearly one is limited by the discretization in terms of reproducing the entire distribution. It occurs to me we have at least two valid interpretations of "this binned distribution:" do we mean the underlying distribution, *which has been binned,* or do we mean the distribution *determined by the histogram bins and frequencies?* – whuber May 19 '20 at 17:39
  • 1
    @whuber Yes, it seems we interpreted that statement in different ways. Always nice to hear another perspective. I edited in a note that I hope will help clarify things – user20160 May 19 '20 at 18:49