3

I have a set of experimental data (with each data-point having its own measured uncertainty), and I wish to produce a histogram of it. The x values of the edges of each bin are already defined. The trick is that I need to have uncertainties for the value of each bin, since I am then going to fit a model-histogram to it. (The model is of a physical process, the outcome of which is best described by a histogram. The model will be fit using a non-linear least squares algorithm, and I want to weight each bin based on its uncertainty).

The uncertainties of each histogram bin need to depend on both the known uncertainties associated with each data-point within the bin, and also the number of data-points within the bin. This is where I am stuck - how can I calculate this?

Bdawg N
  • 33
  • 1
  • 1
  • 4
  • I think some more details would help - for instance, are you assuming normal (or some other distribution) errors? – Silverfish May 24 '16 at 08:23
  • @Silverfish yes indeed - the measured experimental errors are normally distributed, with the SD measured for each (they arise from a photodetector that is known to have this property). Each data point can be assumed to be independent. – Bdawg N May 24 '16 at 23:51
  • Best thing to do is edit the new information into the question - not everyone reads the comments, and this might also draw some more attention to the question. – Silverfish May 25 '16 at 00:03

1 Answers1

5

It sounds like you want to calculate a standard error for the unobserved count (i.e. counts of values without the error) in each bin.

For each bin you can calculate the probability that a given observation ($x_i^\text{obs}$ with associated standard deviation $\sigma_i$) could have come from any given bin.

So the number of observations actually in some specific bin, say bin $j$, is the sum of a collection of $\text{Bernoulli}(p_i(j))$ random variables, where $p_i$ for a given bin is the proportion of the area under a normal distribution $N(x_i,\sigma_i^2)$ within the bin boundaries of the $j$-th bin.

If the Bernoulli observations are in his would imply the standard error of the total count is

$$\sum_{i=1}^n p_i(j)(1-p_i(j))$$

where

$$p_i(j) = \int_{l_j}^{u_j} \frac{1}{\sqrt{2\pi}\sigma_i} e^{-\frac{(x_i-z)^2}{2\sigma_i^2}}\, dz$$

where $l$ and $u$ represent upper and lower bin boundaries, and so $p_i(j)$ may be written as the differences of two normal cdf values.

Under the assumption that the different observations' contributions to the count in a given bin are independent, the distribution of the unobserved "true" count in a given bin would be distributed as Poisson-binomial, but I don't think we need to use that for anything, and - while we can work out the correlation between bin counts - I don't think we need that if your interest is on the individual per-bin standard errors.

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • Awesome, thanks! This makes sense - I'll give it a shot. (And no, I'm not worried about correlation between counts in this case). – Bdawg N May 28 '16 at 04:38
  • @BdawgN Note that if you have bins in a bounded interval, you need to include two additional bins - below the lower bound and above the upper bound - to get the proportions to sum to 1. – Glen_b May 28 '16 at 04:41
  • Yep, sure. A question though - if the SE of the bin count is the sum of the Bernoulli RVs, then the error would *increase* as the number of counts in the bin increases. This seems counter-intuitive, shouldn't the uncertainty decrease as more data is included? – Bdawg N May 28 '16 at 05:23
  • @BdawgN The standard error of the estimated count increases as the expectation increases, yes. The standard error of the estimate of the *proportion* of the total sample in the current bin decreases as you add more data. This is no different from standard errors of means of iid r.v.s decreasing when $n$ increases but standard errors of totals (i.e. of $n\hat\mu$) increasing when $n$ increases. – Glen_b May 28 '16 at 05:27