0

I will start off by saying that I am aware of this post, but applying Sheppard's correction in my case would seem to lead to a negative variance; so either I don't understand that post, or I have a different problem.

Consider $n$ points $P = \{ p_1, ..., p_n \}$ sampled from a 2D uniform distribution in the rectangle $(a_x,b_x)\times (a_y,b_y)$, and a regular grid $G$ spanning that rectangle with $m = m_x \times m_y$ bins all with the same area: $$ v = \frac{b_x-a_x}{m_x}\cdot \frac{b_y-a_y}{m_y} $$

Then, define the following function: $$ \forall g\in G,\quad \rho(g) = \frac{\mathrm{Count(P\cap g)}}{n} $$ where we count the number of points in $P$ that fall in any single grid cell $g$. What are the mean and variance of $\rho(g)$?


The formula for $\rho$ looks like a mean estimator to me. The mean and variance of a continuous uniform distribution over a fixed-size interval are known. With this approach, the mean and variance of $\rho$ across $G$ given $P$ are therefore: $$ \mu_\rho = \frac{v}{mv} = \frac{1}{m} \qquad \sigma_\rho^2 = \frac{v^2}{12 n} $$ (Note that the variance term is divided by $n$ because we are talking about the variance of a mean estimator.)

From simulations though, the variance formula above seems to be wrong. If I understand correctly, applying Sheppard's correction in this case would subtract $v^2/12$ from the variance, which would give a negative result.

Jonathan H
  • 246
  • 1
  • 10
  • You don't seem to be applying Shepard's correction correctly. It refers to binning *univariate observations.* You appear to be doing something else altogether: namely, estimating *densities* of observations that have been binned. That raises the primary question: what do you hope your procedures will tell you about the data? What are you really trying to estimate? We need you to tell us because the likelihood your procedure isn't accomplishing what you intended. – whuber Oct 27 '21 at 17:19
  • 1
    @whuber Thank you for your comment, I see my mistake in applying Sheppard's correction now: this is a different context in which this correction does not apply. The context of my question is a little removed from the problem as it is presently stated. I am trying to determine the mean and variance of the measure of [discrepancy](https://en.wikipedia.org/wiki/Quasi-Monte_Carlo_method) for a set of points (taking the average rather than the superior bound), with the hope of normalizing it (e.g. to a z-score). This is why I am particularly interested in the variance of $\rho$. – Jonathan H Oct 27 '21 at 17:26
  • 1
    Re the edits: how does this situation differ from a standard Binomial distribution? The chance of each $p_i$ lying in $g$ is $1/(m_xm_y)$ and the $p_i$ are independent. It looks like your "empirical" values might be wrong, but it's impossible to tell because you don't describe how they were obtained. – whuber Nov 29 '21 at 19:42
  • 1
    Thanks a lot @whuber! In checking my code to add it to the question, I realized I made a mistake in my original simulations — the formula I derived by interpreting $\rho$ as a binomial r.v. appear to be correct (cf code below)! – Jonathan H Nov 29 '21 at 23:03

1 Answers1

1

The correct approach is to interpret $\rho$ as a binomial r.v., in that case: $$ \forall g\in G,\quad \mathrm{Count}(P\cap g) \sim B\left( n, \frac{1}{m} \right) $$ In that case, the density variable $\rho(g)$ has the following mean and variance: $$ \begin{align} \mu_\rho &= \frac{1}{n}\ n . \frac{1}{m} = \frac{1}{m} \\[3mm] \sigma^2_\rho &= \frac{1}{n^2}\ n . \frac{1}{m} . \left(1 - \frac{1}{m}\right) = \frac{m - 1}{m^2 n} \end{align} $$

Here is a Matlab program to verify that this is correct via simulations:

    % grid definition
    ax = 0;
    bx = 1;
    ay = 0;
    by = 1;
    
    mx = 11;
    my = 19;
    
    m = mx * my;
    rx = linspace( ax, bx, mx+1 );
    ry = linspace( ay, by, my+1 );
    
    % try with increasing number of points
    npts = ceil(exp(linspace( log(1e2), log(1e4), 20 )));
    
    var_emp = zeros(size(npts)); 
    var_ref = zeros(size(npts));
    for k = 1:length(npts)
        n = npts(k);
        ux = ax + (bx-ax) * rand(n,1);
        uy = ay + (by-ay) * rand(n,1);
        
        H = histcounts2( ux, uy, rx, ry );
        
        var_emp(k) = var(H(:)/n);
        var_ref(k) = (m-1) / (m^2 * n);
    end
    
    % plot results
    plot( npts, var_emp, npts, var_ref, 'LineWidth', 2 ); 
    grid on; xlim(npts([1,end]));
    xlabel('#sample');
    ylabel('Estimator variance');
    legend('Empirical','Theory');

Empirical variance vs Binomial formula

Jonathan H
  • 246
  • 1
  • 10