Empirical distribution alternative

Question

BOUNTY:

The full bounty will be awarded to someone who provides a reference to any published paper which uses or mentions the estimator $\tilde{F}$ below.

Motivation:

This section is probably not important to you and I suspect it won't help you get the bounty, but since someone asked about the motivation, here's what I'm working on.

I am working on a statistical graph theory problem. The standard dense graph limiting object $W : [0,1]^2 \to [0,1]$ is a symmetric function in the sense that $W(u,v) = W(v,u)$. Sampling a graph on $n$ vertices can be thought of as sampling $n$ uniform values on the unit interval ($U_i$ for $i = 1, \dots, n$) and then the probability of an edge $(i,j)$ is $W(U_i, U_j)$. Let the resulting adjacency matrix be called $A$.

We can treat $W$ as a density $f = W / \iint W$ supposing that $\iint W > 0$. If we estimate $f$ based on $A$ without any constraints to $f$, then we cannot get a consistent estimate. I found an interesting result about consistently estimating $f$ when $f$ comes from a constrained set of possible functions. From this estimator and $\sum A$, we can estimate $W$.

Unfortunately, the method that I found shows consistency when we sample from the distribution with density $f$. The way $A$ is constructed requires that I sample a grid of points (as opposed to taking draws from the original $f$). In this stats.SE question, I'm asking for the 1 dimensional (simpler) problem of what happens when we can only sample sample Bernoullis on a grid like this rather than actually sampling from the distribution directly.

references for graph limits:

L. Lovasz and B. Szegedy. Limits of dense graph sequences (arxiv).

C. Borgs, J. Chayes, L. Lovasz, V. Sos, and K. Vesztergombi. Convergent sequences of dense graphs i: Subgraph frequencies, metric properties and testing. (arxiv).

Notation:

Consider a continuous distribution with cdf $F$ and pdf $f$ which has a positive support on the interval $[0,1]$. Suppose $f$ has no pointmass, $F$ is everywhere differentiable, and also that $\sup_{z \in [0,1]} f(z) = c < \infty$ is the supremum of $f$ on the interval $[0,1]$. Let $X \sim F$ mean that the random variable $X$ is sampled from the distribution $F$. $U_i$ are iid uniform random variables on $[0,1]$.

Problem set up:

Often, we can let $X_1, \dots, X_n$ be random variables with distribution $F$ and work with the usual empirical distribution function as $$\hat{F}_n(t) = \frac{1}{n} \sum_{i=1}^n I\{X_i \leq t\}$$ where $I$ is the indicator function. Note that this empirical distribution $\hat{F}_n(t)$ is itself random (where $t$ is fixed).

Unfortunately, I am not able to draw samples directly from $F$. However, I know that $f$ has positive support only on $[0,1]$, and I can generate random variables $Y_1, \dots, Y_n$ where $Y_i$ is a random variable with a Bernoulli distribution with probability of success $$p_i = f((i-1+U_i)/n)/c$$ where the $c$ and $U_i$ are defined above. So, $Y_i \sim \text{Bern}(p_i)$. One obvious way that I might estimate $F$ from these $Y_i$ values is by taking $$\tilde{F}_n(t) = \frac{1}{\sum_{i=1}^n Y_i} \sum_{i=1}^{\lceil tn \rceil} Y_i$$ where $\lceil \cdot \rceil$ is the ceiling function (that is, just round up to the nearest integer), and redraw if $\sum_{i=1}^n Y_i = 0$ (to avoid dividing by zero and making the universe collapse). Note that $\tilde{F}(t)$ is also a random variable since the $Y_i$ are random variables.

Questions:

From (what I think should be) easiest to hardest.

Does anyone know if this $\tilde{F}_n$ (or something similar) has a name? Can you provide a reference where I can see some of its properties?
As $n \to \infty$, is $\tilde{F}_n(t)$ a consistent estimator of $F(t)$ (and can you prove it)?
What is the limiting distribution of $\tilde{F}_n(t)$ as $n \to \infty$?
Ideally, I'd like to bound the following as a function of $n$ -- e.g., $O_P(\log(n) /\sqrt{n})$, but I don't know what the truth is. The $O_P$ stands for Big O in probability

$$ \sup_{C \subset [0,1]} \int_C |\tilde{F}_n(t) - F(t)| \, dt $$

Some ideas and notes:

This looks a lot like acceptance-rejection sampling with a grid-based stratification. Note that it is not though because there we do not draw another sample if we reject the proposal.
I'm pretty sure this $\tilde{F}_n$ is biased. I think the alternative $$\tilde{F^*}_n(t) = \frac{c}{n} \sum_{i=1}^{\lceil tn \rceil} Y_i$$ is unbiased, but it has the unpleasant property that $\mathbb{P}\left(\tilde{F^*}(1) = 1\right) < 1$.
I'm interested in using $\tilde{F}_n$ as a plug-in estimator. I don't think this is useful information, but maybe you know of some reason why it might be.

Example in R

Here is some R code if you want to compare the empirical distribution with $\tilde{F}_n$. Sorry some of the indentation is wrong... I don't see how to fix that.

# sample from a beta distribution with parameters a and b
a <- 4 # make this > 1 to get the mode right
b <- 1.1 # make this > 1 to get the mode right
qD <- function(x){qbeta(x, a, b)} # inverse
dD <- function(x){dbeta(x, a, b)} # density
pD <- function(x){pbeta(x, a, b)} # cdf
mD <- dbeta((a-1)/(a+b-2), a, b) # maximum value sup_z f(z)


# draw samples for the empirical distribution and \tilde{F}
draw <- function(n){ # n is the number of observations
  u <- sort(runif(n)) 
  x <- qD(u) # samples for empirical dist
  z <- 0 # keep track of how many y_i == 1
  # take bernoulli samples at the points s
  s <- seq(0,1-1/n,length=n) + runif(n,0,1/n) 
  p <- dD(s) # density at s
  while(z == 0){ # make sure we get at least one y_i == 1
    y <- rbinom(rep(1,n), 1, p/mD) # y_i that we sampled
    z <- sum(y)
  }
  result <- list(x=x, y=y, z=z)
  return(result)
}

sim <- function(simdat, n, w){
  # F hat -- empirical dist at w
  fh <- mean(simdat$x < w) 
  # F tilde
  ft <- sum(simdat$y[1:ceiling(n*w)])/simdat$z
  # Uncomment this if we want an unbiased estimate.
  # This can take on values > 1 which is undesirable for a cdf.
  ### ft <- sum(simdat$y[1:ceiling(n*w)]) * (mD / n)
  return(c(fh, ft))
}


set.seed(1) # for reproducibility

n <- 50 # number observations
w <- 0.5555 # some value to test this at (called t above)
reps <- 1000 # look at this many values of Fhat(w) and Ftilde(w)
# simulate this data
samps <- replicate(reps, sim(draw(n), n, w))

# compare the true value to the empirical means
pD(w) # the truth 
apply(samps, 1, mean) # sample mean of (Fhat(w), Ftilde(w))
apply(samps, 1, var)  # sample variance of (Fhat(w), Ftilde(w))
apply((samps - pD(w))^2, 1, mean) # variance around truth


# now lets look at what a single realization might look like
dat <- draw(n)
plot(NA, xlim=0:1, ylim=0:1, xlab="t", ylab="empirical cdf",
     main="comparing ECDF (red), Ftilde (blue), true CDF (black)")
s <- seq(0,1,length=1000)
lines(s, pD(s), lwd=3) # truth in black
abline(h=0:1)
lines(c(0,rep(dat$x,each=2),Inf),
     rep(seq(0,1,length=n+1),each=2),
     col="red")
lines(c(0,rep(which(dat$y==1)/n, each=2),1),
      rep(seq(0,1,length=dat$z+1),each=2),
      col="blue")

output from the above data

EDITS:

EDIT 1 --

I edited this to address @whuber's comments.

EDIT 2 --

I added R code and cleaned it up a bit more. I changed notation slightly for readability, but it is essentially the same. I'm planning on putting a bounty on this as soon as I'm allowed to, so please let me know if you want further clarifications.

EDIT 3 --

I think I addressed @cardinal's remarks. I fixed the typos in the total variation. I'm adding a bounty.

EDIT 4 --

Added a "motivation" section for @cardinal.

Your question started to become ambiguous the moment you referred to undefined objects and to use some idiosyncratic notation. For instance, $f$ appears early on but has no apparent connection to $F$ and it is only by reading much further that we learn you're thinking of it as "not a discrete distribution"--but what kind of object is it? Crucially, what does "$\sup_z f(z)$ mean? "$\sup$" usually means *supremum* but maybe it has something to do with the essential support of a distribution? Because everything in the question depends on what these mean, I cannot make sense of the question. — whuber, Mar 14 '13 at 21:36
(Continued). Why do you introduce $y_0$? What role does it play? What does "$\text{Var}(\hat{F_n}(t))$" mean when $\hat{F_n}$ is a distribution--and so $\hat{F_n}(t)$ would be a *number*? Are you implicitly replacing the $x_i$ in the definition of $\hat{F_n}$ by $X_i$ or do you intend $\text{Var}(\hat{F_n})$? The risk in trying to write your entire background and question in mathematical notation is that a single little slip can render the entire thing nonsensical, so it would really help to see a bit of English explanation of terminology and context. — whuber, Mar 14 '13 at 21:40
Thanks @whuber for your comments. Please let me know if the revised question is still confusing. — user1448319, Mar 14 '13 at 21:51
Thanks--I am glad I asked for those details, because obviously I *was* misreading some of the notation. I'm still struggling to understand what's going on. The first thing I notice is that the denominator in $\tilde{F}_n(t)$ could have a substantial chance of being zero, making this--to put it mildly--rather an awkward estimator. I know you have tried to rule that out by *fiat*, but the possibility really has to be considered in assessing the properties of the estimator. Also, is the asymmetry inherent? Do you truly sample $1=n/n$ but never $0=0/n$? — whuber, Mar 14 '13 at 21:53
Since $f$ is continuous and has no point mass, as $n \to \infty$ it isn't going to cause much of a problem for me that I never sample at 0. One hack around this is to first sample $U_i$ in the interval $[(i-1)/n, i/n)$ and then sample $Y_i \sim \text{Bern}(f(U_i) / (\sup_{z \in [0,1]}))$ which is possible in my situation. — user1448319, Mar 14 '13 at 22:04
Aha! That's the first indication I have seen that $n$ is not fixed and that you are interested in the asymptotics. If it's true you have flexibility to choose $n$, doesn't that open up a wealth of possibilities, such as adaptive choices of the sample points (rather than limiting to a fixed grid $\{i/n\}$)? It is also evident you are making unstated assumptions, such that $f$ is continuous (equivalently, $F$ is *absolutely continuous*). What else can you assume about the underlying distribution $F$ that can help with this analysis? — whuber, Mar 14 '13 at 22:08
I don't actually have any data. I have a paper that says something like "if you use the standard plug-in estimator here, then this thing (that I haven't described) that you're trying to bound is order $n^{-1}\log(n)$". Unfortunately, I can't use the standard plug-in estimator based on the empirical distribution because it is impossible for me to sample from the distribution $F$. I can take samples along the unit interval wherever I want though, but I have to specify where I'm taking them before I am told what $F$ is. Once I have said where I'm going to take them, then I can sample the $Y_i$. — user1448319, Mar 14 '13 at 22:15
(The indentation problems in the code are a bug resulting from an interaction with the $\TeX$ markup: unbalanced dollar signs result in the subsequent lines being indented. The only workaround I have found is to place a comment after any line having an odd number of dollar signs and to insert an odd number of dollar signs within the comment, as in `#$`.) — whuber, Mar 16 '13 at 03:00
Some remarks: (**1**) Ignoring for a moment the imprecision of the statement itself, note that "$f$ has no pointmass" is not equivalent to $F$ being differentiable everywhere (counterexample?). (**2**) Should the $c$ be outside of $f$ in the definition of $p_i$? (**3**) The definition of TV distance you give seems unconventional, is discordant with the intuition behind it and doesn't match the link you provided. Maybe these are just typos? (**4**) Do you know $f$ explicitly or can you only sample "implicitly" as you describe (in your application)? — cardinal, Mar 16 '13 at 03:23
A couple of other questions/remarks: It seems implicitly based on how you are proposing to construct $p_i$ that you are really considering a triangular array $Y_{i,n}$, $i=1,\ldots,n$ for the purposes of convergence analysis. From how you construct the $p_i$, it seems you should also be able to (just as easily) sample Bernoulli random variables with conditional probability of success $f(U)/c$ where $U$ is a uniform random variable. Is that true? (A little more context to your question would likely resolve a lot of these queries.) Cheers. — cardinal, Mar 16 '13 at 17:57
This question has been improved so much I didn't even recognize it until I realized I'd seen the comments before. It's now a really interesting and much more well-written question. — Glen_b, Mar 18 '13 at 21:41
I am going to be nitpicky; please take it as a sign of my interest. (**1**) The expression given for TV is still incorrect. Consider two random variables, one is $\mathcal U(0,1/2)$ and the other is $\mathcal U(1/2,1)$. The TV distance between them is 1 (Choose $C = (0,1/2)$, for example), but your expression yields 1/2. (**2**) I would really like to know if you are married to the sampling scheme you have described. My last comment asks about an alternate sampling scheme. The purpose is to potentially give you an estimator that is (i) easier to analyze and (...) — cardinal, Mar 18 '13 at 22:26
(...) (ii) with potentially superior properties. (**3**) I would suggest you consider a little more flexibility with regard to the bounty. You may be surprised at some of the answers you get and awarding it too quickly may stunt that process to a certain degree. — cardinal, Mar 18 '13 at 22:27
I removed all references to total variation in the above question because I saw that you were correct and what I'm interested in is not the same as total variation. Yes, I am married to that sampling scheme. This isn't for data analysis or MCMC simulation, it's for a theory paper I'm working on. I really want to just cite some paper and say "this is what s/he did", and be done with it. Thanks for your comment regarding the bounty. This is the first time I've used one, so if I'm not satisfied with the responses I get, I'll just answer more questions and set up another bounty on this problem. — user1448319, Mar 18 '13 at 22:36
Fair enough. Thanks for the response. My main point of raising the issue of the sampling scheme was to address the *theory*, i.e., with (presumably) no extra work to sample as I've described (compared to the scheme you've laid out), one can obtain a more recognizable estimator with behavior that will be (I think) very easy to compare to the empirical distribution function. — cardinal, Mar 18 '13 at 22:40
Items 2 and 4 can be proved by very pedestrian arguments. Item 4 yields $\mathcal O_P(n^{-1/2})$ at the very least. Item 3 is slightly complicated by the fact that $\sqrt{n} (\tilde F_n(t) - \tilde F_n^{\star}(t)) \to \mathcal N(0,\sigma^2)$ in distribution for $\sigma^2$ which is a function of $F(t)$, $c$ and $\int f^2$. In the aforementioned sense, these two estimators are not as "close" to one another as one might hope. Also, $\tilde F_n^{\star}(t)$ is not unbiased, as a straightforward calculation will show. — cardinal, Mar 19 '13 at 03:55
@user: Can you provide a reference (or two) that motivated your original consideration of this sampling scheme and estimator? That would provide helpful context, I think. — cardinal, Mar 20 '13 at 23:35
@cardinal, I added my motivation for this problem. I hope that it's helpful (though regretfully, I suspect it won't be). — user1448319, Mar 21 '13 at 00:27
That's a fantastic update and gives a much clearer picture of what you're interested in and what potential connections to other problems might be. This $W$ function employed as you describe essentially looks like some generalization of how one might construct a random version of a classical threshold graph. Thanks! — cardinal, Mar 21 '13 at 00:52
indeed it is related: http://www-stat.stanford.edu/~susan/papers/threshold0817.pdf — user1448319, Mar 21 '13 at 04:57
Resembles intensity estimation for some restricted point process. It may have a name in survival analysis... — Memming, Oct 17 '13 at 21:56

James Prichard · Answer 1 · 2014-01-02T19:52:50.957

While this reference

EDIT: ADDED REFERENCE TO VERY SIMILAR STATISTIC "Nonparametric Estimation from Incomplete Observations" E. L. Kaplan and Paul Meier, Journal of the American Statistical Association, Vol. 53, No. 282 (Jun., 1958), pp. 457-481

is not to your ECDF-like estimator on $[0,1]$ I believe it is logically equivalent to the Kaplan-Meier estimator (aka. product limit estimator) as used in Survival Analysis, even though that's applied to a time range $[0,\infty)$.

Estimating the bias would be possible once you have a reasonable estimate of the distribution via kernel smoothing if it is well enough behaved (see, e.g., Khmaladze transformation on Wikipedia).

In the bivariate case in your graph problem estimating $f = W / \iint W$ from $A$ albeit with a trivial symmetry constraint seems similar to the approach in Jean-David Fermanian, Dragan Radulovic, and Marten Wegkamp (2004), Weak convergence of empirical copula processes, Bernoulli, vol. 10, no. 5, 847–860, as @cardinal indicated "Multivariate Delta Method".

user1448319 · Answer 2 · 2013-03-20T02:55:20.383

This answers questions 2 and 3 above. I still really want a reference though (from question 1).

This doesn't yet take into account when $\sum Y_i = 0$.

Consider $g(A,B) = A/(A+B)$, then \begin{align} g_A(A,B) &= (A+B)^{-1} + A(A+B)^{-2}\\ g_B(A,B) &= -A(A+B)^{-2}\\ g_{AA}(A,B) &= 2B(A+B)^{-3}\\ g_{AB}(A,B) &= (A-B)(A+B)^{-3}\\ g_{BB}(B,B) &= 2A(A+B)^{-3} \end{align} where the subscripts denote the derivatives. Recall $p_i = f((i-1+U_i)/n)/c$. Let \begin{align} R = \frac{1}{n}\sum_{i=1}^{\lceil nt \rceil} Y_i, \quad& \mu_R = \mathbb{E}(R) = \int_0^t p(u) \, d u = c^{-1}F(t)\\ S = \frac{1}{n}\sum_{\lceil nt \rceil +1}^n Y_i, \quad& \mu_S = \mathbb{E}(S) = \int_t^1 p(u) \, d u = c^{-1}(1-F(t)) \end{align} So note that $\mu_R + \mu_S = c^{-1}F(t) + c^{-1}(1-F(t)) = c^{-1}$ and $g(\mu_R, \mu_S) = F(t)$. Also, \begin{align} \text{ Var}(R) &= \frac{1}{n^2} \sum_{i=1}^{\lceil nt \rceil} \text{ Var}(Y_i) = \frac{1}{n} \int_0^t f(u)/c(1-f(u)/c) \, d u = \frac{1}{nc^2} \int_0^t f(u)(c-f(u)) \, d u\\ \text{ Var}(S) &= \frac{1}{nc^2} \int_t^1 f(u)(c-f(u)) \, d u \end{align} Note that $\text{ Cov}(R,S) = 0$ by independence of the $Y_i$s.

Now, we use a taylor expansion to get \begin{align} &\mathbb{E}\left(\tilde{F}_n(t)\right) =\mathbb{E}\left( \frac{1}{\sum_{i=1}^n Y_i} \sum_{i=1}^{\lceil tn \rceil} Y_i \right) =\mathbb{E}\left(\frac{nR}{nR+nS}\right) =\mathbb{E}\left(\frac{R}{R+S}\right) =\mathbb{E}\left(g(R,S)\right)\\ &=g(\mu_R,\mu_S) + \frac{1}{2}\mathbb{E}((R - \mu_R)^2)g_{RR}(\mu_R, \mu_S) \nonumber\\ &\quad + \mathbb{E}((R - \mu_R)(S-\mu_S))g_{RS}(\mu_R, \mu_S) + \frac{1}{2}\mathbb{E}((S - \mu_S)^2)g_{SS}(\mu_R, \mu_S) + \dots \\ &= F(t) + \frac{1}{2}\mathbb{E}((R - \mu_R)^2)2\mu_S(\mu_R+\mu_S)^{-3} \nonumber\\ &\quad + \mathbb{E}((R - \mu_R)(S-\mu_S))(\mu_R-\mu_S)(\mu_R+\mu_S)^{-3} \nonumber\\ &\quad + \frac{1}{2}\mathbb{E}((S - \mu_S)^2) 2\mu_R(\mu_R+\mu_S)^{-3} + \dots\\ &= F(t) + (\mu_R+\mu_S)^{-3} \bigg( \mathbb{E}((R - \mu_R)^2)\mu_S + \mathbb{E}((R - \mu_R)(S-\mu_S))(\mu_R-\mu_S) \nonumber\\ &\quad + \mathbb{E}((S - \mu_S)^2) \mu_R \bigg) + \dots \\ &= F(t) + c^3 \left( \text{ Var}(R)c(1-F(t)) \right. \nonumber\\ &\quad + \left.\text{ Cov}(R,S)(cF(t) - c(1-F(t))) + \text{ Var}(S) cF(t) \right) + \dots \\ &= F(t) + c^4 \left( \left(\frac{1}{n} \int_0^t f(u)(c-f(u)) \, d u\right) (1-F(t)) \right. \nonumber\\ &\quad \left. + \left(\frac{1}{n} \int_t^1 f(u)(c-f(u)) \, d u \right) F(t) \right) + \dots\\ &= F(t) + \tilde{V}_{F(t)}/n + \dots\\ &= F(t) + {\cal O}(n^{-1}) \end{align} where \begin{align} \tilde{V}_{F(t)} &= c^2\left(\int_0^t f(u)(c-f(u)) \, d u\right) (1-F(t)) + c^2\left(\int_t^1 f(u)(c-f(u)) \, d u \right) F(t)\\ &< c^2\left( \int_0^t cf(u) \, d u\right)(1 - F(t)) + c^2\left( \int_t^1 cf(u) \, d u\right)F(t)\\ &< c^3 2F(t)(1-F(t)) \end{align} In particular, we get \begin{align} \sqrt{n}\left(\tilde{F}_n(t) - F(t)\right) \overset{d}{\to} N(0,V_{F(t)}) \end{align}

Please comment if you see something wrong with this.

EDITS:

Edit 1 --

Fixed a typo in $V_{F(t)}$. Thanks @cardinal for your suggestion in the comments about question 4.

Edit 2 --

Fixed plenty of typos: I had $c^{-1}$ where I should have had $c$ in many places. I still need to address @cardinal's response about $\sum Y_i = 0$.

Dear @user: This is on the right track; here are some suggestions. (**1**) The mean of $\tilde F_n(t)$ doesn't exist, at least not until you specify what happens when $\sum_i Y_i = 0$, so strictly speaking the analysis in the answer is not correct. Defining a behavior at zero will break the independence structure, but all is not lost. (**2**) Essentially, what you're doing is applying the multivariate delta method. Note that this *doesn't* require the existence of the mean of $\tilde F_n(t)$, so it will be cleaner (and more correct) if you go this route. — cardinal, Mar 19 '13 at 12:49
(**3**) Item 4 in your list is handled as follows. Note that $$\sup_{C\subset [0,1]} \int_C |\tilde F - F| \leq \sup_{[0,1]} |\tilde F - \tilde F^{\star}| + \int_0^1 |\tilde F^\star - \mathbb E \tilde F^\star| + O(n^{-1})\>.$$ The first term on the right-hand side, $\{\sum_i Y_i > 0\}$, is $\leq |1 - c n^{-1}\sum_i Y_i|$, so is clearly $\mathcal O_p(n^{-1/2})$. You're left only to deal with the middle term, but that succumbs readily to Markov's inequality followed by Jensen's and is also $\mathcal O_p(n^{-1/2})$. — cardinal, Mar 19 '13 at 12:55
Dear @user: It would be helpful to see some more elaboration to your remark regarding not needing to considering the case $\sum_i Y_i = 0$. What you are describing is conditional sampling. The $Y_i$ conditional on $\{\sum_i Y_i > 0\}$ are *not* independent (or conditionally independent), so the (implicit) analysis in the answer does not hold. It may be helpful to look at the $n=2$ case to see this (just draw the $2 \times 2$ table). — cardinal, Mar 19 '13 at 19:55
As an additional aside, it may be worth noting that $\sup_C \int_C |\tilde F - F| = \int_0^1 |\tilde F - F|$, so this definition can be simplified. — cardinal, Mar 19 '13 at 19:56

Empirical distribution alternative

BOUNTY:

Motivation:

Notation:

Problem set up:

Questions:

Some ideas and notes:

Example in R

EDITS:

2 Answers2

EDITS: