9

Let $f$ be a nonnegative function. I am interested in finding $z \in [0,1]$ such that $$ \int_0^{z} f(x)\,dx = \frac{1}{2}\int_0^1 f(x)\,dx$$ The caveat: all I can do is sample $f$ at points in $[0,1]$. I can, however, choose the locations where I sample $f$ randomly, if I so desire.

Questions:

  1. Is it possible to obtain an unbiased estimate of $z$ after finitely many samples? If so, what is the smallest posssible variance of such an estimate after $k$ samples?
  2. If not, what procedures are available to estimate $z$, and what are their associated convergence times.

As pointed out by Douglas Zare in the comments, this can be very difficult to do if the function is close to zero or very large. Fortunately, the function I need to use this for is bounded from above and below, so lets suppose that $1 \leq f(x) \leq 2$. Moreover, we can also assume that $f$ is Lipschitz or even differentiable if that helps.

cardinal
  • 24,973
  • 8
  • 94
  • 128
robinson
  • 403
  • 3
  • 8
  • 1
    If you don't have more information, you can have very bad behavior. Imagine that $f$ is $0$ between $1/3$ and $2/3$, and $\int_0^{1/3} f(x)~dx \approx 1/2.$ Slight changes to $f$ will make the median jump from below $1/3$ to above $2/3$. – Douglas Zare Sep 08 '12 at 03:33
  • @robinson Could you provide more information about $f$? Or are you interested on solving the problem for any density $f$? –  Sep 08 '12 at 13:37
  • @DouglasZare - Thanks for the comment; see my edit. – robinson Sep 08 '12 at 19:33
  • @Procrastinator - I edited the question with a little more information. – robinson Sep 08 '12 at 19:35
  • 3
    (+1) For the update. Dividing the left-hand side by the right, one can see that this reduces to finding the median of an unknown probability distribution supported on $[0,1]$. – cardinal Sep 08 '12 at 19:38
  • @cardinal - sure, with the caveat that you can't directly sample from the random variable whose median you are trying to find; you can only evaluate the function $f$. – robinson Sep 08 '12 at 23:08

2 Answers2

7

As cardinal pointed out in his comment, your question can be restated as follows.

By simple algebra, the integral equation can be rewritten as $$ \int_0^z g(x)\,dx = \frac{1}{2} \, , $$ in which $g$ is the probability density function defined as $$ g(x)=\frac{f(x)}{\int_0^1 f(t)\,dt} \, . $$

Let $X$ be a random variable with density $g$. By definition, $P\{X\leq z\}=\int_0^z g(x)\,dx$, so your integral equation is equivalent to $$ P\{X\leq z\}=\frac{1}{2} \, , $$ which means that your problem can be stated as:

"Let $X$ be a random variable with density $g$. Find the median of $X$."

To estimate the median of $X$, use any simulation method to draw a sample of values of $X$ and take as your estimate the sample median.

One possibility is to use the Metropolis-Hastings algorithm to get a sample of points with the desired distribution. Due to the expression of the acceptance probability in the Metropolis-Hastings algorithm, we don't need to know the value of the normalization constant $\int_0^1 f(t)\,dt$ of the density $g$. So, we don't have to do this integration.

The code bellow uses a particularly simple form of the Metropolis-Hastings algorithm known as the Indepence Sampler, which uses a proposal whose distribution does not depend on the current value of the chain. I have used independent uniform proposals. For comparison, the script outputs the Monte Carlo minimum and the result found with standard optimization. The sample points are stored in the vector chain, but we discard the first $10000$ points which form the so called "burn in" period of the simulation.

BURN_IN = 10000
DRAWS   = 100000

f = function(x) exp(sin(x))

chain = numeric(BURN_IN + DRAWS)

x = 1/2

for (i in 1:(BURN_IN + DRAWS)) {
    y = runif(1) # proposal
    if (runif(1) < min(1, f(y)/f(x))) x = y
    chain[i] = x
}

x_min = median(chain[BURN_IN : (BURN_IN + DRAWS)])

cat("Metropolis minimum found at", x_min, "\n\n")

# MONTE CARLO ENDS HERE. The integrations bellow are just to check the results.

A = integrate(f, 0, 1)$value

F = function(x) (abs(integrate(f, 0, x)$value - A/2))

cat("Optimize minimum found at", optimize(F, c(0, 1))$minimum, "\n")

Here are some results:

Metropolis minimum found at 0.6005409 
Optimize minimum found at 0.601365

This code is meant just as a starting point for what you really need. Hence, use with care.

Zen
  • 21,786
  • 3
  • 72
  • 114
  • Thanks for your answer. I don't know R, so I'm not sure how to parse what you are doing. Could you state in words/formulas your procedure? Thank you. In particular, I'm wondering if you are respecting the constraint that the only thing you can do is evaluate f - you are not allowed, for example, to integrate $f$, (although you can certainly form Monte-Carlo approximations to integrals based on random evaluations). – robinson Sep 08 '12 at 23:12
  • Yes, I'm just evaluating $f$ to get the Monte Carlo estimate. – Zen Sep 08 '12 at 23:26
  • The code is just an example. The R syntax is similar to other languages. Any particular statement that you don't understand? Check out the Wikipedia page on the Metropolis-Hastings algorithm. Of course, the general idea is more important. You can sample from the $f/\int f$ using any method you have available. – Zen Sep 08 '12 at 23:30
  • Did you take an introductory course on stochastic processes, covering discrete time Markov chains? – Zen Sep 09 '12 at 03:01
  • yes. Am I correct in understanding you use Metropolis-Hastings to repeatedly generate samples of $X$, the random variable whose density is proportional to $f$, and then you take the sample median? – robinson Sep 09 '12 at 03:18
  • Yes, this is it. Get a big sample from $f/\int f$ (by any simulation method) and take the sample median as your estimate of $\mathrm{Median}[X]$. – Zen Sep 09 '12 at 15:22
  • I've added more details to the answer. – Zen Sep 10 '12 at 03:18
  • (+1) Although in this case the numerical result is quite good, sometimes it is important to *thin* the MH shample in order to obtain a pseudo-independent sample. For example by sampling every 25 iterations, `median(chain[seq(BURN_IN,BURN_IN + DRAWS,25)])`. –  Sep 10 '12 at 10:27
  • Good point, Procrastinator. We should never forget that the values of the chain are dependent. And in this case we are not taking any ergodic mean. So, your "jumping" strategy makes sense. – Zen Sep 10 '12 at 17:56
  • 1
    BTW: Procrastinators of the World, unite! But not today... – Zen Sep 10 '12 at 17:57
3

The quality of the integral approximation, at least in the case as simple as 1D, is given by (Theorem 2.10 in Niederreiter (1992)): $$ \Bigl|\frac 1N \sum_{n=1}^N f(x_n) - \int_0^1 f(u) \, {\rm d}u \Bigr| \le \omega (f; D_N^*(x_1, \ldots, x_N) ) $$ where $$ \omega(f;t) = \sup \{ |f(u)-f(v)| : u, v \in [0,1], |u-v|\le t , t>0\} $$ is the function's modulus of continuity (related to total variation, and easily expressable for Lipschitz functions), and $$ D_N^*(x_1,\ldots,x_N) = \sup_u \Bigl| \frac1N \sum_n 1\bigl\{ x_n \in [0,u) \bigr\} - u \Bigr| = \frac1{2N} + \max_n \Bigl|x_n - \frac{2n-1}{2N}\Bigr| $$ is the (extreme) discrepancy, or the maximum difference between the fraction of hits by the sequence $x_1, \ldots, x_N$ of a semi-open interval $[0,u)$ and its Lebesgue measure $u$. The first expression is the definition, and the second expression is the property of the 1D sequences in $[0,1]$ (Theorem 2.6 in the same book).

So obviously to minimize the error in the integral approximation, at least in the right-hand side of your equation, you need to take $x_n = (2n-1)/2N$. Screw the random evaluations, they run a risk of having a random gap at an important feature of the function.

A big disadvantage to this approach is that you have to commit to a value of $N$ to produce this uniformly distributed sequence. If you are not happy with the quality of approximation it provides, all you can do is to double the value of $N$ and hit all the midpoints of the previously created intervals.

If you want to have a solution where you can increase the number of points more gradually, you can continue reading that book, and learn about van der Corput sequences and radical inverses. See Low discrepancy sequences on Wikipedia, it provides all the details.

Update: to solve for $z$, define the partial sum $$ S_k = \frac1N \sum_{n=1}^k f\Bigl( \frac{2n-1}{2N} \Bigr).$$ Find $k$ such that $$ S_k \le \frac12 S_N < S_{k+1}, $$ and interpolate to find $$ z_N = \frac{2k-1}{2N} + \frac{S_N/2 - S_k}{N(S_{k+1}-S_k)}. $$ This interpolation assumes that $f(\cdot)$ is continuous. If additionally $f(\cdot)$ is twice differentiable, then this approximation by integrating the second order expansion to incorporate $S_{k-1}$ and $S_{k+2}$, and solving a cubic equation for $z$.

Thomas FEL
  • 185
  • 1
  • 8
StasK
  • 29,235
  • 2
  • 80
  • 165
  • 1
    I like the gist of this. I think it would be helpful to make more explicit the strategy you propose to solve the OP's question. At present, the answer reads (to me) mostly as if it were addressing how to compute the RHS of the equation in the question. – cardinal Sep 10 '12 at 00:40
  • 1
    (+1) Nice update. $S_N$ can simply be viewed as a Riemann-sum approximation to the integral where we use the value of $f$ at the midpoint of each interval defined by the partition, rather than the left or right endpoint. :-) – cardinal Sep 10 '12 at 14:26
  • Yes; it is interesting though that this Riemann sum has this optimality justification. – StasK Sep 10 '12 at 15:27