4

Let $\{x_1, \ldots, x_n\}$ be a set of $n$ i.i.d. samples from a distribution $p(x)$. I would like to evaluate the distribution of the sum $$ S = \sum_{1\leq i<j\leq n} f(x_i, x_j), $$ where $f$ is a continuous function.

The sample size $n$ is sufficiently large that I cannot approximate the distribution by Monte Carlo simulation. The central limit theorem cannot be applied because the summands are not i.i.d.

Edit: The question is distinct from Limit of a convolution and sum of distribution functions. The linked question considers the behaviour of $P(S>x)$ in the limit $x\rightarrow \infty$. I am however interest in the limit $n\rightarrow\infty$ where $n$ is the number of samples.

Edit: Maybe adding some context will make the question more easily understandable. I am considering interaction rates between distinct entities labelled by an index $i$. A random attribute $x_i$ is associated with each entity and the interaction rate between two individuals $f(x_i, x_j)$ is a function of said attributes. I would like to determine the distribution of the total rate of events $S$. At the moment, I am using a very simple functional form $f(x_i,x_j)=\exp(-a |x_i-x_j|)$.

Till Hoffmann
  • 827
  • 5
  • 15
  • possible duplicate of [Limit of a convolution and sum of distribution functions](http://stats.stackexchange.com/questions/29052/limit-of-a-convolution-and-sum-of-distribution-functions) – Xi'an Jul 22 '15 at 12:40
  • @Xi'an: I clarified how the question differs. – Till Hoffmann Jul 22 '15 at 12:48
  • Any information on $p$ or $f$? (I don't assume $f$ is symmetric, is it?) – Stephan Kolassa Jul 22 '15 at 13:04
  • @StephanKolassa: Unfortunately no information on $p$ except that I'm happy to assume it has finite moments. I would prefer to not make the assumption that $f$ is symmetric under exchange but am happy to make the assumption if it helps with the derivation. – Till Hoffmann Jul 22 '15 at 13:07
  • My first thought was [McDiarmid's inequality](https://en.wikipedia.org/wiki/Doob_martingale#McDiarmid.27s_inequality), but that's only really helpful here if for all $x, y$, $\min\left( \sup_{x'} \lvert f(x, y) - f(x', y) \rvert, \sup_{y'} \lvert f(x, y) - f(x, y') \rvert \right) < C n^{-3/2}$. [Talagrand's inequality](https://terrytao.wordpress.com/2010/01/03/254a-notes-1-concentration-of-measure/#tala-conc) could also help for bounded $p(x)$ [which could possibly be worked around via truncation] and convex, Lipschitz $f$ [which I think can't be easily worked around]. – Danica Jul 22 '15 at 16:46
  • If $p$ is Gaussian and $f$ Lipschitz, there is also the [Gaussian concentration inequality](https://terrytao.wordpress.com/2010/01/03/254a-notes-1-concentration-of-measure/#gauss-conc), but I don't know if there are any generalizations to non-Gaussian distributions (other than Talagrand's). – Danica Jul 22 '15 at 16:51
  • Added a few more details and an example of a simple function $f$ which I am currently trying to make progress with. – Till Hoffmann Jul 22 '15 at 19:41
  • When $f$ is a [kernel function](https://en.wikipedia.org/wiki/Positive-definite_kernel), as in your edit, $T = S \big/ \binom{n}2$ is an unbiased estimator for the squared norm of the [kernel embedding](https://en.wikipedia.org/wiki/Kernel_embedding_of_distributions) of $p$ under $f$. The behavior of this embedding has been studied particularly by [Song (2008)](http://www.cc.gatech.edu/~lsong/papers/lesong_thesis.pdf) and [Gretton et al. (2012)](http://www.jmlr.org/papers/volume13/gretton12a/gretton12a.pdf). I haven't looked thoroughly for a result that'll be useful here, though. – Danica Jul 22 '15 at 21:13
  • Why are the $f(X_i,X_j)$ not _identically distributed_ random variables. I agree that $f(X_i,X_j)$ and $f(X_i,X_k)$ are not independent but why the blanket claim that they are not i.i.d.? Also, surely $f(X_i,X_j)$ and $f(X_k,X_{\ell})$ are independent too? Also, in what sense is $e^{-|x|}$ a _smooth_ function? – Dilip Sarwate Jul 23 '15 at 02:47
  • @DilipSarwate, you are right that the variables would be i.i.d. if we summed up $f(X_i,X_j)$ such that no index was ever repeated. Unfortunately that's not the because the same $i$ occurs multiple times for different $j$. Thanks for pointing out the *smooth* function error. I'll correct that. – Till Hoffmann Jul 23 '15 at 10:05

2 Answers2

1

You might start out determining some of the moments of S:

$$E(S)={n \choose 2}E(f(x_1,x_2))$$

$$E(S^2)={n \choose 2}E(f(x_1,x_2)^2)+2{n \choose 3}(E(f(x_1,x_2)f(x_1,x_3))+E(f(x_1,x_2)f(x_2,x_3))+E(f(x_1,x_2)f(x_3,x_2)))+6{n \choose 4}E(f(x_1,x_2))^2$$

either in general or some initial simpler form such as $f(x_i,x_j)=x_i x_j^2$. And then appeal to the central limit theorem.

JimB
  • 2,043
  • 8
  • 14
  • Clearly the preconditions of the CLT are not guaranteed to hold in this circumstance. What assumptions are you making about $f$ and the distribution of $x_i$ that would allow you to apply the CLT? – whuber Jul 22 '15 at 16:06
  • 1
    Sorry, I should have been clearer. By "appeal to the central limit theorem" I'm meaning something more like "hope that the central limit theorem will apply". If all moments of S exist I would think there should be some version of the CLT that can be used. Yes, there would still be more work to do in terms of specifics about the nature of the function $f$. – JimB Jul 22 '15 at 16:12
  • 1
    Thank you for the clarification. Although there are versions of the CLT that do allow for lack of independence among random variables in a sequence, they work by maintaining close control over *how much* dependence there is. It's hard to see how one could get much control here, given the extremely general nature of the question and the extensive interdependencies of the terms in this sum. I think any insight you could share about that would really get to the heart of the matter. – whuber Jul 22 '15 at 16:16
  • 1
    As whuber notes, the CLT seems difficult to use here. A moment bound could work: by Markov's inequality, $\DeclareMathOperator{\E}{\mathbb E}\Pr\left( \lvert S - \E S \rvert \ge a \right) \le a^{-k} \E \lvert S - \E S \rvert^k$. You can optimize this bound over $k$; even values are more convenient, since then you don't have to worry about the absolute values. Finding the general central moments of $S$ is probably an unpleasant task, though. – Danica Jul 22 '15 at 16:54
  • @Dougal. With maybe a simple example that's symmetric (such as $f(x,y)=x y$), then with Mathematica or Maple, the moments might not be too onerous. And for simple-minded simulations with functions $x y$, $x y^2$, and $x\sqrt{|y|}$ with n = 50 suggests that a normal distribution as a limiting distribution varies as to the quality of the approximation of Pr(S>x) (and, of course, all approximations probably get worse for extreme values of x). – JimB Jul 22 '15 at 18:19
  • Thanks for your help. It turns out the CLT hunch was correct. Answer with references provided. – Till Hoffmann Sep 11 '15 at 13:27
0

Answer

After some reading, it turns out that the sum $S$ is a degree-two U-statistic. Its asymptotic distribution in the limit of large $n$ is normal.

The mean and variance of the asymptotic distribution are \begin{align} \mathrm{E}(S)&=\tbinom n2 \mathrm{E}\left(f(x_1, x_2)\right)\\ \mathrm{var}(S)&=\frac{4}{n}\tbinom n2^2\mathrm{cov}\left(f(x_1, x_2), f(x_1, x_3)\right).\\ &=2(n-1)\mathrm{cov}\left(f(x_1, x_2), f(x_1, x_3)\right) \end{align}

Background

Asymptotic normality was proven by Hoeffding in the late 1940s (see the wiki-link for the original reference). Elements of Large Sample Theory provides a nice treatment in chapter 6. For a publicly-available treatment (although not quite as nice), see chapter 10 of these lecture notes: Asymptotic Tools.

Till Hoffmann
  • 827
  • 5
  • 15