4

Consider a number $n$ of independent, normally distributed variables $Y_1$, $Y_2$, ..., $Y_n$. Consider also the $n$ variables defined by the average of the last $h$ terms, $X_h = \frac{\sum^{n}_{k = n -h + 1} Y_k}{h}$ for $h \in \{1,.., n\}$. The variables $\sqrt{h}X_h$ are normally distributed.

I'd like to understand what is the probability of having $P\{\sqrt{h}X_h > T \text{ or } \sqrt{h-1}X_{h-1} > T \text{ or .. or } X_1>T \}$, in words having one or more of the $\sqrt{h}X_h$ to output above a threshold value.

Do you have any reference or suggestions about this problem? I can work out the $n = 2$ case through geometry working on the expression $P\{Y_1 + Y_2 > \sqrt{2}T \, |\, Y_1<T \}$. However I soon get lost into integral hell when dimension increases.

Thank you


EDIT:

I add my solution when $n = 2$.The probability we are looking for is equal to one minus the probability for all $\sqrt{h}X_h$ to output under threshold $T$, so:

$$ P = 1 - p(Y_1 + Y_2 < \sqrt{2}T\,|\,Y_1< T)p(Y_1<T) $$

We can calculate the term $ p(Y_1 + Y_2 < \sqrt{2}T\,|\,Y_1< T) $ geometrically (see picture), through:

$$ p(Y_1 + Y_2 < \sqrt{2}T\,|\,Y_1< T) = \int_{-\infty}^{\sqrt{2}T -T} \text{d}x N(x) + \int^{\infty}_{\sqrt{2}T -T}\text{d}x N(x)\Big[ 1 - \int^{T}_{\sqrt{2}T -x}\text{d}y N(y)\Big]=\\ = 1 - \int^{\infty}_{\sqrt{2}T -T}\text{d}x N(x)\Big( F(T) - F(\sqrt{2} T - x) \Big)=\\ = 1 - \Phi(T)\big(1 - \Phi(\sqrt{2}T - T)\big) + \int^{\infty}_{\sqrt{2}T -T}\text{d}x N(x) \Phi(\sqrt{2} T - x) $$

Naming $H(T)$ the last integral term, the probability we are looking for become:

$$P = 1 - \Phi(T)\Big( 1 -\Phi(T)\big(1-\Phi(\sqrt{2}T - T)\big) + H(T) \Big)$$

If $T = 3$, we find a probability $P \sim 2.5\%$ which implies a value larger than the probability $1 - \Phi(T)$ by a factor $\frac{P(T)}{1-\Phi(T)} \sim 1.823$.

Where we noted with $\Phi$ and $N$ the normal partition and probability density function respectively. This seems confirmed by simulations, in python:

import scipy.stats as sts
import numpy as np

N = 1000000
M = 2
host_matrix = np.zeros((N,M))

for n in range(N):
    arr = sts.norm(0).rvs(M)
    wind = np.cumsum(arr)/np.sqrt(np.arange(M) + 1)
    host_matrix[n] = wind

Here a routine for calculating the value of the formula above:

def fp_prb(thr):
    import scipy.integrate as integrate
    def hard_term_f(x, T):
        return sts.norm.pdf(x)*sts.norm.cdf(sqrt(2)*T - x)
    
    hard_term = integrate.quad(lambda x: hard_term_f(x,thr),thr*(sqrt(2) - 1),+np.inf)[0]
    F_t = F(thr)
    F_sqrt = F(thr*(sqrt(2)-1))
    
    prb = 1 - F_t*( 1 - F_t*(1 - F_sqrt) + hard_term )    
    return  prb

The observed frequency and probability ratio can be computed through:

thr = 3

no_fp = len(np.argwhere((host_matrix[:,0] < thr ) & (host_matrix[:,1] < thr )))
fp_freq =  1 - no_fp/N
norm_freq = 1 - sts.norm.cdf(thr)
print('FP frequency: {:.4f}'.format(fp_freq))
print('Normal detections frequency: {:.4f}'.format(norm_freq))
print('Ratio: {:.3f}'.format(fp_freq/norm_freq))

An example output:

FP frequency: 0.0025
Normal detections frequency: 0.0013
Ratio: 1.841
deppep
  • 51
  • 3
  • 1
    Just to clarify, you don't wish to assume that $Y_1, \dotsc, Y_n$ are independent, nor identically distributed? – Oxonon Dec 18 '20 at 22:10
  • This doesn't need any references, because (assuming only that these variables are *jointly* Normally distributed) all the $h X_h$ will have Normal distributions whose means and variances are readily computable from the means of the $Y_i$ and entries of their covariance matrix. This requires just some additions, no integration at all. – whuber Dec 18 '20 at 22:15
  • @whuber please can I ask you to elaborate further? thank you very much! – deppep Dec 18 '20 at 22:18
  • @Oxonon thank you, corrected! – deppep Dec 18 '20 at 22:18
  • This is a FAQ. See https://stats.stackexchange.com/search?q=sum+normal+mean+variance. – whuber Dec 18 '20 at 22:20
  • @whuber not very useful comment but I will check. thank you again. – deppep Dec 18 '20 at 22:22
  • I'm not offering an answer: I'm showing where you can find any answer that suits you among the hundreds that have already appeared, so that you don't have to wait for someone to post something here. – whuber Dec 18 '20 at 22:24
  • @deppep are you sure $P(X+YT$. – fblundun Dec 18 '20 at 22:28
  • @fblundun thank you very much! I've found some typos and corrected them. Sorry, kind of late here! :) – deppep Dec 18 '20 at 22:32
  • 1
    @whuber, even with an explicit distribution for some Gaussian $d$-dimensional random vector $X$, how would you compute $P(\cap_{i=1}^d X_i > t)$ without marginalising, i.e. without integration? – Oxonon Dec 19 '20 at 13:17
  • Given the nice head-up at [link](https://math.stackexchange.com/questions/3954390/whats-the-probability-the-cumulative-average-of-multiple-gaussian-variables-exc/3954467#3954467) by @ConMan, I added a few things to both main posts. – deppep Dec 19 '20 at 13:22
  • @Oxon That's a good point. I would approach this using tools of stochastic processes, because the question can be framed as the chance of reaching an absorbing barrier. The barrier might be arbitrarily complicated, though, unless additional assumptions are added, such as that the $Y_i$ are iid. – whuber Dec 19 '20 at 13:24
  • 2
    BTW, I believe this question (for iid $Y_i$) is explicitly and authoritatively answered at https://stats.stackexchange.com/a/310295/919. – whuber Dec 19 '20 at 13:53
  • 1
    @whuber thank you for the link: it looks promising but I will need some time to experiment with the thing. much appreciated – deppep Dec 19 '20 at 14:39
  • 1
    Since you've included some computational experiments in the main post and seem interested in numerical values, one thing to highlight is that $P(X_h > T | X_{h-1} < T, \dotsc, X_h < T) \leq P(X_t > T | X_{h-1} < T)$ and the dependence of $X_{h}$ on $X_{j}$ for $h \gg j$ looks to be weak. Hence you could bound the quantity you're interested in using just the result for the 2 dimensional case you've already derived, and the result might not be awful. – Oxonon Dec 19 '20 at 15:49

0 Answers0