7

Suppose there exists a sequence of $n$ numbers with two possible instantiations:

  1. The sequence contains all zeros;
  2. $n-1$ of the numbers are zeros, and one is a zero-mean Gaussian random variable $y_t\sim\mathcal{N}(0,\sigma^2)$ where $t$ denotes the location of the Gaussian in the sequence of zeros. The location $t$ is uniformly distributed from 1 to $n$ (i.e. the Gaussian is equally likely to be found in every location in the sequence).

This sequence is corrupted by zero-mean i.i.d. additive Gaussian noise (that is also independent from the sequence). Therefore, when I observe a vector $\mathbf{x}$, it is either:

  1. Sequence of $n$ i.i.d. Gaussian random variables $\mathbf{x}=\{x_i\}_{i=1}^n$ where $x_i=z_i\sim\mathcal{N}(0,s^2)$ for all $i=1,2,\ldots,n$;
  2. Sequence of $n$ Gaussian random variables $\mathbf{x}=\{x_i\}_{i=1}^n$ for $t\in 1,2,\ldots, n$ and $P(T=t)=1/n$, where $x_t=y_t+z_t$ which means $x_t\sim\mathcal{N}(0,\sigma^2+s^2)$, and $x_i=z_i\sim\mathcal{N}(0,s^2)$ for $i\neq t$.

I am looking for a hypothesis test to distinguish these two sequences, denoting the first sequence as a null. Since all the random variates are independent, the likelihood function of null hypothesis is then just the product of Gaussians: $$f_1(\mathbf{x})=\frac{1}{(2\pi s^2)^{n/2}}e^{-\frac{1}{2s^2}\sum_{i=1}^nx_i^2}$$ while the likelihood function of the alternate hypothesis is the following mixture of the products of Gaussians:

$$f_2(\mathbf{x})=\frac{1}{n}\frac{1}{(2\pi s^2)^{(n-1)/2}}\frac{1}{\sqrt{2\pi(\sigma^2+s^2)}}\sum_{t=1}^ne^{-\frac{x_t^2}{2(s^2+\sigma^2)}-\frac{1}{2s^2}\sum_{i=1,i\neq t}^nx_i^2}$$

With a little arithmetic massaging, the likelihood ratio is as follows:

$$\Lambda(\mathbf{x})=\frac{f_2(\mathbf{x})}{f_1(\mathbf{x})}=\frac{1}{n}\sqrt{\frac{s^2}{\sigma^2+s^2}}\sum_{t=1}^n e^{\frac{\sigma^2}{2s^2(s^2+\sigma^2)}x_t^2}$$

So the LRT here involves squaring each observation, summing their exponents, and comparing the resultant test statistic to a threshold. However, the distribution of the test statistic is analytically daunting (at least to me, maybe someone knows how to analyze the probabilities of error without resorting to numerical analysis).

The simpler, "natural" (at least to me) hypothesis test is to square the observed sequence term-by-term, find the maximum and compare to a threshold. This is much more amenable to analysis, but how close is it to the optimal LRT above? Are there asymptotic results (say, as $n$ increases)?

This problem seems like something that might have been solved in the radar scenario (seems to correspond to pinging a target with a single Gaussian, and having some random offset in the return time of the signal if it bounces off the target), however, I can't find any appropriate references.

M.B.M.
  • 1,059
  • 8
  • 18
  • How do you obtain a sum instead of a product over the data? – whuber Jul 17 '13 at 14:57
  • @whuber The sum is due to the mixture distribution over the data when alternate hypothesis is true. I clarified in the question. – M.B.M. Jul 17 '13 at 15:45
  • I think that's an unnecessary complication. A sufficient statistic is the sum of squares of the data: this will have a distribution equal under the null to a $\chi^2(n)$ scaled by $s^2$ or, under the alternate, to the sum of a $\chi^2(n)$ scaled by $s^2$ and an independently distributed $\chi^2(1)$ scaled by $\sigma^2$. – whuber Jul 17 '13 at 15:51
  • @whuber Could you please explain why the sum of squares of the data is a sufficient statistic in this case, as it's not obvious to me given that the distribution of the data when the alternate hypothesis is true is a mixture? I can see the sum of the squares of data being sufficient statistic if this was a test between $n+1$ hypotheses (i.e. the current alternate hypothesis broken into $n$ separate hypotheses for the signal $y_t$ being in each of the $n$ locations, plus the current null hypothesis for absent $y_t$). However, this problem doesn't care about learning location $t$ of the signal. – M.B.M. Jul 17 '13 at 16:03
  • 1
    The alternate is not a mixture, that's why. You have *guaranteed* that precisely one of the values is different. Therefore the data are a random permutation of an experiment in which the first observation is from one distribution and all the rest are from another (and all distributional parameters are known, apparently). Because you don't label the data and they are independent, the likelihood *conditional* on any permutation is the product of the likelihoods. Because this conditional likelihood does not depend on the permutation, it must be the full likelihood. – whuber Jul 17 '13 at 16:09
  • Aha!!! Very nice explanation. This also means that the test involving the maximum is not optimal either (which doesn't matter, as the sum of squared observations is nicer to deal with analytically). If you put your comment in an answer form, I'll accept it. – M.B.M. Jul 17 '13 at 16:18
  • 2
    I'm not convinced this is correct. The potential error I made lies in asserting that the conditional likelihood is independent of the permutation: but it *does* depend on it. And that, unfortunately, gets us right back to where you started. This pseudo-reasoning, though, suggests you might construct a successful test anyway by using the sum of squares as your test statistic. It can be justified as a first-order approximation to the correct log-likelihood. – whuber Jul 17 '13 at 16:40
  • 1
    @whuber Right, I was just about to post that I was confused again, as I was having trouble writing down the likelihood function for the alternate... Could you please elaborate on the "first-order approximation to the correct log-likelihood"? How does one justify an approximation of a function when the function is unknown in the first place? – M.B.M. Jul 17 '13 at 16:44

1 Answers1

2

You can use Grubb's test here

Your statistical problem is essentially to test for a single "aberration" in your data vector. This is a very similar problem to using Grubb's test to detect an outlier. Indeed, one could reasonably say that it is the same problem. An obvious thing to do here is to test using Grubb's statistic, which is the maximum studentised magnitude of the data values:

$$G(\mathbf{x}) \equiv \max_{i} \frac{|x_i - \bar{x}_n|}{s_n}.$$

Larger values of this test statistic are more conducive to the alternative hypothesis that there is an aberrant value in the underlying series. Under your normality assumption it is possible to get an asymptotic approximation for the tail area of the null distribution of this statistic (which is what is usually used in the test), but it is also quite simple to simulate the distribution of this statistic under the null hypothesis $H_0: \sigma = 0$.

The advantage of Grubb's test is that it will generally detect the alternative case fairly easily when $\theta/s$ is reasonably large. By focusing entirely on the value with the highest studentised magnitude it will focus in on the most aberrant value in the observed series. Note that the test depends heavily on the assumption of an underlying normal distribution, and it is not at all robust to deviations from this distribution. (If the underlying distribution has tail behaviour that deviates from the normal distribution then it will mess up the hypothesis test. There is not really any way around this when you are trying to test for a single aberrant value.)


Implementation: Here is an implementation in R of the Grubb's test, where we have programmed a simulation to approximate the true null distribution of the test statistic. We program the function Grubbs.test, which is a customised hypothesis test taking a data vector x and a stipulated number of simulations sims. (If we set progress = TRUE the function gives a progress bar to track the progress of the simulations; this is useful if the number of simulations is large.)

Grubbs.test <- function(x, sims = 10^5, progress = TRUE) {

#Set description of test and data
method      <- "Grubb's test (simulated)"
data.name   <- deparse(substitute(x))

#Check validity of inputs
if (!is.numeric(x))           { stop("Error: Input x should be numeric data") }
if (!is.numeric(sims))        { stop("Error: Input sims should be a positive integer") }
if (sims != as.integer(sims)) { stop("Error: Input sims should be a positive integer"); }
if (min(sims) < 1)            { stop("Error: Input sims should be a positive integer") }
if (!is.logical(progress))    { stop("Error: Input progress should be a logical value") }
if (length(progress) != 1)    { stop("Error: Input progress should be a single logical value") }

#Set null and alternative hypotheses
null.value  <- 0
attr(null.value, "names") <- "aberration variance"
alternative <- "greater"

#Calculate test statistics
n <- length(x)
sample.mean <- mean(x)
sample.sd   <- sd(x)
estimate    <- max(abs(x-sample.mean))/sample.sd
attr(estimate, "names")  <- "maximum studentised magnitude"
statistic   <- estimate
attr(statistic, "names") <- "G"

#Calculate p-value
SIM.STAT <- rep(0, sims)
if (progress) { PROGRESS <- txtProgressBar(min = 0, max = sims) }
for (i in 1:sims) {
  SIM.DATA    <- rnorm(n)
  MEAN        <- mean(SIM.DATA)
  SD          <- sd(SIM.DATA)
  SIM.STAT[i] <- max(abs(SIM.DATA-MEAN))/SD 
  if (progress) { setTxtProgressBar(PROGRESS, value = i) } }
if (progress) { close(PROGRESS) }
p.value <- sum(SIM.STAT >= statistic)/sims
attr(p.value, "names") <- NULL

#Create htest object
TEST        <- list(method = method, data.name = data.name,
                    null.value = null.value, alternative = alternative,
                    estimate = estimate, statistic = statistic, p.value = p.value)
class(TEST) <- "htest"
TEST }

Let's try implementing this test for two cases, one where the null hypothesis is true and one where it is false. First we will generate two data sets DATA.NULL and DATA.ALT, where the latter includes an aberrant value generated from an underlying normal distribution with non-zero variance.

#Set parameters
n     <- 100
m     <- 4.2
s     <- 1.2
sigma <- 3.1

#Generate mock data
set.seed(730921341)
ABERRATION    <- rnorm(1, mean = 0, sd = sigma)
VAL           <- sample.int(n, size = 1)
DATA.NULL     <- rnorm(n, mean = m, sd = s)
DATA.ALT      <- DATA.NULL
DATA.ALT[VAL] <- DATA.NULL[VAL] + ABERRATION

Now we will implement Grubb's test on these data vectors to see if we can detetct the underlying aberration. As you can see, in this particular case (where $\sigma/s = 3.5$), the test correctly identifies that the first data vector does not have an aberrant value and the second data vector does.

#Implement tests
TEST1 <- Grubbs.test(DATA.NULL)
TEST2 <- Grubbs.test(DATA.ALT)

#Show output of tests
TEST1

        Grubb's test (simulated)

data:  DATA.NULL
G = 2.2678, p-value = 0.9452
alternative hypothesis: true aberration variance is greater than 0
sample estimates:
maximum studentised magnitude 
                     2.267813 

TEST2

        Grubb's test (simulated)

data:  DATA.ALT
G = 4.4083, p-value = 0.00031
alternative hypothesis: true aberration variance is greater than 0
sample estimates:
maximum studentised magnitude 
                     4.408309 
Ben
  • 91,027
  • 3
  • 150
  • 376