Testing a linear hypothesis for a discrete distribution

Question

I have a variable $X$ with a finite number of values $1,2,...k$. The parameter is simply the distribution: a vector $p\in \mathbb{R}^k$ with $k$ components where $p_i=P(X=i)$. I have a dataset with $N$ values of $X$ from independent trials.

Now I have $f$ a linear (or affine) $k$ variables function. Example $f(p)=p_1+2p_2-3p_5$. I want to test the hypothesis $H_0$: $f(p)=0$.

How can I do?

(I'm interested mainly in asymptotic methods for large $N$)

If you're after asymptotic methods you would simply use the multivariate normal approximation for the multinomial (keeping in mind that it's degenerate). The mean and the variance-covariance matrix are pretty easy to derive, so you can readily get a normal approximation for the natural estimator of $f$. — Glen_b, Dec 18 '17 at 04:40
Thanks. I see. I realize the covariance matrix is pretty simple: https://en.wikipedia.org/wiki/Multinomial_distribution. I think it's equivalent to work directly with $Y=1_{X=1}+2.1_{X=2}-3.1_{X=5}$. — Benoit Sanchez, Dec 18 '17 at 09:43

Benoit Sanchez · Accepted Answer · 2017-12-19T16:20:17.883

1

For large $N$ the idea is to rely on normality thanks to the Central Limit Theorem (CLT). Write $\hat p$ the natural estimator of $p$: the empirical distribution. We use the statistics $f(\hat p)$.

$f(\hat p)$ tends to become normally distributed for large $N$. There are two possible ways to see it:

$\hat p$ becomes normally distributed thanks to multidimentional CLT with covariance matrix given by the multinomial disbribution, and then $f(\hat p)$ is normally distributed too as a linear combination of it
$f(\hat p)=\frac{1}{N}\displaystyle\sum_{j=1}^N f(1_{X_j=1},1_{X_j=2},...,1_{X_j=k}$) thus you can use one dimensional CLT for this sequence of i.i.d. variables

The mean of $f(\hat p)$ is $f(p)$. You can use a z-test.

You need an estimator of the sample variance. You can either derive it from the covariance matrix of $\hat p$ or use the usual estimator for the variable $f(1_{X=1},1_{X=2},...,1_{X=k})$. Both methods will yield the same estimator: it is a quadratic form of $\hat p$.

Note about z-test: formally this is not exactly a z-test since the variance is not known but estimated. Some authors still call it a z-test. Some might prefer a t-test but they are essentially the same: the statistics is the same, only the distribution approximation under $H_0$ differs. The two approximations are extremely close except for very small sample size. For small sample size, it is unclear whether a t-test would be better. See this clarification. The focus was on large $N$ anyway.

edited Dec 19 '17 at 16:20

answered Dec 18 '17 at 12:13

Benoit Sanchez

7,377
21
43

Under the null you know the variance exactly, so you would use a z-test rather than a t-test. $\text{Var}(\hat{f})=\text{Var}(\hat{p}_1+2\hat{p}_2-3\hat{p}_5)$ $=$ $\text{Var}(\hat{p}_1)+4\text{Cov}(\hat{p}_1,\hat{p}_2)+4\text{Var}(\hat{p}_2)+9\text{Var}(\hat{p}_5)-6\text{Cov}(\hat{p}_1,\hat{p}_5)-12\text{Cov}(\hat{p}_2,\hat{p}_5)$ $=$ $\sigma^2_0$ (say), all terms of which are known under the null. Then ... $\frac{\hat{f}-f_0}{\sigma_0}$ will be asymptotically standard normal. – Glen_b Dec 18 '17 at 23:22
On the other hand, you don't actually have any theory that establishes that replacing the known variance with an estimated one results in a t-distribution. (In practice it will work just fine but there's no result I know of that actually establishes this goes to a t distribution; you just have that both a t and the thing you're working with both go asymptotically to a normal.) – Glen_b Dec 18 '17 at 23:47
I see your points. But we can't say that the variance is really known because the cov. of $\hat p$ depends on $p$. And the null does not give $p$ but a set $p$ belongs to, I first wrote "z" but read that it's not consensual to say $z$ with estimated variance. But yes, there no guarantee using the student distribution is better than using the normal one. Asymptotically it doesn't matter: t=z. For smaller $N$, t-test does not seem to be a great solution for binomial-like data... – Benoit Sanchez Dec 19 '17 at 09:52
Sorry; you're correct; I was mixing in something from a completely unrelated question that I read at almost the same time. However you then don't really have an asympotic argument for the t, only for the z. Can you link to what you read? – Glen_b Dec 19 '17 at 09:56
Gung and Adam's answers mainly: https://stats.stackexchange.com/questions/85804/choosing-between-z-test-and-t-test – Benoit Sanchez Dec 19 '17 at 09:58
I decided to call it "z" finally. Apparently, some authors use the word "z-test" with estimated variance. I added a note for a (partial) clarification. – Benoit Sanchez Dec 19 '17 at 16:25
Cool. To my mind it would be called a t-test if you should (at least notionally) compare the statistic with percentage points of a t-distribution; we lack any argument for that. – Glen_b Dec 19 '17 at 22:25

Testing a linear hypothesis for a discrete distribution

1 Answers1