13

I posted this to mathoverflow and no one's answering:

Scheffé's method for identifying statistically significant contrasts is widely known. A contrast among the means $\mu_i$, $i=1,\ldots,r$ of $r$ populations is a linear combination $\sum_{i=1}^r c_i \mu_i$ in which $\sum_{i=1}^r c_i=0$, and a scalar multiple of a contrast is essentially the same contrast, so one could say the set of contrasts is a projective space. Scheffé's method tests a null hypothesis that says all contrasts among these $r$ populations is $0$, and given a significance level $\alpha$, rejects the null hypothesis with probability $\alpha$ given that the null hypothesis is true. And if the null hypothesis is rejected, Scheffé points out that his test tells us which contrasts differ significantly from $0$ (I'm not sure the Wikipedia article I linked to points that out).

I would like to know if one can do something similar in a different sort of situation. Consider a simple linear regression model $Y_i = \alpha + \beta x_i + \varepsilon_i$, where $\varepsilon_i\sim\operatorname{i.i.d.}N(0,\sigma^2)$, $i=1,\ldots,n$.

The null hypothesis I want to consider concerns a different sort of contrast. It says there is no subset $A\subseteq\lbrace 1,\ldots,n\rbrace$ such that $E(Y_i) = \alpha_1 + \beta x_i$ for $i\in A$ and $E(Y_i) = \alpha_2 + \beta x_i$ for $i\not\in A$, where $\alpha_1\ne\alpha_2$. If the subset $A$ is specified in advance, then an ordinary two-sample $t$-test does it, but we want something that considers all subsets and holds down the probability of rejecting a true null hypothesis.

One could figure this out if efficiency were not a concern: find a test that goes through all $2^{n-1}-1$ possibilities. Even then it's problematic; two contrasts would not be independent. I asked an expert on outlier detection about this and he just said it's a combinatorial nightmare. Then I asked if one could prove that there's no efficient way to do it, perhaps by reducing an NP-hard problem to it. He just said he stays away from NP-hard problems.

So: Can one prove either that this problem is "hard" or that it's not?

Michael Hardy
  • 7,094
  • 1
  • 20
  • 38
  • (+1) Copying a comment for clarification from the [MO version](http://mathoverflow.net/questions/132786): Just a small point of clarification: As I read it, $(\alpha_1, \alpha_2, \alpha_3) = (1,2,3)$ qualifies under your null hypothesis, but $(1,2,2)$ and $(1,1,1)$ do not (regardless of $\beta$). Is that what you intended? (It doesn't seem to match some of the other allusions made in the question.) – cardinal Jun 21 '13 at 20:13
  • As stated above, the null hypothesis would be that we need only one $\alpha$, and the alternative hypothesis is that we need two. I don't know why you've got a third one. One could also consider the null hypothesis of just one $\alpha$ versus the alternative hypothesis of several, and maybe that's what I ought to do instead. – Michael Hardy Jun 21 '13 at 20:22
  • Thanks. Perhaps I was thrown off by the original statement of the model as $Y_i = \alpha + \beta x_i + \varepsilon_i$, where I took the $\alpha$ to be a potential typo for $\alpha_i$ (since it was subsequently allowed to vary). – cardinal Jun 21 '13 at 20:31
  • Well, certainly if that $\alpha$ depended in $i$ it would be an over-parametrized model, and not at all like what one normally calls a "simple linear regression model". – Michael Hardy Jun 22 '13 at 00:25

1 Answers1

1

Noticed that no one has answered this question so far...

Basically, the question is this: Is there a 0-1 vector $Z$ such that $$ y_i = \alpha + \beta x_i + \gamma z_i + \epsilon_i $$ gives a (significantly) better fit than $$ y_i = \alpha + \beta x_i + \epsilon_i. $$ "Significantly better" can be captured in terms of sums of squares as an inequality. The question then becomes whether there is a 0-1 solution to the inequality $$ f(z) \ge t. $$ This is a variant of the set partitioning problem, which is known to be NP-hard.

user3697176
  • 852
  • 4
  • 10
  • Can the set partitioning problem actually be reduced to this problem? If so, that would prove this is a hard problem. ${}\qquad{}$ – Michael Hardy Jul 28 '15 at 19:26
  • This problem is at least as hard as the classic set partitioning problem (SPP). SPP takes a linear combination of weights and tries to multiply them by+/- 1 in order to get an expression that sums to 0. Here you want to satisfy an inequality. If this were solvable in polynomial time for arbitrary inputs, then a bisection argument shows that you could also solve SPP in polynomial time. That's not exactly a reduction, but it's close. – user3697176 Jul 28 '15 at 22:48