1

$$ y = (2,4,6,8,10) $$ $$ x_1 = (1,2,3,4,5) $$

Linear model:

$$ y = \beta_0 + \beta_1x_1 $$

  • p-value of $x_1$: <2e-16
  • $R^2$: 1.00
  • p-value of model: <2e-16 with 1 var, 3df

Why doesn't p-value tell us to reject this model and this variable until we increase size of n?

jtd
  • 579
  • 4
  • 11
  • 2
    What's MLR? In a forum where for some "ML" means maximum likelihood (of course, they say) and for others it means machine learning (of course, _they_ say) explaining your abbreviations does no harm and can defuse puzzlement. – Nick Cox Mar 11 '15 at 12:48
  • 2
    A P-value isn't a certification of whether your analysis is sensible (appropriate, well judged) or a quantification of how far it is sensible (etc.). It's just flagging here that a fit that good is unlikely to be a chance fluctuation with this sample size. Wouldn't you troubled if that were not the case, as it is a perfect fit? – Nick Cox Mar 11 '15 at 12:51
  • 1
    As question has been edited, earlier comments may appear puzzling. The short answer is that $P$-value is (highly) sensitive to small $n$; it is just not evident in the example you give. – Nick Cox Mar 11 '15 at 13:01
  • @NickCox: Apologies for abbreviation. Given sample size, pooled variance, and valid assumptions about normality, linearity, homoscedasticity, i.i.d., etc., can we say "the likelihood that this relationship is due to random chance--that $x_1$ neither causes $y$ (nor vice versa), nor shares a causal antecedent with $y$, is <2e-16"? (cf. https://stats.stackexchange.com/questions/141253/can-two-variables-be-perfectly-correlated-but-not-share-a-single-causal-chain-an) – jtd Mar 11 '15 at 13:11
  • 1
    That wouldn't be correct. No independent observer could say whether this is a chance relationship, a legitimate systematic relationship, or even something someone cooked up. Approach it "from the other direction." "IF there were NO relationship in the larger population, random samples of 5 would show this degree of linear connection in fewer than 2 of 10^16 instances." (Although not every software package would quantify it that way. E.g., SPSS reports no p-value at all.) – rolando2 Mar 11 '15 at 13:24
  • @NickCox - it's not that "a fit that good is unlikely to be a chance fluctuation with this sample size"; it's that "chance fluctuations around a condition of zero fit are unlikely to produce a fit this good with this sample size." – rolando2 Mar 11 '15 at 13:29
  • 1
    I agree with @Rolando2. No program can tell you just by looking at data anything about "causal antecedents" or causes. Nor is there a population of relationships, some of which are caused by "random chance", whatever that means, and some of which aren't. By the way, the precise P-value of the order of 1e-16 is suppositious, if only because nothing can be stronger than perfect fit. Unfortunately there is no wording for this that is simultaneously clear, correct and charming, as it is a kind of backwards logic (indeed to many people in statistical science, quite absurd!). – Nick Cox Mar 11 '15 at 13:29
  • 1
    @Rolando2 Yes; that is more accurate wording. I am reaching for paraphrases that will make some kind of sense at the level of this question and inadvertently showing that it's dangerous to do so. – Nick Cox Mar 11 '15 at 13:32
  • @rolando2 and NickCox: Thanks! I have tried to put your knowledge into an answer. – jtd Mar 11 '15 at 13:41

1 Answers1

0

Attempting to put the knowledge from @NickCox and @rolando2 into this answer:

The p-value of a multiple regression variable (or model) cannot tell an independent observer anything about causes, but it can say:

  • IF there were NO relationship in the population between $x_1$ and $y$, properly random samples of $n=5$ would show this degree of fit (or relationship) in fewer than (a suppositious)* 2e-16 of the samples.

*Note that a perfect fit between $x_1$ and $y$ in the question makes the p-value suppositious.

Please feel free to edit!

jtd
  • 579
  • 4
  • 11