3

Example: if I find a statistically significant difference between the heights of men and women, does this say something about being able to predict whether a person is a man or woman based on the height?

It seems to me that in every case where there is a statistically significant difference, we should be able to make a prediction (classify something, and evaluate it using cross-validation) of the independent variable based on knowing the dependent variable. I couldn't think of a counterexample after a long period of thinking about it.

Another example: there is a linear correlation between $A$ and $B$. Given $B$, I would be able to predict $A$ with high certainty.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
elissa
  • 31
  • 1
  • In your question, you state "we should be able to make a prediction... of the *independent* variable based on knowing the *dependent* variable". Usually, we think of our model as predicting the *dependent* variable based on the *independent* variable. Was that a typo, or are you asking about reverse regression? – gung - Reinstate Monica May 06 '14 at 15:28
  • I guess I have seen both in articles. Some articles claim the inverse and some claim the standard prediction. Is there a difference between the claims -- is one stronger than the other? – elissa May 06 '14 at 15:40
  • To see the difference between the two, it may help you to read my answer here: [What is the difference between linear regression on Y with X and X with Y?](http://stats.stackexchange.com/q/22718/7290) – gung - Reinstate Monica May 06 '14 at 16:18

1 Answers1

4

Yes, you could. But there are easier ways to distinguish men from women. Assume that you find a significant difference of 4 inches between men and women, with respective average heights of 5'9" and 5'5". Then a sensible decision rule would be to assume maleness to anyone 5'7" or over. This is basically the approach taken by discriminant analysis.

But the big question is: how often do I make a mistake by this method?

The answer to that depends on the variances of the height distributions, which in your example, allow for a considerable overlap between the two populations and a big probability of false classification.

Recall that your hypothesized "significant difference" assumes that you took a sample of male and female heights. Given a real difference and a sufficiently large sample size, you will get a significant result. Basically, significance depends on the distribution of the sample averages; classification success depends on the distribution of the individual. So you can have a statistically significant result, but a totally crappy classifier.

Placidia
  • 13,501
  • 6
  • 33
  • 62
  • Thanks Placidia, it makes some sense that the variance is the key factor. But it still seems to me that a statistical significance test takes that into account as well. How is the "distribution of the sample averages" different from "distribution of the individual"? If the individuals have a particular distribution, I imagine the samples of those individuals would too... – elissa May 09 '14 at 13:26
  • That's where the central limit theorem comes in and the laws of large numbers. 1) the variance of the sample average tends to 0 as the sample size goes to infinity and 2) the suitable scaled distribution of the sample average tends to normality. – Placidia May 09 '14 at 14:05