3

I am looking at the probability of opting into a program based on a continuous variables $X_1$, $X_2$, $X_3$, etc. When I divide the sample into people who opted in and did not opt in and do a t-test I find that the mean of all the $X$'s for those opting in is significantly different from the mean for those not opting in.

However, when I look at a correlation matrix for the $X$'s and opting in I find really low correlation coefficients -- from between $.04$ and $.13$. But my t-tests are significant at the 95%+ confidence level.

How do I square these two results?

MånsT
  • 10,213
  • 1
  • 46
  • 65
Dan
  • 161
  • 1
  • 7
  • Flipping which variable is the response variable or a predictor will result in models w/ different meanings & possibly different patterns of 'significance'. (You describe the probability of opting in as though it should be a response variable, but then discuss using it as to indicate groups in a t-test.) For a discussion of this issue in a simpler regression context, see my answer [here](http://stats.stackexchange.com/questions/22718/what-is-the-difference-between-doing-linear-regression-on-y-with-x-versus-x-with/22721#22721). @PeterFlom is right that you should be doing logistic regression. – gung - Reinstate Monica Jul 26 '12 at 15:30
  • 1
    @Peter Flom is correct. Essentially for any sized effect you can find statistical significance with a large sample size. John Myles White did an interesting write up on this in the context of social science: http://www.johnmyleswhite.com/notebook/2012/07/17/criticism-5-of-nhst-p-values-measure-effort-not-truth/ – Fraijo Jul 26 '12 at 15:54
  • 1
    I am confused by the question, the answers, and the comments! The source of my confusion is basic: since a t-test measures differences in *averages*, why would that have anything at all to do with *correlation* in the first place? It makes me wonder what averages you are actually taking: you have two groups of people and multiple "continuous variables" for each person. Are you averaging all values of all these variables together? Are you averaging each variable separately? If so, to which averages do you apply the t-test? – whuber Jul 26 '12 at 16:34
  • @whuber here is how I read the question: the program consists of 3 continuous predictor variables and a single binary response variable. If we look at the factors (0-1) for each predictor variable the means of option in (factor = 1) are greater than the not opting in. This suggests there could be correlation between the X_i since larger values seem to be linked to opting in. But as we all seem to have answered, there is no contradiction with low correlation and statistical significance. – Fraijo Jul 26 '12 at 17:58
  • Thanks, @Fraijo. I still don't follow the logic: why would the relationship among the means of $X_i$ and opting in tell us *anything* about correlation among the $X_i$? – whuber Jul 26 '12 at 18:51
  • @whuber @ Dan states that the correlation matrix is between the X_i and opting in. I assumed he meant the correlation between each X_i and opting in was between .04 and .13; i.e. there appears to be some small effect but nothing large. Generally we want low correlation between the X_i (if X_1 and X_2 are highly correlated adding X_2 to a regression with X_1 does not gain us much). – Fraijo Jul 26 '12 at 21:05
  • @Fraijo: Thanks, I see where there is ambiguity here. To me, "matrix" *always* means the full array of mutual correlations, both among the $X_i$ and between them and `opting in`. But, regardless of which interpretation is intended, I still do not see how the t-tests even *ought* to imply correlation. It's easy to construct counterexamples. – whuber Jul 26 '12 at 21:08
  • 1
    @whuber you are absolutely correct, there should be plenty of counter-examples. The inuition for correlation is simple (but not correct): if the mean of X_1 for those who opt in is higher than the alternative, then an increase in X_1 would suggest opting-in is more likely. – Fraijo Jul 27 '12 at 21:14

3 Answers3

7

First, there's no real contradiction. The statistical significance of any statistic is only partly due to its size, it's also a function of sample size. How many people did you have?

Second, since "opting in" is a binary variable, and since you are treating it as a dependent variable with multiple independent variables, you really want logistic regression, not correlations.

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
2

I agree with Peter that there is no contradiction the t test and the correlations are telling you very different things. First of all as Peter mentioned statistical significance depends on both the magnitude of the differnce and the sample size. With a very large sample size small difference can be significant (even highly statistically significant).

Now correlation between the variables measures whether or not the move together in a linear fashion. It may be that if you have paired data for X$_1$ and X$_2$ that they don't tend to increase of decrease together or in the sense of negative correlation have X$_1$ decreasing while X$_2$ increases. So for your data the magnitude of this tendency is low or non existent.

Now two variable can have very different means and zero correlation such as if X$_1$(k)= A + ε(k) and X$_2$(k)= B + η(k) where the ε(k) and η(k) are uncorrelated 0 mean noise terms that are also uncorrelated with each other.

Suppose A-B>0. Then for sufficiently large n (how large n has to be depends on the magnitude of the difference between A and B) the t test will say that the mean of X$_1$is statistically significantly different from the mean o9f X$_2$. But X$_1$ and X$_2$ are uncorrelated. This is like your situation.

On the other hand suppose X$_1$(k) = X$_2$(k) + ε(k) where the ε$_s$ are zero mean independent gaussian noise terms their means will be the same but they could be highly correlated with the degree of correlation dependent on the variance of ε(k).

Michael R. Chernick
  • 39,640
  • 28
  • 74
  • 143
  • To be clearer, I was saying (letting there be only one X): If Y=1 then the person opted into the program. If I divide the sample into two groups - people with Y=1 and Y=0 -- then the mean of X for group two is higher for the group Y=1. It is statistically different at the 95% confidence level, and the magnitude of the difference is not about equal to a fifth of the range of the variable. But when I look at the correlation between Y and X in the full sample, it is very small. I have about 1000 observations, so the sample is not huge – Dan Jul 26 '12 at 17:53
  • I thought -- OK, there is a very small but significant correlation - so there is a significant difference between the mean of X across the two groups, but it is a tiny difference so the correlation is of low magnitude. But, then I saw that the difference in means wasn't that big. -- but I do see what you are saying about different types of noise. Best to just run a logistic regression like you suggest. Thanks a lot. – Dan Jul 26 '12 at 17:59
  • @Dan 1000 observations should be large enough to detect small differences in means. It also might be able to call small differences in estimated correlations statistically significant. – Michael R. Chernick Jul 26 '12 at 18:18
2

I don't agree that you necessarily will gain much from logistic regression, given my understanding of your research questions (To what degree do OptIns differ from NonOptIns? Do we see more than chance differences?). You've already determined using your own criteria that, for each of 3 X's, there is a statistically significant group difference but a weak one. The weakness of the difference can be expressed in the mean difference, the standardized mean difference, or the point-biserial correlation with the OptIn variable, which is the technical name for the type of correlation you've calculated.

I question your need for logistic regression because you haven't said anything about wanting to see to what degree the 3 X's can jointly predict the outcome; about learning the relative importance of each when the other two are controlled; or about assigning predicted probabilities of opting in for each person. Those (among others) are the sorts of things logistic regression would give you.

rolando2
  • 11,645
  • 1
  • 39
  • 60