How to interpret multimodal distribution of bootstrapped correlation?

Question

I have two paired variables, x and y:

person     x      y
1        124    100
2         79     94
3        118    105
...

Here is a scatterplot of the data:

enter image description here

I am interested in the correlation between x and y. Bootstrapping gives me the following distribution. The lines are the actual correlation of the data (rho = 0.16) and the 0.25%- and 97.5%-quantiles of the bootstrap distribution.

What does it mean that the distribution of the correlation is multimodal?

enter image description here

The data are merely an example to illustrate the question. What would this outcome mean, if sample size were large enough?

Maarten Buis · Accepted Answer · 2013-07-11T10:18:00.807

7

My guess would be that there is a (set of) outlier(s) in your data. One mode represents those samples that included them and the other the samples that did not include them. My guess would be that the right mode corresponds to the samples that exclude both the point with the smallest value of $x$ and the point with the largest value of $x$ in your scatterplot. Similar patterns can also occur in larger samples.

edited Jul 11 '13 at 10:18

answered Jul 11 '13 at 09:25

Maarten Buis

19,189
29
59

3

Regardless of the bimodality, confidence intervals that wide (a) include zero (b) indicate a sample size too small to allow much to be done scientifically. I'd guess at a sample size of the order of 12. An enormous bootstrap sample (how many? 100,000?) can't squeeze out juice that isn't in the data to start with. – Nick Cox Jul 11 '13 at 09:46
@NickCox You are right, the sample size was too small to create meaningful results. Bootstrapping with fewer samples gives the same curve, only less smooth. I used a large number of samples to have a clear example for my question here. The validity of the data aside, I am trying to understand what this kind of result *would* mean, if the sample size was large enough (a few thousand people). Would such a result be possible? And what would it signify? – Jul 11 '13 at 10:01
1

@what Do you think my answer is wrong? If so, why? If not, why are you asking a question when you already know the answer? – Maarten Buis Jul 11 '13 at 10:03
2

Only a larger dataset can tell you what is in a larger dataset. Obvious indeed, but also true. – Nick Cox Jul 11 '13 at 10:08
@MaartenBuis I am not saying your answer is wrong! Why do you think so? In fact there *are* outliers, and you answer might be correct. But how would I interpret this result with your answer, if this were a large sample? You wouldn't call them "outliers" then, would you? I'm sorry, if I appear obstinate, I just want to understand what this possibly might mean, if it were not an artefact of my sample. – Jul 11 '13 at 10:14
1

The sample size turns out to be 11 (my guess was 12). My guess is that the values at the top and bottom right pull correlations down most and the higher mode is dominated by samples that exclude both. – Nick Cox Jul 11 '13 at 10:14
1

@NickCox Yes, certainly. My question here comes from the fact that I learned that there *are* multimodal distributions, and how I test for them, but I have never yet encountered any real data that *were* multimodal, and I have never been given an explanation what multimodality could mean. So, if this were truly multimodal, what would it mean? – Jul 11 '13 at 10:16
2

That would be a good question in itself (but check to see what has been written already). But in essence multimodality often means a mixture of some kind (although mixtures don't always imply multimodality). People often cite male and female heights and weights, but in practice it is very hard to predict gender from height or weight and the bimodality that is predicted may be hidden in distributions. – Nick Cox Jul 11 '13 at 10:19
2

Are you asking about distributions in general or sampling distributions specifically? – Maarten Buis Jul 11 '13 at 10:22
2

@what As far as I can tell, the distributions of both variables in your data are not bimodal. You have to realize that what your plot represents is the sampling distribution of the correlation coefficient. This is not the same thing and the sampling distribution would look quite different were the sample larger (for one it would generally be much “narrower”, i.e. have smaller variance) – Gala Jul 11 '13 at 10:30
Thank you, Gaël, I'm getting confused here :-) @MaartenBuis I'm asking about sampling distributions specifically. What you see is the output from an exercise for learning R. The data were not meant to be taken serious. I understand (thank you all!) that the form of the distribution is due to the outliers, which in turn are very likely due to the small sample size. But for my learning experience, I would like to pretend that this were real. Because in fact there is the slight possibility, that the real distribution is actually exactly like this. Or is that impossible? – Jul 11 '13 at 10:38
@MaartenBuis You wrote in your answer, that similar patterns can also occur in larger samples? Does that mean that I should simply ignore the second modus or just trim the outliers from my data? – Jul 11 '13 at 10:40
2

No, it would mean that you need to think about what might have caused that bimodality, probably outliers of some kind, and what you want to do with them. You asked a general question, so I can only give a general answer, which is probably too general to be really useful. The real answer will depend on all the little details that come with a real research project using real (= messy) data. – Maarten Buis Jul 11 '13 at 11:12
4

Remember that the very purpose of the bootstrap is to give a non-parametric estimate of the sampling distribution, so bi-modality is not necessarily a problem. If we had good reasons to expect that the real sampling distribution is bimodal, then the result you found would be just fine. In practice it often means you have outliers, which might or might not be a problem depending on the exact circumstances. The solution would be to know your data and know what you want to do with it. Ignoring parts of the estimated sampling distribution as you seem to propose is unlikely to be the answer. – Maarten Buis Jul 11 '13 at 11:21
2

A strategic comment: it really is much better to ask your real question or real problem, even if you have difficulties formulating it well. Naturally, concrete examples do help, enormously. But if you only ask a specific question, people will focus on that. If you then declare that you really don't care about the example and/or that you are really interested in something quite different, people who are trying to help are likely to think that they wasted (some of) their time. There is a risk that you get identified as someone who is too confused or too indirect to be worth answering. – Nick Cox Jul 11 '13 at 11:30

How to interpret multimodal distribution of bootstrapped correlation?

1 Answers1

Linked