Manipulating the difficulty of guessing a correlation

Question

I am trying to generate correlations between random variables (two dimensional) with a defined linear relationship (in the $r$ sense), but with different visual patterns when plotted. I am trying to create a 'guess the correlation' task where I can systematically manipulate the difficulty for an observer to guess the linear relationship.

What I am doing now is given a correlation $r$ I generate the first and second values, $X_1$ and $X_2$, with $n$ samples from the standard normal distribution. Then from there I make $X_3$ a linear combination of the two $X_3 = r X_1 + \sqrt{1-r^2}\,X_2$

Then: $Y_1 = \mu_1 + \sigma_1 X_1, \quad Y_2 = \mu_2 + \sigma_2 X_3$

And now $Y_1$ and $Y_2$ have a correlation $r$.

For manipulating the difficulty I've been playing with the parameters of the distribution and $n$, however, I am not satisfied with the results.

Any idea on how to systematically increase the difficulty of the task? (i.e., adding outliers, for instance etc).

Note: Difficulty is a cognitive/psychology question rather than a purely statistical one. I intend to test this empirically. The idea is to generate plots with varying parameters for a given correlation value (i.e., changing the number of points, the variance, outlier, functional form? etc). What are the parameters and what would be a systematic way to manipulate them.

"Difficulty" appears to be a psychological concept, not a statistical one, and so may be difficult to address in this forum. However, a very general way to produce scatterplots of any given visual appearance and specified value of $r$ is described at http://stats.stackexchange.com/questions/152028 . — whuber, Jul 12 '16 at 15:12
True, it is a psychological concept and I intend to test this empirically. However, I am looking for ideas to start with and see which statistical parameter can control the cognitive difficulty. The link you shared is extremely useful for this. — Alhayer, Jul 12 '16 at 15:16
How do you intend to present the data? If you hold the scale of the x and y axes constant then it seems plausible that varying the variance might affect the difficulty. Or alternatively, hold the variance constant and manipulate the axes. — Ian_Fin, Oct 24 '16 at 18:58
The solution at http://stats.stackexchange.com/questions/152028 might serve as a foundation to accomplish this. — whuber, Oct 24 '16 at 22:05
Just noticed [this recent article](https://dx.doi.org/10.3758/s13423-016-1174-7). Not had the chance to read it, but may be of some interest. — Ian_Fin, Oct 27 '16 at 08:43

score 1 · Accepted Answer · answered Oct 25 '16 at 09:42

There are a few parameters you could vary:

the slope of the linear effect. For example,

$$X_1=0.1X_2 + N(0,0.1^2)$$

, plotted on a Y-axis of $[-5,5]$ may look a lot more correlated than

$$X_1=X_2 + N(0,1^2)$$

even though the expected correlation is the same.

The distribution of the error terms. Using a t-distribution with few degrees of freedom will have fatter tails than a normal distribution. I would guess that fatter tails would make the correlation harder to guess.
The number of plotted data points. I would expect many humans to have great difficulty determining $\rho$ from three data points. Conversely, you may see interesting biases when thousands of data are plotted.
The plotting point (and size). It may be harder to estimate the correlation if the point is, say, a 10-pixel wide circle than if it is a dot, or a "+".

Manipulating the difficulty of guessing a correlation

1 Answers1