2

I am trying to generate correlations between random variables (two dimensional) with a defined linear relationship (in the $r$ sense), but with different visual patterns when plotted. I am trying to create a 'guess the correlation' task where I can systematically manipulate the difficulty for an observer to guess the linear relationship.

What I am doing now is given a correlation $r$ I generate the first and second values, $X_1$ and $X_2$, with $n$ samples from the standard normal distribution. Then from there I make $X_3$ a linear combination of the two $X_3 = r X_1 + \sqrt{1-r^2}\,X_2$

Then: $Y_1 = \mu_1 + \sigma_1 X_1, \quad Y_2 = \mu_2 + \sigma_2 X_3$

And now $Y_1$ and $Y_2$ have a correlation $r$.

For manipulating the difficulty I've been playing with the parameters of the distribution and $n$, however, I am not satisfied with the results.

Any idea on how to systematically increase the difficulty of the task? (i.e., adding outliers, for instance etc).

Note: Difficulty is a cognitive/psychology question rather than a purely statistical one. I intend to test this empirically. The idea is to generate plots with varying parameters for a given correlation value (i.e., changing the number of points, the variance, outlier, functional form? etc). What are the parameters and what would be a systematic way to manipulate them.

Alhayer
  • 95
  • 1
  • 10
  • 1
    "Difficulty" appears to be a psychological concept, not a statistical one, and so may be difficult to address in this forum. However, a very general way to produce scatterplots of any given visual appearance and specified value of $r$ is described at http://stats.stackexchange.com/questions/152028 . – whuber Jul 12 '16 at 15:12
  • True, it is a psychological concept and I intend to test this empirically. However, I am looking for ideas to start with and see which statistical parameter can control the cognitive difficulty. The link you shared is extremely useful for this. – Alhayer Jul 12 '16 at 15:16
  • How do you intend to present the data? If you hold the scale of the x and y axes constant then it seems plausible that varying the variance might affect the difficulty. Or alternatively, hold the variance constant and manipulate the axes. – Ian_Fin Oct 24 '16 at 18:58
  • The solution at http://stats.stackexchange.com/questions/152028 might serve as a foundation to accomplish this. – whuber Oct 24 '16 at 22:05
  • 1
    Just noticed [this recent article](https://dx.doi.org/10.3758/s13423-016-1174-7). Not had the chance to read it, but may be of some interest. – Ian_Fin Oct 27 '16 at 08:43

1 Answers1

1

There are a few parameters you could vary:

  • the slope of the linear effect. For example,

$$X_1=0.1X_2 + N(0,0.1^2)$$

, plotted on a Y-axis of $[-5,5]$ may look a lot more correlated than

$$X_1=X_2 + N(0,1^2)$$

even though the expected correlation is the same.

  • The distribution of the error terms. Using a t-distribution with few degrees of freedom will have fatter tails than a normal distribution. I would guess that fatter tails would make the correlation harder to guess.

  • The number of plotted data points. I would expect many humans to have great difficulty determining $\rho$ from three data points. Conversely, you may see interesting biases when thousands of data are plotted.

  • The plotting point (and size). It may be harder to estimate the correlation if the point is, say, a 10-pixel wide circle than if it is a dot, or a "+".

JDL
  • 1,244
  • 7
  • 12