1

I have two variables that are supposed to correlate with each other across the whole dataset, but as you can see in the scatter plot below, it appears that I have a mix of two sub-samples.

One in which the variables correlate perfectly with each other (i.e. data points falls on the regression line) and a bigger chunk in which the two variables are not correlated as much.

enter image description here

I'd like to select data points for which the two variables correlates perfectly which other. I tried to fit a simple regression line and remove outliers based on cook's distance > 4/n. (see the scatter plot below)

enter image description here

Also tried singular value decomposition to no avail!

This should be pretty basic with numerous ways to resolve it; Still I am wondering what would be the best approach for selecting only data points with perfect correlation across the two variables. Any ideas in R or python is much appreciated!

whuber
  • 281,159
  • 54
  • 637
  • 1,101
RJF
  • 263
  • 1
  • 2
  • 8
  • 1
    For these data, fitting a two-component Gaussian mixture model looks attractive. There is no "best approach" for fitting *perfect* correlation, because there are so many solutions: indeed, select any pair of points having distinct coordinates: their correlation will be $\pm 1.$ – whuber May 27 '21 at 15:48
  • 1
    Could you add *why* you want to do this? If the application is clear, a better alternative might be suggested. One potential goal of your suggested approach that came to mind has an excellent answer by the wizard above: https://stats.stackexchange.com/a/313138/176202 – Frans Rodenburg May 27 '21 at 16:09
  • @FransRodenburg: Thank you both for your comments. I have constructed a model that appears to generalize very well when applied to only a portion of my validation set. I have some assumptions about this sub-sample that I'd like to investigate further (for example, I expect this sub-sample to have the same ethnicity as my training and test set - a confounder that has not been corrected appropriately when building the model). My model can be summarized as a single score (var.1) that supposed to correlate perfectly with another score (var.2) for the whole dataset. – RJF May 27 '21 at 17:51
  • I'd like to identify the data points that fall on the regression line with an intercept of zero. I hope that makes sense, though! – RJF May 27 '21 at 17:52

0 Answers0