selecting data based on colinearity

Question

I have two variables that are supposed to correlate with each other across the whole dataset, but as you can see in the scatter plot below, it appears that I have a mix of two sub-samples.

One in which the variables correlate perfectly with each other (i.e. data points falls on the regression line) and a bigger chunk in which the two variables are not correlated as much.

I'd like to select data points for which the two variables correlates perfectly which other. I tried to fit a simple regression line and remove outliers based on cook's distance > 4/n. (see the scatter plot below)

Also tried singular value decomposition to no avail!

This should be pretty basic with numerous ways to resolve it; Still I am wondering what would be the best approach for selecting only data points with perfect correlation across the two variables. Any ideas in R or python is much appreciated!

For these data, fitting a two-component Gaussian mixture model looks attractive. There is no "best approach" for fitting *perfect* correlation, because there are so many solutions: indeed, select any pair of points having distinct coordinates: their correlation will be $\pm 1.$ — whuber, May 27 '21 at 15:48
Could you add *why* you want to do this? If the application is clear, a better alternative might be suggested. One potential goal of your suggested approach that came to mind has an excellent answer by the wizard above: https://stats.stackexchange.com/a/313138/176202 — Frans Rodenburg, May 27 '21 at 16:09
@FransRodenburg: Thank you both for your comments. I have constructed a model that appears to generalize very well when applied to only a portion of my validation set. I have some assumptions about this sub-sample that I'd like to investigate further (for example, I expect this sub-sample to have the same ethnicity as my training and test set - a confounder that has not been corrected appropriately when building the model). My model can be summarized as a single score (var.1) that supposed to correlate perfectly with another score (var.2) for the whole dataset. — RJF, May 27 '21 at 17:51
I'd like to identify the data points that fall on the regression line with an intercept of zero. I hope that makes sense, though! — RJF, May 27 '21 at 17:52

selecting data based on colinearity

0 Answers0