Which method is best to find the correlation between 2 datasets in my case

Question

I have multiple datasets that i need to find the correlation between them

The problem is that my datasets are mainly zeros and ones (zero means patient does not have the disease and 1 means patient has the disease) and most of them are zeros and some of these columns are all zeros here is an example of my columns are these

Columns X

0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Columns Y

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Finding Pearson's correlation gets Divided by Zero error

I wonder what is the best method to find how datasets are close to each other

Peason's correlation?

Spearman’s correlation?

Mutual information?

Information Gain?

Some other method?

It looks like your $y$ is all $0$s. If a random variable has no variance (it is constant), the correlation coefficient is not defined. The software is giving you the correct answer. — Dave, Nov 04 '21 at 02:45

KirkD_CO · Answer 1 · 2021-11-04T13:16:25.907

For sparse binary vectors, try the Tanimoto coefficient (also called the Jaccard index). From this page:

Simply put, the Tanimoto Coefficient uses the ratio of the intersecting set to the union set as the measure of similarity. Represented as a mathematical equation:

In this equation, N represents the number of attributes in each object (a,b). C in this case is the intersection set.

This is very commonly used in chemistry with very long binary vectors which are often sparsely populated and is highly effective. There's a whole family of similarity measures of this type that you could try, some are discussed here.

EDIT: adding some details based on the comments - specifically, that the interest is to compare full datasets and not just two observations.

I would compute all the pairwise comparisons - every row of dataset 1vs every row of dataset 2 - and then look at the distribution of values. The minimum, maximum, median, 25th percentile, and 50th percentile will give you an idea of the overall similarity. Plotting the similarities as a histogram will give a visual idea of how similar the two datasets are.

It would also be good to compute the pairwise Tanimoto similarity within each dataset to give you a baseline for how self-similar the datasets are. Then when you compare between datasets you'll have an idea of how similar they are to each other relative to how self-similar they are. I'll add these comments to my answer.

Thanks, i am talking about a dataset, many records, shall I find the sum of Nc, Na, and Nb? or the sum of all T(a,b). how does that work with datasets? — asmgx, Nov 04 '21 at 04:39
I would compute all the pairwise comparisons - every row of dataset 1vs every row of dataset 2 - and then look at the distribution of values. The minimum, maximum, median, 25th percentile, and 50th percentile will give you an idea of the overall similarity. Plotting the similarities as a histogram will give a visual idea of how similar the two datasets are. — KirkD_CO, Nov 04 '21 at 12:15
Thinking about it a bit more.... It might also be good to compute the pairwise Tanimoto similarity within each dataset to give you a baseline for how self-similar the datasets are. Then when you compare between datasets you'll have an idea of how similar they are to each other relative to how self-similar they are. I'll add these comments to my answer. — KirkD_CO, Nov 04 '21 at 13:14

Which method is best to find the correlation between 2 datasets in my case

1 Answers1