1

I have multiple datasets that i need to find the correlation between them

The problem is that my datasets are mainly zeros and ones (zero means patient does not have the disease and 1 means patient has the disease) and most of them are zeros and some of these columns are all zeros here is an example of my columns are these

Columns X

0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Columns Y

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Finding Pearson's correlation gets Divided by Zero error

I wonder what is the best method to find how datasets are close to each other

Peason's correlation?

Spearman’s correlation?

Mutual information?

Information Gain?

Some other method?

asmgx
  • 239
  • 2
  • 9
  • It looks like your $y$ is all $0$s. If a random variable has no variance (it is constant), the correlation coefficient is not defined. The software is giving you the correct answer. – Dave Nov 04 '21 at 02:45

1 Answers1

2

For sparse binary vectors, try the Tanimoto coefficient (also called the Jaccard index). From this page:

Simply put, the Tanimoto Coefficient uses the ratio of the intersecting set to the union set as the measure of similarity. Represented as a mathematical equation:

enter image description here

In this equation, N represents the number of attributes in each object (a,b). C in this case is the intersection set.

This is very commonly used in chemistry with very long binary vectors which are often sparsely populated and is highly effective. There's a whole family of similarity measures of this type that you could try, some are discussed here.

EDIT: adding some details based on the comments - specifically, that the interest is to compare full datasets and not just two observations.

I would compute all the pairwise comparisons - every row of dataset 1vs every row of dataset 2 - and then look at the distribution of values. The minimum, maximum, median, 25th percentile, and 50th percentile will give you an idea of the overall similarity. Plotting the similarities as a histogram will give a visual idea of how similar the two datasets are.

It would also be good to compute the pairwise Tanimoto similarity within each dataset to give you a baseline for how self-similar the datasets are. Then when you compare between datasets you'll have an idea of how similar they are to each other relative to how self-similar they are. I'll add these comments to my answer.

KirkD_CO
  • 1,013
  • 1
  • 6
  • 17
  • Thanks, i am talking about a dataset, many records, shall I find the sum of Nc, Na, and Nb? or the sum of all T(a,b). how does that work with datasets? – asmgx Nov 04 '21 at 04:39
  • I would compute all the pairwise comparisons - every row of dataset 1vs every row of dataset 2 - and then look at the distribution of values. The minimum, maximum, median, 25th percentile, and 50th percentile will give you an idea of the overall similarity. Plotting the similarities as a histogram will give a visual idea of how similar the two datasets are. – KirkD_CO Nov 04 '21 at 12:15
  • Thinking about it a bit more.... It might also be good to compute the pairwise Tanimoto similarity within each dataset to give you a baseline for how self-similar the datasets are. Then when you compare between datasets you'll have an idea of how similar they are to each other relative to how self-similar they are. I'll add these comments to my answer. – KirkD_CO Nov 04 '21 at 13:14