0

I have the following table as pandas dataframe with features feat1 and feat2:

testframe = pd.DataFrame(columns = ['feat1', 'feat2'])
testframe['feat1'] = [1,0,1,0,1,0,1,1,0,1]
testframe['feat2'] = [1,0,1,0,0,0,1,1,0,0]

where the index is the number of observation (e.g people).

Lets assume that the features are not normally distributed, which I found out with Shapiro-Wilk test.

I want to find out, if there are any correlation between feat1 and feat2, so I use Mann-Whitney-U-test. As a result I get a U-Value and a p-value. To find out more about the two features, I want to calculate the effect size. Searching for a suitable test, I found some pearson correlation value, but as far as I remember, this is only suitable for linear and and normally distributed values.

What would be a proper test for the effect size of the whitney-u test? And is there a pythonic way to implement it without many steps in between?

Thanks!

Stochastic
  • 799
  • 1
  • 6
  • 28
nopact
  • 15
  • 3

1 Answers1

0

The Mann-Whitney U test is for testing whether two independent samples were selected from populations having the same distribution. It is non-parametric (meaning does not assume any distribution of your data) and compares the rank of your two groups. It says nothing about correlation.

You can use Spearman's rank correlation:

from scipy import stats
stats.spearmanr(testframe['feat1'], testframe['feat2'])
SpearmanrResult(correlation=0.6666666666666667, pvalue=0.03526520347507997)

However,I hope the test data you provided is not the real data you have. It is binary, meaning 1 and 0 only. If thats the case, you use a jaccard index:

from sklearn.metrics import jaccard_score
jaccard_score(testframe['feat1'], testframe['feat2'])
StupidWolf
  • 4,494
  • 3
  • 10
  • 26