I want to compare two super large datasets (petabytes).
I have tried to use scipy ks_2samp kolmogoroc smirnov 2 sample test, but I get all the pvalues to be 0. I tried to downsample by random sampling around 200 rows only then some columns get filtered as similar (pvalue>0.05). But 200 rows is not representative of such a huge dataset.
May I know if there is another statistical method that can handle large datasets or if there is another method to better compare large dataset distributions? Also, since I am currently only comparing each feature by itself, is there a method that allows me to compare a combination of features in dataset 1 vs dataset 2?
Thank you for your help.