How to Compare Data Distribution of 2 large datasets?

Question

I want to compare two super large datasets (petabytes).

I have tried to use scipy ks_2samp kolmogoroc smirnov 2 sample test, but I get all the pvalues to be 0. I tried to downsample by random sampling around 200 rows only then some columns get filtered as similar (pvalue>0.05). But 200 rows is not representative of such a huge dataset.

May I know if there is another statistical method that can handle large datasets or if there is another method to better compare large dataset distributions? Also, since I am currently only comparing each feature by itself, is there a method that allows me to compare a combination of features in dataset 1 vs dataset 2?

Thank you for your help.

What’s wrong with getting a p-value of zero? With the squillion points you have in your petabytes of data, that’s exactly what I would expect: exceptional ability to detect even the smallest of differences. // Since you’ve calculated a p-value, you’ve overcome whatever hardware issues arise with petabytes of data, so let’s set that aside. If you had mere kilobytes of data, what would you do? — Dave, Jan 19 '22 at 09:54
@Dave is it advisable to do the test on the entire dataset since it is so large? Like you mention it detects the smallest of differences, but maybe overall it is still similar when histograms are plotted. But I do not want to rely on diagrams as I want the process to be automated to detect if two datasets are similar. If kilobytes of data, the ks test will be able to filter features that are similar, not all p-values 0. Do you think I could also look at skewness/kurtosis etc. — eun ji, Jan 19 '22 at 10:06
I tend to take the stance that, if you’re rooting for a large p-value, you’re misusing hypothesis testing. It might be best for you to say exactly what you want to do with the hypothesis test that has you hoping not to catch a difference. — Dave, Jan 19 '22 at 10:09
One approach is first to generate a much simpler, but highly accurate, representation of each distribution and use those for comparison. The comparison will, *of course,* exhibit some differences: the point is to identify, characterize, and quantify those differences, not to perform some kind of hypothesis test. See https://stats.stackexchange.com/questions/35220 for techniques. As for comparing "combinations," you will need to explain what you mean by this term and what specific properties you wish to compare. — whuber, Jan 19 '22 at 17:02
@Dave What I want to do is too actually compare train and test sets to see which training set is most similar in distribution to the test set. Then I will use the train set that is most similar to the test set to do prediction. — eun ji, Jan 20 '22 at 01:18
That’s a problematic way of picking your test set. The point of a test set is to get an honest estimate of how your model would perform when you put it in production (e.g., push new speech recognition software to Alexa), not to rig the results to make you look good. Taking it to the extreme, why not just use the training set as your test set? — Dave, Jan 20 '22 at 01:53
@Dave I am not picking a test set. I am picking a train set to predict on my test set. — eun ji, Jan 20 '22 at 02:54
That’s still sketchy. You’re peeking at the data you’re not supposed to see. — Dave, Jan 20 '22 at 02:56

How to Compare Data Distribution of 2 large datasets?

0 Answers0