I'm looking for a method to identify data drift of features between two different times.
Background: I'm calculating the same features, on almost the same population (for example, company employees) every month. Population size is over 100K. An example of a feature is "How many unique websites this user entered during the month?
Before performing a prediction (classification problem), I want to check that the distribution of each feature remained pretty much the same as last month, and if not, I want to raise an alert. I'm not interested in small changes, I'm looking for large changes that might harm my model. The model was deployed once and is not trained on new data, it only performs predictions. I want to understand when the model is not relevant anymore (or need to be retrained) due to data drift.
I can't count on my performance metric, because it takes 2-3 months until I can get ground truth results on the model's predictions (performance metric helped me to train and validate my deployed model using past data). That's why I want to focus on the distributions of the features and not a metric.
What I've done: For numerical features, I calculated the KS test, and most of the features turned out to be different (p<0.05) between two consecutive months. However, when I examined the data, I saw that there was a minor difference in the mean and variance. Even when I plotted the histogram between two months, I saw a minor difference, but nothing alerting.
I think it is due to the fact that I have a large sample size. So I created a dummy test. I sampled 100K instances from N(0,1) distribution, and another 100K from N(0.001,1) distribution and ran a KS test on these two datasets. I ran this experiment multiple times and saw that most of the times, the KS test returned that the two distributions are different. I don't think that such a small difference shouldn't affect my model (Random Forest model).
Is there another way to test for data drift between two distributions, but with some level of significant difference? (yes, I know it is a bit obscure request)
I saw somewhere that I can concatenate the two distributions, add a label that indicates the source of each distribution (0 for last month, 1 for the current month) and train a simple classification model (Logistic regression, Random Forest) to distinguish between the two distributions, using only the feature's values as the observed variable. If my classifier gets an AUC around 0.5, it means that the two distributions are pretty similar, otherwise, they are different (and I need to set the right threshold for the AUC).
Questions
- Do you think using a KS test when the sample size is large, is a good indicator for a data drift?
- If not, is there another suitable statistical test\method for data drift for large datasets?
- What do you think of the method that uses an ML classifier to distinguish between the two distributions?
Thanks