When comparing performances of classifiers between two different datasets, I use the average precision metric (the datasets are very imbalanced and thus ROC or just Precision unpreferable as was discussed in this community often, i.e. here).
Now, what if the datasets I am comparing are very different in their class imbalance? We know that the baseline value for the PR AUC/Avg. Precision is the share of positive examples in the dataset. Imagine I want to compare the performance of a classifier between the "raw" dataset, and one where I used over or undersampling techniques to counteract the class imbalance.
Raw Dataset Over/Undersampled Dataset
Share of Positive Class 5% 20%
Average Precision Score 50% 60%
Improvement over Baseline 45% 40%
Is it correct to assume the classifier performs better when trained on the raw dataset? Or is this way of comparing the performance between datasets unlogical in its own? Are there other approaches that make sense?