I am comparing a ML classifier to a bunch of other benchmark F1 classifiers by F1 scores. By AUPRC, my classifier does worse than other benchmark methods. When I compared F1 score, however, I got a curious result that my classifier does better than the other methods. After looking at the precision recall curve, I realized that indeed my model was performing better in precision/recall at the probability threshold where the F1 score was being evaluated by default(classifier probability > 0.5), but performed worse at other thresholds.
When I report these results, should I be looking at maximum possible F1 score when comparing between models? Please note that I am working with an imbalanced dataset, and the precision/recall tradeoff matters for my use case (hence the F1 as the performance metric).