I will answer this question in a bit of a rephrased form: "What is the best sampling scheme with imbalanced data?"
If you take a look at my answer below, I link to a few different articles and explain how to deal with imbalanced data to create a stable classifier (whether that's in a live/production environment or in a static model. Obviously, live/production models are sensitive to data, so it's hard to really assess what will go on at first, but a good sampling scheme and loss function, will help address issues with imbalanced data.
The classic imbalanced sampling technique is SMOTE (see ref below),
which oversamples from the minority class to synthetically increase
its prevalence. Boosting algorithms (like adaboost)also will
oversample the cases it got wrong, in order to fix issues with
predictions. Focal is similar in that it will down-weight the "easy"
predictors (in the loss function), so it makes sense to use it. The
tricky part is that boosting algorithms are essentially prone to
overfitting since their sampling is gradient-based to reduce error, so
one must be always careful with how to introduce sampling schemes and
loss functions. That's the only caveat with them. Below I've included
all 3 references.
SMOTE: Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W.
Philip Kegelmeyer. "SMOTE: synthetic minority over-sampling
technique." Journal of artificial intelligence research 16 (2002):
321-357.
Adaboost: Rätsch, Gunnar, Takashi Onoda, and K-R. Müller. "Soft
margins for AdaBoost." Machine learning 42, no. 3 (2001): 287-320.
Focal: Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P.
(2017). Focal loss for dense object detection. In Proceedings of the
IEEE international conference on computer vision (pp. 2980-2988).
Does it make sense to use Focal loss for a tree based classifier like XGBoost?
The one big issue to consider (which was addressed in the above comment) is that for optimal performance, you also want to train the model on similar data it will see in production. You will hurt it by forcing a 50%-50% class balance if in reality, the live data will be 20-80. So, that's where things like a Focal loss, that focus on the loss function, or adaboost, which oversamples the "wrong" cases in each subsequent boosting tree to correct the errors. Oversampling the "minority class" might hurt.