Relative to the number of predictors you're considering, what you have is a very small sample, not one of "moderate size." An adequately sized dataset for logistic regression is typically on the order of 10-20 cases in the smaller class per predictor, unless there is penalization. With about 10 predictors and 15-20 cases per group you only have about 10% of that. So it's not surprising that you found perfect separation and had to move to the penalized Firth approach just to get coefficients at all.
One way to get around the problem would be to use your knowledge of the subject matter to combine the various similar datasets noted in your comment. If the samples are from the same population but some datasets are missing certain predictors present in other datasets, you could consider multiple imputation as a way to combine all the information you have.
Automated methods for model selection have many problems. Best-subset selection, even with cross-validation to minimize overfitting (as provided by the bestglm
package that you linked), will depend heavily on the particular data sample at hand. With typical data sets containing correlated predictors, you will find that the identities of the "best subset" members will change from fold to fold of CV, or among models fit to bootstrap samples of the data. So there won't be a single "best model"; the best you can do is show that the modeling process provides a result that is reasonably likely to meet your requirements.
So the answer depends on the requirements you have for doing this modeling. If you are interested just in prediction you could consider the different type of penalization provided by ridge regression instead of the Firth approach. Unlike the form of penalty provided by the Firth approach, the magnitude of the ridge penalty can be chosen by cross-validation to minimize deviance. That keeps all the predictors in the model but with penalized coefficients to minimize overfitting.
If you need to cut down on the number of predictors for some reason, LASSO might work, returning just a few predictors, but I suspect that with these small datasets you will come up against perfect separation again. Even if it works, as with best-subset you will have to recognize LASSO's arbitrary selection from among the predictor set. Backward stepwise selection is not necessarily a bad way to go. It's probably the least objectionable of the stepwise approaches, and I know professional statisticians who use it regularly. Again, recognize that the particular subset retained might be very sample-dependent.
Whichever approach you choose, do try to estimate the quality of your modeling process by repeating all the steps (including the predictor selection algorithm) on multiple bootstrap samples of the data and test their performance on the full original data set. Even though there will be differences among bootstraps in terms of the particular predictors maintained by predictor-selection models, you can at least document that the model-development process is reasonably effective.