The direction of an effect may be of particular interest to the business, but that doesn't imply that a change in the direction of its estimate between training & test is a stronger indication of problems with the model selection & fitting process than any other change of similar magnitude. Test would be expected to show some differences owing to randomness, plus overall shrinkage from undoing the bias introduced by model selection in training; wildly different estimates are a sign of problems. It's usual to address such concerns using a validation set on which you merely evaluate the performance of the final model without estimating anything. Good performance metrics are in general proper scoring rules, though you may want to use e.g. the area under the receiver operating characteristic curve if only discrimination is important to you. (Note that unless your sample size runs into the thousands, cross-validation or bootstrap validation are better approaches.)
Note also that model selection invalidates significance tests; so there's no reason to be performing them in training in the first place.
PS I think a more common, perhaps more correct, nomenclature is "train, validation, test" for my "train, test, validation"—sorry for any confusion.