This is a complicated issue that introduces many related issues of: 1) clearly specifying a hypothesis, 2) understanding what causal mechanisms (may) underlie a hypothesized effect and 3) choice/style of presentation.
You're right that, if we apply sound statistical practice, to claim that "groups are similar", one would have to perform a test of equivalence. However, tests of equivalence suffer the same issues as their NHST counterpart: the power is merely a reflection of the sample size and the number of comparisons: we expect differences, but their extent and effect on a main analysis is far more important.
When confronted by these situations, baseline comparisons are almost always red-herrings. Better methods (of science and statistics) can be applied. I have a few stock concepts/responses that I consider when answering questions like this.
A "total" column is more important than split-by-treatment columns; a discussion is warranted of those values.
In clinical trials, the safety sample is usually analyzed. This is the subset of those who were first approached, then consented, then randomized, and finally exposed to at least one iteration of control or treatment. In that process, we face varying degrees of participation bias.
Probably the most important and omitted aspect of these studies is presenting Table 1 results in aggregate. This achieves the most important purpose of a Table 1: demonstrating to other investigators how generalizable the study sample is to the broader population in which the results apply.
I find it surprising how fixated investigators, readers, and reviewers are on the tangential trends within patient characteristics when there is a complete disregard to the inclusion/exclusion criteria and the generalizability of the sample.
I'm ashamed to say I was an analyst on a trial that overlooked this as an issue. We recruited patients and then, due to logistical issues, we waited nearly a year before implementing the intervention. Not only did the consort diagram show a huge drop between those periods, but the sample shifted. The result was largely un/underemployed, older, and healthier than the people we intended to reach. I had deep concerns about the generalizability of the study, but it was difficult to lobby for those concerns to be made known.
The power and Type-I error of tests to detect imbalance in baseline characteristics depends on the actual number of characteristics
The point of presenting such a detailed listing of baseline variables, as mentioned previously, is to give a thorough snapshot of the sample; their patient history, labs, medications, and demographics. These are all aspects that clinicians use to recommend treatment to patients. They are all believed to predict the outcome. But the number of such factors is staggering. As many as 30 different variables can be compared. The crude risk of Type I error is 1-(1-0.05)^30 = 0.79. Bonferroni or permutation corrections are advisable if testing must be performed.
Statistical testing in its purest form is meant to be impartial, and it is supposed to be prespecified. However, the choice and presentation of baseline characteristics is often relative. I feel the latter approach is appropriate: if we find, like in my trial, there are interesting traits that describe the sample effectively, we should have the liberty to choose to present those values ad hoc. Testing can be performed if it is of any value, but the usual caveats apply: they are not hypotheses of interest, there is a high risk of confusion as to what significant and non-significant results imply, and the results are more a reflection of sample size and presentation considerations than of any truth.
Rerandomization can be done, but only before patients are exposed to treatment
As I mentioned, the analyzed sample is typically the safety sample. However, rerandomization is a heavily advocated and theoretically consistent approach to patients who have not been exposed to study treatment. This only applies to settings in which batch enrollment is performed. Here, 100 participants are recruited and randomized. If, for instance, probability assigns a high proportion of older people to one group, then the sample can be rerandomized to balance age. This can't be done with sequential or staggered enrollment, which is the setting in which most trials are conducted. This is because timing of enrollment tends to predict patient status by prevalent case "bias" (confusing incident and prevalent eligibility criteria).
Balanced design is not a requirement for valid inference
The randomization assumption says that, theoretically, all participants will have on average equal distributions of covariates. However, as mentioned earlier, when comparing 30 or more levels, the cumulative probability of imbalance is non-negligible. In fact, imbalance of covariates may be irrelevant when considering the whole.
If the randomization is fair, we may see age is elevated in the treatment group, but smoking is elevated in the control group: both of which contribute individually to the risk of the outcome. What is needed for efficient and valid inference is that the propensity score is balanced between groups. This is a much weaker condition. Unfortunately, propensity cannot be inspected for balance without a risk model. However, it's easy to see that such propensity depends on a combination of covariates, and the likelihood of an imbalance in propensities in a randomized sample is far less probable, despite being impossible to show exactly.
If a risk model is known, or strong predictors of the outcome are present, more efficient and valid RCTs are done by simply adjusting for those factors regardless of whether they're balanced between treatment groups
One of my favorite papers, 7 myths of randomized controlled trials, discusses this. Adjustment improves efficiency when the adjustment variable is strongly predictive of the outcome. It turns out that even with perfect 50/50 balance, using say blocked randomization, or even as a coincidence of how randomization was performed, the adjustment will shrink CIs, requiring fewer participants to have an equally powered study; this reduces costs and risks. It is shocking that this isn't done more often.
Observational studies require control for confounding regardless of what Table 1 shows
The randomization assumption eliminates confounding. With nonrandomized treatment, there is confounding. A confounder is a variable which is causal of the outcome and predicts receipt of the quasi-experimental treatment. There is no test to determine which variable(s) is/are confounders. The risk of peeking into the data to answer these questions is that confounders are virtually indistinguishable from mediators or colliders without utterly perfect measurement of longitudinal values (and even then...). Adjusting for mediators attenuates any effect, collider-adjustment can cause any type of bias. Further, one need not adjust for a total set of confounders, but rather they must remove the backdoor criterion.
For instance, in a study of lung function and smoking in adolescents: older kids are more likely to smoke, but since they are taller, their lung function is greater. It turns out the adjusting for height alone suffices to remove confounding since it satisfies the backdoor criterion. Further adjustment for age simply loses efficiency. However, merely inspecting the "balance" of a table 1 in smokers and non-smokers would suggest that both age and height are "imbalanced" and thus should be controlled for. That is incorrect.