I am using BIC to tune a lasso estimation and select the features that will be used in further analysis. The data is quite large, and I have some prior domain knowledge on it, so I split it by several classes and do lasso on each individual class. Then I find that the class with large sample size would lead to a lower alpha, i.e. less regulation. I then realize (after reading this answer) this is a feature of AIC/BIC score because given its definition,
$$\operatorname{BIC}(\hat{\boldsymbol{\mu}})=\frac{\|\mathbf{y}-\hat{\boldsymbol{\mu}}\|^{2}}{\sigma^{2}}+\log (n) \widehat{d f}(\hat{\boldsymbol{\mu}})$$
when $n$ goes large, the first likelihood term would dominate the second term, and thus the AIC/BIC validation would choose more flexible model. This feature causes AIC/BIC validation select quite different models and features if two classes of my original data have large difference in sample size, and quite similar ones if two classes have similar sample sizes. The selected model for the largest class in my data have over 2000 non-zero coefficients, and the smallest one have only 200 or less.
I know that there are arguments (see e.g. this and this answer) that AIC/BIC should be used to select model given a certain sample, and that one should work on the whole sample but not split sample or do cross validation for an AIC/BIC based model selection. However my question is that the sample collection itself is an arbitrary activity. I can always have a choice on how many observations to collect before I even begin to think which statistic method I should use. Also in my case, I split my data based on my prior domain knowledge before I put them into any statistic models. Again this is ad-hoc. So it seems to me the "whole sample" concept itself is a false one. And therefore the sample size itself becomes a hyperparameter beyond the alpha in lasso.
I thus wonder how should I think about the relationship between sample size and AIC/BIC based model selection? Is there any methods to adjust for this sample difference in my case to make a balanced model selection (e.g. replacing $\log(n)$ to a more powerful term)? Or should I do or not do such an adjustment at all?