0

I am using BIC to tune a lasso estimation and select the features that will be used in further analysis. The data is quite large, and I have some prior domain knowledge on it, so I split it by several classes and do lasso on each individual class. Then I find that the class with large sample size would lead to a lower alpha, i.e. less regulation. I then realize (after reading this answer) this is a feature of AIC/BIC score because given its definition,

$$\operatorname{BIC}(\hat{\boldsymbol{\mu}})=\frac{\|\mathbf{y}-\hat{\boldsymbol{\mu}}\|^{2}}{\sigma^{2}}+\log (n) \widehat{d f}(\hat{\boldsymbol{\mu}})$$

when $n$ goes large, the first likelihood term would dominate the second term, and thus the AIC/BIC validation would choose more flexible model. This feature causes AIC/BIC validation select quite different models and features if two classes of my original data have large difference in sample size, and quite similar ones if two classes have similar sample sizes. The selected model for the largest class in my data have over 2000 non-zero coefficients, and the smallest one have only 200 or less.

I know that there are arguments (see e.g. this and this answer) that AIC/BIC should be used to select model given a certain sample, and that one should work on the whole sample but not split sample or do cross validation for an AIC/BIC based model selection. However my question is that the sample collection itself is an arbitrary activity. I can always have a choice on how many observations to collect before I even begin to think which statistic method I should use. Also in my case, I split my data based on my prior domain knowledge before I put them into any statistic models. Again this is ad-hoc. So it seems to me the "whole sample" concept itself is a false one. And therefore the sample size itself becomes a hyperparameter beyond the alpha in lasso.

I thus wonder how should I think about the relationship between sample size and AIC/BIC based model selection? Is there any methods to adjust for this sample difference in my case to make a balanced model selection (e.g. replacing $\log(n)$ to a more powerful term)? Or should I do or not do such an adjustment at all?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • "I can always have a choice on how much sample to collect before I even begin to think which statistic method I should use." This is very often not the case. If you're investigating a rare illness, you may find 20 patients for your study but no more. In many situations it is at least very costly to collect more samples. For sure if you have a larger sample, you can fit a model with more df with larger precision, leading to less bias with moderate variance, but with a small sample that model may lead to a too high variance and therefore may not be good. I don't see what's wrong with that. – Christian Hennig Oct 06 '21 at 14:41
  • In your situation you may want an additional penalty that penalises models that are too different in your different subsamples. That's legitimate, however it's a special situation, so I expect that it either requires new research, or some specialist paper somewhere has done it already, which I don't have the time to look for. – Christian Hennig Oct 06 '21 at 14:44
  • @ChristianHenning Thanks very much for your comment. Yes in many cases the more sample the better. But in big data cases, for example, the data of online consuming behavior would have very different sample size if you collect it on day, week, month, or year level. And the level of the sample size then becomes a choice. Using large data would fit more flexible model with more features and yes larger precision. But if the purpose is to study some pattern that human can learn from, a highly complex model with over 10 thousands or more significant coefficients seems not help. – Alalalalaki Oct 06 '21 at 14:58
  • Also in my case it's like I have both a data from Amazon and a data from a small or middle online site. I definitely want to analyze them separately rather than combing them into one data. But using the same procedure would derive models in very different level due the difference in sample size. This is the problem I want solve here. – Alalalalaki Oct 06 '21 at 15:01

0 Answers0