Should I use AIC / BIC or rather cross validation for discovering gov. equations through linear regression (SINDy)?

Question

I want to use linear regression with very large design matrix for discovery of governing equations to i.e. physical systems. The design matrix would include potential terms that can be part of the equation. This procedure is called SINDy and usually selects a parsimonious set of active parameters through penalized regression (such as LASSO) or sequential thresholding. In both options, lasso and thresholding, there is a hyperparameter that describes how sparse the solution will be. Changing this parameter will lead to different possible solutions. To find the correct one of these possible equations, should I rely on data by using cross-validation or should I try to find it through information criteria? What are the benefits of each?

Another question that arises is, do I actually have a MLE, even if I use LS after the parameter have been chosen by LASSO. Why is this? The Design matrix will include many transformations of the same measured variables. This would be x1 x1^2 x1^3 ... cos(x1) ... everything that could possibly play a role. If I assume x1 is normally distributed and independent, the other parameters no longer are normally distributed due to the transformation, right?

Note that for large datasets, the selection criteria AIC and cross-validation are equivalent. This was shown by Stone in 1977. — cdalitz, Feb 17 '22 at 12:45
Oarz, see https://stats.stackexchange.com/questions/139175 and https://stats.stackexchange.com/questions/175417 which are related. @cdalitz, under some assumptions, that is; see https://stats.stackexchange.com/questions/407291 and https://stats.stackexchange.com/questions/406430. — Richard Hardy, Feb 17 '22 at 16:06

Should I use AIC / BIC or rather cross validation for discovering gov. equations through linear regression (SINDy)?

0 Answers0