How to find all models that meet the pre-specified restrictions

Question

Let's say I have a large number of predictors (e.g. 2000) and I'm facing the problem of choosing the linear regression model under following assumptions:

There are few predictors that have to be included in the final model.
Some predictors are actually domain-specific transformations of original feature, so, for example, out of 50 variables (1 original and 49 transformations) I want to choose only one.
From the remaining variables (not certainly-to-be-included or transformations) I can choose arbitrary subset, just the one that works the best
Lastly, what is actually the most important, there are some pre-known assumptions about some coefficient, i.e. there are some $\beta_{k_1}, ..., \beta_{k_n}$ that should be greater (or equal) than $0$, and some $\beta_{j_1}, ..., \beta{j_n}$ that should be smaller (or equal) than $0$.

I would need some approach that would allow me to automate the proccess and give me a list of only those models that meet the restrictions specified in last bullet point. The problem is that it is obviously not computionally feasible to consider all the subsets of variables, build models from these subsets and check which models actually meet the restrictions. Therefore, I would need some greedy approach which would allow me to obtain the list of models that meet the restrictions that has a high chance to contain the best feasible model (according to some criterion, let's say AIC).

Is there a way to tackle this problem without the need of building all possible models? Maybe there is some way to say that having model A, we shouldn't consider adding variable X as there is slight chance that X will improve model A?
Which measure would be the most appropriate to use while comparing models, if not already suggested AIC?

What is the minimal number of variables? (e.g. how many domain-specific transformation variable groups you have). And would the model meet the criteria if you will include all 2000 variables. — Oka, Mar 24 '19 at 22:08
Not all variables have to be transformed. I'll probably have few variables that have to be included in a model (around 10) and around 20 variables for which different transformations will be tested. Also, there will be variables (ca. 1000) that are not transformations, so any arbitrary subgroups can be included in a model from them. To summarise, the minimal number would be 10. And no, the model would not meet the criteria if I would include all 2000 variables as out 1000 variables (20 * 50) that are transformations I can choose at most 20 variables. Therefore the max number of vars is ca.1030 — jakes, Mar 25 '19 at 10:34

score 1 · Answer 1 · answered Mar 25 '19 at 23:22

There are substantial dangers in your approach.

First, this answer is a superb explanation of the dangers of automated approaches like those involving stepwise model selection. You might well get an overfitted model that fits your particular data set very well but fail to work on new data.

Second, your requirements on pre-specifications of regression coefficient signs might not be able to be met. It is quite possible for the sign of a regression coefficient to flip depending on whether another predictor is included. This is a form of what's called Simpson's paradox. Just because you think you know the "correct" sign of a coefficient, based on its relation by itself with outcome, that doesn't mean that it will have the same sign when you add more predictors.

That said, there are ways to accomplish much of what you want in principled and fairly reliable ways.

First, with respect to transformations of predictors, if you suspect that some are not linearly related to your outcome variable then you could model them flexibly as splines. Then you don't have to include 50 different transformations of each of those predictors in your set to evaluate; you can let the regression process itself find a useful non-linear transformation of the original variable values.

Second, one well accepted way to select a small number of predictors from a large candidate set is LASSO. This not only reduces the number of predictors but also penalizes the regression coefficients to minimize overfitting. If there are some predictors that need to be included in any event, it's possible to apply LASSO only to the others and keep them unpenalized.

Third, instead of standard multiple regression approaches, you could consider a regression tree. As regression trees are based on repeated splits of predictor values, they don't depend on prior transformations and they also can incorporate interactions among predictors.

An Introduction to Statistical Learning is one accessible presentation of these and other approaches to model development and validation.

Thanks for your suggestions. Regarding coefficient signs flipping - I am aware of that fact and that's why I don't want to find the model with best predictive power, but the one that is interpretable and match the intuition. Regarding transformations: those are domain specific and can't be really replaced with splines. — jakes, Mar 26 '19 at 14:32
@jakes forcing a model to fit your intuitions about coefficient signs is not a good idea. A major part of scientific inquiry is challenging such intuitions with data. Your plan would prevent you from discovering unexpected relationships. Allowing the coefficients to have any sign and then documenting that they have the signs you expected will be much more convincing to skeptics and, ultimately, to yourself and your collleagues. — EdM, Mar 26 '19 at 15:34

score 0 · Answer 2 · answered Mar 25 '19 at 12:39

To find all the models that meet the pre-specified restrictions (as specified in the topic of your question) one would need to test all potential models. However, you probably don´t want to do that. To get as close to that as possible, you might want to employ several methods. Forward selection and backward elimination would sound reasonable in this case (though there are many other methods as well).

In forward selection you start with having no predictors, and iteratively add one predictor at a time to the model. In each iteration, you choose and add one predictor, which best improves the model, and repeat the process until the addition of a new predictor does not improve the performance.

In backward elimination, you do exactly the opposite: you start by including all of the features in the model, and remove feature which is the least significant and/or improves the performance of the model the most. This is done iteratively, and the process is repeated until removal of features doesn´t improve the models anymore.

During both processes, you keep all the models which fit the predefined criteria. As you have subgroups of variables, you probably could try some nested design. And if you want, you also could prefilter the variables - for example to remove the ones which are highly correlated and are thus redundant.

Thanks for your suggestion. However, I was already aware of bunch of drawbacks that stepwise model selection involves. @EdM already cited excellent answer regarding them. I was actually hoping for less standard approach. — jakes, Mar 26 '19 at 14:28
Yep, there are drawbacks. I also thought about genetic selection algorithms and pruning the variables to remove all correlated ones, but these probably wont answer the requirements...Still I am curious to hear, which solution you will come up with. — Oka, Mar 26 '19 at 14:38

How to find all models that meet the pre-specified restrictions

2 Answers2