Let's say I have a large number of predictors (e.g. 2000) and I'm facing the problem of choosing the linear regression model under following assumptions:
- There are few predictors that have to be included in the final model.
- Some predictors are actually domain-specific transformations of original feature, so, for example, out of 50 variables (1 original and 49 transformations) I want to choose only one.
- From the remaining variables (not certainly-to-be-included or transformations) I can choose arbitrary subset, just the one that works the best
- Lastly, what is actually the most important, there are some pre-known assumptions about some coefficient, i.e. there are some $\beta_{k_1}, ..., \beta_{k_n}$ that should be greater (or equal) than $0$, and some $\beta_{j_1}, ..., \beta{j_n}$ that should be smaller (or equal) than $0$.
I would need some approach that would allow me to automate the proccess and give me a list of only those models that meet the restrictions specified in last bullet point. The problem is that it is obviously not computionally feasible to consider all the subsets of variables, build models from these subsets and check which models actually meet the restrictions. Therefore, I would need some greedy approach which would allow me to obtain the list of models that meet the restrictions that has a high chance to contain the best feasible model (according to some criterion, let's say AIC).
- Is there a way to tackle this problem without the need of building all possible models? Maybe there is some way to say that having model A, we shouldn't consider adding variable X as there is slight chance that X will improve model A?
- Which measure would be the most appropriate to use while comparing models, if not already suggested AIC?