Are there any techniques to perform dimensionality reduction with multiple groups of highly correlated variables, and variable-specific nonlinear interactions? See the specific details below.
I am interested in efficiently identifying the variables that have interactions, as those will be selected.
Here is a relevant example:
- there are 15 groups of variables
- each group has 10 variables
- the variables in each group are highly correlated with each other (.75-.9 per pair)
- the variables in any given group are not significantly correlated with a variable from a different group (the groups of variables are independent)
- With 8 of the groups (for example) there is one variable in the group that interacts with a variable outside its group, having a significant impact on the dependent variable. (And the other variables in that group aren't needed.)
- Some of the groups without an interacting member are still relevant to the model, and the best variable from these groups can be effectively selected by running a univariate random forest.
- There are 15-20 more variables which may positively contribute to the model. These are not part of correlated variable groups. There are low correlations with the other variables.
The variables are all real numbers.
The above is a simplification, as I'm searching for a practical solution, and the above is adequate. In reality, there could also be 3-way interactions, and a group could have a much less important 2nd or even a 3rd interacting variable (adding the same "real" information, so it would be redundant).
Random Forests are highly effective in this application, when the correct variables are known.
Is there a way, other than an exhaustive search (prohibitive computation time), to identify the variables that have interactions? Primarily for the 2-way interaction case, but also 3-way if possible.