3

I'm looking for a method to identify a shortlist of potentially good 2-way interaction terms rather than trying all possible interactions. This question is similarly asked before here but in a more general sense, not on a big data set.

The answer that is given there ("think" about the problem) is not applicable for me because I have around 800 features and >50K observations. I'd like to get something from the data.


Note: I also tried the random forest method that is given as an answer in the link above but I'm not sure I get the method completely right. The problems with RF are that 1) It overfits on training data so what you find on training doesn't work on holdout. 2) The $importance doesn't really define the strength of the interaction but defines the strength of the predictor itself.

agondiken
  • 191
  • 1
  • 5

1 Answers1

1

I just tried to get the correlation matrix of a random 1000 x 800 matrix in R, and it delivered very quickly. So I don't think you need to worry about taking all pairwise correlations of your 800 features -- you just need to take a random subset of the >50K observations. Then pull out the large ones, as a first pass. With that smaller group, you could try a dimension reduction technique, like PCA, to see if a summary of some of the variables would be better.

It's hard to know how to interpret a model of 800 variables.

Another thought: are these variables typically void on most subjects? Because if so, you could bin them (0 = void; 1 = non-zero) and try an association rule like the apriori algorithm. Pull out large pairwise associations, then compare them to what you would have under independence (the lift). Keep those variables with a high lift.

Again, you probably want to take subsets of the 50K -- depending on how large and fast your computer is.

Placidia
  • 13,501
  • 6
  • 33
  • 62
  • On your first suggestion: My data is kind of noisy; so I'm reluctant to lose some signal during the initial correlation elimination or during PCA. If I'd do some variable reduction, I'd like to do it based on the signal. In this particular data set I'm working on, the variables aren't void - but I'll try your second suggestion on another data set - didn't know about this algorithm. – agondiken Jul 01 '14 at 10:05