Selecting Subsets with Linear Correlation

Question

I'm looking for a method of grouping 200+ samples with 30+ features into groups which share linear correlations among a subset of the features.

I've found Ransac to sometimes return a good regression prediction for the unknown sample in question but fails in most circumstances because the target values of different groups of my samples rely on different feature importances.

I guess an easier way of looking at it would be a way of finding straight lines or curves among subsets of features within multidimensional data that share ascending or descending target values.

Do I understand correctly that you don't yet know *which* subsets of features the samples should have a linear correlation in, but want to derive this during the process? — geekoverdose, Jun 15 '16 at 18:44
This is correct. The assumption is that each subset of samples has a subset of features (from the 30+ features) that linearly determines the target value of each sample in the subset, but neither of these (groupings of samples and their relevant features) are known. Any advice would be appreciated. — user120023, Jun 15 '16 at 21:38

score 0 · Answer 1 · edited May 23 '17 at 12:39

You could possibly use regression or classification model trees for this purpose. The basic idea is that the tree accounts for dividing your samples into groups, while the models in the leaf nodes account for finding a fit of features to target variable for samples inside individual groups. In detail, there exist different approaches to such model trees: e.g. in junction nodes, some use classic feature splits, while others use models there too. Different ways of splitting samples are possible too (i.e.: the "more homogeneous groups" idea of trees depends on how you define "homogeneous").

More frequently used model trees would e.g. be the Quinlan models (M5, Cubist, C4.5, C5, ...). For some more infos and ideas see e.g. this answer on SO that shortly highlights the differences of classic model trees to CARTs, this question on CV that highlights some model tree concepts, or e.g. the Logistic Model Tree article on Wikipedia to get some idea on how those usually work. Further, a very practical perspective on different regression and classification model trees is given in

Kuhn, M. & Johnson, K. Applied Predictive Modeling Springer-Verlag New York, 2013

PS: concerning the "subset of features": one idea behind random forests is to only use a randomly chosen subset of features in each split. I'm not aware if there are model trees out there that do just the same, or a random forest that uses models in nodes. But if you stumble across a concept like this I'd take a closer look for sure.

Selecting Subsets with Linear Correlation

1 Answers1