12

I have a large data set consisting of the values of several hundred financial variables that could be used in a multiple regression to predict the behavior of an index fund over time. I would like to reduce the number of variables to ten or so while still retaining as much predictive power as possible. Added: The reduced set of variables needs to be a subset of the original variable set in order to preserve the economic meaning of the original variables. Thus, for example, I shouldn't end up with linear combinations or aggregates of the original variables.

Some (probably naive) thoughts on how to do this:

  1. Perform a simple linear regression with each variable and choose the ten with the largest $R^2$ values. Of course, there's no guarantee that the ten best individual variables combined would be the best group of ten.
  2. Perform a principal components analysis and try to find the ten original variables with the largest associations with the first few principal axes.

I don't think I can perform a hierarchical regression because the variables aren't really nested. Trying all possible combinations of ten variables is computationally infeasible because there are too many combinations.

Is there a standard approach to tackle this problem of reducing the number of variables in a multiple regression?

It seems like this would be a sufficiently common problem that there would be a standard approach.

A very helpful answer would be one that not only mentions a standard method but also gives an overview of how and why it works. Alternatively, if there is no one standard approach but rather multiple ones with different strengths and weaknesses, a very helpful answer would be one that discusses their pros and cons.

whuber's comment below indicates that the request in the last paragraph is too broad. Instead, I would accept as a good answer a list of the major approaches, perhaps with a very brief description of each. Once I have the terms I can dig up the details on each myself.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Mike Spivey
  • 268
  • 1
  • 3
  • 10
  • 1
    Mike, you might browse through Chapter 3 of [ESL](http://www-stat.stanford.edu/~tibs/ElemStatLearn/), if you are unfamiliar with it. The page at the link provided points to a free, legal PDF of the text. – cardinal Feb 07 '12 at 18:50
  • Can you clarify if you are looking to only keep, say, ten of the original variables or would also be interested in methods that use a small subset of linear combinations of the original variables (the latter being what something like a traditional principal-components regression would give you). – cardinal Feb 07 '12 at 18:52
  • @cardinal: I would like to keep ten or so of the original variables, not linear combinations of them. Each variable has a particular economic meaning, whereas taking linear combinations would lose that: We need to preserve the meaning aspect of the explanatory variables. And thanks for the ESL pointer. I'll take a look at it. – Mike Spivey Feb 07 '12 at 19:07
  • 3
    [This reply](http://stats.stackexchange.com/a/14528/919) gives a concrete example of one of the (many) problems with method 1. A comment by @cardinal to Frank Harrell's reply gets to the crux of the problem with method 2: anything you do with the independent variables alone, without considering their relationships to the dependent variable, risks being irrelevant or worse. As far as standard or "canonical" answers go, asking for one here is a bit like asking for a discussion of all the methods to find rational points on elliptic curves, with their pros and cons :-). – whuber Feb 07 '12 at 19:13
  • 2
    As noted by others here, method 1 will lead to problems. For an intuitively accessible treatment of why that's true / a description of another one of the issues with this approach, you may want to read this: http://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection/20856#20856 – gung - Reinstate Monica Feb 07 '12 at 20:56

4 Answers4

7

Method 1 doesn't work. Method 2 has hope depending on how you do it. It's better to enter principal components in descending order of variance explained. A more interpretable approach is to do variable clustering, then reducing each cluster to a single score (not using Y), then fit a model with the cluster scores.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • +1. By "variable clustering", do you mean *factor analysis*--that's a strategy I might use (also before looking at y). I think of cluster analysis as grouping observations rather than variables, but I have only superficial knowledge of cluster analyses. – gung - Reinstate Monica Feb 07 '12 at 18:06
  • 2
    It does not seem like there is any *a priori* reason to believe that the directions of maximal variance of the *predictors* are necessarily highly correlated with the *response*. Perhaps I'm mistaken or have misunderstood your comment. Could you clarify? – cardinal Feb 07 '12 at 18:46
  • 1
    Also, it sounds like the OP is not (quite) describing principal-components regression in his Method 2. – cardinal Feb 07 '12 at 18:49
  • I wasn't completely clear in my original post, but I need a *subset* of the original variables. So a straight principal components analysis or clustering isn't really what I'm after. – Mike Spivey Feb 07 '12 at 20:52
  • 1
    Variable clustering is related to factor analysis but is simpler. Variables are grouped in how they correlated with each other. See the `varclus` function in the R `Hmisc` package, or PROC VARCLUS in SAS. Data reduction can help with subsetting variables if you exercise a bit of caution; you can remove an entire cluster if its $P$-value is 0.3. With principal components there are techniques such as battery reduction where you essentially approximate the PCs with a subset of their constituent variables. – Frank Harrell Feb 08 '12 at 12:04
3

You might consider using a method like LASSO that regularizes least squares by selecting a solution that minimizes the one norm of the vector of parameters. It turns out that this has the effect in practice of minimizing the number of nonzero entries in the parameter vector. Although LASSO is popular in some statistical circles many other related methods have been considered in the world of compressive sensing.

Brian Borchers
  • 5,015
  • 1
  • 18
  • 27
3

In chapter 5 of Data Mining with R, the author shows some ways to choose the most useful predictors. (In the context of bioinformatics, where each sample row has 12,000+ columns!)

He first uses some filters based on statistical distribution. For instance, if you have half a dozen predictors all with a similar mean and s.d. then you can get away with just keeping one of them.

He then shows how to use a random forest to find which ones are most useful predictors. Here is a self-contained abstract example. You can see I've got 5 good predictors, 5 bad ones. The code shows how to just keep the best 3.

set.seed(99)

d=data.frame(
      y=c(1:20),
      x1=log(c(1:20)),
      x2=sample(1:100, 20),
      x3=c(1:20)*c(11:30),
      x4=runif(20),
      x5=-c(1:20),
      x6=rnorm(20),
      x7=c(1:20),
      x8=rnorm(20,mean=100, sd=20),
      x9=jitter(c(1:20)),
      x10=jitter(rep(3.14, 20))
      )

library(randomForest)
rf=randomForest(y ~ ., d, importance=TRUE)
print(importance(rf))
#         %IncMSE IncNodePurity
# x1  12.19922383    130.094641
# x2  -1.90923082      6.455262
# ...

i=importance(rf)
best3=rownames(i)[order(i[, "%IncMSE"], decreasing=TRUE)[1:3]]
print(best3)
#[1] "x1" "x5" "x9"

reduced_dataset=d[, c(best3, 'y')]

The author's last approach is using a hierarchical clustering algorithm to cluster similar predictors into, say, 30 groups. If you want 30 diverse predictors you then choose one from each of those 30 groups, randomly.

Here is some code, using the same sample data as above, to choose 3 of the 10 columns:

library(Hmisc)
d_without_answer=d[,names(d)!='y']
vc=varclus(as.matrix(d_without_answer))
print(cutree(vc$hclust,3))
# x1  x2  x3  x4  x5  x6  x7  x8  x9 x10 
#  1   2   1   3   1   1   1   2   1   3 

My sample data does not suit this approach at all, because I have 5 good predictors and 5 that are just noise. If all 10 predictors were slightly correlated with y, and had a good chance of being even better when used together (which is quite possible in the financial domain), then this may be a good approach.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Darren Cook
  • 1,772
  • 1
  • 12
  • 26
2

This problem is usually called Subset Selection and there are quite a few different approaches. See Google Scholar for an overview over related articles.

Florian Brucker
  • 200
  • 1
  • 7