How can I choose the 10 variables that explain the most variation in a wealth index?

Question

I have household survey data with 32 questions about assets the household has or doesn't. I assume that taken together the answers to these asset questions (e.g. how many televisions does the household own) are an indication of wealth, and could be used to make a good index of wealth, e.g. using the first component in a principal components analysis.

What I want to do, however, is to choose 10 of these variables that jointly explain the largest possible proportion of the variation in wealth and use those as the questions in a shorter questionnaire that I am developing. What is the best way of doing this?

One possibility that has occurred to me is to calculate the wealth index using PCA then regress this on every possible combination (60 million or so I think) of 10 variables from the 32, and see which gets the highest R-squared. I'm hoping there's an easier way.

Ideally I'm looking to implement this in Stata.

You might also consider looking into the penalized regression literature (e.g. Lasso) — bdeonovic, Dec 11 '13 at 23:05
by the way, what is wrong with using the PCA approach? That seems to be the standard approach in such a scenario. — bdeonovic, Dec 11 '13 at 23:16
@Benjamin can you explain what you mean by the PCA approach? Do you mean the PCA with 64 million regressions as mentioned in my question? — Stuart, Dec 11 '13 at 23:18
well ideally, if you just did PCA and looked at the first couple PCs, the variables with nonzero loadings would represent the subset of your 32 questions that explain the most variance. This might not necesarrily be 10...of course your PC loadings might not be so sparse either. You should look into Sparse Principal component analysis which combines PCA with penalized regression (paper: http://www.stanford.edu/~hastie/Papers/spc_jcgs.pdf). — bdeonovic, Dec 12 '13 at 00:30
@Benjamin Thanks - that paper has a ref to McCabe 1982/1984 http://www.stat.purdue.edu/research/technical_reports/pdfs/1982/tr82-03.pdf which uses the term 'principal variables' for exactly what I'm trying to do. That term didn't seem to catch on though... — Stuart, Dec 12 '13 at 01:12
Principal variables are alive and well e.g. Cumming and Wooff reference within http://stats.stackexchange.com/questions/23863/use-of-pca-analysis-to-select-variables-for-a-regression-analysis I am not aware of a Stata implementation. — Nick Cox, Dec 15 '13 at 10:38

score 2 · Answer 1 · answered Dec 11 '13 at 20:38

2

I think you need to better define what you are looking for. You could have 10 variables that each individually account for 90% of the variance, but if that is the same 90% of the variance then that may not be interesting to you. Performing regression with L1 and/or L2 norms can help you to identify variables or groups of variables that correlate well with your data. There are also other techniques available such as Minimum Redundancy Maximum Relevance that help to select features that are strong predictors.

answered Dec 11 '13 at 20:38

aplassard

239
1
3

Thanks, I've edited the question to make clearer that I'm looking for the set of 10 variables that can *jointly* explain the most variance, i.e. the best 10 for making a wealth index that would look like a wealth index based on the full 32. – Stuart Dec 11 '13 at 23:04
I believe someone else mentioned using the Lasso regression penalty. This would be effective in your case is it reduces the number of parameters that the model uses and attempts to select meaningful covariates. – aplassard Dec 12 '13 at 01:10

score -1 · Answer 2 · answered Dec 11 '13 at 20:23

-1

Generally the way this would be done is by performing a linear regression of the wealth index over the 32 variables. The 10 coefficients with the highest magnitude (absolute value) should be the 10 that explain the most variation.

This is very commonly done in Sabermetrics (think Moneyball) to see, for example, what are the $n$ factors that most influcence a baseball team's winning percentage. See this book chapter for a good introduction to linear regression that uses a similar example: http://www.stat.wisc.edu/~wardrop/courses/371chapter14.pdf

answered Dec 11 '13 at 20:23

Ben

443
3
7

3

The magnitude of the coefficient depends on the scale of the variable in question. Unless the scales are commensurate, this cannot work. – gung - Reinstate Monica Dec 11 '13 at 20:29
4

Even when the scales are commensurate, this often does not work anyway. A counterexample is shown at http://stats.stackexchange.com/a/14528: there, the variable $y_8$ has the second highest absolute coefficient (out of $10$ variables) but is not part of a good selection of $5$ of those variables (and is unlikely to be part of the $5$ best). – whuber Dec 11 '13 at 20:37

How can I choose the 10 variables that explain the most variation in a wealth index?

2 Answers2