In a set of data, I have one dependent variable and 50 independent variables. Out of these 50, how can I find the variables which are important in estimating the dependent variable?
1 Answers
A good approach that actually eliminates variables is Lasso Regression. Basically, you will be splitting up your dataset into several "folds" (usually 5 or 10), where you will fit a regression model to one of them and then test its accuracy against the remaining folds (then repeat for each other fold). This is called cross validation, which attempts to correct for the overfitting bias inherent when you use the same data to fit and test your model.
They key with Lasso is that you are minimizing not just the MSE (as in normal regression), but adding in a "regularization" penalty as well. In this case, it takes the form:
$$\gamma \sum |a_i|$$
Where the $a_i$ are your regression coefficients and $\gamma$ is a non-negative "tuning parameter".
You will be doing two separate optimizations while selecting your variables:
- Pick a value for $\gamma$, then calculate the cross-validation error for the resulting model. (This is the average MSE across the folds not used to fit the model).
- Adjust $\gamma$ and repeat
This will produce a curve of Cross Validation Error vs $\gamma$. We will want to find the $\gamma$ that gives you the smallest cross validation error. Let's call that $\gamma^*$.
Note that $\gamma^*$ is still a sample statistic (i.e., estimate), the model with the lowest $\gamma$ is not necessarily the best at predicting within your allowed space of models.Therefore, it is recommended that we err on the simpler side for a model, given the uncertainty about which model actually gives the lowest cross validation error.
In the case of Lasso, that means that you should choose $\gamma \geq \gamma^*$ such that the cross validation error ($CVE$) is equal to $CVE_{\gamma^*}+s(CVE_{\gamma^*})$ (i.e., find the gamma that gives you a model whose cross validation error is one standard deviation higher than the cross validation error you got at the empirical "optimum" gamma of $\gamma^*$ (the sample standard deviation for CVE is calculated from the folds generated by $\gamma^*$)
-
2Variable selection by LASSO can be useful but the particular variables returned may depend very heavily on the particular data set at hand. It's important to think about what is meant by "important" in the question. That's not always obvious. See [this answer](http://stats.stackexchange.com/a/203310/28500), its links and related questions, for an introduction to the extensive discussion on this site about this issue. – EdM Mar 29 '16 at 14:52
-
@EdM (+1) Good point on "importance"...I took it as "most contributing to predictive accuarcy", but I guess one could use other metrics as well and get different results... – Mar 29 '16 at 15:13
-
Thanks for the answer. Unfortunately, I do not have enough statistical background to write a program for lasso regression. Instead, I tried to use a ready-made window based Statistica Neural Network software to find the importance of variables. This software has two options (1) feature selection and (2) predictor screen. Under feature selection the program gives F and p values. Under predictor screen, it gives R-squared and F-statistic values. Unfortunately none of these four gives me the same order of importance. So I am confused which one to take. – Ali Mar 30 '16 at 04:54
-
In continuation of our earlier discussion, suppose I have 15 variables and R-squared values vary from 0.5 to 0.01. How can I decide which variables are to be considered basing on these values? Can I consider those variables whose variables have more than 50% of the maximum R-squared value (in this example, up to 0.25) or 25% or 10%? – Ali Mar 30 '16 at 05:03
-
@Ali as hinted as by EdM, this is not a "stats 101" problem - I doub't you can point and click your way to a satisfactory solution.....you will need to (a) be willing to learn some R or (b) have a stats-savvy colleague help you out with this. Also, make sure you have a clear objective of what success is...is it 100% predictive accuracy, or to minimize losses, or to maximize revenue (accounting for data gathering costs?) – Mar 30 '16 at 11:15
-
@Ali wrt your specific options in Statistica....it sounds like "feature selection" is looking for the most EXPLANATORY variables (i.e., explain the most variance for the SAMPLE) while "predictor screen" is finding the variables most useful in PREDICTING new values, even though they may not explain the most variance in your current sample. Again...it boils down to whether you are trying to explain your observations or make predictions of new values. – Mar 30 '16 at 11:17
-
Thanks Bey. My ultimate aim is to predict, but before developing a predictive model, I want select the most important variables so that my predictions will be accurate. – Ali Mar 31 '16 at 08:16