1

I have 20 response variables $Y = (Y_1, \dots, Y_{20})$, and 1600 predictor variables $X = (X_1, \dots, Y_{1600})$. There are 128 observations. I wanted to know which pairs of $X$ can best predict each of $Y$.

So I generated all the combinations of $(Y_i, X_j, X_k)$ and did linear regressions for each combination to find R-squared. Based on R-squared, I extracted top 100 combinations to further analyses which pairs of $X$ are the best predictors for $Y$.

I haven't consider multicollinearity between any pair of predictors. Should I consider multicollinearity?

My goal is to find the best pairs of $X_j,$ $X_k$ that can predict a $Y_k$. Can you give some suggestions to further improve this procedure to make it statistically valid ?

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
ChathuraG
  • 71
  • 6

1 Answers1

1

Your problem is known as the $p>n$ problem, so you have more covariates than observations. In standard regression the inverse of the variance-covariance matrix is not positive definite and hence there is no solution to the normal equations (least squares estimation). One approach is to use some form of penalized regression, like Lasso or Ridge Regression. Their goal is to minimize the number of included predictors to your model.

The approach of pairwise elements of $X$ as predictors will probably always be inferior to models including more predictors, except in the special case that you know a priori that there are exactly two predictors that have to be found.

tomka
  • 5,874
  • 3
  • 30
  • 71
  • Yes, That is true, there can be more than two variables that can influence the model. But what does it means to have a high R-squred for a combination ? can we consider that combination is associated together in some way ? – ChathuraG Dec 21 '15 at 18:56