0

Let's say I have a matrix of values for many different variables Y1..Y1000 at X=1,2,3..,10. Some of these variables are directly correlated with X, some follow different shapes (e.g. a normal distribution) and some are just random. I want to build a model to predict X based on given values of Y1..Y1000.

What would be the correct approach for this? I assume a simple linear regression would not be feasible because of the number of variables and the fact that not all variables are linearly dependent on X.

user12622
  • 1
  • 1
  • I'm having trouble following this. First, do you have 100 or 1000 variables labeled "Y"? Are you using "Y" for the predictor variables & "X" for the response variable? (That's OK, but it's the opposite of how it's usually done.) Is the main issue here that you have a multivariate situation (ie, multiple response variables)? Is this supposed to be a $p>>n$ problem? Re different shapes, note that the distribution of predictors is irrelevant, but if the issue is w/ response variables, can you say more about what the distributions are? (Eg, I think a random variable can be normally distributed.) – gung - Reinstate Monica Jul 19 '12 at 19:38
  • I'm sorry that I wasn't clear enough, my stats vocabulary is a bit rusty... The 100 was a typo, that should have been 1000 in both cases. So just to reiterate, I have a set of data that tells me how Y1..Y1000 behave for different known values of X and I want to build a model "X ~ Y1, Y2, Y3, ...", to predict X based on values of Y. So I think X would be the dependent/response variable and Y the independent/predictor variable. Sorry if I mislabeled them. – user12622 Jul 19 '12 at 19:56
  • So in general I think the problem is that I only have around 10 data points per variable but a lot of variables. – user12622 Jul 19 '12 at 19:59

1 Answers1

1

I would estimate a PCA on the Ys and regress X against the factors that explain the most significant amount of the variation of each of the Ys. Based on your comment, however, a classical PCA may be problematic since you have significantly more variables than observations. Asymptotic PCA a la Connor and Korajczyk (1986) might be a better choice. There are other variations on PCA that may be appropriate (like probabilistic PCA).

John
  • 2,117
  • 16
  • 24
  • Because the principal components of the Y's have *nothing* to do with *X*, why does this help? For an explicit example of the difficulty, please see [this comment](http://stats.stackexchange.com/questions/32471/how-can-you-handle-unstable-beta-estimates-in-linear-regression-with-high-mul/32577#comment63983_32577). – whuber Jul 20 '12 at 12:35
  • I have to admit I am not really sure how to go from a PCA on Y to a model. Is there a good tutorial somewhere out there that would explain this process step by step? – user12622 Jul 20 '12 at 15:47