2

I did a principal component analysis, which resulted in 7 components that I am now using for a principal component regression on my independent variable.

However, I want to add 2 control variables, but these are not component scores, but "normal" variables (on a Likert scale, the same sort of variables my independent variables were).

Is that okay to do or do I have to make these into components as well? I have done the regression both with those two as variables and combined in a component and the results are practically the same.

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • 3
    This is not a duplicate. – amoeba Jan 18 '18 at 20:41
  • 2
    I have no idea why this is closed as a duplicate of that Q. It is **not** a duplicate! This is a separate and a clearly defined question (that has been asked before: https://stats.stackexchange.com/questions/47972 - but wasn't answered, so now *that* Q is closed as a duplicate of *this* one). It's well answered below and the answer is accepted. This thread should stay open. I voted to reopen. – amoeba Jan 21 '18 at 21:33

1 Answers1

1

It depends.

As you may know PCR starts with PCA on the independent variables.

$$X = TP' + E$$

After obtaining the scores ($T$), one carries out regression between $T$ and $Y$ so that:

$$Y = TB + F$$

The first PCA step assures the scores (columns of $T$) are uncorrelated so that you can find "healthy" regression coefficients rather than dealing with the originally problematic matrix (rank deficiency, multicollinearity, $p \gg n$ etc..) which can, for example, yield very large regression coefficients and cause overfitting.

Thus, if you add some variables, depending on the nature of those variables, you may end up with a similar problem that caused you to use PCR rather than OLS in the first place. On the other hand, it may be just OK. I suggest to confirm your each model's success via testing it on an independent validation set or at least by using CV.

Personally, I would add those variables prior to PCR. If interpretability by looking at the regression coefficients is your concern, then $$Y = XPB + F$$ thus $\hat{B} = PB$ which you can directly use on your (probably at least mean-centered) data and can be interpreted just as easly.

amoeba
  • 93,463
  • 28
  • 275
  • 317
gunakkoc
  • 1,382
  • 1
  • 10
  • 23
  • Well, adding a few (specifically, two) additional variables will not make $p > n$ if PCA originally reduced $p$ to $p\ll n$. Is your concern then that these 2 additional variables might be highly correlated to the retained PC scores? – amoeba Jan 18 '18 at 13:58
  • For this specific case, yes. But usually I try to provide less specific answers to aid future readers. Is this a bad practice? – gunakkoc Jan 18 '18 at 14:08
  • It's a good practice :-) I was just clarifying what you meant. – amoeba Jan 18 '18 at 14:11
  • Thank you, very helpful reply! I have done most of the assumption testing and I think I have good and valid results in the end. – Boudewijn Hulst Jan 19 '18 at 14:06