5

I recently stumbled upon Y-aware PCA in the blog of win-vector.

They describe how PCA can be adjusted not to explain variation in $X$ but covariation of $X$ and $Y$.

This is explained for the case where $Y$ is continuous. How could one do this in the case where $Y$ is binary? Partial-least-squares (PLS) does something very similar. But to my knowledge it can only be used for regression. What I would like to do is to preprocess data in this Y-PCA style and then apply some other classifier (tree-based e.g.).

Is there any reference for this? Any (best open-source/R implementation) on the web?

EDIT: What I did so far is to apply partial least-squares regression with a binary output and I just keep the projection matrix, say $T$. Now I try other classifiers using the $XT$ instead of $X$ as explanatory variables.

amoeba
  • 93,463
  • 28
  • 275
  • 317
Richi W
  • 3,216
  • 3
  • 30
  • 53
  • 1
    "How to do ___ in R" is off-topic, but your statistical question about PCA is on-topic. Please consider revising your question to just focus on the statistical issues -- otherwise it's at risk of closure. – Sycorax May 24 '16 at 13:40
  • 1
    Please help me: is it just the last sentence that's bad? If the question is statistical and the last sentence says: if you know an implementation then please tell me ... is this not ok? – Richi W May 24 '16 at 13:42
  • 3
    *Y-aware PCA* sounds like just another name for PLS-Partial Least Squares. – Mike Hunter May 24 '16 at 13:46
  • Yes, asking how to do something in R is off-topic. http://meta.stats.stackexchange.com/questions/1335/how-to-ask-question-related-to-the-use-of-r – Sycorax May 24 '16 at 13:52
  • Related: [PLS](http://stats.stackexchange.com/questions/179733/theory-behind-partial-least-squares-regression/179767#179767) – Sycorax May 24 '16 at 13:54
  • 1st Guys, I am quite active in closing question in quant.stackexchange and if 90% is statistical and 10% is "is there an implementation too" then it should be ok. 2nd: I tried PLS and I agree. I just would like to preprocess the data in this Y-aware PCA form and then apply something else (trees e.g.). I will edit the question. – Richi W May 24 '16 at 13:56
  • 1
    @GeneralAbrial I agree with Richard: if a question is fully statistical and only its last sentence asks about R implementation then it's completely fine, does not make the Q off-topic, and this sentence does not have to be edited out. – amoeba May 24 '16 at 14:02
  • 1
    Is there any reference on this "y-aware PCA" at all? The blog post does not mention any. Looks like some ad hoc thing that the blog author came up with. – amoeba May 24 '16 at 14:05
  • 1
    @amoeba .. yes to a certain extent maybe ad-hoc but if we think about the idea then definitely worth to analyze further. Don't you think so? On the other hand something similar was done in PLS. Can we carry this over to other e.g. tree based methods? Does this make sense ;) – Richi W May 24 '16 at 14:25

1 Answers1

5

A better link to the blog on "Y-aware PCA" is here. The authors of that blog have an R package vtreat that implements this and other approaches to conditioning variables before analysis.

As noted in some comments, Y-aware PCA is related to partial least squares (PLS). It weights predictor variables according to their single-variable relations to the outcome variable, which is similar to the first step in PLS. Y-aware PCA then uses those weighted individual predictor variables, rather than their original standardized values, as the input to PCA. PLS, in contrast, uses the weighted sum of those predictor variables as its first component and constructs further orthogonal components by successive regression against residuals (e.g., ISLR, pp. 237-238).

One of the vignettes shows those authors' approach to the binary-outcome issue in their package:

categorical/logical y’s are treated as 0/1 indicators

That might not seem terribly satisfying. As this review on PLS puts it:

... applying a regression method designed for continuous responses to categorical responses or performing dimension reduction with survival data without taking censoring into account is unappealing, although it is reported to give good results in many cases.

The review goes on to cite work in which Cox or logistic regression coefficients were used instead of linear regression coefficients for PLS. That might be a reasonable extension of "Y-aware PCA" to the binary outcome situation.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • Thanks for your answer. I know all the vtreat links but the Boulesteix/Strimmer reference was new to me. All in all it sounds as if my approach makes sense. Apply the pls algorithm in the logistic form and just keep the projection matrix for transforming the data. Then later on I can apply any algorithm of choice. What do you think? – Richi W Jun 08 '16 at 20:58
  • Ok, I think after trying a bit I see the problem: the software tools don't do correct glm pls but ordinary pls and the binary factor is interpreted as 0/1 continuous variable, right? Do you agree to this observation? – Richi W Jun 08 '16 at 21:21
  • @Richard that's my understanding: the tools in vtreat handle the binary outcome as a "continuous" variable with only {0,1} values. Using logistic regression coefficients instead would seem to make sense. A complete logistic-based PLS projection matrix would be somewhat different from the "Y-aware PCA" formulation, which stops after the first step of PLS and thus does not provide an orthogonal transformed matrix. Either seems worth a try. Whether either will work better than the "continuous" {0,1} outcome coding, evidently used by the win-vector folk in `vtreat`, is hard to say. – EdM Jun 08 '16 at 21:52
  • @ameoba, thanks for pointing out that what I had written was at best ambiguous. I've edited the second paragraph to try to be more precise about the ways that Y-aware PCA and PLS use the single-variable relations of predictor variables to Y. – EdM Jun 13 '16 at 16:07
  • Thanks, @EdM. It is clearer now. Your answer already has my +1. – amoeba Jun 13 '16 at 19:25