Formulating Partial Least Squares as minimizing squared error

Question

The book chapter linked below (see section 4.3.1) lists a few formulations of partial least squares (PLS). The first two make sense to me and seem standard:

$$\underset{\mathbf{u}, \mathbf{v}}{\text{maximize}} \quad \frac{\mathbf{u}^\top \mathbf{X}^\top \mathbf{Y} \mathbf{v}}{\lVert \mathbf{u} \lVert \lVert \mathbf{v} \lVert} \quad \iff \quad \underset{\mathbf{u}, \mathbf{v}}{\text{maximize}} \quad \mathbf{u}^\top \mathbf{X}^\top \mathbf{Y} \mathbf{v} \quad \text{s.t.}~\lVert \mathbf{u} \lVert^2 = \lVert \mathbf{v}\lVert^2 = 1$$

They also state the problem is equivalent to minimizing the misfit:

$$\underset{\mathbf{u}, \mathbf{v}}{\text{minimize}} \quad \lVert \mathbf{X} \mathbf{u} - \mathbf{Y} \mathbf{v} \lVert^2 \quad \quad \text{s.t.}~\lVert \mathbf{u} \lVert^2 = \lVert \mathbf{v}\lVert^2 = 1$$

But this doesn't seem equivalent to me. Expanding the quadratic objective function we get:

$$\quad \quad \quad \quad \quad \mathbf{u}^\top \mathbf{X}^\top\mathbf{X} \mathbf{u} + \mathbf{v}^\top \mathbf{Y}^\top\mathbf{Y} \mathbf{v} - 2 \mathbf{u}^\top \mathbf{X}^\top \mathbf{Y} \mathbf{v} \quad \quad \quad \quad \quad (*)$$

It seems like they would like to ignore the first two terms so that all these optimization problems are equivalent, but I don't see how you can do that.

Reference in question: De Bie T., Cristianini N., Rosipal R. (2005) Eigenproblems in Pattern Recognition . In: Handbook of Geometric Computing. Springer, Berlin, Heidelberg

Side Note: I do understand a similar equivalence for the related case of canonical correlations analysis (CCA). In that model the constraints turn into $\mathbf{u}^\top \mathbf{X}^\top\mathbf{X} \mathbf{u} = \mathbf{v}^\top \mathbf{Y}^\top\mathbf{Y} \mathbf{v} = 1$, in which case the first two terms in $(*)$ are constrained to be constant.

A counterexample (?): Consider the following for a choice of $\epsilon$ close to zero.

$$ \mathbf{X} = \begin{bmatrix} 1/\epsilon & 0 \\ 0 & 1 \end{bmatrix} ~; \quad \mathbf{Y} = \begin{bmatrix} 1 & 0 \\ 0 & \epsilon \end{bmatrix}$$

$$ \mathbf{X}^\top \mathbf{X} = \begin{bmatrix} 1 / \epsilon^2 & 0 \\ 0 & 1 \end{bmatrix} \quad \mathbf{Y}^\top \mathbf{Y} = \begin{bmatrix} 1 & 0 \\ 0 & \epsilon \end{bmatrix} \quad \mathbf{X}^\top \mathbf{Y} = \begin{bmatrix} 1/\epsilon & 0 \\ 0 & \epsilon \end{bmatrix} $$

Which (in the first formulation) means we should be maximizing:

$$\mathbf{u}^\top \begin{bmatrix} 1 / \epsilon & 0 \\ 0 & \epsilon \end{bmatrix} \mathbf{v}$$

But in the second formulation means we should be minimizing:

$$ \mathbf{u}^\top \begin{bmatrix} 1 / \epsilon^2 & 0 \\ 0 & 1 \end{bmatrix} \mathbf{u} + \mathbf{v}^\top \begin{bmatrix} 1 & 0 \\ 0 & \epsilon \end{bmatrix} \mathbf{v} - \mathbf{u}^\top \begin{bmatrix} 2/\epsilon & 0 \\ 0 & 2\epsilon \end{bmatrix} \mathbf{v} $$

Now, as $\epsilon \rightarrow 0$, the former case would give $\mathbf{u} = \mathbf{v} = \begin{bmatrix} 1 & 0 \end{bmatrix}^\top$ while the latter case would give $\mathbf{u} = \mathbf{v} = \begin{bmatrix} 0 & 1 \end{bmatrix}^\top$ (since the $1 / \epsilon^2$ term dominates).

The first one (the maximization problem) is correct though, right? — ahwillia, Jan 26 '19 at 21:09

amoeba · Accepted Answer · 2019-01-28T12:04:00.023

This is clearly a mistake in the Eigenproblems in Pattern Recognition.

Good catch. This is not a PLS objective, for the reasons you gave yourself. This does indeed work for CCA (see section 4.2.2 in the same Chapter), but does not work for PLS.

Now you may ask: if this objective does not work, then how can one formulate PLS in terms of minimizing squared error? There is probably no natural way to do it. I have never seen it in the literature. E.g. a pretty comprehensive (but very dense and hard to follow) paper by Torre A Least-Squares Framework for Component Analysis does not mention PLS. I have not seen it elsewhere either.

That said, last year I realized that for univariate $\mathbf y$, OLS (analogue of CCA in this case), ridge regression, PLS, and PCA, all can be very neatly united in one least-squares framework: see my answer to The limit of "unit-variance" ridge regression estimator when $\lambda\to\infty$. I've been wondering if perhaps the multivariate case can also be formulated like that, but I did not manage to work it out so far.

Formulating Partial Least Squares as minimizing squared error

1 Answers1