0

In polynomial regression problems, in which an input vector $\underline{\phi}(\underline{x})$ is used to map a feature vector to a higher dimensional space (an example of this being $(x_{1}, x_{2}) \to (x_{1}, x_{2}, x_{1}x_{2}, x_{1}^{2}, x_{2}^{2})$), and a linear model is then constructed, parametrized through a weights vector $\underline{w}$, such that the OLS cost function is given by

$L = \sum_{i=1}^{N}\left(\underline{w}^{T}\cdot \underline{\phi}(\underline{x}_{i})-y_{i} \right)^{2} + \lambda\sum_{j=1}^{M}w_{j}^{2}$

the Dual Formulation consists of re-paramertizing this model in terms of a parameter vector $\underline{a}$ such that

$\underline{w} = \Phi ^{T} \cdot \underline{a}\hspace{4mm}(Eq. 1)$

in which the matrix $\Phi$ is defined such that $\Phi_{ij} = (\underline{\phi}(\underline{x}_{i}))_{j}$, i.e. the $j^{th}$ component of $\underline{\phi}(\underline{x}_{i})$

Below, I'll provide an outline of the calculation which is often used to justify this, but firstly, let's examine the reparametrization given in (Eq. 1). The fundamental problem to me appears to be that one of these vectors $\underline{w}$ has M components and $\underline{a}$ has N components.

What I'm struggling to understand is any case in which $M\neq N$. I think about this as firstly finding $\underline{a}^{*}$, the $\underline{a}$ which minimises the Loss wrt $\underline{a}$, and then wanting to write down which "physical" model this corresponds to (as the model expressed in terms of $\underline{w}$ is nice an interpretable). This re-paramerization doesn't appear to impose any constraints on $\underline{a}$, it seems like once the model has been re-parametrized, it's legit to do gradient descent in $\mathbb{R}^{N}$.

So what if M<N ? If $\underline{w}$ lives in $\mathbb{R}^{M}$ which is a lower dimensional space than $\mathbb{R}^{N}$ where $\underline{a}$ lives, then there can be no 1:1 mapping between the two. It must be the case that there are multiple values (in fact I assume entire manifolds) of $\underline{a}$ which map to the same $\underline{w}$. Perhaps this isn't problematic in itself, it's just rather unintuitive. The M>N case is however much more problematic, as now any given $\underline{a}$ maps to a manifold in $\underline{w}$ which corresponds to a family of models.

Further to all of this (and now we need to go into a deviation of how (Eq. 1) is usually justified), is it really valid to reparametrize the problem in terms of a vector in $\mathbb{R}^{N}$ and then do unconstrained minimisation/gradient descent?

The justification usually looks something as follows: The gradient of the cost function wrt $\underline{w}$ is given by

$\nabla _{w} L = 2\left[\Phi^{T}\cdot \Phi \cdot \underline{w} -\Phi ^{T}\cdot \underline{y} + \lambda \underline{w}\right]$

and thus the gradient is zero when:

$\underline{w} = \Phi^{T}\cdot\left[\frac{1}{\lambda}\left(\underline{y} - \Phi \cdot \underline{w}\right)\right]$

or

$\underline{w} = \Phi^{T}\cdot \underline{a}$

with $\underline{a} = \frac{1}{\lambda}\left(\underline{y} - \Phi \cdot \underline{w}\right)$

People then seem to justify this as "$\underline{w}$ can be written as $\Phi^{T}\cdot \underline{a}$, and it doesn't matter that $\underline{a}$ is a function of $\underline{w}$, it's still just a vector". It seems to me though, that this is telling us that $\underline{a}$ isn't completely unconstrained, you have to be able to express it as $\frac{1}{\lambda}\left(\underline{y} - \Phi \cdot \underline{w}\right)$ and depending on the dimensionality of $\underline{w}$, this means that admissible values of $\underline{a}$ might only be a manifold within $\mathbb{R}^{N}$ and thus it's not acceptable to just minimise loss wrt $\underline{a}$ and it must be some sort of constrained minimisation...but there's no mention of this in anything I've seen on the Dual Formulation for kernel learning

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
gazza89
  • 1,734
  • 1
  • 9
  • 17

1 Answers1

0

It might help to specialise this back to linear regression, which has all the features you don't like, but is easier.

First, $M<N$

We have $N$ observations of two variables $y$ and $x$, which are both centred. The function $\phi: \mathbb{R}^N\to\mathbb{R}^N$ is the identity, and $M=1$

The penalised OLS loss is

$$L=\left(\sum_{i=1}^n (wx_i-y_i)^2\right) + \lambda w^2$$

The dual form is $w=\Phi^Ta$, where $\Phi_{ij}=(x_i)_j=x_i$. That is, $w=\sum_{i=1}^N x_ia_i$.

The gradient is zero when $$a= \frac{1}{\lambda}(y-\Phi\cdot w)$$ or $ a_i= \frac{1}{\lambda}(y_i-x_i w)$. That is, the gradient is zero when the fitted values fall on a line, and the line has the optimal slope. The apparent freedom to specify the $N$-vector $a$ is illusory -- at the optimum it has to be in the column space of $\Phi$, which is only $M$-dimensional

Now $M>N$:

We still have linear regression and $\phi$ as the identity, but we now have $M>N$ predictors $x$. Still, $\Phi_{ij}=x_{ij}$

The dual form is $w_j=\sum_{i=1}^N x_{ij}a_i$.

The gradient is zero when $$a= \frac{1}{\lambda}(y-\sum_{j=1}^M\Phi_{ij}w_j)$$

It's true that $w$ is of higher dimension than $a$, but $\Phi\cdot w$ isn't, and that's all we see. If $M>N$ there will be multiple $w$s that could give the same fitted values; the problem of finding $w$ is underdetermined. But that was true in the original least-squares formulation as well -- it's not a problem introduced by the dual formulation.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Thomas Lumley
  • 21,784
  • 1
  • 22
  • 73
  • Thanks for you answer. How do I best think about the $M>N$ case? If there were no regulariser, there would be multiple $\underline{w}$ which minimise, and in fact make zero the loss. With regularisation, I don't quite have the intuition, are there still multiple $\underline{w}$ which minimise the loss? Also, if that is the case, how do I think about $\underline{w} = \Phi ^{T}\cdot \underline{a}$ ? Should I think of it as "there are many $\underline{w}$ which minimise the loss, but only one of these will also satisfy this constraint" ? So the dual formulation "makes you choose which one" ? – gazza89 Oct 14 '20 at 08:42
  • Yes, the quadratic regulariser picks a specific $w$ -- out of all the optimal $w$, which one has smallest norm. This does have to be unique, since the space of $w$ is flat and will have a single point of closest approach to the origin. – Thomas Lumley Oct 15 '20 at 22:29