0

At university, I learned with these slides about ridge regression and its derivation with the assumption that the target- and predicted values have the dimensions $1\times1$.

However, now I need to derive ridge regression for the case that the target- and predicted values have the dimensions $1\times k$ with $k > 1$.

I have found these very useful links:

How to derive the ridge regression solution?

Why is ridge regression called "ridge", why is it needed, and what happens when $\lambda$ goes to infinity?

https://tamino.wordpress.com/2011/02/12/ridge-regression/

https://towardsdatascience.com/ridge-regression-for-better-usage-2f19b3a202db

https://en.wikipedia.org/wiki/Tikhonov_regularization

It seems to me that all of the above mentioned links also assume that the target- and predicted values have the dimensions $1\times1$.

Therefore, I am asking for help for the derivation of ridge regression for multi-value-target vectors.

I started the derivation by building the model equation.

$Y(W,X) = \Phi W = \begin{pmatrix} \phi_1(x_1) & \phi_2(x_1) & ... & \phi_m(x_1)\\ \phi_1(x_2) & \phi_2(x_2) & ... & \phi_m(x_2)\\ \vdots & \vdots & \ddots & \vdots\\ \phi_1(x_n) & \phi_2(x_n) & ... & \phi_m(x_n)\\ \end{pmatrix} \begin{pmatrix} w_{11} & w_{12} & ... & w_{1k}\\ w_{21} & w_{22} & ... & w_{2k}\\ \vdots & \vdots & \ddots & \vdots\\ w_{m1} & w_{m2} & ... & w_{mk}\\ \end{pmatrix} = \\ \begin{pmatrix} \sum_{j=1}^m w_{j1} \phi_j(x_1) & \sum_{j=1}^m w_{j2} \phi_j(x_1) & ... & \sum_{j=1}^m w_{jk} \phi_j(x_1) \\ \sum_{j=1}^m w_{j1} \phi_j(x_2) & \sum_{j=1}^m w_{j2} \phi_j(x_2) & ... & \sum_{j=1}^m w_{jk} \phi_j(x_2) \\ \vdots & \vdots & \ddots & \vdots \\ \sum_{j=1}^m w_{j1} \phi_j(x_n) & \sum_{j=1}^m w_{j2} \phi_j(x_n) & ... & \sum_{j=1}^m w_{jk} \phi_j(x_n) \end{pmatrix} \\$

Please note that each row of $Y$ is one multi-value-prediction vector of the dimension $1\times k$. So, $Y$ has the dimension $n\times k$, where $n$ is the number of observations. $\phi_j$ is the $j$th of $m$ functions which takes $x_i \in R^{e}$ with $e \in N$ and calculates a single value.

Now, I edit equation 11 of my university slides to: $E_D(W) = \frac{1}{2}\bigg((\Phi W - Z) \odot (\Phi W - Z) + \lambda W \odot W \bigg)$ where $\odot$ is the Hadamard product and $Z$ is the matrix consisting of multi-value-target vectors.

I know that $W \odot W$ is not suitable here, but I have no idea what else is consistent to $||W||^2_2$ (from the slides).

Next, I want to adjust equation 7: $\nabla_W E_D(W) = \frac{\partial}{\partial W} \frac{1}{2}\bigg((\Phi W - Z) \odot (\Phi W - Z) + \lambda W \odot W \bigg) = \begin{pmatrix} 0 \\ \vdots \\ 0 \\ \end{pmatrix}$

And here is the point, where the problems start. How is this equation possible to solve?

I have no idea...

For several problems, I have validated that the following equation still holds, even with the assumption that $Z$ consists of multi-value-target vectors:

$W_{optimal} = \underbrace{(\Phi^T \Phi+\lambda I)^{-1}\Phi^T}_{\substack{\Phi^\dagger}}Z$

where $\Phi^\dagger$ is the so-called Moore-Penrose pseudo-inverse of $\Phi$.

Unfortunately, I cannot derive the last equation with the help of the before mentioned derivative of $E_D(W)$.

Can someone please help me?

Daniel
  • 123
  • 6
  • 1
    Nothing in my answer at https://stats.stackexchange.com/a/164546/919 is specific to a scalar response: it applies directly, without any modification, to a vector response. – whuber Oct 07 '19 at 18:54
  • [This page](https://stats.stackexchange.com/q/178965/28500) shows a way to extend the standard normal equation for univariate outcomes to the multiple-outcome case. (The Hadamard product you propose is not the way to go.) Together with the approach recommended by @whuber, that should point the way to a solution for a single ridge parameter $\lambda$ used for all outcomes. When you get it, please post the result as an answer to your question for future visitors to the site. – EdM Oct 08 '19 at 17:32
  • @whuber Could you please tell me if the derivation is correct? – Daniel Nov 16 '19 at 21:04

1 Answers1

1

The answer of Zhanxiong helped me a lot to come up with this derivation:

$\text{A}_i$ means the $i$-th row of the matrix $A$

The loss-function is: $$E_D(W)=\frac{1}{2}\sum_{i = 1}^n \Big(\|\Phi_i W-Z_i\|_2^2\Big) +\frac{\lambda}{2}\sum_{i = 1}^k\|(W^T)_i\|_2^2= $$ $$\frac{1}{2}\Big(\text{tr}\big(( \Phi W - Z)^T(\Phi W - Z)\big)+\lambda\; \text{tr}( W^TW)\Big)$$ Differentiate it with respect to $W$ and set it to $0$:

(Section 2.4.2 of the "The Matrix Cookbook" was very helpful for solving the equation)

$$\frac{\partial E_D(W)}{\partial W} = 0$$ $$\frac{\partial \bigg( \frac{1}{2}\Big(\text{tr}\big((\Phi W -Z)^T(\Phi W - Z)\big)+\lambda\; \text{tr}( W^TW)\Big)\bigg)}{\partial W} = \frac{1}{2}\big(2\Phi^T(\Phi W-Z )+2\lambda W \big) = 0$$

$$\Phi^T \Phi W - \Phi^TZ+\lambda W = 0$$ $$(\Phi^T \Phi+ \lambda I)W= \Phi^TZ$$ $$W= (\Phi^T \Phi+ \lambda I)^{-1} \Phi^TZ$$

What are you guys thinking about this derivation?

(Yes, I know that I do not need the $\frac{1}{2}$ in the derivation.)

Daniel
  • 123
  • 6