Modifying ordinary least squares (OLS) in ridge regression to perform transfer learning

Question

I have a question on using ridge regression for transfer learning.

Transfer learning is a type of Machine Learning where knowledge from the source domain when performing a task is transfered to the target domain when performing the same task i.e. the task remains the same but datasets in the two domains differ.

One way to perform transfer learning is parameter sharing. The high level intuition is that target domain model parameters should be very close to source domain model parameters while still allowing for some uncertainty. Mathematically this intuition is captured by penalizing the deviation of the parameters i.e., $\lambda\|W_{target} - W_{source}\|^2_2$, where, $\lambda$ is the penalization parameter and $W$'s are a vector of model parameters. In the past, I have used this approach to perform transfer learning for logistic regression (LR) and conditional random fields (CRF).

I am trying to use this approach to perform transfer learning with ridge regression. Given $N$ labeled examples of the form $\{(x_i,y_i),...,(x_n,y_n)\}$, recall that the OLS approach to ridge regression is,

$E(W) = \frac{1}{2}(Y - XW)^T(Y - XW)+\frac{\lambda}{2}W^TW$

and the closed form solution is,

$\hat{W} = (X^TX+\lambda I)^{-1}X^TY$

When performing transfer learning the OLS looks like,

$E(W) = \frac{1}{2}(Y - XW)^T(Y - XW)+\frac{\lambda}{2}(W-W_s)^T(W-W_s)$

where $W_s$ are the ridge regression parameters learned from the source domain and taking the derivative w.r.t to $W$ and setting it equal to zero gives me,

$\hat{W} = (X^TX+\lambda I)^{-1}(X^TY + \lambda W_s$)

Only the second term differs from the original OLS.

My questions are,

Is this mathematically correct? My hypothesis is that this modified OLS will still result in a unique solution for each setting of $W_s$
For CRF's and LR's, I was able to verify if user supplied gradients were within a tolerance level of numerical optimization based gradients. What checks can I perform here to check if the closed form solution is correct
Are there any references to back this up since I am thinking of using this a baseline method in a paper. I am unable to find any.

EDIT: I did find a similar question on cross-validated that is vaguely answered as well

Any help is appreciated. Thanks

I wonder if the [Woodbury matrix](https://en.wikipedia.org/wiki/Woodbury_matrix_identity) identity can come into play here. I would think about the parameters in the sense of projection. I don't see how you derive your second E(W) expression. Why is it correct? — EngrStudent, Jul 03 '17 at 19:49
I think the $\lambda W_s$ term should be multiplied by inverse matrix too. — seanv507, Jul 20 '17 at 18:10
and you can 'check' it by considering what happens as $\lambda$ goes to infinity — seanv507, Jul 20 '17 at 18:15
and https://stats.stackexchange.com/a/242383 gives the same answer — seanv507, Jul 20 '17 at 18:45
You are right that the $\lambda W_s$ should be multiplied with the inverse term as well. I fixed the equations above to reflect this. — anataraj, Jul 28 '17 at 12:55

score 2 · Answer 1 · answered Nov 02 '17 at 16:13

It looks correct, because we can (more easily) obtain the same answer from the usual Ridge Regression solution.

Write $W - W_s = \delta$. Your objective function is (writing "$f$" rather than $E$, which could be confused with an expectation)

$$\eqalign{ 2f(W) &= (Y - XW)^\prime(Y - XW)+\lambda(W-W_s)^\prime(W-W_s)\\ &=((Y-XW_s) - X\delta)^\prime((Y-XW_s)-X\delta) + \lambda \delta^\prime \delta \\ &= (Y_s - X\delta)^\prime)(Y_s - X\delta) + \lambda\delta^\prime\delta, }$$

exhibiting it as the usual Ridge Regression objective with $Y_s=Y-XW_s$ as the response and $X$ as the regressors. Therefore the optimum is given by the Ridge Regression solution,

$$\hat\delta = (X^\prime X + \lambda)^{-1}X^\prime(Y_s)= (X^\prime X + \lambda)^{-1}(X^\prime Y-X^\prime XW_s).$$

Although this doesn't look the same as your solution, it is, because for any matrix $A$ for which $A+\lambda$ is invertible,

$$(A + \lambda)^{-1}A = (A + \lambda)^{-1}((A + \lambda) - \lambda) = 1 - \lambda(A+\lambda)^{-1}.$$

Taking $A=X^\prime X$ gives

$$\hat\delta=(X^\prime X + \lambda)^{-1}(X^\prime Y + \lambda W_s) - W_s.$$

Consequently

$$\hat W = \hat\delta + W_s = (X^\prime X + \lambda)^{-1}(X^\prime Y + \lambda W_s).$$

This is a nice result because it shows you can use existing software and techniques of Ridge Regression to solve this more general problem of shrinking $W$ to $W_s$ rather than to $0$. Moreover, because the Ridge Regression solution can be obtained with Ordinary Least Squares, you can apply OLS software. There shouldn't be any need to write your own or worry about convergence issues.

One way to check a putative solution evidently is to compute $(X^\prime X + \lambda)\hat W$ and compare it to $X^\prime Y + \lambda W_s$.

I don't know of any references, but I'm sure this result is well-known.

Modifying ordinary least squares (OLS) in ridge regression to perform transfer learning

1 Answers1

Linked