I have a question on using ridge regression for transfer learning.
Transfer learning is a type of Machine Learning where knowledge from the source domain when performing a task is transfered to the target domain when performing the same task i.e. the task remains the same but datasets in the two domains differ.
One way to perform transfer learning is parameter sharing. The high level intuition is that target domain model parameters should be very close to source domain model parameters while still allowing for some uncertainty. Mathematically this intuition is captured by penalizing the deviation of the parameters i.e., $\lambda\|W_{target} - W_{source}\|^2_2$, where, $\lambda$ is the penalization parameter and $W$'s are a vector of model parameters. In the past, I have used this approach to perform transfer learning for logistic regression (LR) and conditional random fields (CRF).
I am trying to use this approach to perform transfer learning with ridge regression. Given $N$ labeled examples of the form $\{(x_i,y_i),...,(x_n,y_n)\}$, recall that the OLS approach to ridge regression is,
$E(W) = \frac{1}{2}(Y - XW)^T(Y - XW)+\frac{\lambda}{2}W^TW$
and the closed form solution is,
$\hat{W} = (X^TX+\lambda I)^{-1}X^TY$
When performing transfer learning the OLS looks like,
$E(W) = \frac{1}{2}(Y - XW)^T(Y - XW)+\frac{\lambda}{2}(W-W_s)^T(W-W_s)$
where $W_s$ are the ridge regression parameters learned from the source domain and taking the derivative w.r.t to $W$ and setting it equal to zero gives me,
$\hat{W} = (X^TX+\lambda I)^{-1}(X^TY + \lambda W_s$)
Only the second term differs from the original OLS.
My questions are,
Is this mathematically correct? My hypothesis is that this modified OLS will still result in a unique solution for each setting of $W_s$
For CRF's and LR's, I was able to verify if user supplied gradients were within a tolerance level of numerical optimization based gradients. What checks can I perform here to check if the closed form solution is correct
Are there any references to back this up since I am thinking of using this a baseline method in a paper. I am unable to find any.
EDIT: I did find a similar question on cross-validated that is vaguely answered as well
Any help is appreciated. Thanks