Notation for ordinary and ridge regression

Question

I'm trying to obtain an estimator $f(x)=y$ where $x \in \mathbb{R}^{D_1}$ and $y \in \mathbb{R}^{D_2}$, both are column vectors. So my training set $X$ and $Y$ are data matrices of size $D_1 \times N$ and $D_2 \times N$, respectively, where $N$ is the number of samples, and $D$'s are the input (feature) and output dimensions.

So I want to learn $\beta$ that gives $\beta x \sim y$ in a least-squares fashion. I was doing this in MATLAB simply by beta_hat = Y * pinv(X); and it seems like working without a problem. Though I want to ask, is this correct?

My question:

Now I want to implement this without pinv because I want to add regularization to it, so I came up with this solution (this is without regularization) : $\hat \beta = Y (X^TX)^{-1}X^T$ is this correct? It also works but MATLAB complains about this :

Warning: Matrix is close to singular or badly scaled. Results may be inaccurate. RCOND =  2.565271e-20.

And even crashes sometimes. So I think I'm making a mistake somewhere, but where?

Thanks in advance,

Edit

Here is what my MATLAB code looks like : (I know there are non-initialized variables like N, but I just cropped them out, they are working as expected) :

Ntr = round(N * 0.7); % Assign first 70% of the samples as training set
Trains = [1:Ntr]; Tests = [Ntr+1:N];
XData = zeros(FeatureSize, N);
YData = zeros(OutputSize, N);

for n=1:N
    % Collect the independent data (into the columns of X)
    XData(:,n) = getFeature(sample(n));
    % Collect output variable for Train samples : 
    if find(Trains==n)
        YData(:,n) = getLabel(sample(n));
    end
end % for each sample

% Learn model: 
if strcmp(RegressionType, 'ordinary')
    C = YData(:,Trains) * pinv(XData(:,Trains));
elseif strcmp(RegressionType, 'ordinary_myImplementation')
    X = XData(:,Trains);
    Y = YData(:,Trains);
    C = Y * inv(X'*X)*X'; % this is where the error happens. Isn't this the same with pinv(X) ?
elseif strcmp(RegressionType, 'ridge')
    X = XData(:,Trains);
    Y = YData(:,Trains);
    C = Y * inv(X'*X + alpha*eye(Ntr,Ntr)) * X';
else, error('Unknown regression type');    end

% Apply model on Test samples : 
YData(:,Tests) = C * XData(:,Tests);

Edit 2

After @Matthew Drury's suggestion, I replaced the line C = Y * inv(X'*X)*X'; to C = linsolve(X',Y')';

But now I'm getting this error:

Warning: Rank deficient, rank = 17, tol =  2.729816e-12.

Is this normal?

Avoid using `pinv` if possible and use the `mldivide` operator. You do not show the way you try to solve this system. Especially if you use ridge regression, ie. you solve $(X^TX + \lambda I)^{-1} (X^T Y)$ it is even harder to get a singular matrix (you amp all the eigenvalues by $\lambda$ essentially. Please give some reproduction code so we can see your estimation procedure. I strongly suspect that the issue stems from there. — usεr11852, Oct 16 '15 at 17:49
@usεr11852 Is correct. You should almost never explicitly invert a matrix in numerical code, instead favoring to use a linear equation solver. I'm not familiar with matlab, but there should be a function like `solve(A, y)` which returns the solution(s) to the linear equation $Ax = y$. Always favor this. Here's a nice blog on the subject: http://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/ — Matthew Drury, Oct 16 '15 at 18:38
It looks like its called `linsolve`: http://www.mathworks.com/help/matlab/ref/linsolve.html — Matthew Drury, Oct 16 '15 at 18:44
Thanks so much Matthew. Please see the update. I think we are solving the equation with too less samples, so this warning makes sense to me. Is it normal? — jeff, Oct 16 '15 at 19:13
This code is suboptimal both for the estimation you want to do as well as the presentation you want to do. For the later just see this thread: [How to make a great R reproducible example?](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) it is for R but **EXACTLY** the same apply for MATLAB. Technically just use `X\Y`, you have overcomplicated yourself. You do not tell us what your $N$ or $D1$, $D2$ so we cannot say if you might have sample sizes. If you have $D1$ or $D2$ larger than 17 this will be rank deficient (based on the warning message). — usεr11852, Oct 16 '15 at 19:20
@usεr11852 thank you so much for your answer. I added the explanations. Shortly, $N$ is # of samples and $D$'s are the I/O dimensions. I'm probably overcomplicating things, yes. But I want to fully understand what's going on in the system, so in this case, I kind of want to "re-invent the wheel" :) **Edit** :And I would really appreaciate if someone can show me the code to apply left division for the ridge regression case. — jeff, Oct 16 '15 at 19:36
@usεr11852 I think my confusion comes from the notation. For example in [mldivide](http://www.mathworks.com/help/matlab/ref/mldivide.html), it solves for `Ax = B`, whereas I want to solve for `xA = B`. — jeff, Oct 16 '15 at 19:41
You are just transposing things; this should not be an issue. You are messing your matrices anyway as you assume your $A$ to be $D1 \times N$ where under normal equations notation such a system would be written as $N \times D1$. Just to ensure the obvious both $D1$ and $D2$ are smaller than $N$ right? — usεr11852, Oct 16 '15 at 21:45
I know, I got used to the "wrong" notation, so all my codes are organized based on that. And no, usually $N$ is much smaller than $D_1$, though $D_2$ is small ($\sim 60$) I reduce the # of features ($D_1$) with PCA but still often input features are bigger than the number of samples. That's why I want to regularize the learning, because it overfits otherwise. — jeff, Oct 16 '15 at 21:48
Given that $N$ is smaller than $D1$ and $D2$ I suggest you look at this thread [What is rank deficiency, and how to deal with it?](http://stats.stackexchange.com/questions/35071/what-is-rank-deficiency-and-how-to-deal-with-it) in detail. Advice: bite the bullet now and get *un*used in using non-standard notation; the amount of bugs you will save yourself from accidental transposition as well as the the increased readability of your dode to yourself and others is a worthwhile time-investment. — usεr11852, Oct 16 '15 at 22:33
@usεr11852 yes I was planning on doing that. I think it's time now :) One final question, if I switch to row-samples notation then I need to solve $\beta X = Y$, and not $X\beta = Y$, right? — jeff, Oct 16 '15 at 22:35

score 1 · Answer 1 · edited Apr 13 '17 at 12:53

Based on what is said in the comments it appears you are solving an under-determined system. That is because the number of samples $N$ is smaller than your number of features $D_1$. Because of this issue one will be always faced with rank deficiency of the design matrix; the thread What is rank deficiency, and how to deal with it? gives more information on the matter.

Notation-wise, as discussed is will be better if one uses standard notation where the number of samples $N$ represent the number of rows rather than the number of columns in the design matrix used. As mentioned the general idea one needs to remember when making this change is that $(AB)^T = B^TA^T$.

As mentioned both by me and @Matthew Drury you should seriously avoid using pinv unless it is explicitly stated in the algorithm you use (so the authors say something like "taking the Moore-Penrose pseudoinverse"). You should use mldivide ( linspace is also an option but it is redundant in your scenario, the Computational Science SE on what differences are between linsolve and mldivide? can offer more clarificaitons on the matter).

Notation for ordinary and ridge regression

Edit

Edit 2

1 Answers1