1

I am in the process of making a vba subroutine in Excel to perform Monte Carlo bootstrap cross validation on a regression formula. I've got the regression part, and the MSE of the CV's part finished. I am running the MonteCarloCV through 100 replications. I recently am trying to add more statistics to my output which includes a correlation matrix for each replication of input data.

This is where I am running into the problem. About 10% of the time, the diagonal of my covariance matrix is degenerate which leads to me not being able to create a correlation matrix. My question is if this is common place, and if it is what work-arounds there are for generating the correlation matrix.

I am fairly certain that I am calculating each correctly, but here's an example of what I'm seeing:

Randomly mark 30 out of 30 records to be used in the training dataset. which comes out to ~20 of the 30 being marked for training:

(These are only X values, I did not want to include Y for the matrix) $$ \begin{matrix} -27&51&65&20&1\\ -14&76&19&2&1\\ -1&82&6&24&1\\ -21&99&78&3&1\\ 27&90&79&19&1\\ -3&79&37&14&1\\ 27&79&18&15&1\\ 8&80&28&13&1\\ -29&48&40&17&1\\ -18&86&99&12&1\\ 19&85&3&12&1\\ 9&100&85&15&1\\ -23&50&86&2&1\\ -17&90&47&1&1\\ 2&41&77&20&1\\ 28&84&86&20&1\\ 24&80&78&19&1\\ \end{matrix} $$

I then find the covariance matrix of the training dataset by using the equation: $$ a = A - 11^{T} A (1/n)$$

$$V = a^{T} a (1/n)$$

Which yields the following matrix:

$$ \begin{matrix} 393.0726644&135.3667820&-53.12456747&64.27681661&0\\ 135.3667820&302.0138408&-10.9480968&-22.31141868&0\\ -53.12456747&-10.9480968&946.2975779&1.332179931&0\\ 64.27681661&-22.31141868&1.332179931&50.00692042&0\\ 0&0&0&0&0\\ \end{matrix} $$

Then I use the following formula to calculate the correlation matrix:

$$ D = diag(\sqrt{V}) $$

$$ p = D^{-1} V D^{-1} $$

and the calculation of $$ D^{-1} $$ is what is causing problems since it's determinate = 0 which throws a div/0 error.

Let me know if more information is needed. I didn't want to post the code because it is rather long, but I can if it would help.

  • You are being inconsistent in constructing $a$ and including the constant column in the calculation. You can do one or the other but not both. Use a tiny dataset--say, three rows and two columns (including the constant) and compare what you're doing with the textbook formulas. – whuber Nov 20 '17 at 20:05
  • @whuber Not exactly sure what you're asking. I did work the problem by hand just now and realized I entered a wrong number into my program to calculate the covariance matrix. I fixed it and edited my answer, but am still getting the same problem with the correlation matrix. – Acumen Simulator Nov 20 '17 at 20:52
  • Correlation to a constant is going to be zero – Aksakal Nov 20 '17 at 20:55
  • Basically, dont try to bootstrap the constant, just set it. It's not a bona fide input, anyways – Aksakal Nov 20 '17 at 21:01
  • @Aksakal I have to include the constant in my OLS algorithm, but you're saying to leave it out for calculating the Correlation matrix? I understand why its zero for the constant, just not why it only causes a 0 determinant only part of the time. – Acumen Simulator Nov 20 '17 at 21:05
  • 1
    Nobody includes the constant in correlation matrix when generating inputs while bootstrapping with monte carlo, trust me on this. – Aksakal Nov 20 '17 at 21:13
  • See https://stats.stackexchange.com/questions/68151/how-to-derive-variance-covariance-matrix-of-coefficients-in-linear-regression for the correct formula and its derivation. – whuber Nov 20 '17 at 21:57
  • @whuber I am calculating the covariance and correlation of X not b. Just included the OLS stuff for context. Wanted to ultimately see if my cross validation and b estimates were less overfit for Monte Carlo replications with low intervariable correlation. – Acumen Simulator Nov 20 '17 at 22:06
  • That helps clarify things--until now, your meaning of "$\sqrt{V}$" was ambiguous. Skip the formulas and go back to the original definition of correlation: you divide the covariance by the two standard deviations, *provided* the standard deviations are nonzero. When they are zero you can declare the correlation to be undefined or zero, depending on your application. – whuber Nov 20 '17 at 22:22

1 Answers1

1

After great comments from whuber and Aksakal, I arrived at the following conclusion:

Since the covariance and variance of the OLS constant is 0, the standard deviation is also 0 which causes the correlation to be either 0 or undefined.

Using this, I eliminated the constant vector from the analysis (since the analysis would not be telling us something we already didn't know about it) and ran the program on the remaining variables. This leading to the following correlation matrix:

$$ \begin{matrix} 1&0.392882116&-0.087105408&0.458461513\\ 0.392882116&1&-0.020479123&-0.181551026\\ -0.087105408&-0.020479123&1&0.006123983\\ 0.458461513&-0.181551026&0.006123983&1\\ \end{matrix} $$

I have also run the vba program for 100s of replications on a few datasets, in which the problem seems to be fixed in entirety.