3

Is there a "rule of thumb" for calculating the amount of RAM required on a computer to conduct a multivariate regression in run-time $t$ with a $p \times n$ design matrix?

Assuming I have to find the product $X^TX$ when solving for OLS coefficients, is it desirable to have enough RAM to hold an $n \times n$ dataset in memory?

I run analyses in both SAS and R, which have different capabilities, so I realize answers may differ based on which language I'm using.

Momo
  • 8,839
  • 3
  • 46
  • 59
RobertF
  • 4,380
  • 6
  • 29
  • 46
  • 2
    I edited the maths in your question, and changed it from $(X^T)X$ to $X^TX$ which is hopefully what you wanted. Or was it $(X^TX)^{-1}$? – Momo Oct 02 '13 at 17:42
  • 1
    For R: see http://adv-r.had.co.nz/memory.html – Metrics Oct 02 '13 at 17:44
  • Thanks Momo - yes I'm assuming the $(X^T)X$ product is the most memory intensive step in OLS calculations. – RobertF Oct 02 '13 at 17:46
  • Although software-specific questions are off-topic here, a rough idea of how much memory the algorithms used to run a multivariate regression will require seems to me to be on-topic. – Silverfish Sep 28 '16 at 15:36
  • 1
    Are you actually computing $X^TX$? Because that's notoriously an unstable way to go about solving this problem -- you want to use a more numerically-stable method such as QR decomposition. – Sycorax Sep 30 '16 at 19:28
  • @Sycorax No, and understandably the memory required to find a solution to $X^T X$ depends upon which decomposition algorithm is used. I suppose the question then becomes which algorithms are used by SAS and R in `proc reg` and `lm()`, respectively. – RobertF Sep 30 '16 at 19:44
  • @RobertF Ah. Well for what it's worth -- I can't speak to SAS but the default in R is QR decomposition. – Sycorax Sep 30 '16 at 19:45

1 Answers1

0

I may have a simple solution to my question, inspired by the article posted by Metrics:

Run a few test regressions on several small random samples from the analysis dataset, let's say .5%, 1%, and 2% sample sizes. Do the same with varying numbers of variables (columns).

Record the CPU times and RAM usage for these samples, then extrapolate to the full sample. More samples may be required if there's a nonlinear relationship.

RobertF
  • 4,380
  • 6
  • 29
  • 46