The largest dimension of feature spaces that the logistic regression can handle?

Question

The estimation in the logistic regression (https://retostauffer.github.io/Rfoehnix/articles/logisticregression.html) is via the Newton method where the computed Hessian is given as $$ H = -X^TWX $$ where $X$ is the features, $W$ is a diagonal matrix, and the size of $X$ and $W$ are $n \times p$ and $n \times n$, respectively. And in the Newton method, we need to inverse the Hessian matrix $H \in R^{p \times p}$, assuming $H$ is invertible.

The question is given sufficient sample size n, how large of p that the implemented logistic regression in the software such as Python, R, MATLAB or others can handle? What are the limitations for a larger value of $p$?

I guess that the answer should be dependent on the sparsity of H. So there should be two cases: when H is very dense and when H has some structured sparsity.

Note that there is the regularized logistic regression model, but the regularization technique is not the focus of the above problem.

"And in the Newton method, we need to inverse the Hessian matrix". Not really, it's easier to solve the system of linear equations for the update than explicitly invert the matrix. — Matthew Drury, Apr 03 '19 at 18:31
@MatthewDrury Thanks for the quick response. Do you mean to factorize H first, and store them for future use? Do you have an example to support the claim? — vtshen, Apr 03 '19 at 19:51
What happens when you try to form the inverse of a rank-deficient matrix? Is $X$ full rank when $p > n$? — Sycorax, Apr 03 '19 at 20:01
Possible duplicate of [Why does logistic regression not work in p > n setting?](https://stats.stackexchange.com/questions/139353/why-does-logistic-regression-not-work-in-p-n-setting) — Sycorax, Apr 03 '19 at 20:05
@Sycorax That is a good point. Though I mentioned that we are given sufficient sample size n, but I should write this out more clearly, that is, in the problem, we assume H is invertible. — vtshen, Apr 03 '19 at 20:23
If you're already assuming that $H$ is invertible, then you seem to have answered your question: we're in the situation where $p \le n$. I don't understand what remains to be answered. Are you asking if there is a *computational* limit on $p$ when you're using R or python to do logistic regression? Because that answer depends more on your computer (how much memory you have and how long you're willing to wait) than it does on math -- and also is not on-topic (see [help]). — Sycorax, Apr 03 '19 at 20:28
@Sycorax Thanks for the quick reply. I agree that we can say we are in a situation where p < n. For the problem, there could be a computational limit in the software, I would like to know any answers regarding the relationship between the value p and the memory or the waiting time. But I feel like there are some mathematics/statistics aspects that are significant in determining the value p. For example, MatthewDrury mentions that "it's easier to solve the system of linear equations for the update than explicitly invert the matrix" — vtshen, Apr 03 '19 at 20:35

score 2 · Answer 1 · answered Apr 03 '19 at 21:36

In theory, the largest dimension is only determined by the amount of memory available.

I have more frequently seen logistic regression implemented via iteratively reweighted least squares (IRLS). This master's thesis ("Stochastic Gradient Descent for Efficient Logistic Regression", Alexander Thorleifsson, Stockholm University 2016) points out that stochastic gradient descent (SGD) is more memory-efficient:

Specifically, SGD algorithms only require O(n) memory which is the minimum required for storing the ith iterate $\hat \theta_i$ , where O is a notation for memory requirement, see more details on page 11. IRLS on the other hand, when implemented with the standard function glm in R, requires roughly O($mn^2$) of memory and when implemented with biglm, designed to work especially well with big data sets, require O($n^2$).

where $n$ is the number of observations and $m$ is the number of features/predictor variables ($p$ in your notation).

Since the algorithm used by the bigglm function (in the biglm package) is independent of $p$, that implies that your logistic regression should be OK as long as the number of observations is not too large. (The biglm package also includes the capability to process data a chunk at a time, for cases where the data set is too big to fit in memory in the first place.)

The master's thesis goes on to point out that one can be more memory-efficient with stochastic gradient descent (SGD):

Theoretically, SGD algorithms are more efficient since they replace the inversion of n × n matrices, where n is the number of parameters, in IRLS with a scalar sequence $\alpha_i$ and a matrix $C_i$ that are faster to manipulate, by design (Tran et. al, 2015).

The largest dimension of feature spaces that the logistic regression can handle?

1 Answers1