I have a data set of 3000 observations with 9 variables, and I'm trying to predict whether water are safe for drinking. Regular multivariate logistic regression isn't that good at forecasting, and also none of the coefficients is significant, even if I run univariate logistic regression. This is why I thought of regularization, but I wasn't able to found an explanation of this and when it is appropriate to use. Also, if it exists, if be happy for a reference to R functions.
Asked
Active
Viewed 75 times
1
-
Is your measured response variable binary or some measure of contamination (e.g., 6 parts per million coronavirus). – Dave Jun 27 '21 at 19:54
-
It is binary: safe or not safe – Ift h Jun 27 '21 at 20:14
-
plenty of regularised glm out there i believe. Glmnet is quite popular and has vignette. However, do you have any expectation of what the relationship is between inputs and "safe"eg I could imagine not safe to drink is "legally" defined as chemical 1> conc1 or chemical2 > conc 2 or chemical 3 > conc3. I don't believe you can fit this in a logistic regression (without adding some nonlinearities). – seanv507 Jun 27 '21 at 21:10
-
2Statistical significance has nothing to do with regularization and forecasting. What doesn’t work about forecasting with logistic regression for you? – Tim Jun 27 '21 at 21:17
-
This is mostly an exercise at class. There are all kind of substances and measures like Chloramines and pH levels. The prediction is around 58% accuracy, which is quite poor in such cases, as it is health issues. – Ift h Jun 28 '21 at 14:24
1 Answers
1
Regularisation aims at reducing the effects of design matrix being overdetermined or underdetermined, recall solving $Ax=b$, $A \in \mathbb{R}^{m \times p}$. Regularisation is appropriate to use if $p>>m$ (underdetermined) or $p<<m$ (overdetermined). Here the case is $m>>p$, overdetermined (m=3000, p=9 in this case).
Using LASSO or elastic net regularisation are recommended instead of plain logistic regression. Without regularisation, solution may not be correct. glmnet's introduction will give a good idea how to use LASSO and elastic-net regularisations.
See also rank deficiency.

msuzen
- 1,709
- 6
- 27
-
Agreed on glmnet—and just to add on, for a general background into what the LASSO does conceptually, I recommend *Introduction to Statistical Learning* by James et al. The R labs are outdated, but the concepts are explained in an intuitive and accessible way. – Mark White Jun 27 '21 at 23:00
-
It’s just been updated with a new edition. Haven’t checked the R labs but presumably they’ve been updated, too. – Mooks Sep 06 '21 at 16:46