3

My situation:

  • small sample size: 116
  • binary outcome variable
  • long list of explanatory variables: 44
  • explanatory variables did not come from the top of my head; their choice was based on the literature.

Statistical test chosen: logistic regression

I need to find the variables that best explain variations in the outcome variable (I am not interested in making predictions).

The problem: This question is a follow-up on the 2 questions listed below. From them, I got that performing automated stepwise regression has its downsides. Anyway, it seems that my sample size would be too small for that. It seems that my sample is also too small to enter all variables at once (using the SPSS 'Enter' method). This leaves me with my issue unresolved: how can I select a subset of variables from my original long list in order to perform multivariate logistic regression analysis?

UPDATE1: I am not an statistician, so I would appreciate if jargons can be reduced to the minimum. I am working with SPSS and am not familiar with other packages, so options that could be run with that software would be highly preferable.

UPDATE2: It seems that SPSS does not support LASSO for logistic regression. So following one of your suggestions, I am now struggling with R. I have passed through the basics, and managed to run a univariate logistic regression routine successfully using the glm code. But as I tried glmnet with the same dataset, I am receiving an error message. How could I fix it? Below is the code I used, followed by the error message:

data1 <- read.table("C:\\\data1.csv",header=TRUE,sep=";",na.string=99:9999)

y <- data1[,1]

x <- data1[,2:45]

glmnet(x,y,family="binomial",alpha=1)  

**in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  : 
(list) object cannot be coerced to type 'double'**

UPDATE3: I got another error message, now related to missing values. My question concerning that matter is here.

Puzzled
  • 365
  • 1
  • 2
  • 15
  • 1
    What is the sample size for the smaller of the two outcome categories? – Matt Reichenbach Jun 11 '14 at 14:57
  • @Matt Reichenback: The sample size for the smaller of the two outcome categories is 32. – Puzzled Jun 11 '14 at 21:56
  • Hard to be sure without a reproducible example, but `y` needs to be a factor; so, if it's not already, `y – Scortchi - Reinstate Monica Jun 20 '14 at 11:52
  • @Scortchi: Just tried that. But now I am getting the following error message: `Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : NA/NaN/Inf in foreign function call (arg 5)`. Maybe some problem with my missing values? What do you think? – Puzzled Jun 20 '14 at 12:32
  • Remove cases or predictors with missing values to find out. As to the best way to deal with missing values in LASSO, it merits a question to itself IMO, if it hasn't already been asked. If it's just a few predictors of the 44 with missing values, I'd probably favour excluding those predictors rather than trying to impute the missing values. – Scortchi - Reinstate Monica Jun 20 '14 at 13:43
  • @Scortchi: I tried that and got no error message. Unfortunately, most of my 44 variables have missing values. I will look up about missing values in LASSO and maybe come up with a new question. – Puzzled Jun 20 '14 at 20:25

2 Answers2

5

You can perform selection and logistic regression simultaneously using the LASSO or Elastic Net regression algorithms. The basic idea behind LASSO is to solve the $l_1$-penalized optimization problem $$\min_{\beta} \{ l(\beta) + \lambda||\beta||_1 \},$$ where $l(\cdot)$ is the likelihood function. Popular implementations, e.g. glmnet, efficiently solve for a grid of $\lambda$ values. This is useful because we usually don't know $\lambda$ a priori and need to apply some type of cross-validation. If you have correlated features then it helps to add some $l_2$ (ridge) penalty, which is the idea behind the Elastic Net.

Since you don't have a lot of data, I think this is probably your best bet. If you want to use a separate variable selection stage you will need to choose a metric (e.g. deviance of single-variable regression) and also a threshold. The LASSO gives you only one parameter to tune and operates within the context of multivariable logistic regression models directly.

EDIT: The question now specifically requests an approach that is implemented in SPSS. As I don't have/use that software I don't know whether lasso logistic regression is implemented. Perhaps someone can let us know in the comments.

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
MichaelJ
  • 636
  • 6
  • 6
  • I seriously doubt that logistic regression lasso is implemented in SPSS, which is what the OP apparently has to use. – StasK Jun 12 '14 at 02:48
  • 1
    SPSS does have LASSO and Elastic Net since version [17](https://teamsite.smu.edu.sg/wiki/stats/Shared%20Documents/newfeaturesSPSSv17.pdf). Does it support logistic regression? I'm not sure. Regardless, the OP mentions SPSS, but I don't think it's apparent that SPSS must be used. Anyone can use R, it's free. I gave a link to an R package, but the question was about regression, not software. – MichaelJ Jun 12 '14 at 03:34
  • I do have to use SPSS, I am not familiar with other packages. Right now I am trying to read about Lasso, and run it in SPSS. – Puzzled Jun 12 '14 at 13:11
  • @MichaelJ, anybody can use Excel, it is on everybody's computer anyway. – StasK Jun 12 '14 at 13:35
  • 1
    @StasK, your statement RE: Excel is false. I don't have it on my computer and it costs $$$. – MichaelJ Jun 12 '14 at 14:40
  • OK, anybody can write their C code, C is available for free in GNU compiler collection. Respect what other people are asking about, please. In industry environment, you may have a specific package installed for you by the company IT, and you cannot install anything of your own, such as R. – StasK Jun 13 '14 at 13:23
  • @StasK: Companies also have procedures for approving new software, so it's useful to know what's out there. In any case, in general, & for the benefit of all readers, ideal answers to such questions should say (1) what you can do, & (2) what you can do with software $X$. Giving the first bit & hoping someone else can supply the second is in no way disrespectful. – Scortchi - Reinstate Monica Jun 14 '14 at 10:17
  • 2
    The specific request for SPSS was a later edit to the question that wasn't there when I gave my answer. Hence my later edit. – MichaelJ Jun 14 '14 at 12:59
  • IMO your answer's fine: if someone can add info about SPSS's current capabilities it's just the icing on the cake. (And it's probably easy enough for the OP himself to find out - knowing what version he uses, having the manuals to hand, &c.) – Scortchi - Reinstate Monica Jun 14 '14 at 16:54
  • 1
    From what I could get about LASSO, it seems a good option. But I am struggling to experiment with it on SPSS. Among the types of regression it offers, there is CATREG (optimal scaling, categorical regression), where it gives the LASSO option for regularization. Is this what I am aiming at? If so, I will later ask for help with setting up and error messages that I am getting. The logistic regression routine does not offer LASSO as an option. Obs: SPSS' help section is far from being as straghtfoward as non-statistitians need, in my opinion. – Puzzled Jun 16 '14 at 13:45
  • @MichaelJ: Maybe you could help me with the last update to my question? – Puzzled Jun 19 '14 at 21:39
4

Besides the excellent suggestions about using shrinkage approaches. Quadratic penalization should also be considered (we have a case study on this in J Clinical Epidemiology, first author Moons). Other than that, data reduction or redundancy analysis (all masked to $Y$) can play an important role, e.g., combining variables that are hard to separate. Variable clustering and principal components are two of many data reduction methods. With the number of events available, the 15:1 rule would indicate that reduction is needed down to two factors (masked to $Y$).

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • 3
    (+1) Just like to add that data reduction guided by subject-matter knowledge & common sense alone can also be effective - which might be appealing to a non-statistician. – Scortchi - Reinstate Monica Jun 12 '14 at 17:24
  • 1
    Quadratic penalization won't select a subset of variables, no? – MichaelJ Jun 14 '14 at 13:04
  • Correct, and that is the reason for its superiority in terms of predictive discrimination. – Frank Harrell Jun 14 '14 at 14:56
  • @FrankHarrell: I would like to read more about quadratic penalization, could you please send me the full reference for the paper you mentioned? – Puzzled Jun 16 '14 at 12:53
  • Also, any advice on how to run quadratic penalization on SPSS? – Puzzled Jun 16 '14 at 13:55
  • 2
    Quadratic penalization is usually called *ridge regression*, so I suggest searching using the latter keywords. As pointed out, quadratic penalization won't select a subset of variables, which was originally specified as a need. If predictive accuracy is the main concern then ridge regression (quadratic penalization) is probably the way to go. – MichaelJ Jun 16 '14 at 16:21
  • 2
    See Moons et al, J Clin Epi 57:2363-70, 2004. – Frank Harrell Jun 16 '14 at 16:26
  • @Scortchi: Could you please explain further what you meant by "data reduction guided by subject-matter knowledge & common sense"? Any examples? – Puzzled Jun 19 '14 at 21:56
  • 1
    @Puzzled: E.g. the response is likely related to a patient's fatness, & you use body-mass index as a predictor rather than height & weight. Or it's likely related to a increase in the amount of rainfall, & you form a predictor from the trend component of a time series model fitted to daily measurements. Or it's likely related to an individual's wealth, & you form a predictor by adding savings, investments & property value. Or it's likely related to a student's academic ability, & you form a predictor from a weighted average of test scores in different subjects. – Scortchi - Reinstate Monica Jun 20 '14 at 09:58