4

I have a binary classification problem with 5K records and 60+ features/columns/variables. dataset is slightly imbalanced (or not) with 33:67 class proportion

What I did was

1st) Run a logistic regression (statsmodel) with all 60+ columns as input (meaning controlling confounders) and find out the significant risk factors (p<0.0.5) from result(summary output). So through this approach, I don't have to worry about confounders because confounders are controlled via multivariate regression. Because I have to know that my risk factors are significant as well Meaning build a predictive model on the basis of significant features. I say this because in a field like medical science/clinical studies, I believe it is also important to know the causal effect. I mean if you wish to publish in a journal, do you think we can just list the variables based on feature importance approach (results of which differ for each FS approach). Ofcourse, I find some common features across all feature selection algorithm. But is this enough to justify that this a meaningful predictor? Hence, I was hoping that p-value would convince and help people understand that this is significant predictor

2nd) Use the identified 7 significant risk factors to build a classification ML model

3rd) It yielded an AUC of around 82%

Now my question is

1) Out of 7 significant factors identified, we already know 5 risk factors based on domain experience and literature. So we are considering the rest 2 as new factors which we found. Might be because we had a very good data collection strategy (meaning we collected data for new variables as well which previous literature didn't have)

2) But when I build a model with already known 5 features, it produces an AUC of 82.1. When I include all the 7 significant features, it still produces an AUC of 82.1-82.3 or sometimes, it even goes down to 81.8-81.9 etc. Not much improvement. Why is this happening?

3) If it's of no use, how does statsmodel logistic regression identified them as significant feature (with p<0.05)?

4) I guess we can look at any metric. As my data is slightly imbalanced (33:67 is the class proportion), I am using only metrics like AUC and F1 score. Should I be looking at accuracy only?

5) Should I balance the dataset because I am using statsmodel Logistic regression to identify the risk factors from the summary output? Because I use tree based models later to do the classification which can handle imbalance well, so I didn't balance.Basically what I am trying to know is even for `significant factor identification using statsmodel logistic regression, should I balance the dataset?

6) Can you let me know what is the problem here and how can I address this?

7) How much of an improvement in performance is considered as valid/meaningful to be considered as new findings?

The Great
  • 1,380
  • 6
  • 18
  • 1
    When you report your AUC performance, do you mean on unseen data? – Dave Jan 01 '20 at 07:23
  • Yes, I mean on test data. my datataset is split into `70:30` train and test data. So, 30 pc of test data. So whatever metrics I am talking about it is in test data – The Great Jan 01 '20 at 08:32
  • 1
    And when you call a feature significant, do you mean that your software reports a parameter p-value less than 0.05 (or some other threshold that you use, maybe 0.01) when you call model.summary()? – Dave Jan 01 '20 at 13:16
  • Yes. When I run logistic regression using `statsmodel api`, I get a `p-value` less than 0.05 for certain input variables. You are right – The Great Jan 01 '20 at 13:21

3 Answers3

12

A few general points before answering the individual questions.

First, in logistic regression (unlike in linear regression) coefficient estimates will be biased if you omit any predictor associated with outcome whether or not it is correlated with the included predictors. This page gives an analytic demonstration for the related probit regression.

Second, it's not necessary (even if it's desirable) to know the mechanism through which a predictor is related to outcome. If it improves outcome prediction (either on its own or as a control for other predictors) it can be useful. "Answer[ing] the question does [this] new feature really effect/explain the outcome behavior?'" generally can't be done by statistical modeling; modeling like yours can point the way to the more detailed experimental studies needed to get to the mechanism.

Third, class imbalance problems typically arise from using an improper scoring rule or from just not having enough members of the minority class to get good estimates. See this page among many on this site. Your nicely designed study has over 1500 in the minority class, so the latter is certainly not a problem. Accuracy and F1 score are not strictly proper scoring rules, and the AUC (equivalent to the concordance or C-index) is not very sensitive for detecting differences among models (note that these issues are essentially the same in survival modeling or in logistic regression). So concentrate on using a correct and sensitive measure of model quality.

Fourth, even with your sample size using a single test/train split instead of modeling-process validation by bootstrapping might be leading you astray. See this page and its links. With bootstrapping you take several hundred samples of the same size as your data set, but with replacement, after you have built your model on the entire data set. You do not set aside separate training, validation, and test sets; you use all of the data for the model building and evaluation process. Bootstrapping mimics the process of taking your original sample from the underlying population. You repeat the entire model building process (including feature selection steps) on each bootstrap sample and test, with appropriate metrics, the performance of each model on the full original data set. Then pool the results over all the models from the bootstraps. You can evaluate bias and optimism/overfitting with this approach, and if you are doing feature selection you can compare among the hundreds of models to see the variability among the selected features.

Fifth, with respect to feature selection, predictors in clinical data are often highly inter-correlated in practice. In such cases the specific features selected by any method will tend to depend on the particular sample you have in hand. You can check this for yourself with the bootstrapping approach described above. That will be true of any modeling method you choose. That is one of many reasons why you will find little support on this site for automated model selection. In any case, the initial choice of features to evaluate should be based on your knowledge of the subject matter.

So with respect to the questions:

  1. Congratulations on identifying 2 new risk factors associated with outcome. A predictive model certainly should include them if they are going to be generally available to others in your field. Under the first and second general points above, however, you might want to reconsider removing from your model any predictors that might, based on your knowledge of the subject matter, be associated with outcome. With over 1500 in the minority class you are unlikely to be overfitting with 60 features (if they are all continuous or binary categorical). The usual rule of thumb of 15 minority-class members per evaluated predictor would allow you up to 100 predictors (including levels of categorical variables beyond the second and including interaction terms). If any predictor is going to be available in practice and is expected to be related to outcome based on your knowledge of the subject matter, there's no reason to remove it just because it's not "statistically significant."

  2. The third and fourth general points above might account for this finding. AUC is not a very sensitive measure for comparing models, and using a fixed test/train split could lead to split-dependent imbalances that would be avoided if you did bootstrap-based model validation, as for example with the rms package in R. That leads to:

  3. A logistic regression model optimizes a log-loss, effectively a strictly proper scoring rule that would be expected to be more sensitive than AUC. Note that the size of your study will make it possible to detect "significance" at p < 0.05 for smaller effects than would be possible with a smaller study. Use your knowledge of the subject matter to decide if these statistically significant findings are likely to be clinically significant.

  4. Avoid accuracy. Avoid F1. Be cautious in using AUC. Use a strictly proper scoring rule.

  5. See the third general point above. If your ultimate goal is to use something like boosted classification trees then there is probably no need to do this preliminary logistic regression. Note, however, that a well calibrated logistic regression model can be much easier to interpret than any but the simplest (and potentially most unreliable) tree models. And make sure that your optimization criterion in a tree model provides a proper scoring rule; once again, avoid accuracy as a criterion.

  6. There really is no problem. Bootstrap-based logistic model validation and calibration instead of the single fixed test/train split could provide a much better sense of how your model will perform on new data. If your model is well calibrated (e.g., linearity assumptions hold) then you could use the logistic regression model directly instead of going on to a tree-based model. If you need to make a yes/no decision based solely on the model, choose a probability cutoff that represents the tradeoff between false-negative and false-positive findings.

  7. The answer to your last question depends on your knowledge of the subject matter. Again, this is the issue of statistical significance versus clinical significance. Only you and your colleagues in the field can make that determination.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • Hi @EdM - Thanks a ton for your response and help. Upvoted.. Much appreciated. May I check with you on two things 1) you suggest me to go with Bootstrap validation. 2) choose a appropriate metric to evaluate the model performance. Should I be looking at log-loss as a metric instead of AUC score? Is that what you suggest? 3) other than these two, am I right to understand that the steps that I am doing is correct? – The Great Jan 01 '20 at 21:30
  • @SSMK if the "two things" are directly related to this question and answer, leave a comment and I will respond either by comment or with an edited answer in a day or a few days. If unrelated, post a new question and place a comment here with a link to the new question. – EdM Jan 01 '20 at 21:43
  • Hi @EdM, thanks. As my English proficiency is limited, would like to confirm with you on the above items mentioned in previous comment of mine – The Great Jan 01 '20 at 22:48
  • Regarding point 5 in your answer, the reason I was using `Logistic Regression` was to find out the significant risk factors that influence the outcome. Using Tree based models and other models with Feature selection gives me different feature subsets. Though I find common features being returned based on feature selection algorithm, I felt it was also necessary for a feature to be significant since we are saying that "newly found features could also help us identify patients who might develop the condition". – The Great Jan 01 '20 at 23:17
  • I understand if it's just about prediction, we can rely on `feature importance` but since this is about medical science, shouldn't we also try to answer the question "does new feature really effect/explain the outcome behavior"? Hence I chose Logistic regression. Yes, after that I also followed it up with `boosting` because the model performance was better with the selected features using `Xgboost` than logistic regression. Is it incorrect to do this way? – The Great Jan 01 '20 at 23:17
  • Hi @EdM - one quick question. By this `Bootstrap-based logistic model validation and calibration instead of the single fixed test/train split could provide a much better sense of how your model will perform on new data` , do you mean to say that I don't split the data into train and test at all. But just create bootstrapped dataset first, then split it into 3 groups (train, test and validation(model assessment)). Have I understood it right? – The Great Jan 03 '20 at 10:25
  • 1
    @TheGreat with bootstrapping you use all the data, no splits in the sense that you mean, repeating the _entire model building process_ on several hundred re-samples of the same size taken from the data set with replacement (so that some cases are duplicated or more and some omitted from each re-sample). Then evaluate the models against the entire data set, and collect information about the retained predictors if you wish. Am busy today but will respond in detail over the weekend. – EdM Jan 03 '20 at 15:23
  • 1
    @TheGreat I have added _in italics_ text to some of my initial general points and added a 5th general point on feature selection. If your intent is to use boosted classification trees I don't see a reason to pre-select predictors via logistic regression, which depends on linearity and will only evaluate interactions that you specify. You might find it hard, however to explain _how_ the predictors are related to outcome in boosted trees. Logistic regression results are easier to explain. For boosting, use the log-loss metric as in logistic regression or another proper scoring rule, not AUC etc. – EdM Jan 04 '20 at 04:09
  • Much appreciate your time and inputs. – The Great Jan 04 '20 at 04:28
  • Should you have time, can you please help me with this related post? https://stats.stackexchange.com/questions/444316/how-to-compare-and-evaluate-models-for-a-new-feature – The Great Jan 12 '20 at 00:57
6

6) Can you let me know what is the problem here and how can I address this?

With all due respect, by reading your post I see only red flags due to misapplication and misunderstanding of the statistical methods. I would suggest employing a statistician (and at the very least, reading a great deal on clinical prediction models/regression modeling from Frank Harrell or Ewout Steyerberg before continuing).

Ed Rigdon gave some more pointed answers (but I will be more blunt and less specific), i.e. that your dumping of all collected variables into the model is NOT a good approach nor does it guarantee anything and that you need high cases (of the smaller outcome group) per POTENTIAL predictor (i.e. all that you are screening, so 100 times more than the exact number of potential predictors (features as you called them) would be a minimum number of CASES in the smaller group of the binary variable. However, especially when subject matter expertise is available (almost always is in medical literature), it is poor choice to let variable selection algorithms (especially based on p-values or ROC/sensitivity/specificity guide variable selection) as this often leads to the WRONG set of variables with poor reproducibility. I suggest you look at many blogs Frank Harrell has written on this because sensitivity/specificity and p-values are suboptimal ways to select "good" predictors.

There is a lot in your original post that indicates a rote, cook-book style of statistical practice that leads to poor model performance and dangerous inference. I say this only to provide you with the appropriate caution and to encourage deeper investigation into the correct way to do this (i.e. prespecifying the model fully in advance or using better methods of variable selection than you have). Frank Harrell and Ewout Steyerberg would be excellent resources for you. They will introduce you to smooth calibration curves and other ways to assess model performance, most if not all of which you ignored in your post, and those which are absolutely superior to your initial approach.

LSC
  • 734
  • 3
  • 10
  • With all due respect, this sounds like a fanboy of a particular school of statistical thought. – Josef Jan 01 '20 at 18:32
  • When you say "**wrote**, cook-book style of statistical practice" do you maybe mean *"written"* or *"rote"*? – Seldom 'Where's Monica' Needy Jan 01 '20 at 20:07
  • @LSC - Appreciate your inputs and response. Upvoted... Main reason for posting this here was to learn and correct my mistakes. Excluding the p-values, sensitivity, specificity etc, can you let me what is the optimal way to select good features? – The Great Jan 01 '20 at 21:37
  • @SeldomNeedy it appears that wasn't the only typographical error I made :) – LSC Jan 02 '20 at 01:43
  • 1
    @Josef feel free to add something useful to the discussion stating that what I've said is incorrect or wrong. The literature has great examples of how what I've said is accurate, and hence, how I've formed my opinion. The "ML" "AI" "Big data" folks who are (mis)using logistic regression make claims about their "methods" that don't hold up mathematically/probabilistically or with simulations. If you're in anyway advocating that creating a saturated model and then retaining the "significant" predictors is a good methodology, you're showing your ignorance. If something else, feel free to explain. – LSC Jan 02 '20 at 01:47
  • @SSMK what is the goal of your analysis? The answer, as usual, is that it depends. Subject matter expertise is always an advantage. Glad you're not offended because I wasn't trying to be rude, just straight forward. There is a lot of misinformation abound. – LSC Jan 02 '20 at 01:49
  • 1
    @LSC I mainly don't like your language. For example, I think it is a good model/feature robustness analysis if "automatic" feature selection ends up with those that are also meaningful in field substantive way. I just went through something related https://github.com/statsmodels/statsmodels/issues/6237 https://github.com/statsmodels/statsmodels/issues/6323 – Josef Jan 02 '20 at 02:18
  • 1
    answering a question with "hire a statisticians" reminds me of the response to computer problems that often was "use linux instead of windows" (I will delete this comment again) – Josef Jan 02 '20 at 02:19
  • @LSC Am trying to build a predictive model based on causal features. 1) Is it even possible? 2) Don't you think it's good to build a predictive model based on causal features? 3) If not the above(my) approach, how can I justify that these are the new risk factors that we found? Just feature importance is enough? – The Great Jan 02 '20 at 02:41
  • 4
    @Josef "hire a statistician" is code for "the analysis you are trying to perform requires intensive help from some with statistical expertise, not simply an answer on CV." I've seen many times a question begin with "I don't have much statistical training, but I'm trying to use this advanced statistical technique..." and the best response really is that CV cannot provide sufficient help for the asker to proceed confidently, and the services of a professional statistician are required. In this case, EdM gave a fantastic answer but I agree that OP might want to seek professional advice. – Noah Jan 02 '20 at 03:01
  • @Noah Is it feasible to find a statistician that can help with "causal machine learning"? I wouldn't hire a "traditional statistician" that essentially sounds like "your entire approach is wrong" when it doesn't fit his/her views. BTW I find EdM's answer useful. Disclaimer: my background is econometrics and I never read Harrell Jr. – Josef Jan 02 '20 at 03:27
  • By "statistician" I just mean someone with demonstrated expertise in the relevant field who can spend time providing detailed advice, not someone who espouses specific preferences toward "traditional" statistical methods over machine-learning methods. I am a statistician who specializes in causal inference and machine learning, so yes, it is feasible to find a statistician that specializes in that area. It's not clear to me how ideology is biasing this answer or why you think that the typical statistician is clouded by bias toward one set of methods. – Noah Jan 02 '20 at 03:41
  • 2
    Sorry for being distracting here. I just found the tone of LSC's answer not helpful for combining a causal interpretation with machine learning approaches. It just sounded like the standard argument against automatic explanatory variable selection. The question goes beyond that. – Josef Jan 02 '20 at 04:04
  • Aside: As a `statsmodels` developer I am interested in what support a statistics package can provide to aid statistical and causal interpretation of newer machine learning approaches. There are still a lot of open questions, at least for me. – Josef Jan 02 '20 at 04:09
  • What the OP asked for: help with his statistical problem. What the OP did not ask for: advice on who already knows how to solve his problem. The references you provided were helpful, sort of, but dressing someone down because they don't know as much as a biostatistician is the worst kind of elitism. How would your professional statisticians have become professional statisticians if every time they tried to learn something new someone told them to hire someone who knows better? I thought doctors were self-important but they don't hold a candle to 'statisticians'. – llewmills Jan 04 '20 at 07:14
  • 1
    @Llewmills part of helping people, including budding statisticians, is to know when and how many guardrails are needed. It doesn't appear the OP is completing a homework problem, but rather, may be doing work for a publication possibly in clinical medicine (from what the post contains). Maybe this is wrong, but even budding statisticians get told to sit back and work as programmers while the lead statistician handles problems outside the scope of the junior. Further, the OP asked a specific headline question but provided greater detail indicating this is an issue bigger than a post (1/n) – LSC Jan 04 '20 at 10:29
  • 2
    on stack exchange should/can handle. There is literally weeks of learning (at least) that the OP might need to come at this from a better angle that really someone in person should be enlisted to speak directly with the OP and help teach the OP while addressing the problem. Is it elitism to tell a patient not to use WebMD because they don't actually understand medicine? People have a misinformed notion that statistical practice is basically a set of calculations and fail to weigh the gravity of improper methodology and interpretation (hence why so much research is garbage). (2/n) – LSC Jan 04 '20 at 10:31
  • 1
    The scary part is that the garbage research floating around in good journals is what impacts patient care. If we encourage people to develop a better foundation and seek professional advice in the meantime, that is only a good thing. I gave OP references to help elaborate and explain the issue to start the teaching process. Also, I'm not a statistician, so there isn't much of a motivation here for me to gate-keep or act elite about a group I'm not part of in the first place. If you're paying a statistician for time, you have every right to ask for teaching along the way [good ones will](3/3) – LSC Jan 04 '20 at 10:34
6

I would like to add one point to EdM's answer, that has not been mentioned yet.

Statistically significant but not important

This could be some random feature of the data and because of the multiple testing problem some features are significant in the dataset purely by sampling.

However, it could also be that the overall effect of an explanatory variable is small but it could be large for some subgroups or over some range of the values of the variables. In that case, a significant small main effect could pick up an effect from a missing interaction or from a missing nonlinearity.

Examples could be risk factors like cholesterol where the effect increases with other factors and only a small fraction of the sample is exposed to those other factors. Some factors could be age related and the risk factor is only important for a small age group in the sample.

Using other estimation methods like tree models might pick up some of this nonlinearity and thereby improve overall prediction.

Josef
  • 2,631
  • 3
  • 13
  • 14