Questions tagged [regression-strategies]

Regression Modeling Strategies

The purpose of this category is to refer to questions and discussions about regression modeling strategies, especially when multiple methods are being combined. For example how much data reduction should be done before using $Y$? What is best practice for model validation for specific model types? How does the choice of predictive accuracy measures impact model validation? How should parameters be assigned for various parts of a model, and how does the number of parameters assigned to one part of the model affect the number of parameters to assign to another part? What is the best way to detect that parameters in a model are hard to disentangle and how could pre-modeling data reduction have helped? What is a good strategy for getting a complex model accepted by non-statisticians? When does one use traditional multivariable regression modeling vs. a black box?

290 questions
98
votes
8 answers

What is the benefit of breaking up a continuous predictor variable?

I'm wondering what the value is in taking a continuous predictor variable and breaking it up (e.g., into quintiles), before using it in a model. It seems to me that by binning the variable we lose information. Is this just so we can model…
Tom
  • 1,511
  • 1
  • 12
  • 17
56
votes
4 answers

Can a random forest be used for feature selection in multiple linear regression?

Since RF can handle non-linearity but can't provide coefficients, would it be wise to use random forest to gather the most important features and then plug those features into a multiple linear regression model in order to obtain their coefficients?…
35
votes
5 answers

Overfitting a logistic regression model

Is it possible to overfit a logistic regression model? I saw a video saying that if my area under the ROC curve is higher than 95%, then its very likely to be over fitted, but is it possible to overfit a logistic regression model?
carlosedubarreto
  • 547
  • 2
  • 5
  • 10
30
votes
3 answers

Should final (production ready) model be trained on complete data or just on training set?

Suppose I trained several models on training set, choose best one using cross validation set and measured performance on test set. So now I have one final best model. Should I retrain it on my all available data or ship solution trained only on…
Yurii
  • 1,724
  • 14
  • 26
27
votes
1 answer

Appropriate residual degrees of freedom after dropping terms from a model

I am reflecting on the discussion around this question and particularly Frank Harrell's comment that the estimate for variance in a reduced model (ie one from which a number of explanatory variables have been tested and rejected) should use Ye's…
Peter Ellis
  • 16,522
  • 1
  • 44
  • 82
27
votes
3 answers

Evaluating logistic regression and interpretation of Hosmer-Lemeshow Goodness of Fit

As we all know, there are 2 methods to evaluate the logistic regression model and they are testing very different things Predictive power: Get a statistic that measures how well you can predict the dependent variable based on the independent…
24
votes
5 answers

When is quantile regression worse than OLS?

Apart from some unique circumstances where we absolutely must understand the conditional mean relationship, what are the situations where a researcher should pick OLS over Quantile Regression? I don't want the answer to be "if there is no use in…
24
votes
2 answers

Bayesian thinking about overfitting

I've devoted much time to development of methods and software for validating predictive models in the traditional frequentist statistical domain. In putting more Bayesian ideas into practice and teaching I see some key differences to embrace. …
21
votes
2 answers

Does LASSO suffer from the same problems stepwise regression does?

Stepwise algorithmic variable-selection methods tend to select for models which bias more or less every estimate in regression models ($\beta$s and their SEs, p-values, F statistics, etc.), and are about as likely to exclude true predictors as…
21
votes
4 answers

How should I check the assumption of linearity to the logit for the continuous independent variables in logistic regression analysis?

I am confused with the assumption of linearity to the logit for continuous predictor variables in logistic regression analysis. Do we need to check for the linear relationship while screening for potential predictors using univariable logistic…
19
votes
1 answer

What does it mean to make the sample size a random variable?

Frank Harrell has started a blog (Statistical Thinking). In his premier post, he lists some key features of his statistical philosophy. Among other items, it includes: Make the sample size a random variable when possible What does it mean…
gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
18
votes
5 answers

Can I ignore coefficients for non-significant levels of factors in a linear model?

After seeking clarification about linear model coefficients over here I have a follow up question concerning non-signficant (high p value) for coefficients of factor levels. Example: If my linear model includes a factor with 10 levels, and only 3 of…
18
votes
2 answers

Can we use categorical independent variable in discriminant analysis?

In discriminant analysis, the dependent variable is categorical, but can I use a categorical variable (e.g residential status: rural, urban) along with some other continuous variable as independent variable in linear discriminant analysis?
18
votes
3 answers

Model building and selection using Hosmer et al. 2013. Applied Logistic Regression in R

This is my first post on StackExchange, but I have been using it as a resource for quite a while, I will do my best to use the appropriate format and make the appropriate edits. Also, this is a multi-part question. I wasn't sure if I should split…
GNG
  • 397
  • 4
  • 11
17
votes
4 answers

Why does propensity score matching work for causal inference?

Propensity score matching is used for make causal inferences in observational studies (see the Rosenbaum / Rubin paper). What's the simple intuition behind why it works? In other words, why if we make sure the probability of participating in the…
max
  • 1,254
  • 1
  • 12
  • 29
1
2 3
19 20