Questions tagged [predictive-models]

Predictive models are statistical models whose primary purpose is to predict other observations of a system optimally, as opposed to models whose purpose is to test a particular hypothesis or explain a phenomenon mechanistically. As such, predictive models place less emphasis on interpretability and more emphasis on performance.

Wikipedia has articles https://en.wikipedia.org/wiki/Predictive_modelling and https://en.wikipedia.org/wiki/Predictive_analytics with further references.

2756 questions
130
votes
4 answers

Differences between cross validation and bootstrapping to estimate the prediction error

I would like your thoughts about the differences between cross validation and bootstrapping to estimate the prediction error. Does one work better for small dataset sizes or large datasets?
grant
  • 1,491
  • 2
  • 11
  • 10
119
votes
6 answers

Difference between confidence intervals and prediction intervals

For a prediction interval in linear regression you still use $\hat{E}[Y|x] = \hat{\beta_0}+\hat{\beta}_{1}x$ to generate the interval. You also use this to generate a confidence interval of $E[Y|x_0]$. What's the difference between the two?
107
votes
15 answers

US Election results 2016: What went wrong with prediction models?

First it was Brexit, now the US election. Many model predictions were off by a wide margin, and are there lessons to be learned here? As late as 4 pm PST yesterday, the betting markets were still favoring Hillary 4 to 1. I take it that the betting…
horaceT
  • 3,162
  • 3
  • 15
  • 19
88
votes
8 answers

When is unbalanced data really a problem in Machine Learning?

We already had multiple questions about unbalanced data when using logistic regression, SVM, decision trees, bagging and a number of other similar questions, what makes it a very popular topic! Unfortunately, each of the questions seems to be…
Tim
  • 108,699
  • 20
  • 212
  • 390
74
votes
16 answers

Practical thoughts on explanatory vs. predictive modeling

Back in April, I attended a talk at the UMD Math Department Statistics group seminar series called "To Explain or To Predict?". The talk was given by Prof. Galit Shmueli who teaches at UMD's Smith Business School. Her talk was based on research she…
wahalulu
  • 171
  • 1
  • 3
  • 7
68
votes
9 answers

How can I help ensure testing data does not leak into training data?

Suppose we have someone building a predictive model, but that someone is not necessarily well-versed in proper statistical or machine learning principles. Maybe we are helping that person as they are learning, or maybe that person is using some…
67
votes
3 answers

Variables are often adjusted (e.g. standardised) before making a model - when is this a good idea, and when is it a bad one?

In what circumstances would you want to, or not want to scale or standardize a variable prior to model fitting? And what are the advantages / disadvantages of scaling a variable?
60
votes
5 answers

Is adjusting p-values in a multiple regression for multiple comparisons a good idea?

Lets assume you are a social science researcher/econometrician trying to find relevant predictors of demand for a service. You have 2 outcome/dependent variables describing the demand (using the service yes/no, and the number of occasions). You have…
57
votes
6 answers

Alternatives to logistic regression in R

I would like as many algorithms that perform the same task as logistic regression. That is algorithms/models that can give a prediction to a binary response (Y) with some explanatory variable (X). I would be glad if after you name the algorithm,…
Tal Galili
  • 19,935
  • 32
  • 133
  • 195
50
votes
3 answers

What is the root cause of the class imbalance problem?

I've been thinking a lot about the "class imbalance problem" in machine/statistical learning lately, and am drawing ever deeper into a feeling that I just don't understand what is going on. First let me define (or attempt to) define my terms: The…
45
votes
3 answers

whether to rescale indicator / binary / dummy predictors for LASSO

For the LASSO (and other model selecting procedures) it is crucial to rescale the predictors. The general recommendation I follow is simply to use a 0 mean, 1 standard deviation normalization for continuous variables. But what is there to do with…
43
votes
1 answer

Manually calculated $R^2$ doesn't match up with randomForest() $R^2$ for testing new data

I know this is a fairly specific R question, but I may be thinking about proportion variance explained, $R^2$, incorrectly. Here goes. I'm trying to use the R package randomForest. I have some training data and testing data. When I fit a random…
Stephen Turner
  • 4,183
  • 8
  • 27
  • 33
43
votes
2 answers

Mean absolute percentage error (MAPE) in Scikit-learn

How can we calculate the Mean absolute percentage error (MAPE) of our predictions using Python and scikit-learn? From the docs, we have only these 4 metric functions for Regressions: metrics.explained_variance_score(y_true,…
Nyxynyx
  • 885
  • 3
  • 9
  • 15
43
votes
3 answers

Variance of $K$-fold cross-validation estimates as $f(K)$: what is the role of "stability"?

TL,DR: It appears that, contrary to oft-repeated advice, leave-one-out cross validation (LOO-CV) -- that is, $K$-fold CV with $K$ (the number of folds) equal to $N$ (the number of training observations) -- yields estimates of the generalization…
42
votes
1 answer

When and how to use standardized explanatory variables in linear regression

I have 2 simple questions about linear regression: When is it advised to standardize the explanatory variables? Once estimation is carried out with standardized values, how can one predict with new values (how one should standardize the new…
teucer
  • 1,801
  • 2
  • 16
  • 29
1
2 3
99 100