Questions tagged [out-of-sample]

Refers to the practice of assessing model performance on some "test" or "holdout" or "out-of-sample" set of data that was not used for model building.

160 questions
68
votes
9 answers

How can I help ensure testing data does not leak into training data?

Suppose we have someone building a predictive model, but that someone is not necessarily well-versed in proper statistical or machine learning principles. Maybe we are helping that person as they are learning, or maybe that person is using some…
33
votes
3 answers

Do we need a test set when using k-fold cross-validation?

I've been reading about k-fold validation, and I want to make sure I understand how it works. I know that for the holdout method, the data is split into three sets, and the test set is only used at the very end to assess the performance of the…
b_pcakes
  • 435
  • 1
  • 4
  • 5
29
votes
4 answers

Has the journal Science endorsed the Garden of Forking Pathes Analyses?

The idea of adaptive data analysis is that you alter your plan for analyzing the data as you learn more about it. In the case of exploratory data analysis (EDA), this is generally a good idea (you are often looking for unforeseen patterns in the…
22
votes
5 answers

New revolutionary way of data mining?

The following excerpt is from Schwager's Hedge Fund Market Wizzards (May 2012), an interview with the consistently successful hedge fund manager Jaffray Woodriff: To the question: "What are some of the worst errors people make in data mining?": A…
vonjd
  • 5,886
  • 4
  • 47
  • 59
17
votes
1 answer

Is Kaggle's private leaderboard a good predictor of out-of-sample performance of the winning model?

While the results of the private test set can not be used to refine the model further, isn't model selection out of a huge number of models being performed based on the private test set results? Would you not, through that process alone, end up…
rinspy
  • 3,188
  • 10
  • 40
16
votes
4 answers

Why isn't the holdout method (splitting data into training and testing) used in classical statistics?

In my classroom exposure to data mining, the holdout method was introduced as a way of assessing model performance. However, when I took my first class on linear models, this was not introduced as a means of model validation or assessment. My online…
15
votes
0 answers

Confusion with Vowpal Wabbit's multiple-pass behavior when performing ridge-regression

I have encountered many peculiarities/misunderstandings of Vowpal Wabbit when trying to do online multiple-pass learning. Specifically, I need to solve a Ridge Linear regression problem, with N=4e6 points and a total of around K=2.38e5 features.…
15
votes
4 answers

Predictive models: statistics can't possibly beat machine learning?

I am currently following a master program focused on statistics/econometrics. In my master, all students had to do 3 months of research. Last week, all groups had to present their research to the rest of the master students. Almost every group did…
15
votes
2 answers

How to calculate out of sample R squared?

I know this probably has been discussed somewhere else, but I have not been able to find an explicit answer. I am trying to use the formula $R^2 = 1 - SSR/SST$ to calculate out-of-sample $R^2$ of a linear regression model, where $SSR$ is the sum of…
crazydriver
  • 151
  • 1
  • 1
  • 3
13
votes
1 answer

Difference between "in-sample" and "pseudo out-of-sample" forecasts

Is there an explicit difference between in-sample forecasts and pseudo out-of-sample forecasts. Both is meant in the context of evaluating and comparing forecasting models.
altabq
  • 665
  • 3
  • 6
  • 16
12
votes
4 answers

What is the more appropriate way to create a hold-out set: to remove some subjects or to remove some observations from each subject?

I have a dataset with 26 features and 31000 rows. It is the dataset of 38 subjects. It is for a biometric system. So I want to be able to identify subjects. In order to have a testing set, I know I have to remove some values. So what is it better to…
Aizzaac
  • 989
  • 2
  • 11
  • 21
12
votes
2 answers

How can a smaller learning rate hurt the performance of a gbm?

I've always subscribed to the folk wisdom that decreasing the learning rate in a gbm (gradient boosted tree model) does not hurt the out of sample performance of the model. Today, I'm not so sure. I'm fitting models (minimizing sum of squared…
Matthew Drury
  • 33,314
  • 2
  • 101
  • 132
11
votes
2 answers

A ''significant variable'' that does not improve out-of-sample predictions - how to interpret?

I have a question that I think will be quite basic to a lot of users. Im using linear regression models to (i) investigate the relationship of several explanatory variables and my response variable and (ii) predict my response variable using the…
10
votes
3 answers

Bootstrapping estimates of out-of-sample error

I know how to use bootstrap re-sampling to find confidence intervals for in-sample error or R2: # Bootstrap 95% CI for R-Squared library(boot) # function to obtain R-Squared from the data rsq <- function(formula, data, indices) { d <-…
Zach
  • 22,308
  • 18
  • 114
  • 158
10
votes
1 answer

Does modeling with Random Forests require cross-validation?

As far as I've seen, opinions tend to differ about this. Best practice would certainly dictate using cross-validation (especially if comparing RFs with other algorithms on the same dataset). On the other hand, the original source states that the…
neuron
  • 269
  • 2
  • 11
1
2 3
10 11