Highest Voted 'out-of-sample' Questions - Statistical Analysis Stack Exchange

68

votes

9 answers

How can I help ensure testing data does not leak into training data?

Suppose we have someone building a predictive model, but that someone is not necessarily well-versed in proper statistical or machine learning principles. Maybe we are helping that person as they are learning, or maybe that person is using some…

asked Dec 19 '11 at 22:49

Michael McGowan

4,561
3
31
46

33

votes

3 answers

Do we need a test set when using k-fold cross-validation?

I've been reading about k-fold validation, and I want to make sure I understand how it works. I know that for the holdout method, the data is split into three sets, and the test set is only used at the very end to assess the performance of the…

cross-validation validation out-of-sample

asked Jul 27 '16 at 17:30

b_pcakes

435
1
4
5

29

votes

4 answers

Has the journal Science endorsed the Garden of Forking Pathes Analyses?

The idea of adaptive data analysis is that you alter your plan for analyzing the data as you learn more about it. In the case of exploratory data analysis (EDA), this is generally a good idea (you are often looking for unforeseen patterns in the…

hypothesis-testing overfitting exploratory-data-analysis out-of-sample differential-privacy

asked Jul 11 '16 at 14:31

Cliff AB

17,741
1
39
84

22

votes

5 answers

New revolutionary way of data mining?

The following excerpt is from Schwager's Hedge Fund Market Wizzards (May 2012), an interview with the consistently successful hedge fund manager Jaffray Woodriff: To the question: "What are some of the worst errors people make in data mining?": A…

data-mining curve-fitting out-of-sample

asked Jul 02 '12 at 13:57

vonjd

5,886
4
47
59

17

votes

1 answer

Is Kaggle's private leaderboard a good predictor of out-of-sample performance of the winning model?

While the results of the private test set can not be used to refine the model further, isn't model selection out of a huge number of models being performed based on the private test set results? Would you not, through that process alone, end up…

model-selection overfitting out-of-sample

asked Jun 08 '17 at 10:21

rinspy

3,188
10
40

16

votes

4 answers

Why isn't the holdout method (splitting data into training and testing) used in classical statistics?

In my classroom exposure to data mining, the holdout method was introduced as a way of assessing model performance. However, when I took my first class on linear models, this was not introduced as a means of model validation or assessment. My online…

regression validation model-evaluation out-of-sample

asked Jan 29 '15 at 05:31

tirkquest

183
1
6

15

votes

0 answers

Confusion with Vowpal Wabbit's multiple-pass behavior when performing ridge-regression

I have encountered many peculiarities/misunderstandings of Vowpal Wabbit when trying to do online multiple-pass learning. Specifically, I need to solve a Ridge Linear regression problem, with N=4e6 points and a total of around K=2.38e5 features.…

machine-learning ridge-regression online-algorithms out-of-sample vowpal-wabbit

asked Jan 07 '14 at 23:43

richizy

251
2
5

15

votes

4 answers

Predictive models: statistics can't possibly beat machine learning?

I am currently following a master program focused on statistics/econometrics. In my master, all students had to do 3 months of research. Last week, all groups had to present their research to the rest of the master students. Almost every group did…

machine-learning forecasting predictive-models prediction out-of-sample

asked Mar 24 '18 at 11:56

dubvice

389
1
13

15

votes

2 answers

How to calculate out of sample R squared?

I know this probably has been discussed somewhere else, but I have not been able to find an explicit answer. I am trying to use the formula $R^2 = 1 - SSR/SST$ to calculate out-of-sample $R^2$ of a linear regression model, where $SSR$ is the sum of…

regression machine-learning r-squared out-of-sample

asked Aug 06 '16 at 06:32

crazydriver

151
1
1
3

13

votes

1 answer

Difference between "in-sample" and "pseudo out-of-sample" forecasts

Is there an explicit difference between in-sample forecasts and pseudo out-of-sample forecasts. Both is meant in the context of evaluating and comparing forecasting models.

forecasting model-comparison out-of-sample in-sample

asked Nov 07 '13 at 13:11

altabq

665
3
6
16

12

votes

4 answers

What is the more appropriate way to create a hold-out set: to remove some subjects or to remove some observations from each subject?

I have a dataset with 26 features and 31000 rows. It is the dataset of 38 subjects. It is for a biometric system. So I want to be able to identify subjects. In order to have a testing set, I know I have to remove some values. So what is it better to…

machine-learning cross-validation out-of-sample

asked Oct 13 '16 at 14:22

Aizzaac

989
2
11
21

12

votes

2 answers

How can a smaller learning rate hurt the performance of a gbm?

I've always subscribed to the folk wisdom that decreasing the learning rate in a gbm (gradient boosted tree model) does not hurt the out of sample performance of the model. Today, I'm not so sure. I'm fitting models (minimizing sum of squared…

machine-learning boosting out-of-sample

asked Aug 15 '16 at 03:29

Matthew Drury

33,314
2
101
132

11

votes

2 answers

A ''significant variable'' that does not improve out-of-sample predictions - how to interpret?

I have a question that I think will be quite basic to a lot of users. Im using linear regression models to (i) investigate the relationship of several explanatory variables and my response variable and (ii) predict my response variable using the…

statistical-significance predictive-models p-value prediction out-of-sample

asked Aug 25 '18 at 13:16

dubvice

389
1
13

10

votes

3 answers

Bootstrapping estimates of out-of-sample error

I know how to use bootstrap re-sampling to find confidence intervals for in-sample error or R2: # Bootstrap 95% CI for R-Squared library(boot) # function to obtain R-Squared from the data rsq <- function(formula, data, indices) { d <-…

bootstrap resampling out-of-sample

asked Oct 06 '11 at 13:23

Zach

22,308
18
114
158

10

votes

1 answer

Does modeling with Random Forests require cross-validation?

As far as I've seen, opinions tend to differ about this. Best practice would certainly dictate using cross-validation (especially if comparing RFs with other algorithms on the same dataset). On the other hand, the original source states that the…

cross-validation random-forest overfitting out-of-sample

asked Jul 20 '15 at 19:38

neuron

269
2
11

Questions tagged [out-of-sample]