Highest Voted 'train-test-split' Questions - Statistical Analysis Stack Exchange

10

votes

4 answers

I've already used my entire dataset in a regression, should I not use that as a prediction model?

At the hospital I work at we were writing a paper on what variables about a patient predict whether they'll return for a follow-up visit. We included variables such as age, gender, distance from their home to the hospital, mechanism of injury and…

asked Oct 25 '21 at 19:29

Joe Crozier

247
1
9

3

votes

1 answer

Why does error rate of kNN increase when k approaches size of training set?

I've been experimenting with the effect that different values of k have on the generalisation error of kNN classifier, and I've gotten some unexpected results towards the end when k approaches the size of the dataset. The result that I get is the…

classification k-nearest-neighbour train-test-split

asked Nov 04 '21 at 18:42

namiyousef

43
5

3

votes

2 answers

What is the role of 'shuffle' in train_test_split()?

Wondering what shuffle does if I set it to True when splitting a dataset into train and test splits. Can I use shuffle on a dataset which is ordered by dates? train, test = train_test_split(df, test_size=0.2, random_state=42, shuffle=True) Example…

machine-learning python scikit-learn train-test-split

asked Oct 31 '21 at 07:08

Abhiram

131
4

2

votes

1 answer

Validity of basic train - test - split for a time series using a RNN

I am trying to determine if a simple train-test-split is valid for a time series if I use a Recurrent Neural Network (LSTM). Lets say I have samples (x) which consist of 2 days values (time steps) and the y variable represents the day after (ie.…

machine-learning time-series neural-networks recurrent-neural-network train-test-split

asked Jan 28 '22 at 06:44

JmML

23
3

2

votes

1 answer

Should outlier detected before or after train test split

Outliers are usually first detected using Boxplot, then the suspicious observations may be sent to experts for justification - justify whether they are true outliers (contaminated data) or leverage points. Suppose I need to perform model selection…

cross-validation outliers train-test-split

asked Jun 05 '21 at 03:54

Master Shi

643
6
10

1

vote

1 answer

random split vs time based split of train and test data

I have been working on binary classification problem using algorithms such as Random Forest, Boosting methods, neural networks and logistic regression. I have data from Jan 2017 to Jan 2022. We wish to train the model based on historical (completed…

machine-learning neural-networks classification predictive-models train-test-split

asked Feb 25 '22 at 12:45

The Great

1,380
6
18

1

vote

0 answers

How to guarantee the test set is "independent"?

In Machine Learning (ML) tasks, one splits the dataset into training and test sets. We train the ML model based on the training test, and then we evaluate the performance of the model with the test set. It is always crucial to have "independent"…

machine-learning model-selection dataset independence train-test-split

asked Feb 11 '22 at 16:59

Hamed

111
2

1

vote

1 answer

If my test size is small, should the validation set be the same size?

I know there is a rule of thumb to split the data to 70%-90% train data and 30%-10% validation data. But if my test size is small, for example: its size is 5% of the size of the train set, and I can't make it bigger, should the validation data be…

machine-learning validation train-test-split

asked Dec 28 '21 at 09:43

Amit S

27
7

1

vote

0 answers

Scematic of ML model training and testing process?

I'm currently getting confused by how to train a model and then to cross validate it. Many tutorials seem to show that the process is as follows: Define model e.g model = LogisticRegression() Split data - X,y Cross validation…

machine-learning cross-validation train-test-split

asked Dec 24 '21 at 14:01

ryan132442

361
4

1

vote

1 answer

What's the official name of the "crop test"?

I call "crop test" or whether my model passed the "crop test" when I remove data from my dataset, conveniently before some events in the data to check whether the historical predictions match the latest predictions in the cropped version of the…

dataset train train-test-split

asked Dec 09 '21 at 08:02

SkyWalker

825
1
7
12

1

vote

1 answer

Stratify a train / test split according to some categorical variables

I would like to train / test split a dataset in such a way that all categories of categorical variables are in both train and test split. I tried ( using sickit learn ) : df_moto_train , df_moto_test = train_test_split( df_moto , test_size = 0.15 ,…

stratification train-test-split

asked Nov 15 '21 at 06:35

Fabrice BOUCHAREL

215
2
9

1

vote

0 answers

train/validate/test split for time series anomaly detection

I'm trying to perform a multivariate time series anomaly detection. I have training data that consists of "normal" data. I train on this data and detect anomalies on the test set that contains normal + anomalous data. My understanding is that it…

machine-learning time-series anomaly-detection train-test-split

asked Oct 01 '21 at 17:51

siaabd001

11
1

1

vote

1 answer

Train/dev/test split with limited and skewed positive labels

(Because of the sensitive nature of the actual project, I am using an analogy here. I hope it's clear, if not, please let me know!) My goal is to classify images as cats or dogs (binary classification). I have a large data dataset with images of…

classification train-test-split

asked Jul 27 '21 at 11:24

stinodego

13
4

1

vote

0 answers

Python library to split dataset with token-level labels into dev, train and test

I have two datasets and I need to split them into dev, train and test. The first one is a regular dataset where in each line there is a sentence along with a label. I found the following code in a post in TowardsDataScience which is perfect for this…

python dataset train-test-split

asked Jul 20 '21 at 18:36

sthemeli

111
3

1

vote

0 answers

What is the difference between spliting the dataset into training and testing or collecting the training and testing data seperately?

I am working on active learning and I was wondering about the difference if we split the dataset into training and testing or collecting and labeling the training and testing datasets separately. Either way, the ratio between training and testing…

machine-learning active-learning train-test-split

asked Jun 24 '21 at 10:20

Phoenix

111
5

Questions tagged [train-test-split]