Questions tagged [train-test-split]
30 questions
10
votes
4 answers
I've already used my entire dataset in a regression, should I not use that as a prediction model?
At the hospital I work at we were writing a paper on what variables about a patient predict whether they'll return for a follow-up visit. We included variables such as age, gender, distance from their home to the hospital, mechanism of injury and…

Joe Crozier
- 247
- 1
- 9
3
votes
1 answer
Why does error rate of kNN increase when k approaches size of training set?
I've been experimenting with the effect that different values of k have on the generalisation error of kNN classifier, and I've gotten some unexpected results towards the end when k approaches the size of the dataset.
The result that I get is the…

namiyousef
- 43
- 5
3
votes
2 answers
What is the role of 'shuffle' in train_test_split()?
Wondering what shuffle does if I set it to True when splitting a dataset into train and test splits. Can I use shuffle on a dataset which is ordered by dates?
train, test = train_test_split(df, test_size=0.2, random_state=42, shuffle=True)
Example…

Abhiram
- 131
- 4
2
votes
1 answer
Validity of basic train - test - split for a time series using a RNN
I am trying to determine if a simple train-test-split is valid for a time series if I use a Recurrent Neural Network (LSTM). Lets say I have samples (x) which consist of 2 days values (time steps) and the y variable represents the day after (ie.…

JmML
- 23
- 3
2
votes
1 answer
Should outlier detected before or after train test split
Outliers are usually first detected using Boxplot, then the suspicious observations may be sent to experts for justification - justify whether they are true outliers (contaminated data) or leverage points.
Suppose I need to perform model selection…

Master Shi
- 643
- 6
- 10
1
vote
1 answer
random split vs time based split of train and test data
I have been working on binary classification problem using algorithms such as Random Forest, Boosting methods, neural networks and logistic regression. I have data from Jan 2017 to Jan 2022. We wish to train the model based on historical (completed…

The Great
- 1,380
- 6
- 18
1
vote
0 answers
How to guarantee the test set is "independent"?
In Machine Learning (ML) tasks, one splits the dataset into training and test sets. We train the ML model based on the training test, and then we evaluate the performance of the model with the test set.
It is always crucial to have "independent"…

Hamed
- 111
- 2
1
vote
1 answer
If my test size is small, should the validation set be the same size?
I know there is a rule of thumb to split the data to 70%-90% train data and 30%-10% validation data.
But if my test size is small, for example: its size is 5% of the size of the train set, and I can't make it bigger, should the validation data be…

Amit S
- 27
- 7
1
vote
0 answers
Scematic of ML model training and testing process?
I'm currently getting confused by how to train a model and then to cross validate it.
Many tutorials seem to show that the process is as follows:
Define model e.g model = LogisticRegression()
Split data - X,y
Cross validation…

ryan132442
- 361
- 4
1
vote
1 answer
What's the official name of the "crop test"?
I call "crop test" or whether my model passed the "crop test" when I remove data from my dataset, conveniently before some events in the data to check whether the historical predictions match the latest predictions in the cropped version of the…

SkyWalker
- 825
- 1
- 7
- 12
1
vote
1 answer
Stratify a train / test split according to some categorical variables
I would like to train / test split a dataset in such a way that all categories of categorical variables are in both train and test split.
I tried ( using sickit learn ) :
df_moto_train , df_moto_test = train_test_split( df_moto , test_size = 0.15 ,…

Fabrice BOUCHAREL
- 215
- 2
- 9
1
vote
0 answers
train/validate/test split for time series anomaly detection
I'm trying to perform a multivariate time series anomaly detection. I have training data that consists of "normal" data. I train on this data and detect anomalies on the test set that contains normal + anomalous data. My understanding is that it…

siaabd001
- 11
- 1
1
vote
1 answer
Train/dev/test split with limited and skewed positive labels
(Because of the sensitive nature of the actual project, I am using an analogy here. I hope it's clear, if not, please let me know!)
My goal is to classify images as cats or dogs (binary classification). I have a large data dataset with images of…

stinodego
- 13
- 4
1
vote
0 answers
Python library to split dataset with token-level labels into dev, train and test
I have two datasets and I need to split them into dev, train and test. The first one is a regular dataset where in each line there is a sentence along with a label. I found the following code in a post in TowardsDataScience which is perfect for this…

sthemeli
- 111
- 3
1
vote
0 answers
What is the difference between spliting the dataset into training and testing or collecting the training and testing data seperately?
I am working on active learning and I was wondering about the difference if we split the dataset into training and testing or collecting and labeling the training and testing datasets separately. Either way, the ratio between training and testing…

Phoenix
- 111
- 5