Questions tagged [train]

Training (or estimation) of statistical models or machine learning algorithms.

327 questions
331
votes
5 answers

What is the trade-off between batch size and number of iterations to train a neural network?

When training a neural network, what difference does it make to set: batch size to $a$ and number of iterations to $b$ vs. batch size to $c$ and number of iterations to $d$ where $ ab = cd $? To put it otherwise, assuming that we train the neural…
Franck Dernoncourt
  • 42,093
  • 30
  • 155
  • 271
32
votes
1 answer

Benefits of stratified vs random sampling for generating training data in classification

I would like to know if there are any/some advantages of using stratified sampling instead of random sampling, when splitting the original dataset into training and testing set for classification. Also, does stratified sampling introduce more bias…
gc5
  • 877
  • 2
  • 12
  • 23
30
votes
3 answers

Imputation before or after splitting into train and test?

I have a data set with N ~ 5000 and about 1/2 missing on at least one important variable. The main analytic method will be Cox proportional hazards. I plan to use multiple imputation. I will also be splitting into a train and test set. Should I…
Peter Flom
  • 94,055
  • 35
  • 143
  • 276
23
votes
2 answers

Scikit correct way to calibrate classifiers with CalibratedClassifierCV

Scikit has CalibratedClassifierCV, which allows us to calibrate our models on a particular X, y pair. It also states clearly that data for fitting the classifier and for calibrating it must be disjoint. If they must be disjoint, is it legitimate to…
17
votes
5 answers

Can increasing the amount of training data make overfitting worse?

Suppose I train a neural network on dataset A and evaluate on dataset B (that has a different feature distribution than dataset A). If I increase the amount of data in dataset A by a factor of 10, is it likely to decrease accuracy on dataset B?
15
votes
2 answers

Can I (justifiably) train a second model only on the observations that a previous model predicted poorly?

Say I commit the following sins while building a predictive model: I take my dataset and split it into four subsets: Three for training (Train_A, Train_B, and Train_C) and one for validation. I train an initial model (Model_A) on Train_A. Because…
15
votes
2 answers

Is there a way to incorporate new data into an already trained neural network without retraining on all my data in Keras?

I have already trained a neural network on my data. In the future, I will receive some more data. How can I incorporate this data into my model without rebuilding it from scratch?
yalpsid eman
  • 273
  • 1
  • 2
  • 10
14
votes
3 answers

Training, testing, validating in a survival analysis problem

I've been browsing various threads here, but I don't think my exact question is answered. I have a dataset of ~50,000 students and their time to dropout. I am going to be performing proportional hazards regression with a large number of potential…
Peter Flom
  • 94,055
  • 35
  • 143
  • 276
14
votes
2 answers

Different results from randomForest via caret and the basic randomForest package

I am a bit confused: How can the results of a trained Model via caret differ from the model in the original package? I read Whether preprocessing is needed before prediction using FinalModel of RandomForest with caret package? but I do not use any…
Malte
  • 263
  • 1
  • 2
  • 6
13
votes
1 answer

How to know if a learning curve from SVM model suffers from bias or variance?

I created this learning curve and I want to know if my SVM model suffers from bias or variance? How can I conclude that from this graph?
Afke
  • 267
  • 1
  • 3
  • 10
12
votes
4 answers

TfidfVectorizer: should it be used on train only or train+test

When training a model it is possible to train the Tfidf on the corpus of only the training set or also on the test set. It seems not to make sense to include the test corpus when training the model, though since it is not supervised, it is also…
PascalVKooten
  • 2,127
  • 5
  • 22
  • 34
11
votes
5 answers

Good examples/books/resources to learn about applied machine learning (not just ML itself)

I've taken an ML course previously, but now that I am working with ML related projects at my job, I am struggling quite a bit to actually apply it. I'm sure the stuff I'm doing has been researched/dealt with before, but I can't find specific…
stoneman
  • 11
  • 4
10
votes
4 answers

I've already used my entire dataset in a regression, should I not use that as a prediction model?

At the hospital I work at we were writing a paper on what variables about a patient predict whether they'll return for a follow-up visit. We included variables such as age, gender, distance from their home to the hospital, mechanism of injury and…
10
votes
3 answers

Is it in general helpful to add "external" datasets to the training dataset?

Several people have already asked "is more data helpful?": What impact does increasing the training data have on the overall system accuracy? Can increasing the amount of training data make overfitting worse? Will a model always score better on the…
gebbissimo
  • 410
  • 3
  • 12
10
votes
3 answers

Approaches when learning from huge datasets?

Basically, there are two common ways to learn against huge datasets (when you're confronted by time/space restrictions): Cheating :) - use just a "manageable" subset for training. The loss of accuracy may be negligible because of the law of…
andreister
  • 3,257
  • 17
  • 29
1
2 3
21 22