Less training data gave me better test score

Question

I'm currently working on a project where I'm also the one who's labelling the data that im going to use for the model.

My model is a document-classification model where I'd classify if a document belongs to a certain category or not.

So the problem is that I made different models in different stages/times where I had different amounts of training data.

Initially when I had 2.8k amount of training data, the best performance (f1-score) I could get was about 75-76% but then when I got more data with like about 3.3k amount of training data, the performance of my model gave me about 73-75% f1-score

Could anyone enlighten me on why lesser training data had better test scores compared to the model with more training data.

Im implementing my model on Keras as a DNN btw. (I doubt this would help cause my question is kinda theoretical)

How often did you run the training? It could be a coincidence. It could also be that the 2.8k data represents the underlying distribution well and in the 3.3k set is noise. — hh32, Nov 06 '17 at 10:16
yeah well I don't really have a fix number of epochs to run the training in since I've added a early stopping monitor which checks if the validation loss isn't increasing anymore. Hmm won't more data trained for the model provide more intuition on the different kinds of instances found in different categories? Or does it really just provide more noise. Or would it just vary depending on the situation? @hh32 — Vincent Pakson, Nov 06 '17 at 10:24
Does your F1 score relate to running the model on the same data you've used for training or did you split train-test? — Spätzle, Nov 06 '17 at 10:31
I split the train-test. my test data is not seen by the model as is suggested in building most models. :) — Vincent Pakson, Nov 06 '17 at 10:48

score 2 · Answer 1 · answered Nov 07 '17 at 13:02

2

Your dataset size is too small by a factor of 10 for split-sample validation to work. Everything is too unstable at this sample size. You are also using a discontinuous improper accuracy score which will easily trick you into selecting the wrong model. I recommend using the Efron-Gong optimism bootstrap for strong internal validation with the same sample used to build the model, and consider accuracy scores related to the log-likelihood or use the Brier score.

answered Nov 07 '17 at 13:02

Frank Harrell

74,029
5
148
322

hmm, interesting. Do you happen to have resources for that sir? on the other hand, is mean square log error related to log-likelihood? – Vincent Pakson Nov 08 '17 at 04:25
A logarithmic scoring rule does not square anything. My RMS course notes and book go into details. See [links](http://www.fharrell.com/p/blog-page.html). – Frank Harrell Nov 08 '17 at 13:27

score 1 · Answer 2 · answered Nov 06 '17 at 10:39

It depends on the data. Let's assume the data is valid, because you labeled it yourself. I try to think of it as a new problem. Imagine this simple example:

Let's say this is your data and the line (your DNN) that separates them:

Everything is fine, right? So now you get new data and it looks like this:

The new black circle is classified wrongly by your old line (blue) and you need to find a new one (green). Hence adaptation is needed.

In more technical terms: It seems that the new data you introduce changes the feature space which causes the decision boundary to change also.

I would try the following, since you are using a Neural Network:

Adapt the batch size. As you may know batch updating introduces noise to the gradient. It could simply be that the samples you take for a batch is too large/too small for 3.3k data points.
Change the learning rate decay. I'll just assume you are using this technique. You can try to change the rate and make it slower such that it can adapt to more data.
Train your net on the original 2.8k data points and use the weights as a initialization and train the net on the remaining 500 data points.
Change the architecture of the DNN.

These are the standard ways to approach this problem.

Thank you for this answer. First I'd like to look more into your first three points seeing that I have not implemented my model to train by batch and I've used the default learning rate. on the other hand, i used this like to help me in determining the structure of my DNN https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw — Vincent Pakson, Nov 06 '17 at 10:52
Sorry, I meant that I decided not to mind the size of the batch during my training, I used a default value by keras which is 32. I'd look into this much more. thank you for your ideas. — Vincent Pakson, Nov 06 '17 at 10:57

Ruben van Bergen · Answer 3 · 2017-11-06T12:42:08.553

The first question you need to ask yourself is: is this reduction in f1-score larger than could be expected by chance? You can think of your classifier as having some "true" performance level (for instance 75%), but any particular instance of training and testing will give you a noisy estimate of that "true" performance. It seems like the difference you're seeing may be within the confidence intervals of your estimates. If that is the case, then I think a likely explanation is that your classifier is already at ceiling level for the model your using. This means that adding more training data cannot significantly improve its performance any more. In which case, the bottleneck now is not the amount of training examples, but rather the architecture of the model or the inherent noise in the data.

One thing you could try is to use a larger test set, so that your performance estimates become more precise. This will give you a better sense of whether the difference in performance is "real". You could also try some intermediate sizes for your train batch (making sure that the data you add at each intermediate point is from the same pool of 5k training examples that you added originally). If the drop in performance is real, you should see a systematic decline across sizes in between 28k and 33k. If it's due to chance, you would expect to see a more noisy pattern of performance estimates across train batch sizes.

Edit: my 2nd suggestion above (now struck through) wasn't correct. I said you would expect the performance to drop systematically only if the drop were "real", and to see a more random pattern if the drop were due to chance. I now realize this isn't true, because the intermediate steps aren't independent: you're adding the same data to get from 28k to 33k, only you're doing it bit by bit. So in fact you could expect the same result (a smooth change) in both cases, and therefore this isn't a good test.

If you find that the drop can't be explained by chance, then you could examine those new training examples to see whether they are somehow systematically different from the initial training batch, and/or from the test set. But I wouldn't start worrying about that until you're fairly confident that it's not just a coincidence (which seems plausible given your description).

This is an interesting point. Thank you for this answer. Id like to ask if what you meant by intermediate sizes for the train batch is my batch size. Apparently as I slowly increase my batch size, the f1-score of my model also decreases. and yes the batches come from the same pool of training examples that I have. — Vincent Pakson, Nov 06 '17 at 11:20
I meant to try increasing the training batch from the initial 28k samples, in small steps, to the final size of 33k, training & testing the classifier at each step. However, now that I think about it, this isn't really a very good diagnostic and my prediction above was incorrect (I'll edit my answer to explain). — Ruben van Bergen, Nov 06 '17 at 12:33
unfortunately, my training set is not 28k but 2.8k. Since labelling the data is a costly task for a single person. As of the test set, initially when i used 30% of my data as test, which means that I have 70-30 split for train and test respectively, The test score was at about 70% and when increased it to 60-40 split, I got my current results. Last time I tried a 50-50 split, which for me is unadvisable, and I got 68% test score. I believe I've found the optimal split for training my model. — Vincent Pakson, Nov 07 '17 at 03:07

Less training data gave me better test score

3 Answers3