The first question you need to ask yourself is: is this reduction in f1-score larger than could be expected by chance? You can think of your classifier as having some "true" performance level (for instance 75%), but any particular instance of training and testing will give you a noisy estimate of that "true" performance. It seems like the difference you're seeing may be within the confidence intervals of your estimates. If that is the case, then I think a likely explanation is that your classifier is already at ceiling level for the model your using. This means that adding more training data cannot significantly improve its performance any more. In which case, the bottleneck now is not the amount of training examples, but rather the architecture of the model or the inherent noise in the data.
One thing you could try is to use a larger test set, so that your performance estimates become more precise. This will give you a better sense of whether the difference in performance is "real". You could also try some intermediate sizes for your train batch (making sure that the data you add at each intermediate point is from the same pool of 5k training examples that you added originally). If the drop in performance is real, you should see a systematic decline across sizes in between 28k and 33k. If it's due to chance, you would expect to see a more noisy pattern of performance estimates across train batch sizes.
Edit: my 2nd suggestion above (now struck through) wasn't correct. I said you would expect the performance to drop systematically only if the drop were "real", and to see a more random pattern if the drop were due to chance. I now realize this isn't true, because the intermediate steps aren't independent: you're adding the same data to get from 28k to 33k, only you're doing it bit by bit. So in fact you could expect the same result (a smooth change) in both cases, and therefore this isn't a good test.
If you find that the drop can't be explained by chance, then you could examine those new training examples to see whether they are somehow systematically different from the initial training batch, and/or from the test set. But I wouldn't start worrying about that until you're fairly confident that it's not just a coincidence (which seems plausible given your description).