I'm training an LSTM that has a time series as input, and outputs a classification of 'a','b', 'null', 'd', 'e'. In the data, over 78% of the y-labels are 'null', so the LSTM is quite good at picking 'null' with nearly perfect recall and precision. But it's essentially useless at picking out the other labels.
Thus, my LSTM, as-is, is not any better than a robot that always chooses 'null', regardless of the input.
On the one hand, it's entirely possible that the data lacks discriminatory patterns that allow the LSTM to identify non-null outcomes. For the purposes of this question, please ignore that possibility. Setting that aside, how should one go about accounting for outputs that are not well-distributed? Do you force that 'spike' into a number of sub-classifications, to artificially smooth the distribution?
Is there a solution in changing the loss function, such that it weights non-null results more heavily than null results? Or that it penalizes null results?
Does it make any sense to remove all null results from the dataset, train an algo on that subset, then train a downstream network to pick out the nulls from the entire dataset?
In case it is helpful, I've included the LSTM model and some output.
Layer (type) Output Shape Param
lstm_1 (LSTM) (None, 1000) 4052000
dense_1 (Dense) (None, 5) 5005
Total params: 4,057,005
Trainable params: 4,057,005
Non-trainable params: 0
Loss Function: Categorical CrossEntropy
Optimizer: ADAM
Dense activation: Softmax
precision recall f1-score support
0 0.69 0.52 0.60 4356
1 0.29 0.00 0.00 4472
2 0.82 0.98 0.89 64371
3 0.76 0.30 0.43 4666
4 0.00 0.00 0.00 4152
avg / total 0.74 0.81 0.76 82017
I don't claim that my LSTM structure is the best. I've tried several, but I think I need to address this output distribution before settling on a structure.