I am currently doing a multiclass classification task on sequence data and am using tf.contrib.crf.crf_log_likelihood to compute sentence level log-likelihood values.
In particular it implements a linear chain CRF, where the likelihood values are calculated by summing over the unary and binary scores and normalised by subtracting the log sum exponentials over all alpha values from the forward pass.
As far as I understood correctly the output of the above function is the same as the formula (13) described in Natural Language Processing (Almost) from Scratch by Collobert et al. 2011
Furthermore in my system, the average negative log-likelihood (over each batch) is minimised as the training objective and the variables are updated based on tf.train.AdamOptimizer.
After training for roughly 2 epochs the maximum log-likelihood value (over the batch) starts to become positive.
I am wondering how this could happen? Would this not entail a probability over 1?