I agree with you that from the paper their methodology was not totally clear. To understand it better, you should check as well paper they refer to by Bertsimas et al (2003) that is more verbose about the model. It appears that what they mean by "mixture" is that they modeled probability of observing particular session $\boldsymbol{S}_i$ given the outcome of the session $C_i$, $P(\boldsymbol{S}_i | C_i)$ using separate RNNs. In such a case, the probability of observing a particular outcome $C_i$ given the session can be calculated using Bayes theorem
$$
P(C_i = \omega|\boldsymbol{S}_i) = \frac{P(\boldsymbol{S}_i | C_i)\,P(C_i = \omega) }{\sum_{\omega \in \Omega} P(\boldsymbol{S}_i | C_i)\,P(C_i = \omega) }
$$
It is hard to say what exactly is $P(\boldsymbol{S}_i | C_i)$. They refer to Bertsimas et al (2003) who modeled it using Markov Chain. Here they use RNNs. Then they say
Taking inspiration from the Automatic Speech Recognition (ASR)
community and similarities to “Language Modeling”, we adapted
some of their more recent techniques to our problem. In preliminary
experiments, 5-grams performed better than shorter chains, so we
used them.
From this description we cannot know what they mean by "some of their more recent techniques". Maybe they used RNN with skip-grams, so given the history they predicted next event in the sequence, but this is only a guess.
They call it a "mixture" in a sense of mixture distribution (see mixture-distribution) where the distribution can be thought as a mixture of different distributions $P(\boldsymbol{S}_i | C_i)$ with some mixing weights $P(C_i = \omega)$. Mixtures are used in clustering, but also there are generalizations of the idea to mixtures of regression models or mixtures of neural networks.
TL;DR they have separate models for different outcomes $C_i$ and then weight the models using the overall frequencies of the outcomes in the data $P(C_i = \omega)$, but from the description is not very clear how exactly they are doing this, because they didn't give the details.