There are a number of applications of the Gaussian Mixture Model
(GMMs) to acoustics/audio data for the purposes of classification; ex paper1 and ex paper2. GMMs for the case of clustering and position source generation can be understood.
What is unclear from various papers is the details of facilitating such a model which does not represent explicitly the temporal dependencies when the data is produced with different temporal features. Questions such as 'what if the class source changes the rate of signal production?', or a question 'does the methodology examine only the latest temporal component (block/window)?'. These questions would displace the parameterizations of a GMM, would they not?
It also appears that the parameters $\mathbf{\mu}_i,\mathbf{\sigma}_i$, would have the dimensionality of the number of samples in the audio sequence under examination, correct? So that if $\mathbf{\mu}_i \in \mathbb{R}^\tau$, the signal data $D$ being examined is $D_{T-\tau,\ldots,T}$?
The question is, how can GMMs be adapted in practice for the purposes of signal classification?