4

I have 200k sequences and each element of the sequence is vector of length 200. I plan to learn a HMM using this data, using the Baum-Welch EM algorithm to infer transition and emission probabilities. I wanted to know if I can do the fitting in batches of sequences(i.e. learn a HMM from say 1000 sequences first, and then train this HMM with the next 1000 sequences and so on). Is this right? Why should I/should I not do this? How does this compare with fitting a HMM with all 200k sequences at once?

Subraveti Suraj
  • 406
  • 3
  • 10
  • You might want to consult [Rabiner's](http://ieeexplore.ieee.org/document/18626/) excellent paper, section _V.B: Multiple Observation Sequences_. He has a clear explanation of how you have to modify the learning step. Also, have a look at [this](https://stats.stackexchange.com/a/95553/108209) answer. – DimP Apr 27 '17 at 15:17

1 Answers1

2

I wouldn't recommend the batch method you suggested, since the final trained HMM will mainly reflect the final 1000 sequences. The influence of the remainder of the sequences will be limited, contributing only to the starting parameters of the model prior to the last model-training step. That said, I realize Baum-Welch training for large sequence sets such as this can be rather slow. You may wish to consider learning an initial model with the faster Viterbi training method (a.k.a segmental K-means algorithm; see Juang & Rabiner (1990) IEEE Transactions on Acoustics... 38, 1639-1641), and then refining the parameters further with Baum-Welch. Alternatively you could just use Viterbi training on its own; I have found this method often yields models of comparable quality to those trained with the Baum Welch algorithm.

Shaun Wilkinson
  • 208
  • 1
  • 4
  • Thank you! Do you know of any libraries that implement this training method? – Subraveti Suraj Apr 29 '17 at 19:12
  • The **aphid** R package should be able to do it. It's designed for analyzing biological sequences but works well for other applications. The package isn't on CRAN yet but is available with download instructions at https://github.com/shaunpwilkinson/aphid – Shaun Wilkinson Apr 30 '17 at 23:41
  • Aah, I was looking for something in Python. Pandas/NumPy fan :p – Subraveti Suraj May 01 '17 at 00:33