I am building a speech recognition system using Hidden Markov Model in python. I referred to this and this question and its answers, which were very helpful.
In my approach, I split the continuous speech into separate words. I am thinking of using HMM to detect each word. So my states of HMM will be phones.
What I understood so far is that HMM estimates next state based on current state(phone). But I don't get how to estimate first state of HMM(i.e. the first phone of the word).
Can you suggest the best approach to use HMM to achieve this?
Also states of HMM will be phones, but I am not getting what can be observation in problem? There are multiple frame in a single phone and there is a feature vector corresponding to each frame. What should I use as observation?