I have the perfect paper for you: http://www.stat.columbia.edu/~liam/teaching/neurostat-fall17/papers/hmm/rabiner.pdf
Note afaik there are newer BW methods.
Attempted simple explanation for Viterbi:
The goal of viterbi is to find an optimal sequence of states during some discrete period of time.
Start off with the initial probabilities of being in a state $s$ at time $t=0$. Then increase $t$ and for each state at that time ($t=1$), find the best state $s$ at $t-1$ (best as in the sequence which gives the highest probability). Save this information. That means for each state $s$ at $t=1$ save what the best state $s$ at $t=0$ is. Also save the probability for that (optimal) sequence.
Then increase $t$ to $t=2$, and repeat the procedure (with the probabilities for each state $s$ at $t-1$ being the just calculated best probabilities for that state). At some point you'll reach $t=T$ (last point in time). Then you just pick the state with the highest probability, and remember for each state at each time you have saved the best state for $t-1$, so take the best state at $t=T$ and then work your way backwards always picking that saved optimal state.
To elaborate on what it does on a more general level: If you have a set of states $S$ which can all be picked over some duration of time $T$, a naive solution to find the best sequence would be to enumerate all possible sequences and then calculate the probability for each of them (that means calculating the product of the state and transition probabilities). The mistake one is doing here is that a lot of the calculations are duplicated, as an example you could have two sequences $p_1, p_2$ ($p$ is a sequence of states) where the only difference is the final state, yet with this approach you are calculating the probabilities of the shared states for both $P(p_1)$ and $P(p_2)$.
By taking the iterative approach used in viterbi, there are no duplicated calculations, so obviously it will be a lot more efficient.