3

I'm trying to understand the underlying mechanisms of LSTM from a programming perspective. I am no math person, and a lot of articles and papers look like alphabet soup to me. But I thought that if I can translate the process to a programming language, I may be able to understand it better (e.g. step through a debugger, discover that some equations are just APIs I already know, etc.).

The forward pass is pretty straightforward. Its equations are posted everywhere and appear to be the same, with some minor variations. This:

is pretty much a 6-liner:

f = sigmoid(add(dot(wf, xt), dot(uf, hp), bf))
i = sigmoid(add(dot(wi, xt), dot(ui, hp), bi))
o = sigmoid(add(dot(wo, xt), dot(uo, hp), bo))
c = tanh(add(dot(wc, xt), dot(uc, hp), bc))
ct = add(multiply(f, cp), multiply(i, c))
ht = multiply(o, tanh(ct))

But what I can't seem to find is a canonical set of equations for the backward pass.

I am using this seq2seq repo as reference to get my head around the general flow and process. So far, it's the only repo I've found to have a concise implementation of LSTM and the other things around it (embeddings, softmax, training, etc.). Just in case this implementation is flawed or tweaked, I've been gathering articles that present equations and/or code for the backward pass as cross-reference.

these:

https://blog.aidangomez.ca/2016/04/17/Backpropogating-an-LSTM-A-Numerical-Example/
https://nicodjimenez.github.io/2014/08/08/lstm.html
https://stackoverflow.com/a/46689998

and these:

https://blog.varunajayasiri.com/numpy_lstm.html
https://www.geeksforgeeks.org/lstm-derivation-of-back-propagation-through-time/
https://wiseodd.github.io/techblog/2016/08/12/lstm-backprop/

and these:

https://arxiv.org/abs/1808.03314
https://arxiv.org/abs/1610.02583

But all of them seem to have different versions of the backwards pass.

  • The first three links have backward pass equations that do not use the derivative of sigmoid (sigmoid(x) * (1 - sigmoid(x))) to compute for the gate derivatives, while the second three and the repo I referenced use it (dsigmoid() or sigmoid_grad() in the code).
  • The equations variably use dot product, outer product, or element-wise products.
  • The arxiv papers linked sounded like they were trying to solve this issue of a missing canonical implementation. But aside from the alphabet soup and dense wall of words, their equations look different from the rest and just add to the confusion.

Again, I'm not well-versed in math, some of these might really be just shortcuts/alternate implementations/clever manupulations that I can't reason about at the level I'm in.

So my questions are:

  1. Why isn't there a canonical set of equations for LSTM back propagation?
  2. A lot of articles seem to show how people are deriving them. Why?
  3. If the forward equations are the same, why do people end up having very different backward equations?
  4. Where can I find a good, minimal reference for implementing LSTM in full (both forward and backward)? The backward equations especially.
Jim
  • 31
  • 3

0 Answers0