What is causing the GPU out-of-memory error(OOM) for my Sequence-to-Sequence network with LSTM?

Question

I'm currently attempting to make a Seq2Seq Chatbot with LSTMs. The data I used is from Cornell's Movie Dialog Corpus.

Here's the link to my code on GitHub, I would appreciate it if you took a look at it: Seq2Seq Chatbot You need to change the path of the file in order for it to run correctly.

I'm using 2 GTX 1080 with 8GB RAM, and I'm training my code with GPU support.

Here's the error I got:

.
.
.
2018-01-14 17:09:40.102649: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 1002348032 totalling 955.91MiB
2018-01-14 17:09:40.102656: I tensorflow/core/common_runtime/bfc_allocator.cc:683] Sum Total of in-use chunks: 7.30GiB
2018-01-14 17:09:40.102665: I tensorflow/core/common_runtime/bfc_allocator.cc:685] Stats:
Limit:                  7968181453
InUse:                  7836243200
MaxInUse:               7836262144
NumAllocs:                   48210
MaxAllocSize:           1002348032

2018-01-14 17:09:40.103459: W tensorflow/core/common_runtime/bfc_allocator.cc:277] **********************_******************************xx*****************************************xxxx
2018-01-14 17:09:40.103484: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocatingtensor with shape[3,1143,44592]
Traceback (most recent call last):
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
    return fn(*args)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
    status, run_metadata)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3,1143,44592]
         [[Node: training/RMSprop/gradients/dense_1/Max_grad/Cast = Cast[DstT=DT_FLOAT, SrcT=DT_BOOL, _class=["loc:@dense_1/Max"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/RMSprop/gradients/dense_1/Max_grad/Equal)]]
         [[Node: loss/mul/_93 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3108_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/edward/デスクトップ/Chatbot/Seq2Seq Model/main (One-hot Batch).py", line 100, in <module>
    model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=3, epochs=epochs, validation_split=0.)
  File "/home/edward/.local/lib/python3.4/site-packages/keras/engine/training.py", line 1657, in fit
    validation_steps=validation_steps)
  File "/home/edward/.local/lib/python3.4/site-packages/keras/engine/training.py", line 1213, in _fit_loop
    outs = f(ins_batch)
  File "/home/edward/.local/lib/python3.4/site-packages/keras/backend/tensorflow_backend.py", line 2357, in __call__
    **self.session_kwargs)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3,1143,44592]
         [[Node: training/RMSprop/gradients/dense_1/Max_grad/Cast = Cast[DstT=DT_FLOAT, SrcT=DT_BOOL, _class=["loc:@dense_1/Max"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/RMSprop/gradients/dense_1/Max_grad/Equal)]]
         [[Node: loss/mul/_93 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3108_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'training/RMSprop/gradients/dense_1/Max_grad/Cast', defined at:
  File "/home/edward/デスクトップ/Chatbot/Seq2Seq Model/main (One-hot Batch).py", line 100, in <module>
    model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=3, epochs=epochs, validation_split=0.)
  File "/home/edward/.local/lib/python3.4/site-packages/keras/engine/training.py", line 1634, in fit
    self._make_train_function()
  File "/home/edward/.local/lib/python3.4/site-packages/keras/engine/training.py", line 990, in _make_train_function
    loss=self.total_loss)
  File "/home/edward/.local/lib/python3.4/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/home/edward/.local/lib/python3.4/site-packages/keras/optimizers.py", line 225, in get_updates
    grads = self.get_gradients(loss, params)
  File "/home/edward/.local/lib/python3.4/site-packages/keras/optimizers.py", line 73, in get_gradients
    grads = K.gradients(loss, params)
  File "/home/edward/.local/lib/python3.4/site-packages/keras/backend/tensorflow_backend.py", line 2394, in gradients
    return tf.gradients(loss, variables, colocate_gradients_with_ops=True)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/gradients_impl.py", line 581, in gradients
    grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/gradients_impl.py", line 353, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/gradients_impl.py", line 581, in <lambda>
    grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/math_grad.py", line 87, in _MaxGrad
    return _MinOrMaxGrad(op, grad)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/math_grad.py", line 77, in _MinOrMaxGrad
    indicators = math_ops.cast(math_ops.equal(y, op.inputs[0]), grad.dtype)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/math_ops.py", line 745, in cast
    return gen_math_ops.cast(x, base_type, name=name)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/gen_math_ops.py", line 892, in cast
    "Cast", x=x, DstT=DstT, name=name)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

...which was originally created as op 'dense_1/Max', defined at:
  File "/home/edward/デスクトップ/Chatbot/Seq2Seq Model/main (One-hot Batch).py", line 73, in <module>
    decoder_outputs = decoder_dense(decoder_outputs)
  File "/home/edward/.local/lib/python3.4/site-packages/keras/engine/topology.py", line 603, in __call__
    output = self.call(inputs, **kwargs)
  File "/home/edward/.local/lib/python3.4/site-packages/keras/layers/core.py", line 847, in call
    output = self.activation(output)
  File "/home/edward/.local/lib/python3.4/site-packages/keras/activations.py", line 26, in softmax
    e = K.exp(x - K.max(x, axis=axis, keepdims=True))
  File "/home/edward/.local/lib/python3.4/site-packages/keras/backend/tensorflow_backend.py", line 1213, in max
    return tf.reduce_max(x, axis=axis, keep_dims=keepdims)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/math_ops.py", line 1525, in reduce_max
    name=name)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/gen_math_ops.py", line 2485, in _max
    keep_dims=keep_dims, name=name)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[3,1143,44592]
         [[Node: training/RMSprop/gradients/dense_1/Max_grad/Cast = Cast[DstT=DT_FLOAT, SrcT=DT_BOOL, _class=["loc:@dense_1/Max"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/RMSprop/gradients/dense_1/Max_grad/Equal)]]
         [[Node: loss/mul/_93 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3108_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Exception ignored in: <bound method Session.__del__ of <tensorflow.python.client.session.Session object at 0x7f18568a7e10>>
Traceback (most recent call last):
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 696, in __del__
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/c_api_util.py", line 30, in __init__
TypeError: 'NoneType' object is not callable

It's telling me that I ran out of memory.

The funny thing is, I'm currently dividing my input sentences into batches of 20(I changed this value, description further down below) and training them, and I see that the program is dividing a lot of chunks of the memory(i.e., Something similar to 2018-01-14 17:09:40.102649: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 1002348032 totalling 955.91MiBis printed to the console a ridiculous number of times) for just this one batch of data, and the program never gets to the next batch.

Following similar issues to mine I saw on the internet,I've tried:

Reducing memory usage by changing the data type: float32 -> float16
Reduced the batch size to 10, 5 and then 3
Reduced the epochs to 3

and none of these worked. The majority of the similar issues I saw were all using image data, by the way.

I'm thinking it may be related to the size of one batch(it's a numpy array of [3,1143,44592]), or simply an error in my code; but honestly, I'm terribly stuck right now.

Any help would be greatly appreciated!

score 4 · Accepted Answer · answered Jan 16 '18 at 18:49

^{This question is better asked on StackOverflow, but I'll give you a hint.}

First up, the tensor that the engine is trying to allocate is enourmous:

$$1143\cdot44592\cdot3=152905968 \approx 150M$$

With float32 variables, it takes ~600Mb, but even with float16 is't ~300Mb, which is a lot. And this is just one layer of the network (not including the training data and intermediate tensors and internal tensorflow infrastructure). Also note that the RMSProp optimizer actually doubles all variables to perform gradient updates, so the memory is doubled as well.

I've skimmed through your code and noticed that you use such a big layer for softmax output for word prediction. Not only is it inefficient in terms of memory, this approach is also hard to train.

In NLP, it's common to use sampling loss function in order to classify words in a large vocabulary, most commonly negative sampling and NCE. These losses are implemented in tensorflow, but require a bit of manual work in keras (see this discussion on GitHub), but they are much more memory and computationally efficient.

So I think the biggest improvement for you would be to implement NCE loss function. You can also try to train with plain sgd optimizer to save memory.

re your first line: Migration to [SO] was rejected because the OP is blocked from asking questions there. — whuber, Jan 16 '18 at 19:15
Ok. Thanks. Just curious if it's ok to ask programming questions on CV in this case? — Maxim, Jan 16 '18 at 19:16
That's viewed as circumventing SE policies. Such efforts are usually greeted with substantial negative prejudice :-). — whuber, Jan 16 '18 at 19:18

score 0 · Answer 2 · answered Jan 16 '18 at 19:04

You don't quiet have your LSTM setup right. Some thoughts:

Use keras.preprocessing.text.text_to_word_sequence to turn your texts into sequences of word ids.
Use keras.preprocessing.sequence.pad_sequences to truncate/pad all your sequences to something like 32 or 64 words.
Use an embedding layer after your input layer to map the sequences of word ids to a sequence of word vectors.
Use the sequences of word vectors as input to to the LSTM
Try a GRU instead of an LSTM (it's a little bit simpler)
After the decoding layer, the output will be a sequence of word vectors. Use nearest neighbors to lookup the nearest word in your vocabulary for each "word" in the output.

You can learn the embeddings while training the network, or you can use a pre-trained embedding layer.

What is causing the GPU out-of-memory error(OOM) for my Sequence-to-Sequence network with LSTM?

2 Answers2