I'm currently attempting to make a Seq2Seq Chatbot with LSTMs. The data I used is from Cornell's Movie Dialog Corpus.
Here's the link to my code on GitHub, I would appreciate it if you took a look at it: Seq2Seq Chatbot You need to change the path of the file in order for it to run correctly.
I'm using 2 GTX 1080 with 8GB RAM, and I'm training my code with GPU support.
Here's the error I got:
.
.
.
2018-01-14 17:09:40.102649: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 1002348032 totalling 955.91MiB
2018-01-14 17:09:40.102656: I tensorflow/core/common_runtime/bfc_allocator.cc:683] Sum Total of in-use chunks: 7.30GiB
2018-01-14 17:09:40.102665: I tensorflow/core/common_runtime/bfc_allocator.cc:685] Stats:
Limit: 7968181453
InUse: 7836243200
MaxInUse: 7836262144
NumAllocs: 48210
MaxAllocSize: 1002348032
2018-01-14 17:09:40.103459: W tensorflow/core/common_runtime/bfc_allocator.cc:277] **********************_******************************xx*****************************************xxxx
2018-01-14 17:09:40.103484: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocatingtensor with shape[3,1143,44592]
Traceback (most recent call last):
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
return fn(*args)
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
status, run_metadata)
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3,1143,44592]
[[Node: training/RMSprop/gradients/dense_1/Max_grad/Cast = Cast[DstT=DT_FLOAT, SrcT=DT_BOOL, _class=["loc:@dense_1/Max"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/RMSprop/gradients/dense_1/Max_grad/Equal)]]
[[Node: loss/mul/_93 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3108_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/edward/デスクトップ/Chatbot/Seq2Seq Model/main (One-hot Batch).py", line 100, in <module>
model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=3, epochs=epochs, validation_split=0.)
File "/home/edward/.local/lib/python3.4/site-packages/keras/engine/training.py", line 1657, in fit
validation_steps=validation_steps)
File "/home/edward/.local/lib/python3.4/site-packages/keras/engine/training.py", line 1213, in _fit_loop
outs = f(ins_batch)
File "/home/edward/.local/lib/python3.4/site-packages/keras/backend/tensorflow_backend.py", line 2357, in __call__
**self.session_kwargs)
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3,1143,44592]
[[Node: training/RMSprop/gradients/dense_1/Max_grad/Cast = Cast[DstT=DT_FLOAT, SrcT=DT_BOOL, _class=["loc:@dense_1/Max"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/RMSprop/gradients/dense_1/Max_grad/Equal)]]
[[Node: loss/mul/_93 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3108_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Caused by op 'training/RMSprop/gradients/dense_1/Max_grad/Cast', defined at:
File "/home/edward/デスクトップ/Chatbot/Seq2Seq Model/main (One-hot Batch).py", line 100, in <module>
model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=3, epochs=epochs, validation_split=0.)
File "/home/edward/.local/lib/python3.4/site-packages/keras/engine/training.py", line 1634, in fit
self._make_train_function()
File "/home/edward/.local/lib/python3.4/site-packages/keras/engine/training.py", line 990, in _make_train_function
loss=self.total_loss)
File "/home/edward/.local/lib/python3.4/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
return func(*args, **kwargs)
File "/home/edward/.local/lib/python3.4/site-packages/keras/optimizers.py", line 225, in get_updates
grads = self.get_gradients(loss, params)
File "/home/edward/.local/lib/python3.4/site-packages/keras/optimizers.py", line 73, in get_gradients
grads = K.gradients(loss, params)
File "/home/edward/.local/lib/python3.4/site-packages/keras/backend/tensorflow_backend.py", line 2394, in gradients
return tf.gradients(loss, variables, colocate_gradients_with_ops=True)
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/gradients_impl.py", line 581, in gradients
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/gradients_impl.py", line 353, in _MaybeCompile
return grad_fn() # Exit early
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/gradients_impl.py", line 581, in <lambda>
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/math_grad.py", line 87, in _MaxGrad
return _MinOrMaxGrad(op, grad)
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/math_grad.py", line 77, in _MinOrMaxGrad
indicators = math_ops.cast(math_ops.equal(y, op.inputs[0]), grad.dtype)
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/math_ops.py", line 745, in cast
return gen_math_ops.cast(x, base_type, name=name)
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/gen_math_ops.py", line 892, in cast
"Cast", x=x, DstT=DstT, name=name)
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
...which was originally created as op 'dense_1/Max', defined at:
File "/home/edward/デスクトップ/Chatbot/Seq2Seq Model/main (One-hot Batch).py", line 73, in <module>
decoder_outputs = decoder_dense(decoder_outputs)
File "/home/edward/.local/lib/python3.4/site-packages/keras/engine/topology.py", line 603, in __call__
output = self.call(inputs, **kwargs)
File "/home/edward/.local/lib/python3.4/site-packages/keras/layers/core.py", line 847, in call
output = self.activation(output)
File "/home/edward/.local/lib/python3.4/site-packages/keras/activations.py", line 26, in softmax
e = K.exp(x - K.max(x, axis=axis, keepdims=True))
File "/home/edward/.local/lib/python3.4/site-packages/keras/backend/tensorflow_backend.py", line 1213, in max
return tf.reduce_max(x, axis=axis, keep_dims=keepdims)
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/math_ops.py", line 1525, in reduce_max
name=name)
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/gen_math_ops.py", line 2485, in _max
keep_dims=keep_dims, name=name)
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[3,1143,44592]
[[Node: training/RMSprop/gradients/dense_1/Max_grad/Cast = Cast[DstT=DT_FLOAT, SrcT=DT_BOOL, _class=["loc:@dense_1/Max"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/RMSprop/gradients/dense_1/Max_grad/Equal)]]
[[Node: loss/mul/_93 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3108_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Exception ignored in: <bound method Session.__del__ of <tensorflow.python.client.session.Session object at 0x7f18568a7e10>>
Traceback (most recent call last):
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 696, in __del__
File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/c_api_util.py", line 30, in __init__
TypeError: 'NoneType' object is not callable
It's telling me that I ran out of memory.
The funny thing is, I'm currently dividing my input sentences into batches of 20(I changed this value, description further down below) and training them, and I see that the program is dividing a lot of chunks of the memory(i.e., Something similar to 2018-01-14 17:09:40.102649: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 1002348032 totalling 955.91MiB
is printed to the console a ridiculous number of times) for just this one batch of data, and the program never gets to the next batch.
Following similar issues to mine I saw on the internet,I've tried:
- Reducing memory usage by changing the data type:
float32 -> float16
- Reduced the batch size to 10, 5 and then 3
- Reduced the epochs to 3
and none of these worked. The majority of the similar issues I saw were all using image data, by the way.
I'm thinking it may be related to the size of one batch(it's a numpy array of [3,1143,44592]), or simply an error in my code; but honestly, I'm terribly stuck right now.
Any help would be greatly appreciated!