7

I'm using Q-learning for my side project. After few million episodes, I found the cumulative rewards seems to reach stable. I'm wondering if there's a scientific way(s) to determine when to stop training rather than observe the cumulative rewards.

user2131907
  • 173
  • 1
  • 5
  • In q-learning each tends to be "complete". You could make an alternative paradigm where you make one tour of one step in size for each possible starting state in the domain, and you look at the change in the rewards, then you make a few complete tours, and then make the array of micro-tours, and look at the difference in the reward field between the two arrays of micro-tours. You could temporarily "darken" a random subset of the allowed states, like simulated annealing, You could temporarily "lighten" too. You could look at the smaller eigenvectors of the PCA of the Q-matrix over time.... – EngrStudent Jan 17 '18 at 12:00

1 Answers1

9

This depends very much on what your goal is. Here are some different cases I can think of:


Goal: Train until convergence, but no longer

From your question, I get the impression that this seems to be your goal. The easiest way is probably the "old-fashioned" way of plotting your episode returns during training (if it's an episodic task), inspecting the plot yourself, and interrupting the training process when it seems to have stabilized / converged. This assumes that you actually implemented something (like a very simple GUI with a stop button) so that you are able to decide manually when to interrupt the training loop.

To do this automatically (which is what I suppose you're looking for when you say "scientific way(s) to determine when to stop training"), I suppose you could do something simple like measuring average performance over the last 10 episodes, and also average performance over the last 50 episodes, and average performance over the last 100 episodes (for example). If those are all very similar, it may be safe to stop training. Or, maybe better, you could measure variance in performance over such a period of time, and stop if the variance drops below a certain threshold.


Goal: Compare performance of an algorithm to another algorithm / performance described in publications

In this case, you'd simply want to make sure to use a similar amount of training time / number of training steps as was used for the baseline you're comparing to. What often happens in current Reinforcement Learning research is to measure the mean performance over the last X (e.g. X = 10 or X = 100) episodes at specific points in time (e.g. after 10M, 50M, 100M and 200M frames in Atari games, see: https://arxiv.org/abs/1709.06009). Even better in my opinion is to do exactly this at every single point in time during training, and plot a learning curve. In this case it really doesn't matter all too much when you stop training, as long as you do it consistently in the same way for all algorithms you're comparing. Note: your decision for when to stop training will influence which conclusions you can reasonably draw though. If you stop training very early, you can't conclude anything about long-term training performance.


Goal: Implement an agent that is intended to be deployed for a long period of time

In this case, you may even want to consider to simply never stop learning ("life-long learning"). You can simply keep updating as your agent is deployed and acts in its environment. Or you could consider to halt the training whenever performance seems adequate, if you're afraid that it may degrade afterwards during deployment.

Dennis Soemers
  • 2,238
  • 1
  • 12
  • 22