Reinforcement learning in non stationary environment

Question

Q1: Are there common or accepted methods for dealing with non stationary environment in Reinforcement learning in general?

Q2: In my gridworld, I have the reward function changing when a state is visited. Every episode the rewards reset to the initial state. All I want my agent to learn is "Don't go back unless you really need to", however this makes the environment non-stationary. Can/Should this very simple rule be incorporated in the MDP model, and how? Is Q-learning the best solution for dealing with this problem? Any suggestions or available examples?

Q3: I have been looking into Q-learning with experience replay as a solution to dealing with non stationary environments, as it decorrelates successive updates. Is this the correct use of the method or it is more to deal with making learning more data efficient? And I have only seen it used with value approximation. I am not sure if it is an overkill to use it for a simple discretised state space, like gridworld, or there is a different reason for this.

Please feel free to answer or comment even if you can't address all questions.

Neil Slater · Accepted Answer · 2017-10-21T16:29:24.860

Q1: Are there common or accepted methods for dealing with non stationary environment in Reinforcement learning in general?

Most basic RL agents are online, and online learning can usually deal with non-stationary problems. In addition, update rules for state value and action value estimators in control problems are usually written for non-stationary targets, because the targets already change as the policy improves. This is nothing complicated, simply use of a learning rate $\alpha$ in updates when estimating values, effectively a rolling geometric mean as opposed to averaging over all history in an unweighted fashion.

However, this addresses longer-term non-stationarity, such as the problem changing between episodes, or over an even longer time scale. Your description looks more like you wish to change the reward structure based on actions the agent has taken, within a short timescale. That dynamic response to actions is better framed as a different more complex MDP, not as "non-stationarity" within a simpler MDP.

An agent cannot learn changes to the environment that it has not yet sampled, so changing reward structure will not prevent the agent from returning to previously-visited states. Unless you are using something like a RNN in the agent, the agent will not have a "memory" of what happened before in the episode other than whatever is represented in the current state (arguably using a RNN makes the hidden layer of the RNN part of the state). Across multiple episodes, if you use a tabular Q-learning agent, then the agent will simply learn that certain states have low value, it will not be able to learn that second or third visits to the state cause that effect, because it has no way to represent that knowledge. It will not be able to adjust to the change fast enough to learn online and mid-episode.

Q2: In my gridworld, I have the reward function changing when a state is visited. All I want my agent to learn is "Don't go back unless you really need to", however this makes the environment non-stationary.

If that's all you need the agent to learn, perhaps this can be encouraged by a suitable reward structure. Before you can do that, you need to understand yourself what "really need to" implies, and how tight that has to be logically. You may be OK though just by assigning some penalty for visiting any location that the agent has already or recently visited.

Can/Should this very simple rule be incorporated in the MDP model, and how?

Yes, you should add the information about visited locations into the state. This immediately will make your state model more complex than a simple grid world, increasing the dimensionality of the problem, but it is unavoidable. Most real-world problems very quickly outgrow the toy examples provided to teach RL concepts.

One alternative is to frame the problem as a Partially Observable Markov Decision Process (POMDP). In that case the "true" state would still include all the necessary history in order to calculate the rewards (and as this is a toy problem on a computer you would still have to represent it somehow), but the agent can attempt learn from restricted knowledge of the state, just whatever you let it observe. In general this is a much harder approach than expanding the state representation, and I would not recommend it here. However, if you find the idea interesting, you could use your problem to explore POMDPs. Here is a recent paper (from Google's Deep Mind team, 2015) that looks at two RL algorithms combined with RNNs to solve POMDPs.

Q3: I have been looking into Q-learning with experience replay as a solution to dealing with non stationary environments, as it decorrelates successive updates. Is this the correct use of the method or it is more to deal with making learning more data efficient?

Experience replay will not help with non-stationary environments. In fact it could make performance worse in them. However, as already stated, your problem is not really about a non-stationary environment, but about handling more complex state dynamics.

What you may need to do is look into function approximation, if the number of states increases to a large enough number. For instance, if you want to handle any back-tracking and have a complex reward-modifying rule that tracks each visited location, then your state might change from a single location number to a map showing visited locations. So for example it might go from $64$ states for an $8 \times 8$ grid world to a $2^{64}$ state map showing visited squares. This is far too high to track in a value table, so you will typically use a neural network (or a convolutional neural network) to estimate state values instead.

With a function estimator, experience replay is very useful, as without it, the learning process is likely to be unstable. The recent DQN approach for playing Atari games uses experience replay for this reason.

If the environment is non-stationary then how do you deal with the fact that, in the grid world example, being in the state at time t=1 is not the same as being in that state at t=2? If you treat them as separate states then surely the dimensionality of your state space will just explode? — tryingtolearn, Oct 19 '17 at 13:35
@tryingtolearn: The whole point of a Markov state is that it captures all the important details of how the MDP will progress from that point. Typically being in state at t=1 is **not** different from being in the same state at t=2, in terms of expected future reward and state transitions. If you wind up with rules that are based on the value of t, then you put t into the state. This might happen if you can get reward at any time step, but the number of time steps is limited - the episode always ends at t=10 for example. In that case knowing your remaining time could be important — Neil Slater, Oct 19 '17 at 14:01
@NeilSlater can you expand on the POMDP and RNN ideas in your answer? They sound interesting. And if possible, give relevent sources because sometimes it is hard to navigate the literature. I really don't like the idea of keeping the sequence of visited states, although this is the only thing I could think of so far as well, so I am looking for other options. The model becomes overcomplicated that way, given that I need to introduce a very simple rule. I am not sure if I am missing something very obvious or I am just not using the correct model and formulation. — Voltronika, Oct 20 '17 at 08:46
@NeilSlater Can't this be helped by using things like policy-gradient methods? In practice, do you happen to know what the *standard* for solving these kind of problems is? — tryingtolearn, Oct 20 '17 at 10:48
@tryingtolearn: No, policy gradients still need a correct state representation, they work nicely when there are lots of actions to choose, or in continuous action spaces. Having a correct and valid state representation is "the standard". You have some free choices, based on domain knowledge, on what counts as "correct and valid". If you want to know what the "standard" is for a grid-based movement with dynamic restrictions to movement, then I very much doubt there is some published reference on how to do it "correctly". It will depend too much on the nature of the restrictions. — Neil Slater, Oct 20 '17 at 11:25
@Voltronika I expanded the answer adding a paragraph about POMDPs. Note that framing your problem as a POMDP makes it much harder to work on and solve than extending the state to include suitable memory of visited locations. So I suggest you only look into that if studying POMDPs is a goal. — Neil Slater, Oct 21 '17 at 14:02

score 0 · Answer 2 · answered Apr 17 '20 at 06:45

Q1: Q learning is an online reinforcementn learning algorithm that work well with the stationary environment. It may also be used with a non-stationary model with the condition that the model (reward function and transition probabilities) does not change fast.

Reinforcement learning in non stationary environment

2 Answers2