Markov Property in practical RL

Question

In the standard textbook RL setting we usually use the MDP framework where we assume that the current state is independent of the the whole history given the previous state. Obviously, in real life this is not always a valid assumption and can often become a reason for an RL algorithm failing in a specific environment. Yet, the majority of current RL research assumes the Markov property. Why is that?

EDIT: I am aware of higher-order MDPs as mentioned in the comments. My question was more related to what is currently being done in practice by state-of-the-art RL algorithms. For example, DDPG with non-image observations (i.e. low-level observation such as torque, acceleration, etc) considers only the last observation (without any observation augmentation). DQN applied on Atari and its derivatives indeed use several previous images, but the main reason is to infer velocities and movement of pixels (i.e. make the image observations equivalent to the low-level observations mentioned earlier).

Indeed, the trick of applying observation augmentation is used sometimes, but still very rarely. Also, the number of previous states considered is often very small and manually tuned. But apart from empirical testing, how do we know that using a big number, say 50, is not the better choice (putting aside the computational complexity of using 50 images as input to a NN). Furthermore, these models do not really account for the actions that were previously taken. I guess what I am trying to ask is why are we not trying to use some more automated approach for determining these dependencies, for example something like LSTM (apart from the fact that training such model becomes more difficult)?

Possible duplicate of [Markov Process about only depending on previous state](https://stats.stackexchange.com/questions/2457/markov-process-about-only-depending-on-previous-state) — mlwida, Feb 16 '18 at 15:12
you can encode the states in such a way, that the previous state contains the relevant information of the whole history. See https://stats.stackexchange.com/questions/2457/markov-process-about-only-depending-on-previous-state — mlwida, Feb 16 '18 at 15:13
Thanks, I did not clarify my initial question well enough. Please see the edit. — niko, Feb 16 '18 at 15:59

score 2 · Answer 1 · answered Feb 17 '18 at 09:52

Yet, the majority of current RL research assumes the Markov property. Why is that?

The main reason for assuming the Markov property to hold is because it enables theoretical proofs (for example proofs of convergence to optimal policies in the limit) for certain algorithms. Intuitively, you can interpret the Markov property as saying "my state representation contains all information that is relevant for decision-making". With that intuition, I think it's easy to see that you're never going to be able to prove anything about convergence to optimality if you don't have that assumption.

I suppose you can argue that theoretical proofs that rely on unrealistic assumptions have limited value, but it's still considered to be useful to prove these kinds of properties for certain cases at least.

In practice, there's plenty of research where RL algorithms are empirically evaluated in settings where the Markov property may not entirely hold (or where it's unknown whether it holds). The assumption is just necessary for a strong theoretical framework.

But apart from empirical testing, how do we know that using a big number, say 50, is not the better choice (putting aside the computational complexity of using 50 images as input to a NN).

We don't, and in practice precisely that computational complexity will be the deciding factor.

Furthermore, these models do not really account for the actions that were previously taken. I guess what I am trying to ask is why are we not trying to use some more automated approach for determining these dependencies, for example something like LSTM (apart from the fact that training such model becomes more difficult)?

Under the Markov property, taking into account previously taken actions shouldn't be necessary (if older actions are still important their effects should in some way be encapsulated in the state representation according to the Markov property). Of course, that may not hold in practice, in which case your suggestions may lead to better empirical performance. I'm not 100% sure, I guess some people will already have tried such ideas too. Increasing network complexity in that way can also lead to a more difficult learning problem though. You may theoretically improve the extent to which your network has the capability of learning the truly optimal policy, but in practice make the learning so much more complicated that it struggles to learn. The simpler network may in theory be unable to truly learn an optimal policy, but at least be able to learn something ''good enough'' in practice.

Markov Property in practical RL

1 Answers1