Reinforcement Model Learning

Question

Classical reinforcement learning (Q- or Sarsa-Learning) can be extended with models of the environment. These models are usually transition tables that contain the probability of arriving at a particular state given another state and one action.

In model-free learning, these transition probabilities are "incorporated" into the evaluation function. Having only a single transition table as a model, therefore, is no advantage over the model-free variants. Its state predictions are just as limited by the Markov property as the evaluation prediction of model-free variants is.

Are there machine learning methods to automatically generate the model for an exploring agent with incomplete knowledge about its environment (e.g. its position) that go beyond the update of a single transition table?

Are there, for example, methods that generate different transition tables which "segment" the environment based on their predictive success?

Reviewing the literature I could not find any answer. The Encyclopedia of Machine Learning provides a current overview of Hierarchical Reinforcement Learning. But this is different from what I want to know, because of at least one of the following reasons.

The model is already given, the problem is the learning of an evaluation function with a hierarchical model.
Learning the model is intricately connected to external reward, not independent from it.
All these methods are concerned with optimizing reinforcement learning by pooling different states that can be treated as one. I am looking for one that differentiates apparently identical states to improve predictions. (This also distinguishes my question from this question.)

A simple illustration is a grid world where the agent has to reach a particular goal position. In contrast to traditional reinforcement learning as well as the hierarchical methods mentioned above, however, the state of the agent is not its absolute position in the environment but the four cells surrounding it. This introduces transition ambiguity.

I am looking for a way to automatically resolve this ambiguity.

Edit:

xxxxxx
x....x
xxxxxx

An agent that moves in the above grid world and that perceives the four surrounding cells, for example, might model the environment with two separate transition tables. One for the left half and one for the right half of the environment. Individually, each table is unambiguous, although one single transition table for the whole environment would not be.

Edit:

A single ambiguous transition table (the states are the cells north, east, south, and west of the agent respectively and the actions are movements in one of these directions):

        x.xx x.x. xxx.
x.xx, n    1    0    0
x.xx, e    0    1    0
x.xx, s    1    0    0
x.xx, w    1    0    0
x.x., n    1    0    0
x.x., e    0    1    1
x.x., s    0    1    0
x.x., w    1    1    0
xxx., n    0    0    1
xxx., e    0    0    1
xxx., s    0    0    1
xxx., w    0    1    0

Notice the ambiguity in x.x., e and x.x., w. This ambiguity can be resolved by "segmenting" the environment or splitting up the transition table as follows.

Two unambiguous transition tables

        x.xx x.x.
x.xx, n    1    0
x.xx, e    0    1
x.xx, s    1    0
x.xx, w    1    0
x.x., n    0    1
x.x., e    0    1
x.x., s    0    1
x.x., w    1    0


        x.x. xxx.
xxx., n    0    1
xxx., e    0    1
xxx., s    0    1
xxx., w    1    0
x.x., n    1    0
x.x., e    0    1
x.x., s    1    0
x.x., w    1    0

Edit: Related question

The transition function that is typically learned is from states to states. Based on what other information could you "segment" the environment? — Neil G, Jan 12 '17 at 22:45
@NeilG transitions might be grouped into one transition table *as long* as there is no contradicting transition. temporal proximity is a possible heuristic on which transitions "belong" together and which don't. — wehnsdaefflae, Jan 12 '17 at 23:04
Are you suggesting using a history of states to predict the next state? — Neil G, Jan 13 '17 at 00:42
@NeilG Only as an example. Of course one could always extend the conditional in the transition table past the last state to also include the state before that and the one before *that* etc. However, this does not segment the environment as I intend to. Also, there will always be environments which cannot be described by any finite ordered Markov process. I'm looking for something that provides several transition tables for a process that cannot be described by one. — wehnsdaefflae, Jan 13 '17 at 07:53
@NeilG Also note the above-mentioned definition of "state" as relative to the agent (it's immediate neighborhood). — wehnsdaefflae, Jan 13 '17 at 08:08
"I'm looking for something that provides several transition tables for a process that cannot be described by one." This doesn't make any sense to me. Like you said, you can always extend the state space to have all of the information you want. — Neil G, Jan 13 '17 at 08:17
@NeilG please consider the example transition tables I added. Does it now make sense? — wehnsdaefflae, Jan 13 '17 at 08:42
The segmenting Boolean is just another component of your state. If you're not allowed to observe that, your state is *partially observable*, but you can still have transition *probabilities*. — Neil G, Jan 13 '17 at 09:28
@NeilG first off thanks for your comments. i know i can derive probabilities from ambiguous transition tables and that the problem can be regarded as a partially observable mdp. *to avoid that* i defined "state" not as the absolute position of the agent in the world but as its relative surroundings. the intention is to describe an agent **that already considers individual segments of its environment to be fully observable** (i.e., order-1 Markov determined). contradicting evidence enables the agent to switch tables. is there a way to automatically decompose transition matrices as shown above? — wehnsdaefflae, Jan 13 '17 at 09:39
Know which table you should be using is another component of the state. You either have to observe it or you have to infer it from observations or else you cannot know which table you should be using and it's just as if the universe is flipping a coin as to what happens. — Neil G, Jan 13 '17 at 09:52
@NeilG only if you apply the usual conception of "state" from the perspective of an outside observer. my intention is, however, to consider states from the agent's perspective. the whole gridworld is only to make the transitions more understandable. shall we move on to chat or did we arrive at an impasse ;) ? — wehnsdaefflae, Jan 13 '17 at 10:00
I'm talking about states from the agent's perspective. Everything that you use to make a decision is the *state*. What you get from the outside world is the *observation*. You infer a distribution over states given observations. — Neil G, Jan 13 '17 at 11:13
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/51699/discussion-between-neil-g-and-wehnsdaefflae). — Neil G, Jan 13 '17 at 11:13
Have you seen this [paper](http://mlg.eng.cam.ac.uk/pub/pdf/DeiRas11.pdf) before? — jkt, Jan 14 '17 at 02:25

Reinforcement Model Learning

0 Answers0

Linked

Reinforcement *Model* Learning

0 Answers0

Linked

Reinforcement Model Learning