how to generate state and reward in model free reinforcement Learning

Question

In the model free reinforcement Learning, most algorithms mentioned that using random walk simulates the next state. Here I don't understand how to generate the next state? Does it select all the states with same probability?

And for the reward $r(s,a,s'),$ in the algorithms, why isn't it a input function i.e. when $(s,a,s')$ is given, we have a value.

I think it is better to give a real world example to illustrate above question.

score 4 · Accepted Answer · answered Nov 02 '20 at 18:54

Reinforcement learning is often described in an MDP or POMDP framework. By framework, I mean a set of abstract concepts which can be used to describe a large number of different specific problems / games at once. Frameworks are generally useful because it allows you to reason about many different specific things at once. In the (PO)MDP framework, concepts include such things as "reward", and "state", and "transition".

Driving a car is an example of a task which can be abstracted as an POMDP: the state consists of the relevant state of the world (e.g. the road ahead, nearby cars, pedestrians, and other objects, the car itself and its mechanical parts), the "transition function" is simply the laws of physics, and the "reward" is a bit subjective, but you can imagine you're rewarded for getting to your destination and penalized for crashing into things.

A robot trying to navigate a maze can also be abstracted as a POMDP: the state consists of the location of the robot in the maze, the transition is governed by again by the laws of physics governing how the robot can physically move, and the reward is presumably positive if the robot solves the maze.

So returning to your questions:

how to generate the next state?

The next state comes from the transition function of your (PO)MDP. Exactly what that transition function depends on what your (PO)MDP is modeling, it may be physical laws, or the rules of a board game, etc. If it's a board game, you can just use the rules of the game to determine what happens next.

And for the reward r(s,a,s′), in the algorithms, why isn't it a input function

In order for the (PO)MDP framework to be able to model a large number of different games and problems, the abstract reward function is often formulated as being random. Maybe you're playing a game where you roll a dice, and get the resulting number of dollars (aka reward). If MDPs could only have deterministic reward, then it would be difficult to fit this type of game into the framework. So in an effort to make the framework as general as possible, rewards are stochastic.

Thanks. `transition function depends on what your (PO)MDP is modeling` in the model-free case, we don't know the transition function and then how to randomly generate the next state? — user6703592, Nov 03 '20 at 17:10
model free means your RL agent doesn't know the transition function, not that *you* don't know it. for example, it's perfectly possible to train a model free agent which plays pacman, even though you know the exact rules and transition function of pacman. — shimao, Nov 03 '20 at 17:58
I still don't understand what's the difference between Agent know the transition function and I know the transition function? Could I understand as when Agent know the transition function, the transition function is actually the part of optimal policy (optimal policy is based on the he transition function); when only I know the transition function, the optimal policy is independent on the transition function? Could you use pacman as a example to illustrate above two case? — user6703592, Nov 05 '20 at 04:02
the optimal policy is something completely independent from notions of model free or model based. you might find this question/answers useful for clarifying model free vs based https://stats.stackexchange.com/questions/407230/what-is-the-difference-between-policy-based-on-policy-value-based-off-policy or feel free to ask a separate question if that still doesn't help — shimao, Nov 06 '20 at 16:31

how to generate state and reward in model free reinforcement Learning

1 Answers1