7

This is from a paper 'Algorithms for Inverse Reinforcement Learning' by Ng, Russell (2001)

We assume that we have the ability to simulate trajectories in the MDP (from the initial state $s_0$) under the optimal policy, or under any policy of our choice. For each policy $\pi$ that we will consider (including the optimal one), we will need a way of estimating $V^{\pi}(s_0)$ for any setting of the $\alpha_i$'s. To do this, we first execute $m$ $\underline{\text{Monte Carlo}}$ trajectories under $\pi$.

Sorry for the long quote. What is the meaning of 'Monte Carlo' in the last sentence?

My first thought would be to just run the simulation again and again $m$ times. But rethinking it, I might be very wrong.

Tim
  • 108,699
  • 20
  • 212
  • 390
cgo
  • 7,445
  • 10
  • 42
  • 61

2 Answers2

10

What Ng and Russell seem to be saying is that for each policy $\pi$ they simulate $m$ "possible" outcomes for processes starting at point $s_0$. By "trajectories" they seem to mean the possible developments in time of simulated processes -- different possible scenarios created by simulation. So you were correct, Monte Carlo stands here for "simulation" (see also this thread).

Tim
  • 108,699
  • 20
  • 212
  • 390
0

Monte Carlo here simply means use sampling to estimate the values. Practically this means collecting a sequence of (state, action) pairs, i.e. the trajectory using some arbitrary policy, and from this you can compute relevant quantities like V, etc

makokal
  • 143
  • 6