0

Suppose I have an episodic Markov Decision Process where all episodes start in the same state, $s_0$. I also have a parameterized policy $\pi_\theta$, and I'm trying to find a $\theta$ such that the performance of the policy is maximized in this environment.

Let's examine two different ways of defining performance for the policy. The first one is simply the value (expected accumulated reward) of the policy in the initial state: $$ J_1(\theta)=V_{\pi_\theta}(s_0) $$

The second one is the average reward for the same policy. $$ J_2(\theta)=\mathbb E_{S\sim d_{\pi_\theta}, A\sim \pi_\theta(\cdot|S)}\big[R(S,A)\big] $$ where $d_{\pi_\theta}$ is the on-policy distribution for policy $\pi_\theta$.

If I remember Sutton's book correctly, then $\nabla_\theta J_1(\theta)\propto\nabla_\theta J_2(\theta)$. Is this really the case? In simpler words, does maximizing the average reward imply that we're also maximizing the expected return at the starting state? If so, how can one prove this?

Note

This question might seem similar to Average expected reward vs expected reward for start-state, but the similarity is superficial. I am aware that $V_{\pi_\theta}(s_0)\neq\mathbb E_{S\sim d_{\pi_\theta}}[V_{\pi_\theta}(S)]$, which seems to be the source of confusion for the asker. Instead of asking if such quantities are equal, I'm asking if maximizing one leads to maximization of the other.

JLagana
  • 275
  • 1
  • 10
  • Hopefully someone else will provide a thorough answer because I don't know much about this material but it sounds like it should because isn't expected cumulative reward divided the number of horizons equal to the average reward ? Also, do you recommend Sutton as the book to read for this stuff. Thanks. – mlofton Apr 02 '20 at 13:25
  • Hm, I'm not sure. Well, if the return is discounted then they're definitely not the same, right? I'm not sure if your last sentence is a question. If it is, then yes, I can definitely recommend Sutton's book. – JLagana Apr 03 '20 at 09:21
  • Thanks for book recommendation. Still, I would think that, even with discounting, maximizing one would be maximizing the other because discounting is just a scale factor. So, maximizing [scalefactor * (cumulative reward /n) = maximizing [ scalefactor * ( average reward ) ] ? – mlofton Apr 04 '20 at 13:11
  • But the scale factor inside the sum changes for each term: $\sum \gamma^i R_{i+1}$. Each $\gamma$ is being raised to a different power), so you can't factor it out. – JLagana Apr 08 '20 at 15:23
  • 1
    yes. I see what you mean: So, you're saying that maximizing the discounted average reward, step by step, is not the same as maximizing the discounted cumulative reward, step by step ? I think you are correct. My mistake. Still, it would be interesting to ask an expert what the actual statement regardiong equivalence is. Thank. – mlofton Apr 09 '20 at 16:23

0 Answers0