Questions tagged [markov-decision-process]

38 questions
3
votes
1 answer

What is the difference between Reinforcement Learning(RL) and Markov Decision Process(MDP)?

What is the difference between a Reinforcement Learning(RL) and a Markov Decision Process(MDP)? I believed I understood the principles of both, but now when I need to compare the two I feel lost. They mean almost the same to me. Surely they are…
Pluviophile
  • 2,381
  • 8
  • 18
  • 45
3
votes
1 answer

States in Bandit Problems

I am wondering if there is an interpretation of the Bandit Problem with more than one states. I know that there are versions which views each slot machine as an independent Markovian machines and as such the states evolve when an arm is pulled.…
3
votes
1 answer

UCB Exploration in Reinforcement Learning

I have two questions regarding the upper confidence bounds (UCB) exploration in reinforcement learning: UCB exploration is derived from Hoeffding's inequality which assumes that the reward is bounded in the interval [0,1]. If the rewards are not…
3
votes
0 answers

Model or State Uncertainty in Queueing Model due to uncertain arrival rate

$\textbf{Introduction}$ I am currently modelling a scenario where two queues need to be served by a single server in a non preemptive discipline. I am quite sorted on generating the optimal policy via Value or Policy Iteration when given the arrival…
3
votes
2 answers

Uniqueness of the optimal value function for an MDP

Suppose we have a Markov decision process with a finite state set and a finite action set. We calculate the expected reward with a discount of $\gamma \in [0,1]$. In chapter 3.8 of the book "Reinforcement Learning: An Introduction" (by Andrew Barto…
jakab922
  • 181
  • 1
  • 9
2
votes
1 answer

Is a policy $\pi(s)$ on Markov decision process a random variable?

Citing Wikipedia: The goal in a Markov decision process is to find a good "policy" for the decision maker: a function $\pi$ that specifies the action $\pi(s)$ that the decision maker will choose when in state $s$. Once a Markov decision process…
2
votes
1 answer

How to solve a Markov Decision Problem with State Transition Matrix and Reward Matrix

I'm stuck in solving a simple dynamic probabilistic model. I have Three states {Sunny, Cloudy, Rainy}. I have the Transition Probability Matrix for the states transitioning to another state (for eg. Sunny -> Cloudy or Sunny -> Sunny). For the Action…
2
votes
1 answer

Dyna-Q Algorithm Reinforcement Learning

In step(f) of the Dyna-Q algorithm we plan by taking random samples from the experience/model for some steps. Wouldn't it be more efficient if we construct an MDP from experience by computing the state transition probabilities and reward…
1
vote
0 answers

Fixed point of the Bellman operator for suboptimal policies

Consider an MDP and let the Bellman operator be defined as follows, $$ (T^\pi_\gamma V)(s) = \sum_{a\in A}\pi(s)\big(r(s,a) + \gamma \sum_{s' \in S} p(s'\mid s,a) V(s')\big) $$ where, $\pi:S\to \Delta(A)$ is a policy, i.e., a function that maps…
Erik M
  • 111
  • 3
1
vote
0 answers

Bellman Optimality Operator fixed point

I'm reading Szepesvári's book on RL. My question is concerning the proof of Theorem A.10 (p. 71). Theorem Let $V$ be the fixed point of $T^∗$ and assume that there is policy $π$ which is greedy w.r.t $V:T^πV=T^∗V$. Then $V=V^∗$ and $π$ is an…
1
vote
0 answers

Is random policy a stochastic policy?

I'm a student to start to study RL. When I studied MDP and watched the gridworld example, I had one question. In the gridworld, we usually assume that we can have four actions in any states, e.g. up, down, left, right. In this case, if we have a…
1
vote
0 answers

Open AI Gym for TSP problem?

In a previous question I asked about use of Open AI Gym as a vehicle for modeling business problems as MDPs. A comment suggested that I start a new question with more refined scope. In general, I'm interested in RL for combinatorial optimization. As…
1
vote
1 answer

What kind of model to optimize the allocation a ressource in the context of time to event outcome?

I have a list of N patients that are competing for one treatment at each time. A treatment becomes available at times t=1,...,T. I want to build a model that can take the time-varying characteristics of all the patients at the time t, when a…
1
vote
0 answers

Optimal action-value as function of optimal value. Proof

Currently reading through Algorithms for Reinforcement Learning, I think these notes are good, but there're bits that are a bit unclear, and I have few questions that I think are quite basic: Definition of optimal value function…
1
vote
1 answer

Equivalent definitions of Markov Decision Process

I'm currently reading through Sutton's Reinforcement Learning where in Chapter 3 the notion of MDP is defined. What it seems to me the author is saying is that an MDP is completely defined by means of the probability $p(s_{t+1},r_t | s_t,…
1
2 3