Questions tagged [markov-decision-process]
38 questions
3
votes
1 answer
What is the difference between Reinforcement Learning(RL) and Markov Decision Process(MDP)?
What is the difference between a Reinforcement Learning(RL) and a Markov Decision Process(MDP)?
I believed I understood the principles of both, but now when I need to compare the two I feel lost. They mean almost the same to me. Surely they are…

Pluviophile
- 2,381
- 8
- 18
- 45
3
votes
1 answer
States in Bandit Problems
I am wondering if there is an interpretation of the Bandit Problem with more than one states. I know that there are versions which views each slot machine as an independent Markovian machines and as such the states evolve when an arm is pulled.…

dezdichado
- 105
- 7
3
votes
1 answer
UCB Exploration in Reinforcement Learning
I have two questions regarding the upper confidence bounds (UCB) exploration in reinforcement learning:
UCB exploration is derived from Hoeffding's inequality which assumes that the reward is bounded in the interval [0,1]. If the rewards are not…

gnikol
- 657
- 2
- 6
- 16
3
votes
0 answers
Model or State Uncertainty in Queueing Model due to uncertain arrival rate
$\textbf{Introduction}$
I am currently modelling a scenario where two queues need to be served by a single server in a non preemptive discipline. I am quite sorted on generating the optimal policy via Value or Policy Iteration when given the arrival…

Dylan Solms
- 183
- 5
3
votes
2 answers
Uniqueness of the optimal value function for an MDP
Suppose we have a Markov decision process with a finite state set and a finite action set. We calculate the expected reward with a discount of $\gamma \in [0,1]$.
In chapter 3.8 of the book "Reinforcement Learning: An Introduction" (by Andrew Barto…

jakab922
- 181
- 1
- 9
2
votes
1 answer
Is a policy $\pi(s)$ on Markov decision process a random variable?
Citing Wikipedia:
The goal in a Markov decision process is to find a good "policy" for
the decision maker: a function $\pi$ that specifies the action
$\pi(s)$ that the decision maker will choose when in state $s$. Once
a Markov decision process…

Multivac
- 168
- 10
2
votes
1 answer
How to solve a Markov Decision Problem with State Transition Matrix and Reward Matrix
I'm stuck in solving a simple dynamic probabilistic model. I have Three states {Sunny, Cloudy, Rainy}.
I have the Transition Probability Matrix for the states transitioning to another state (for eg. Sunny -> Cloudy or Sunny -> Sunny). For the Action…

Sammy
- 65
- 3
2
votes
1 answer
Dyna-Q Algorithm Reinforcement Learning
In step(f) of the Dyna-Q algorithm we plan by taking random samples from the experience/model for some steps.
Wouldn't it be more efficient if we construct an MDP from experience by computing the state transition probabilities and reward…

gnikol
- 657
- 2
- 6
- 16
1
vote
0 answers
Fixed point of the Bellman operator for suboptimal policies
Consider an MDP and let the Bellman operator be defined as follows,
$$
(T^\pi_\gamma V)(s) = \sum_{a\in A}\pi(s)\big(r(s,a) + \gamma \sum_{s' \in S} p(s'\mid s,a) V(s')\big)
$$
where,
$\pi:S\to \Delta(A)$ is a policy, i.e., a function that maps…

Erik M
- 111
- 3
1
vote
0 answers
Bellman Optimality Operator fixed point
I'm reading Szepesvári's book on RL. My question is concerning the proof of Theorem A.10 (p. 71).
Theorem
Let $V$ be the fixed point of $T^∗$ and assume that there is policy $π$ which is greedy w.r.t $V:T^πV=T^∗V$. Then $V=V^∗$ and $π$ is an…

Nick Halden
- 23
- 5
1
vote
0 answers
Is random policy a stochastic policy?
I'm a student to start to study RL.
When I studied MDP and watched the gridworld example, I had one question.
In the gridworld, we usually assume that we can have four actions in any states, e.g. up, down, left, right.
In this case, if we have a…

beef stew
- 43
- 3
1
vote
0 answers
Open AI Gym for TSP problem?
In a previous question I asked about use of Open AI Gym as a vehicle for modeling business problems as MDPs. A comment suggested that I start a new question with more refined scope. In general, I'm interested in RL for combinatorial optimization. As…

jbuddy_13
- 1,578
- 3
- 22
1
vote
1 answer
What kind of model to optimize the allocation a ressource in the context of time to event outcome?
I have a list of N patients that are competing for one treatment at each time. A treatment becomes available at times t=1,...,T.
I want to build a model that can take the time-varying characteristics of all the patients at the time t, when a…

Mery
- 11
- 1
1
vote
0 answers
Optimal action-value as function of optimal value. Proof
Currently reading through Algorithms for Reinforcement Learning, I think these notes are good, but there're bits that are a bit unclear, and I have few questions that I think are quite basic:
Definition of optimal value function…

user8469759
- 213
- 1
- 8
1
vote
1 answer
Equivalent definitions of Markov Decision Process
I'm currently reading through Sutton's Reinforcement Learning where in Chapter 3 the notion of MDP is defined.
What it seems to me the author is saying is that an MDP is completely defined by means of the probability
$p(s_{t+1},r_t | s_t,…

user8469759
- 213
- 1
- 8