A popular reinforcement learning algorithm, an instance of TD (temporal difference) learning.
Questions tagged [q-learning]
165 questions
24
votes
4 answers
Why does Q-Learning use epsilon-greedy during testing?
In DeepMind's paper on Deep Q-Learning for Atari video games (here), they use an epsilon-greedy method for exploration during training. This means that when an action is selected in training, it is either chosen as the action with the highest…

Karnivaurus
- 5,909
- 10
- 36
- 52
21
votes
1 answer
What is the difference between episode and epoch in deep Q learning?
I am trying to understand the famous paper "Playing Atari with Deep Reinforcement Learning" (pdf). I am unclear about the difference between an epoch and episode. In algorithm $1$, the outer loop is over episodes, while in figure $2$ the x-axis is…

A.D
- 2,114
- 3
- 17
- 27
19
votes
2 answers
How exactly to compute Deep Q-Learning Loss Function?
I have a doubt about how exactly the loss function of a Deep Q-Learning Network is trained. I am using a 2 layer feedforward network with linear output layer and relu hidden layers.
Let's suppose I have 4 possible actions. Thus, the output of…

A.D
- 2,114
- 3
- 17
- 27
16
votes
2 answers
Why was the letter Q chosen in Q-learning?
Why the letter Q was chosen in the name of Q-learning?
Most letters are chosen as an abbreviation, such as $\pi$ standing for policy and $v$ stands for value. But I don't think Q is an abbreviation of any word.

draw
- 261
- 2
- 6
14
votes
4 answers
Why don't we use importance sampling for one step Q-learning?
Why don't we use importance sampling for 1-step Q-learning?
Q-learning is off-policy which means that we generate samples with a different policy than we try to optimize. Thus it should be impossible to estimate the expectation of the return for…

siva
- 451
- 3
- 12
12
votes
2 answers
Is planning in Dyna-Q a form of experience replay?
In Richard Sutton's book on RL (2nd edition), he presents the Dyna-Q algorithm, which combines planning and learning.
In the planning part of the algorithm, the Dyna-agent randomly samples n state-action pairs $(s, a)$ previously seen by the agent,…

Julep
- 485
- 3
- 11
12
votes
5 answers
epsilon-greedy policy improvement?
I am learning reinforcement learning from David Silver's open course and Richard Sutton's book. While I enjoy the course and the book much, I am currently confused in $\epsilon$-greedy policy improvement.
Both the book and the open course have a…

Mou
- 638
- 2
- 5
- 14
11
votes
2 answers
Reinforcement learning in non stationary environment
Q1: Are there common or accepted methods for dealing with non stationary environment in Reinforcement learning in general?
Q2: In my gridworld, I have the reward function changing when a state is visited. Every episode the rewards reset to the…

Voltronika
- 213
- 3
- 7
10
votes
2 answers
Overview over Reinforcement Learning Algorithms
I'm currently searching for an Overview over Reinforcement Learning Algorithms and maybe a classification of them. But next to Sarsa and Q-Learning + Deep Q-Learning I can't really find any popular algorithms.
Wikipedia gives me an overview over…

greece57
- 201
- 1
- 4
10
votes
1 answer
How efficient is Q-learning with Neural Networks when there is one output unit per action?
Background:
I am using Neural Network Q-value approximation in my reinforcement learning task. The approach is exactly the same as one described in this question, however the question itself is different.
In this approach the number of outputs is…

Serhiy
- 959
- 1
- 8
- 11
9
votes
1 answer
Is Deep-Q Learning inherently unstable
I'm reading Barto and Sutton's Reinforcement Learning and in it (chapter 11) they present the "deadly triad":
Function approximation
Bootstrapping
Off-policy training
And they state that an algorithm which uses all 3 of these is unstable and…

enumaris
- 1,075
- 2
- 9
- 19
8
votes
1 answer
Proof of Convergence for SARSA/Q-Learning Algorithm
I would like to ask if someone can refer to me the paper containing the proof of convergence of $Q-$learning/SARSA (either/both), one of the learning algorithms in reinforcement learning.
The iterative algorithm for SARSA is as follows:
$$ Q(s_t,…

cgo
- 7,445
- 10
- 42
- 61
8
votes
3 answers
Why there is no transition probability in Q-Learning (reinforcement learning)?
In reinforcement learning, our goal is to optimize state-value function or action-value function, which are defined as following:
$V^{\pi}_s = \sum p(s'|s,\pi(s))[r(s'|s,\pi(s))+\gamma V^{\pi}(s')]=E_{\pi}[r(s'|s,a)+\gamma…

whatsname
- 113
- 1
- 7
7
votes
1 answer
MDP and Sate Value Finding?
I have a complex MDP (I think) as follows. anyone can describe me simply how the value for state $V(A)^*$ is find?
First Update: really for this solved question I need a canonical answer, step by step solution, if any for learning purpose.
Second…

Maryam Panahi
- 29
- 5
7
votes
1 answer
Q-learning when to stop training?
I'm using Q-learning for my side project. After few million episodes, I found the cumulative rewards seems to reach stable. I'm wondering if there's a scientific way(s) to determine when to stop training rather than observe the cumulative rewards.

user2131907
- 173
- 1
- 5