Value iteration does not converge when using Q learning

Question

I have a simple game and want my agent to play it with a help of reinforcement learning. We have a board and a value in each cell. The goal is to go from start to finish point with the highest score (agent can go in 4 available directions: up, down, left, right) within given moves (distance from start to finish with no extra steps).

The issue is that my extracted policy doesn't give me the correct result (green - starting point; red - finish cell)

So I want to clarify all the parameters that I chose for my algorithm:

States space size (number of cells on the board): 4x4 = 16
Actions space size (one for each direction): 4
Probability of the next state (equal for each available next state): 1/4 = 0.25 (for central cells); 1/3 = 0.33 (for border cells); 1/2 = 0.5 (for corner cells)
Reward: value of the cell or -1 if we no longer can reach finish from that point.

But my value function does not want to converge (and it always has to), so probably the issue with values I provide to it. Help me figure out what major mistake did I miss.

The code for the value function calculation looks like this one

def value_iteration(states_space_size, game):
    v = np.zeros(states_space_size)
    max_iterations = 1000
    eps = 1e-20
    last_dif = float('inf')

    for i in range(max_iterations):
        prev_v = np.copy(v)  # last value function
        for s in range(states_space_size):  # 16: size of the board
            q_sa = []
            for a in range(len(DIRECTIONS)):  # 4: up, down, left, right
                next_states_rewards = []
                for next_sr in get_available_states_from(s, a, game):
                    # (probability, next_state, reward) of the states you can go from (s,a)
                    p, s_, r = next_sr
                    # reward from one-step-ahead state
                    next_states_rewards.append((p*(r + prev_v[s_])))
                # store the sum of rewards for each pair (s,a)
                q_sa.append(np.sum(next_states_rewards))
            # choose the max reward of (s,a) pairs and put it on the actual value function for STATE s
            v[s] = max(q_sa)

        # check convergence
        if np.abs(np.abs(np.sum(prev_v - v)) - last_dif) < eps:
            print('Value-iteration converged at iteration %d' % (i+1))
            break
        last_dif = np.abs(np.sum(prev_v - v))
    return v

In case you want to refer to the whole listing here is the link

What is the "correct result"? According to the rewards that you have assigned, being in the states `(2, 0)`, `(2, 1)`, and `(1, 1)`, is better than getting to the end state. So the policy will continue to stay in those states that have high reward. In order to get the desired result, try making each state (excluding the end state) have a non-positive reward. This way, the policy extracted from value iteration will not get stuck in an infinite loop. — ljeabmreosn, Dec 06 '18 at 16:59
I need a Bellman equation to maximize cumulative result for my rewards? Where should I apply it? Is it missing from the code above or it should be done on a policy extraction step? — Most Wanted, Dec 11 '18 at 19:59
I don't think anything is missing from the code provided. The rewards that are shown don't make much sense. If some of the intermediate states have a higher reward than the desired "end" state, why wouldn't the agent with a maximal policy attempt to stay in the highest reward states? This is similar to trying to find a minimum cost path in a graph with a negative cost cycle. — ljeabmreosn, Dec 11 '18 at 22:06
Makes a perfect sense. What would you suggest to do? I've tried to set reward at final state to a very high value, but obviously agent will not find best path to it and only will get to that point. — Most Wanted, Dec 12 '18 at 13:40
Since all of the rewards are non-negative, you can convert all of the rewards `r` for each state `s` (excluding the "end" state) to `r'(s) = -d/(r(s)+c)` for positive constants `c` and `d`. For example, if `c = d = 1` then `r'(s) = -1/(r(s)+1)`. — ljeabmreosn, Dec 12 '18 at 15:44
You're missing a learning rate and a discount factor in your Q-learning update rule. Both are necessary for convergence. — DaVinci, Dec 10 '18 at 21:45
ok, discount factor is equal to 1 in my example, because rewards stay the same for each step. Where should I apply my learning rate? Is it different from 1 factor and will influence resulting convergence? — Most Wanted, Dec 11 '18 at 19:46
Could you elaborate about why using a discount factor of 1 and a learning rate will cause this problem? — Sycorax, Dec 12 '18 at 21:18
This is being automatically flagged as low quality, probably because it is so short. At present it is more of a comment than an answer by our standards. Can you expand on it? You can also turn it into a comment. — gung - Reinstate Monica, Dec 12 '18 at 21:31
In the general case, the value function with a discount factor of 1 is ill-defined. There may be infinitely long sequences that accrue infinite cumulative reward, hence there can be no guarantees for a discount factor of 1. It may still work, however. The learning rate, e.g. alpha in these update-rules: https://en.wikipedia.org/wiki/Q-learning, is essential in the case where the transitions aren't deterministic. You want to average the future rewards you are getting. but with a learning rate of 1 you are always using the latest observation which doesn't converge. — DaVinci, Dec 13 '18 at 06:59

Most Wanted · Accepted Answer · 2018-12-12T21:25:11.607

Ok, so I've slightly modified the initial example and the code below gives me working policy

states_space_size = 16  # 4x4 size of the board
actions_space_size = len(DIRECTIONS)
QSA = np.zeros(shape=(states_space_size, actions_space_size))
max_iterations = 80
gamma = 1  # discount factor
alpha = 0.9  # learning rate
eps = 0.99  # exploitation rate
s = 0  # initial state
for i in range(max_iterations):
    # explore the world?
    a = choose_an_action(actions_space_size)
    # or not?
    if random.random() > eps:
        # which criterion on decreasing epsilon
        a = np.argmax(QSA[s])

    r, s_ = perform_action(s, a, game)
    qsa = QSA[s][a]
    qsa_ = np.argmax(QSA[s_])
    QSA[s][a] = qsa + alpha*(r + gamma*qsa_ - qsa)

    # change state
    s = s_

    # converge criterion instead of max iterations?
print(QSA)

I have introduced learning rate variable (how quickly to forget older results) and exploration/exploitation rate (choose random actions vs following existing policy) and seems like resulting policy gives the desired path

SO is different from other online fora that you might have used in the past. Specifically, the answer box is reserved for answers, not additional questions or edits to your original question. If there are additional things that you would like to ask, you can ask a new question. Please see the [help] for more information. — Sycorax, Dec 12 '18 at 21:14
@Sycorax thanks, I've updated the answer to contain only working result — Most Wanted, Dec 12 '18 at 21:24
@MostWanted why is it `qsa_ = np.argmax(QSA[s_])` instead of `qsa_ = max(QSA[s_])`? — Dee, Mar 16 '21 at 09:57

Value iteration does not converge when using Q learning

1 Answers1

Linked