I have a simple game and want my agent to play it with a help of reinforcement learning. We have a board and a value in each cell. The goal is to go from start to finish point with the highest score (agent can go in 4 available directions: up
, down
, left
, right
) within given moves (distance from start to finish with no extra steps).
The issue is that my extracted policy doesn't give me the correct result (green
- starting point; red
- finish cell)
So I want to clarify all the parameters that I chose for my algorithm:
- States space size (number of cells on the board):
4x4 = 16
- Actions space size (one for each direction):
4
- Probability of the next state (equal for each available next state):
1/4 = 0.25
(for central cells);1/3 = 0.33
(for border cells);1/2 = 0.5
(for corner cells) - Reward: value of the cell or
-1
if we no longer can reach finish from that point.
But my value function does not want to converge (and it always has to), so probably the issue with values I provide to it. Help me figure out what major mistake did I miss.
The code for the value function calculation looks like this one
def value_iteration(states_space_size, game):
v = np.zeros(states_space_size)
max_iterations = 1000
eps = 1e-20
last_dif = float('inf')
for i in range(max_iterations):
prev_v = np.copy(v) # last value function
for s in range(states_space_size): # 16: size of the board
q_sa = []
for a in range(len(DIRECTIONS)): # 4: up, down, left, right
next_states_rewards = []
for next_sr in get_available_states_from(s, a, game):
# (probability, next_state, reward) of the states you can go from (s,a)
p, s_, r = next_sr
# reward from one-step-ahead state
next_states_rewards.append((p*(r + prev_v[s_])))
# store the sum of rewards for each pair (s,a)
q_sa.append(np.sum(next_states_rewards))
# choose the max reward of (s,a) pairs and put it on the actual value function for STATE s
v[s] = max(q_sa)
# check convergence
if np.abs(np.abs(np.sum(prev_v - v)) - last_dif) < eps:
print('Value-iteration converged at iteration %d' % (i+1))
break
last_dif = np.abs(np.sum(prev_v - v))
return v
In case you want to refer to the whole listing here is the link