7

I'm using Q-Learning to train a MDP-based form filling dialogue manager. Right now it operates in a nearly toy setup with the total of 210 states (generally corresponding to form filling progress) and 6 available actions (e.g. greeting, asking for information, confirming information).

The learning is essentially trial-and-error with a simple user simulator which can answer the agent's requests (deterministically, with no errors). The reward function handcrafted by myself is just -1 for each dialogue turn (if nothing else is applicable), +1 for form-filling progress (slot value filling/confirmation), +1000 for reaching the goal and -1000 for greeting at the wrong time. I use PyBrain RL framework for implementation.

So, the problem is, after like 100000 interaction episodes in general Q-Learning doesn't seem to find the optimal policy. After learning, it acts somewhat suboptimally, making random -1 actions and avoiding the major pain of -1000, but it seemingly does nothing to pursue the reward of 1000 at the end of a successful dialogue.

The perfect action-reward sequence for the case of 1-slot form would be:

greet -1 --> askSlotValue -1 --> fillSlot +1 --> askConfirmation -1 --> confirmSlot +1 --> quitDialogue +1000.

So, what do I with learning in order to actually obtain the optimal policy in this case? Do I tweak the reward function somehow, or do I switch to learning on the pre-scripted dialogues?

I believe that even a basic simulation-based trial-and-error approach can deal with a problem of this size. Is it reasonable?

Igor Shalyminov
  • 215
  • 2
  • 8

1 Answers1

5

Exploration might be an issue. Are you sure the algorithm tries all legal actions during the training? Setting a very high initial estimate for all the Q-values will encourage exploration at the start of the training. You could also try "soft selection" where you randomly select an action other than the one with highest Q-value some of the time.

nsweeney
  • 130
  • 4