Deep Reinforcement Learning in Dialogue Systems

Maluuba envisions a world where machines can think, reason and communicate with humans. While 2016 has seen great discussion of the transformative impacts of chatbots, the reality is that most chatbot experiences are very limited, often relying on menu systems and scripts. 

In this post, we describe our work on a technology that is essential to make better conversational agents, namely Reinforcement Learning (RL).

The aspect of conversation that we focus on here is the conversation flow. In the context of conversational agents, we call it dialogue management: it consists of choosing at each step of the conversation what to say next to the user. 

Current chatbots mostly have scripted dialogue management, i.e., they follow rules to decide what to say to the user. Using scripts makes dialogue flow simple and keeps the dialogues short. However, it has many drawbacks, among which:

  • Information search is not optimized: if you are looking for an Italian restaurant and the system has found 100 places, it will drop the list and let you go through it. In this case, good dialogue management would ask you to refine your criteria to only propose a few relevant restaurants.
  • The conversational agent does not adapt: it will talk to me as it will talk to you and to anybody else. Good dialogue management should adapt to the user.
  • From a developer’s point of view, the approach is non-transferable: every time you need to add a new domain (e.g., adding cafes to restaurants), you will need to write new scripts for this domain. 
  • The conversational agent is not robust: scripted conversational agents are not good at recovering from misunderstandings or unexpected cases.  

To avoid these drawbacks and design conversational agents that are adaptive and learn to optimize their behaviour, we use RL. Before going into the details of what is RL, we explain how conversational agents are designed and where RL can be used.

Conversational Agents

The diagram below shows the traditional pipeline of a text-based, goal-oriented conversational agent:

Natural Language Understanding (NLU) takes the user utterance as input and it extracts from it the user’s intents. These intents are expressed as dialogue acts. For instance, if NLU gets as input ‘I am looking for a plane ticket from San Francisco to Montreal.’, it might represent the user’s intents with the following dialogue acts: inform_origin_city and inform_destination_city. NLU will also detect the named entities, here San Francisco and Montreal. 

The state tracker is next in the pipeline. Its role is to keep track of the user’s goal throughout the dialogue and it outputs a state which represents this goal. This state is often a probability distribution over possible values in the user’s goal. Based on the user utterance, it may output for instance that the origin city is San Francisco with a probability of 0.9 and Montreal is the destination city with a probability of 0.85. Based on this state, the dialogue manager decides what to say next to the user. The decision is also on the intent level, meaning that the dialogue manager outputs dialogue acts. 

In the graph above, the dialogue manager decides to ask the user to refine her query by asking if she has a budget for the trip. This decision is then rendered in a sentence by the response generation module which outputs in this case ‘Sure, what is your budget for this trip?’.

Reinforcement Learning (RL)

In the RL setting, an agent is placed in an environment that it often doesn’t know about. This agent can act in the environment and it receives rewards based on its actions. The agent’s goal is to find the actions which maximise its cumulative reward at each time step. We apply this setting to dialogue management. In our case, the unknown environment is the users’ behaviour, the actions are the dialogue acts that the system can perform and the rewards may be based on task completion or user satisfaction.
The RL setting fits the dialogue setting quite well because RL is meant for situations when feedback may be delayed. When a conversational agent carries a dialogue with a user, it will often only know at the end whether or not the dialogue was successful and the task was achieved.

We have used a combination of RL and deep learning to train the dialogue manager of an agent which finds restaurant. We describe our work in the next section and explain the advantages of our approach. While we used these techniques in the restaurant domain, they can be applied to any task oriented dialogue domain. 

Applying Deep Reinforcement Learning to Dialogue Management

We have published an article that aims to improve the learning behavior of a dialogue manager by applying deep RL. In this paper, an actor-critic algorithm based on two neural networks is shown to outperform a previous state-of-the-art RL algorithm using Gaussian processes.

The Model

Our dialogue manager is based on an actor-critic architecture. We have two networks: one is the critic and the other one is the actor. The critic’s role is to estimate the quality of each action for any given state. The actor’s role is to use this information to choose the next action to perform.

On simulations, we showed that our model learned an optimal behaviour faster than previous state-of-the-art methods using Gaussian processes. This architecture also enabled us to leverage another type of learning, called supervised learning and to adopt a two-stage training approach. 

Two-stage Training

A crucial benefit of this architecture is that the dialogue manager can be trained in two steps: first, it learns with supervised learning to emulate human decision making from dialogues and then it pursues its learning with reinforcement learning by performing conversations itself and learning from its own mistakes.

The first stage is done by training the actor to imitate human behaviour in recorded dialogues. From these dialogues, the actor learns to pick the same actions that a human would take. At the end of this phase, the dialogue manager is already capable of having some successful goal-oriented conversations with users. However, it needs to learn to adapt to unseen situations and for this, we switch to reinforcement learning and let the actor and critic make their own decisions and learn from feedback whether they were good or not. This can be done by making the dialogue manager have dialogues with either a user simulator or real users. 

Screen Shot 2016-11-23 at 12.56.55 PM.png

This possibility of doing two-stage training is crucial when developing a conversational agent. In effect, when one designs a dialogue system, it is standard to collect dialogues to train natural language understanding and state tracking. With the actor-critic architecture, we can leverage this data even further by partially training the dialogue manager.

Summary States and Actions

Another great advantage of our model is that there is no need to handcraft a state space and an action space in order to train it. In comparison, for tractability, previous algorithms would summarize the state and action spaces. 

The state space is composed of all the states that the state tracker may return. Since the state tracker returns probability distributions, this space is infinite. The action space is composed of all the dialogue acts that the dialogue manager can perform. 

Previous approaches simplified the state space by mapping the probability distributions returned by the state tracker to a binary vector. This mapping was performed through a heuristic. On the other hand, our model does not require this handcrafted mapping and can be trained directly on the output of the state tracker.

Previous work also simplified the action space by outputting a dialogue act, without considering the slot type. For instance, if the dialogue manager needed to request more information, it  would only output the act request but not the slot type. The slot type was added based on a heuristic. On the other hand, our dialogue manager directly outputs dialogue act-slot type combinations, e.g., request_budget. This also results in less handcrafting because there is no need for the heuristic that a posteriori maps the dialogue act to a slot type.

As a conclusion, in this paper, we proposed a dialogue manager that learned faster, leveraged more data, and required less handcrafting. 

Read the paper referenced in this blog post

Policy Networks with Two-Stage Training for Dialogue Systems

Paul Gray