Maluuba releases two new datasets for Natural Language Understanding research


Maluuba’s latest contribution to the AI research community

In the era of deep learning and artificial intelligence, smart machines are only as good as the data you train them with. Consider humans: we are a product of the signals, information and stimuli that we take in every day. The quality of our environment heavily impacts our outcomes from our intelligence to our ability to make decisions.

At Maluuba, we’re working toward a future where humans and machines interact naturally and intuitively. This requires machines that comprehend and communicate as people do and that demands training data that reflects those abilities. Many recent AI breakthroughs have come about through the use of publicly available datasets; however, datasets in the field of language understanding have so far been limited.

Today, Maluuba introduces two sophisticated new datasets to advance research in natural language understanding. Created by a team of humans, rather than synthetically, these datasets explore fundamental aspects of human capabilities including information-seeking, exploration, curiosity, decision-making and memory.

A new machine reading comprehension dataset

Maluuba’s NewsQA was developed to train algorithms capable of answering complex questions -- questions that require human-level comprehension and reasoning skills. Leveraging CNN articles from the DeepMind Q&A Dataset, we prepared a crowd-sourced machine reading corpus of over 100,000 question-answer pairs.

Our collection methodology was designed to solicit questions that require seeking out information, reasoning, and recognizing when information is incomplete. Questions were provided by human crowd-workers who only read the headline and some summary points of each article. This encouraged curiosity-based questions about what more the article described, and prevented queries that were leading or so specific that they gave the answer away.


Consequently, NewsQA contains many nuanced questions. A significant proportion of questions cannot be solved without reasoning. Some answers can only be inferred by synthesizing incomplete information or by recognizing conceptual overlaps. Paraphrase recognition may require synonymy and world knowledge. Additionally, some questions have no answer or no unique answer in the corresponding article, so systems must learn to recognize when information is insufficient.

A goal oriented dialogue dataset

A generation of voice assistants such as Siri, Cortana, and Google Now have become popular spoken dialogue systems. More recently, we have seen a rise in text-based conversational agents (aka chatbots). Text is preferred to voice by many users for privacy reasons and in order to avoid bad speech recognition in noisy environments. These agents are also welcome as an alternative to downloading and installing applications. This makes a lot of sense when completing simple tasks such as ordering a cab or asking for the weather.

The problem is in most cases, much like voice assistants, these chatbots only support very simple and sequential interactions. This is sufficient in cases where the user's goal is well-defined and dialogue flow can be easily hand-crafted. However, there are other use-cases, like customer service and travel booking, that require a multistage, potentially cyclic decision-making process.

Most dialogue systems implement goal-oriented conversations as a sequential, slot-filling process. Each dialogue state is either augmented with new information (left) or overwritten (right).

Solving frame tracking would enable dialogue systems to memorize all the information provided by the user and allow comparisons between items.

With that in mind, we developed Frames Dataset to model decision-making in complex conversational settings---in our case, booking a vacation package entailing flights and hotel. Beyond simply returning results from a database, we believe the next generation of conversational agents should help users to explore databases, compare items, and decide between similar options. To that end, Frames dataset is a crowd-sourced set of dialogues in which pairs of human users simulate the interaction between a vacation seeker and a travel agent. These conversations are complex and free-flowing, as opposed to the single chain of requests and responses that current voice assistants support, though still directed toward a useful final goal.


Our goal and hope is that these sophisticated new datasets will push forward the field of AI and natural language, so that collectively we can reach our goal of a world where machines communicate intuitively with humans.

Access Maluuba Datasets



Read the papers referenced in this post

NewsQA: A Machine Comprehension Dataset


ResearchPaul Gray