Introduction to Question Answering and Machine Reading

Question Answering has always been a challenge in natural language processing. There has been lots of works on question answering based on Information Retrieval, Semantic Parsing and Neural Networks(MemNN). Information Retrieval systems first retrieve a broad set of candidate answers by querying the search API of KBs with a transformation of the question into a valid query and then use fine-grained detection heuristics to identify the exact answer. Semantic parsing methods focus on the correct interpretation of the meaning of a question by a semantic parsing system. Recently, J. Weston et. al. propose to train a neural network(MemNN) with a memory to answer various kinds of questions, and got quite promising results.

Memory Neural Network(MemNN)

MemNN takes a set of inputs x_1,...,x_n that are to be stored in the memory, a question q, and outputs an answer a. Each of the x_i,q,a contains words from a vocabulary V.
MemNN consists of 3 modules: attention module(input memory representation), reading module (output memory representation) and response generation module:

Attention Module

Each input x_iwill be projected into a continuous input vector m_i, in the simplest way using embedding matrix A generated by SkipGram or Glove. The question q will also be projected into a continuous vector u by another matrix B. Then the MemNN will generate a weight p_ifor each memory m_i. These weights will be used in the reading module to retrieve the input: p_i=Softmax(u^Tm_i)

Reading Module

Each input x_i will be projected into a continuous output vector c_i using another embedding matrix C. Then the reading module will output a sum of all output vectors c_i weighted by p_i: o=\sum_i p_i c_i

Response Generation Module

This module will generate a prediction based on the question vector u as well as the output of the reading module o: a=Softmax(W(o+u)).

The overall model is showed in Fig. 1(a).During training, all three embedding matrices A,Band C as well as W are jointly learning by minimizing a cross-entropy loss between \hat{a} and the true label a.

We can also stack multiple MemNN to get a multiple-layer MemNN by using the sum of the question input and the output of the current layer: u^{k+1}=u^k+o^k as the input to the upper layer, as shown in the Fig 1(b).
During training, all embedding matrices A,B,C as well as W are jointly learned by minimizing the cross-entropy loss between \hat{a} and the true label a.

There are several improvements based on MemNN. A. Sordoni et. al. propose a new iterative alternating attention methods which allows a fined-grained exploration of both the question and the document(memory), A. Miller et. al. propose a Key-Value MemNN which makes reading document more viable by utilizing different encoding methods for lookup and reading respectively.

Iterative Alternating Attention

The iterative alternating attention can be considered as an inference chain which focus on different part of the question and documents at each search step, where the inference chain is modeled by a GRU. In particular, at search step t: (1) It first perform an attentive read on the question, resulting a question glimpse q_t, and (2) given the query glimpse, it extract a conditional document glimpse d_t, representing the part of document that is relevant the current query glimpse. Both attentive reads are based on the previous hidden state of the inference GRUs_{t-1}, summarizing the information that has been gathered so far at search step t.

Question Attentive Reading

We denote q_ias the encodings for the word i in the question q, q_{i,t}as the question attention weights, then the query glimpseq_t at step t can be calculated by:

q_{i, t} = softmax+{i = 1,\cdots,|Q|}\tilde{q_i}^T(A_q s_{t-1} + a_q)
q_t = \sum_i q_{i, t}\tilde{q_i}

where A_q\in R^{1\times h}, a_q\in R^{1}, s_{t-1} as the previous hidden state of GRU.

Document Attentive Reading

The notation is very similar with the question attentive reading, note that the document reading weight is also conditioned on the query weight q_t, then the document glimpse d_t at step t can be calculated by:

d{i, t}= softmax_{i=1,\cdots,|D|}\tilde{d_i}^T(A_d[s_{t-1}, q_t] + a_d)
d_t=\sum_i d_{i, t}\tilde{d_i}

Key-Value MemNN

Key-Value MemNN is a generation of MemNN of the way that the documents are stored in memory. In MemNN, the attention and reading are based on different vectors of the same document using different embedding matrices. MemNN generalizes this by using a key vector for attention and a value vector for reading, while the key and the value can be quite different, for example, we can split the document to windows of W words where the center is an entity. We can use the whole window as key while only the center entity as the value. Then the memory attention and reading can be performed as following.

Memory Attention

p_{m_i}= Softmax(A\Phi_Q(q) \cdot A\Phi_K(k_{h_i}))
Where \Phi are feature maps of dimension D, A\in R^{d\times D}, qis the question vectors, k_{m_i}is the key vector of the memory i

Memory Reading

o = \sum_i p_{m_i}A\Phi_V (v_{m_i})
Where v_{m_i}is the value vector of memory i

Reference:

[1]. J. Weston, S. Chopra, A. Bordes. Memory Networks. arXiv, 2015.
[2]. S. SukhbaatarM A. Szlam, J. Weston, R. Fergus. End-to-End Memory Networks. arXiv, 2015.
[2]. A. Sordoni, P. Bachman, Y. Bengio. Iterative Alternating Neural Attention for Machine Reading.. arXiv, 2016.
[3]. A. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes, J. Weston. Key-Value Memory Networks for Directly Reading Documents. arXiv, 2016.
[4]. A. Bordes, J. Weston, S. Chopra. Question Answering with Subgraph Embedding. arXiv, 2014.