Question Answering has always been a challenge in natural language processing. There has been lots of works on question answering based on Information Retrieval, Semantic Parsing and Neural Networks(MemNN). Information Retrieval systems first retrieve a broad set of candidate answers by querying the search API of KBs with a transformation of the question into a valid query and then use fine-grained detection heuristics to identify the exact answer. Semantic parsing methods focus on the correct interpretation of the meaning of a question by a semantic parsing system. Recently, J. Weston et. al. propose to train a neural network(MemNN) with a memory to answer various kinds of questions, and got quite promising results.

# Memory Neural Network(MemNN)

MemNN takes a set of inputs $x_1,...,x_n$ that are to be stored in the memory, a question $q$, and outputs an answer $a$. Each of the $x_i,q,a$ contains words from a vocabulary $V$.
MemNN consists of 3 modules: attention module(input memory representation), reading module (output memory representation) and response generation module:

## Attention Module

Each input $x_i$will be projected into a continuous input vector $m_i$, in the simplest way using embedding matrix $A$ generated by SkipGram or Glove. The question $q$ will also be projected into a continuous vector $u$ by another matrix $B$. Then the MemNN will generate a weight $p_i$for each memory $m_i$. These weights will be used in the reading module to retrieve the input: $p_i=Softmax(u^Tm_i)$

Each input $x_i$ will be projected into a continuous output vector $c_i$ using another embedding matrix $C$. Then the reading module will output a sum of all output vectors $c_i$ weighted by $p_i$: $o=\sum_i p_i c_i$

## Response Generation Module

This module will generate a prediction based on the question vector u as well as the output of the reading module $o: a=Softmax(W(o+u))$.

The overall model is showed in Fig. 1(a).During training, all three embedding matrices $A,B$and $C$ as well as $W$ are jointly learning by minimizing a cross-entropy loss between $\hat{a}$ and the true label $a$.

We can also stack multiple MemNN to get a multiple-layer MemNN by using the sum of the question input and the output of the current layer: $u^{k+1}=u^k+o^k$ as the input to the upper layer, as shown in the Fig 1(b).
During training, all embedding matrices $A,B,C$ as well as $W$ are jointly learned by minimizing the cross-entropy loss between $\hat{a}$ and the true label $a$.

There are several improvements based on MemNN. A. Sordoni et. al. propose a new iterative alternating attention methods which allows a fined-grained exploration of both the question and the document(memory), A. Miller et. al. propose a Key-Value MemNN which makes reading document more viable by utilizing different encoding methods for lookup and reading respectively.

## Iterative Alternating Attention

The iterative alternating attention can be considered as an inference chain which focus on different part of the question and documents at each search step, where the inference chain is modeled by a GRU. In particular, at search step $t$: (1) It first perform an attentive read on the question, resulting a question glimpse $q_t$, and (2) given the query glimpse, it extract a conditional document glimpse $d_t$, representing the part of document that is relevant the current query glimpse. Both attentive reads are based on the previous hidden state of the inference GRU$s_{t-1}$, summarizing the information that has been gathered so far at search step $t$.

We denote $q_i$as the encodings for the word $i$ in the question $q, q_{i,t}$as the question attention weights, then the query glimpse$q_t$ at step $t$ can be calculated by:

$q_{i, t} = softmax+{i = 1,\cdots,|Q|}\tilde{q_i}^T(A_q s_{t-1} + a_q)$
$q_t = \sum_i q_{i, t}\tilde{q_i}$

where $A_q\in R^{1\times h}, a_q\in R^{1}, s_{t-1}$ as the previous hidden state of GRU.

The notation is very similar with the question attentive reading, note that the document reading weight is also conditioned on the query weight $q_t$, then the document glimpse $d_t$ at step $t$ can be calculated by:

$d{i, t}= softmax_{i=1,\cdots,|D|}\tilde{d_i}^T(A_d[s_{t-1}, q_t] + a_d)$
$d_t=\sum_i d_{i, t}\tilde{d_i}$

## Key-Value MemNN

Key-Value MemNN is a generation of MemNN of the way that the documents are stored in memory. In MemNN, the attention and reading are based on different vectors of the same document using different embedding matrices. MemNN generalizes this by using a key vector for attention and a value vector for reading, while the key and the value can be quite different, for example, we can split the document to windows of $W$ words where the center is an entity. We can use the whole window as key while only the center entity as the value. Then the memory attention and reading can be performed as following.

### Memory Attention

$p_{m_i}= Softmax(A\Phi_Q(q) \cdot A\Phi_K(k_{h_i}))$
Where $\Phi$ are feature maps of dimension $D$, $A\in R^{d\times D}, q$is the question vectors, $k_{m_i}$is the key vector of the memory i

$o = \sum_i p_{m_i}A\Phi_V (v_{m_i})$
Where $v_{m_i}$is the value vector of memory $i$