A Summary of "Reasoning With Neural Tensor Networks for Knowledge Base Completion"

Here, I summarize and try to explain in detail what I read and understood in the paper entitled "Reasoning With Neural Tensor Networks for Knowledge Base Completion" by Socher, Chen, Manning, and Ng from Stanford.

Main Ideas

The overall goal of the paper is to answer whether two entities, $(e_1, e_2)$, are in a given relation $R$.

Neural Tensor Network

This is a modified neural network architecture that has a bilinear tensor layer instead of a standard linear layer that directly relates the two entities. The aim of this model is compute a score that indicates how likely it is for the two entities to be in a given relationship. The function is defined by:

$$g(e_1, R, e_2) = u^{T}_{R} f \left(e_1^T W_{R}^{[1:k]}e_2 + V_{R}\begin{bmatrix}e_1 \\ e_2\end{bmatrix} + b_R\right)$$

$W_{R}^{[1:k]} \in \mathbb{R}^{d \times d \times k}$ is a tensor, and $e_{1}^{T}W_{R}^{[1:k]}e_2$ is what the paper calls a "bilinear tensor product" (I couldn't find a formal definition of this anywhere online), which is then added to the output of a standard layer, $V_R$, which is then added to the the bias, $b_R$. The whole sum is then passed through $f$, which is elementwise $\tanh$, and finally multiplied on the left by $u_R^{T}$, where $u_R$ is determines how the activated weights are combined to get a signle final score $g \in \mathbb{R}$.

This equation seemed a bit daunting to me at first, so here's a more careful examination of what is going on:

First, a reminder of what $\tanh$ looks like:

The following sum is fed into an elementwise $\tanh$ that operates on a vector in $\mathbb{R}^{k}$:

$$e_{1}^{T}W_{R}^{[1:k]}e_2 + V_{R}\begin{bmatrix}e_1 \\ e_2\end{bmatrix} + b_R$$

The easiest to identify thing here is the bias node, which is represented by the vector $b_R$.
The next easy thing to identify here is the regular neural network layer, represented by the product $V_{R}\begin{bmatrix}e_1 \\ e_2\end{bmatrix}$, where $V_{R} \in \mathbb{R}^{k \times 2d}$ represents the weight matrix that specifies how to linearly combine the input in $k$ different ways. The vector $\begin{bmatrix}e_1 \\ e_2 \end{bmatrix}$ is just a singular vector in $\mathbb{R}^{2d}$ constructed by vertically concatenating the entries of $e_1$ and $e_2$ into one vector. So far so good. The expression $V_{R}\begin{bmatrix}e_1 \\ e_2 \end{bmatrix} + b_R$ itself is taken straight out of the expression for a single neural network layer in a classical neural network, where this sum is then passed through an activation function and then through the remaining layers.
All that remains to parse is the most interesting and different part of the sum, the bilinear tensor product, $e_{1}^{T}W_{R}^{[1:k]}e_{2}$. This notation was slightly confusing, but a diagram from the paper was illustrative: this operation represents stacking $k$ bilinear forms $e_1^{T}W_{R}^{i}e_2, i \in \{1, \ldots, k\}$ on top of each other to get a vector in $\mathbb{R}^{k}$. The tensor $W_{R}^{[1:k]}$ can be thought of as $k$ slices put together, where each slice is a $d \times d$ matrix relating entries from $e_1$ to entries in $e_2$. A concrete example might be of use: Let $d, k = 2$, let $e_1 = \begin{bmatrix}a \\ b\end{bmatrix}$ and let $e_2 = \begin{bmatrix}c \\ d\end{bmatrix}$. Let $W^{1}_{R} = \begin{bmatrix} w_{11}^1 & w_{12}^1 \\ w_{21}^1 & w_{22}^1 \end{bmatrix}$ and $W_{R}^{2} = \begin{bmatrix} w_{11}^2 & w_{12}^2 \\ w_{21}^2 & w_{22}^2 \end{bmatrix}$. Then $$W_{R}^{1}e_2 = \begin{bmatrix}w_{11}^{1}c + w_{12}^{1}d \\ w_{21}^{1}c + w_{22}^{1}d\end{bmatrix} \text{ and } W_{R}^{2}e_2 = \begin{bmatrix}w_{11}^{2}c + w_{12}^{2}d \\ w_{21}^{2}c + w_{22}^{2}d\end{bmatrix}$$

Then

$$e_{1}^{T}W_{R}^{1}e_2 = a(w_{11}^{1}c + w_{12}^{1}d) + b(w_{21}^{1}c + w_{22}^{1}d) $$

and

$$e_{1}^{T}W_{R}^{2}e_2 = a(w_{11}^{2}c + w_{12}^{2}d) + b(w_{21}^{2}c + w_{22}^{2}d)$$

The interesting thing to note in both of these forms is that each entry in $e_1$ gets to be multiplied with each entry in $e_2$, and the product of any two individual entries is given a distinct weight. The final "bilinear tensor product" is then

$$e_{1}^{T}W_{R}^{[1:2]}e_{2} = \begin{bmatrix} w_{11}^{1}ac + w_{12}^{1}ad + w_{21}^{1}bc + w_{22}^{1}bd \\ w_{11}^{2}ac + w_{12}^{2}ad + w_{21}^{2}bc + w_{22}^{2}bd \end{bmatrix}$$

What I have understood through this example is that the bilinear tensor product term is just $k$ bilinear forms stacked on top of each other. What this means is that each entry in $e_1$ gets to be multiplied with each entry in $e_2$ $k$ times with $k$ different weights. Thus, we get to turn $k$ knobs, where a knob is a $d \times d$ matrix representing the strength of association between pairs of entries in $e_1$ and $e_2$. The paper explains that this bilinear term allows us the model to explicitly relate the two inputs multiplicatively, rather than just having an implict nonlinear association that we would get with this term removed.

In summary, not only do we get to control how the stacked input vector is recombined, we also get to control how pairwise products of the vector entries are weighted.

Finally, once the big sum is passed through the $\tanh$ activation function, the resulting $k$-vector gets multiplied by $u_{R}^{T}$, which is a row $k$-vector, thus giving us a single score at the very end.

The paper points out that the Neural Tensor Network model, as defined above, combines the ideas and strengths from several different model types.

Loss Function

The loss function or training objective in this paper is called a "contrastive max-margin" objective function. The paper descrbes one main idea used to motivate this objective function: if we have a training set $T^{(i)} = (e_{1}^{(i)}, R^{(i)}, e_{2}^{(i)})$, each triplet that actually belongs to the training set should receive a higher score than a triplet where one of the entities is replaced randomly with a new entity. This seems like a natural requirement, since the relationships defined by triplets in the training set are known to be true. The triplets where an entity has been replaced by a random entity is called a corrupted triplet. The set of corrupted triplets is denoted by $T_{c}^{(i)} = (e_{1}^{(i)}, R^{(i)}, e_c)$. Here, $e_c$ has been randomly sampled from the set of all entities that can appear at that position in the relation $R^{(i)}$. (‼ one point I was confused about here was whether or not $e_c$ is parameterized by $i$. It seems like it should be, since the possible choices of $e_c$ depends on the relation $R^{(i)}$, which itself is indexed by $i$). What I found a little bit interesting here is that the corruption only happens in one position. A relation $R$ doesn't have to be symmetric, which means that a corruption $(e_1, R, e_c)$ is different from a corruption $(e_c, R_, e_2)$. Why, then, do we only corrupt on the right?

As we saw earlier, the Neural Tensor Network model itself is parameterized by the choice of relation $R$, and in particular, each relation $R$ has its own set of weight matrices/tensors, $W_R, V_R, u_R, b_R$. Here, I faced another point of confusion. The paper defines $\mathbf{\Omega}$ to be the set of NTN parameters for all relationships, and it is comprised of $\mathbf{u}$, $\mathbf{W}$, $\mathbf{V}$, $\mathbf{b}$, and $\mathbf{E}$. While the first four of these are clear, I am a little confused about what $E$ is supposed to be. Is it the set of all entities? Finally, the paper defines the objective function as:

$$J(\mathbf{\Omega}) = \sum_{i = 1}^{N}\sum_{c = 1}^{C}\max\left(0, 1 - g(T^{(i)}) + g(T_{c}^{(i)})\right) + \lambda ||\mathbf{\Omega}||_{2}^{2}$$

Where $N$ is the number of training points, $C$ is the number of randomly sampled corrupted triplets of each given correct triplet (i.e. in the training set). The max in the summation forces the the minimizer to drive $g(T^{(i)})$ to be as much larger than $g(T_{c}^{(i)})$ as possible, up until it reaches exactly $1$ more than $g(T_{c}^{(i)})$, at which point any additional increase in $g(T^{(i)})$ is meaningless for the output $J$. The $\lambda ||\mathbf{\Omega}||_{2}^{2}$ summand is a standard $L_2$ regularization term that helps with overfitting.

This equation for the objective function was a little puzzling initially, since it isn't quite clear what it means to take the $2$-norm of $\mathbf{\Omega}$, which itself wasn't defined very precisely. Though reading onto the paragrah after that reveals that this ambiguous notation is actually defining a set of five different objective functions (perhaps we can the final objective as the minimization of their sum?) This is still a point of slight unclarity for me. The paper uses the L-BFGS nonlinear optimization method to find a local minimum of the cost function.

Vector Representations

In the framework being used for this paper, each entity has a vector representation $e \in \mathbb{R}^d$. It seems like this framework was being used in multiple papers in the early 2010s, including in "Learning Structured Embeddings of Knowledge Bases" by Bordes, Weston, Collobert, and Bengio, in which a way of assigning entities vector representations is discussed.

The NTN paper (the one currently being summarized) states that the NTN model works well with randomly initialized entity vectors, which are then learned for each entity through the training process (since the actually relationships between entity vectors are part of the traning data, which then translates to the learned function $g$). The paper also proposes a new scheme for representing entities using the composition of word vectors, which are vectors in $\mathbb{R}^d$. An entity is represnted by the average of the vectors of words that compose to it. For example, $v_{\textit{homo sapiens}} = 0.5(v_{\textit{homo}} + v_{\textit{sapiens}})$. This can then embed some similarities between entities before even training. The example used in the training is homo erectus. If this entity hasn't been seen before, a fact about homo sapiens can still be extended to it due to the fact that $v_{\textit{homo}}$ is in the word compositions for both vector representations, which means that $v_{\textit{homo erectus}}$ will start out relatively close to $v_{\textit{homo sapiens}}$ even though $v_{\textit{erectus}}$ is random.

The total number of entities is $N_E$ and the total number of unique words is $N_W$. If the training is done on words, the entity embedding is $E \in \mathbb{R}^{d \times N_W}$ and if the training is performed with whole vectors, the entity embedding is $E \in \mathbb{R}^{d \times N_E}$.

Experimental Results

The experiments performed in the paper were quite succesful, achieving accuracies of $86.2\%$ on the WordNet dataset and $90\%$ on the FreeBase dataset, though the improvement seemed marginal over an existing model called the Bilinear Model (not quite the same as the NTN, though it uses an idea that inspired the NTN).

Final Thoughts

This was my first look at Knowledge Base completion. I thought it was quite an interesting area and I might look further into it later. What brought me to this paper was the paper called End-To-End Differentiable Proving by Rocktäschel and Riedel, which I wanted to study as a part of my dive into automated and neurosymbolic reasoning. I will attempt to summarize that paper next.