I'm going to talk about the data structures created in Learning to Transduce with Unbounded Memory which came out of DeepMind last year. The conclusions says it best (emphasis mine):
Even in tasks for which benchmarks obtain high accuracies, the memory-enhanced LSTMs converge earlier, and to higher accuracies, while requiring considerably fewer parameters than all but the simplest of Deep LSTMs.
The crux of the paper is in the "unbounded memory" which is in the form of a stack, queue, or deque. These are very old and well understood data structures in computer science, the trick was how they were made to be differentiable (a requirement for backpropagation).
I'll explain the stack first in detail, and then extend it to the queue and deque.
Why? Normally neural nets have finite memory, so can't be used to translate even simple grammars beyond their training set length. With the Neural Stack, there is an unbounded memory the net can use to save state while processing, enabling arbitrarily long parsing. Example application: reversing a string.
The operations on our stack are:
- Push adds a new value to the stack.
- Pop removes value(s) from the stack.
- Read returns the top value(s) on the stack.
The major difference is that ours is continuous, and values on the stack have a “strength” between 0.0 and 1.0 of how much of them is on the stack. Our operations affect the “strength” of values on the stack. This leads to three departures from traditional stacks:
- Pop (between 0 and 1) removes strength from the stack, which may remove multiple values, partial values, or no values.
- Push (between 0 and 1) adds a single value with a given strength, which may be 0.
- Read always returns the value of the top 1.0 of strength, which may be a mixture of values.
Here are the equations for a full update for the Neural Stack. Don’t worry if they don’t make sense now, the point of this post is to explain what’s going on graphically.
Stack Inputs and Outputs
Like we saw in the equations, we have three inputs and one output:
- v is the value we’re adding to the stack.
- u is the “strength” of our pop operation.
- d is the “strength” of our push operation.
- r is the output read value, which returns the top of the stack.
The paper represents this as two side-by-side arrays, with the values in one column, and the strengths in a second. I think this can be misleading, because values with large strengths are displayed the same as values with small strengths.
This post’s representation uses continuous height to show the strength of a value in the stack. Values with zero strength are not shown since they are effectively “gone” from the stack.
With height corresponding to strength, illustrating our operations is easier.
Three Basic Operations
We discussed previously the three operations on a stack are Pop, Push, and Read. Now let’s relate them to our representation where height=strength.
Pop removes between 0.0 and 1.0 height from the stack, which could be multiple values, partial values, or none at all. It “chops” the top off without regard to where values start or end.
Push adds a single new value to the stack, with a specified strength between 0.0 and 1.0. This doesn’t affect the rest of the stack at all.
Read returns the top 1.0 of height, which probably includes multiple values. In that top 1.0 subsection, it returns an average weighted by the strengths (heights) of the values.
This ordering (Pop, Push, Read) is the order in which we will apply these operations.
Our first operation is the pop operation, which removes height (and thus values) from the stack. The strength to pop is specified as u and is a value between 0 and 1. The pop removes up to its strength in height, so if the pop is greater than the height of the stack, then the stack is empty.
The second operation applied is push, which adds a single value to the stack. Where pop could operate on multiple values on the stack, push only adds a value, and leaves the rest of the stack unchanged. The strength (height) of the new value is the push value d and is between 0 and 1.
The final operation doesn’t change the stack; it only reads the top 1.0 of height. We give the read value as r and it’s a weighted average of whatever values are in the top 1.0 of height. If the stack is less than 1.0 in height, than the remaining read height is counted as if it had a value of zero.
Putting them all together, our full update for a single timestep involves performing all three operations in order: Pop, Push, and Read. Every update involves all three — if the machine doesn’t want to push or pop during an update it can set u and d to 0.
Why? If we already got unbounded memory from the stack, why do we need a queue? It turns out queues are much more versatile than stacks, and a queue automaton is Turing Complete! In theory, a system with a Neural Queue could be trained to learn any computable function. Example application: copy a string.
Our equations are similar to those for the stack, but at first glance it’s hard to tell the differences. The equation for V_t is the same as equation (1) in the Neural Stack.
Queue Inputs and Outputs
The queue has the same inputs and outputs as the stack, though we’ll use them differently.
The Neural Queue is operated like the Stack, except that pop and read happen at the bottom, while push still happens on the top.
What if we want to push and pop from both sides? Combine them both to form a Double-Ended Queue (abbreviated DeQue).
Why? The DeQue is actually a unification, and can learn to be either a stack or a queue or both. A lazy programmer can just use a Neural DeQue and let the neural network learn how it wants to use unbounded memory. It is able to learn both the stack (reversal) and queue (copying) example applications.
The equations for this are complicated, but the graphical representation is just as straightforward as the stack and queue.
DeQue Inputs and Outputs
Our inputs and outputs are the same, but doubled; one each for top and bottom.
The DeQue has the same operations, but it has two of each applied at the same time. Both are applied simultaneously, so there is still one Pop operation that removes height from the top and bottom, one Push operation that adds a value to the top and bottom, and one Read operation that returns the top 1.0 and bottom 1.0 of height-averaged values.
This paper isn’t the first time stacks and queues have been used with neural networks. DeepMind’s paper references a few, including a 1998 paper introducing a Neural Network Pushdown Automaton which had many of the same features, including a differentiable stack. Section 2 of DeepMind’s paper discusses the differences between their respective machines.
Additionally, other research is being done to add external memory to Neural Network machines. Recently a paper demonstrated training of End-to-End Memory Networks, which are networks attached to large memories.
Finally a much more complex Neural Turing Machine was published by DeepMind in 2014. While it showed some preliminary results, it still seems too complicated to be useful. However, the Neural Turing Machine is potentially a superset of the machines in this paper (stack-, queue-, and deque-based).
Thanks to the Deep Learning SF Meetup for encouraging me to write this up, and to Duncan Sabien, Daniel Grossmann-Kavanagh, and Rafael Cosman for reading drafts of this.
If you have any comments or questions, reach out to me on Twitter at @machinaut.