The Differentiable Neural Computer (DNC) is a neural network that, loosely speaking, separates computation from memory. Often the best way to understand is to read the code directly – otherwise, the following should serve as a gentle introduction.
Motivation
LSTMs or RNNs have their memory tied up in the activations of neurons and these typically account for a small proportion of the total number of parameters. For a number of tasks, we want to be able to scale memory without having to scale the whole rest of the network – enter differentiable memory structures.
The Differentiable Neural Computer is so called because it attempts to augment a neural net with a memory bank such that computations can proceed under a fully differentiable analog of the Von Neumann architecture complete with memory allocation and deallocation.
To push the analogy further: LSTMs are a simple CPU with their activations representing the contents of registers; however, the DNC is a CPU (called the ‘controller’) with a separate and independently scalable differentiable form of RAM (called the ‘memory’).
So, when is this approach useful? Well, many sequential tasks have implicit structure in them which is best exploited by operations on particular data structures. So, if we can learn to store the right data structures themselves and learn to operate on them in the right way, then these tasks become easy. It’s the same in normal programming: we choose the right data structure and right algorithm to deal with problems; in doing so we try to select the most desirable tradeoff between computation and memory. When we introduce RNNs with external memory, we can use backprop to learn that right tradeoff.
The problem is that it’s difficult to learn that tradeoff if we can’t allocate and deallocate memory more explicitly. The more we have a mechanism to do that, the more we can learn when to offload the right stuff to memory and when to perform a particular action on that memory. One of the key aspects of the DNC is its ability to learn to manage memory in that same way.
How it works
In practice, we don’t want to have to interact with the whole memory at once – this motivates the use of attention. The way you weight each row in the memory matrix is the key to good performance in these types of architectures.
In the DNC the read/write weightings come from 3 core attention-based ideas: content, memory allocation and temporal order of memory interactions.
1) Content: find memories closest to the key (cosine distance).
2) Memory allocation: we now have memory management problems. And we don’t want to be tied to using contiguous blocks (by using indices into the memory matrix). So the DNC maintains a list of free memory and its usage and then gives you the choice: update or write somewhere new.
For example, once it’s read something and used it, it can learn to free up that memory slot.
3) Temporal order: we want to be able to iterate through memories in the order that they were written. This is an important prior for some computational tasks where the desired solution requires reading and writing large amounts of data in sequential order. We use some helper variables in the network to help calculate how to use the idea of temporal ordering to come up with a weighting. For example, the ‘precedence weighting’ keeps track of the location of that was just written to.
For more in-depth information, please see the comments provided in the implementation.