DyNet: The Dynamic Neural Network Toolkit

Most deep learning libraries (Theano, Tensorflow, etc.) use what the authors call static declaration, i.e. they separate the declaration and execution of the network architecture. In contrast, the authors propose a new deep learning library, DyNet, based on a programming model which unifies model declaration and execution (dynamic declaration).

Shortcomings of static declaration

There are various use cases for which static declaration is not suited for. For example, when one is dealing with variable length sequences in RNNs or recursive neural networks.

Current libraries deal with these constructions by increasing the complexity of the computational graph formalism. For example, objects can be created with unspecified input size at declaration time and flow control (conditional execution, iteration, etc.) is incorporated in the graphs. However, adding flow control primitives as operations and variable sized tensors to the libraries makes the implementation of the graph more complex. Furthermore, complex control-flow logic is very unintuitive to implement in these libraries (e.g. Theano’s scan).

DyNet to the rescue

DyNet allows to build a different computational graph for each training example, which is very convenient when you are dealing with variable length sequences. Furhermore, all flow control is handled by the host language, which makes developing much easier.

In order to make rebuilding the graph at every iteration computationally feasible, DyNet’s C++ backend is heavily optimized to ease the burden of graph construction. This is made easier by the fact that flow control and facilities for dealing with variably sized inputs remain in the host language, so the graph is not as “heavy” in the first place.

Here’s a small toy example of DyNet in practice: a binary text classifier that at test time can make a decision before reading the whole document.

1  import dynet as dy
2  import numpy as np
3
4  model = dy.model()  # initialize a Model object, which holds all network parameters
5
6  # add parameters to the Model
7  W_p = model.add_parameters(50)
8  b_p = model.add_parameters(1)
9  E = model.add_lookup_parameters((20000, 50))
10
11 trainer = dy.SimpleSGDTrainer(model)  # a Trainer object will be in charge of 
12                                       # updating the parameters of the model
13
14 for epoch in range(num_epochs):
15     # train 
16     for in_words, out_label in training_data:
17         dy.renew_cg()  # clear the current computational graph
18
19         access the model parameters and add them to the graph, so they can be computed with
20         W = dy.parameter(W_p)
21         b = dy.parameter(b_p)
22
23         score_sym = W*sum([ E[word] for word in in_words ]) + b
24         loss_sym = dy.logistic( out_label*score_sym )
25         loss_sym.backward()  # compute loss and accumulate gradients in the 'model' variable
26         trainer.update()  # update the params and clear gradients from the model
27       
28     # test
29     correct_answers = 0 
30     for in_words, out_label in test_data:
31         dy.renew_cg()
32         W = dy.parameter(W_p)
33         b = dy.parameter(b_p)
34         score_sym = b
35    
36         for word in in_words:
37             score_sym = score_sym + W * E[word]
38             if abs(score_sym.value()) > threshold:  # .value() calculates the value of the  
39                                                     # computation by forwarding data through 
40                                                     # the graph
41                 break
42
43         if out_label * score_sym.value() > 0:
44             correct_answers += 1
45
46 print(correct_answers/len(test_data))

Line 38 is particularly interesting. It demonstrates Python’s control-flow logic interleaving the model declaration. Notice also that one needs to call the value() method of the symbolic variables (or Expression objects as they are called in the paper) to obtain the numerical results, unlike in Chainer, which performs the forward step automtically while it is performing the computation graph construction. This makes it theoretically possible to create graph optimization routines that run before performing actual computation in DyNet.

Additional features

DyNet comes with a set of Builder Classes, which provide convenient and efficient implementations of higher-level constructs that are implemented on top of the core DyNet auto-differentiation functionality. These include RNNs, tree-structured networks and various ways to calculate softmax distributions.

To improve computational efficiency, DyNet supports sparse updates of parameters (e.g. when dealing with word embeddings) and parallel processing to train the model across many CPU cores.

What is also interesting is that DyNet has specially designed batching operations which treat the number of mini-batch elements not as another standard dimension, but as a special dimension with particular semantics. This means that the user doesn’t have to keep track of extra batch dimensions explicitly. What follows is a comparison of non-batched and batched implementations of a classifier (notice the functions ending with “_batch” in the batched implementation).

Non-minibatched classification:

1 # in_words is a tuple (word_1, word_2)
2 word_1 = E[in_words[0]]
3 word_2 = E[in_words[1]]
4 scores_sym = dy.softmax(W*dy.concatenate([word_1, word_2])+b)
5 loss_sym = dy.pickneglogsoftmax(scores_sym, out_label)

Minibatched classification:

1 # in_words is a list [(word_{1,1}, word_{1,2}), (word_{2,1}, word_{2,2}), ...]
2 # out_labels is a list of output labels [label_1, label_2, ...]
3 word_1_batch = dy.lookup_batch(E, [x[0] for x in in_words])
4 word_2_batch = dy.lookup_batch(E, [x[1] for x in in_words])
5 scores_sym = dy.softmax(W*dy.concatenate([word_1_batch, word_2_batch])+b)
6 loss_sym = dy.sum_batches( dy.pickneglogsoftmax_batch(scores_sym, out_labels) )

Conclusion

I believe that DyNet is certainly worth exploring, even more so when considering that its performance for various recurrent and dynamic neural networks is on par with conventional libraries. In addition to making it easier to implement dynamic networks, unifying model declaration and execution makes debugging the model much easier. It’s also worth pointing out that since the computational graph is much more lightweight, the compilation happens considerably faster than in Theano or Tensorflow.