A Functional API For Feedforward Neural Nets in PyTorch

The encapsulation of model state in PyTorch is, to be frank, confusing. To train a model, the user is required to share its parameters and its gradient among multiple disconnected objects, including an optimization algorithm and a loss function. These objects are in turn called upon to mutate the gradient and the parameters in specific, unintuitive ways, including the loss.backward() and optimizer.step() methods.

What's going on with state in PyTorch? Why are model parameters shared and mutated between three distinct objects? And what would PyTorch look like with an API that was more clearly inspired by functional programming?

Background: PyTorch and parameter state

To see how parameter state is currently handled in PyTorch, let's take a simple module container as an example. We'll initialize the container with 128 input nodes, a hidden layer with 64 nodes and ReLU activation, and an output layer with 10 nodes and log-softmax activation:

import torch

model = torch.nn.Sequential(
    torch.nn.Linear(128, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 10),
    torch.nn.LogSoftmax()
)

So far, this API feels intuitive. We define the network's layers declaratively, mirroring an interpretation of the network as a composition of linear transformations, and PyTorch returns to us a module container that can operate on inputs and outputs to the network. No class inheritence, and no state to deal with.

Things get weird when it comes time to train the network. To perform training, PyTorch requires us to initialize an optimizer -- that is, an optimization algorithm, such as stochastic gradient descent (SGD). Suddenly, we need to share the model's parameter state with the optimizer object in order to initialize it:

# Initialize an optimizer using stochastic gradient descent
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

PyTorch also requires us to initialize a second object, a loss function, to calculate the gradient of the network. This object is entirely decoupled from the module container model:

# Initialize a loss function using negative log-likelihood loss
loss_fn = torch.nn.NLLLoss()

When training the model, these three objects -- the module container, the optimizer, and the loss function -- all come together to mutate the same state, the model's parameters and gradient. Here's what a simple training loop might look like:

for epoch in range(epochs):
    for feature, label in training_data:
        # Clear out the gradient
        optimizer.zero_grad()

        # Forward pass through the network
        output = model(feature)

        # Calculate the loss
        loss = loss_fn(output, label)

        # Compute the gradient
        loss.backward()

        # Update the weights
        optimizer.step()

Let's take these steps one-by-one to try to figure out how the parameter state is shared between the three objects.

1. Clear out the gradient

In PyTorch, optimizers know nothing about the training loop, so they will continue to accumulate gradients indefinitely unless instructed to stop. Hence, to initiate a new forward and backward pass, we need to start by clearing out any gradients that may have previously been calculated:

optimizer.zero_grad()

This method is not particularly intuitive. Why should the optimizer own this method, instead of the module container model (which initialized the weights)? Why is the optimizer responsible for storing the gradient, and not the model? The situation will become even more confusing later on, when we consider how the gradient is calculated.

2. Foward pass through the network

The forward pass is straightfowrad. We can treat our module container model as a function and use it to generate the output of the network:

output = model(feature)

No funny business here, just input and output.

3. Calculate the loss

Calculating the loss is also relatively straightforward. We can treat our NLLLoss object as a function and apply it to the output and the labels to calculate the loss:

loss = loss_fn(output, label)

4. Compute the gradient

Here, things get particularly weird. The loss function output computes the gradient across the network's layers:

loss.backward()

From a mathematical perspective, it makes some sense that the output of the loss function owns the backward() method: after all, the gradient represents the partial derivative of the loss function with respect to the network's weights. Yet by assigning backward() to the loss function output, PyTorch is obscuring the fact that it is using the layers contained by model to propagate the gradient backward through the network -- layers which have been passed through to output by the loss function, in order to calculate the gradient in loss.backward(). Instead of making this shared state clear, the API obscures it, returning None and mutating gradient state in-place.

5. Update the weights

With the gradient in hand, it's time to update the weights by taking a small step in the direction of the negative gradient. Which of our objects should be responsible for this step? You might guess the module container model, which stores the parameter state, but instead this duty falls on the optimizer:

optimizer.step()

Defining step() as an optimizer method with no inputs or outputs obfuscates the consequences of the operation. What was the shape of the gradient? How large were the weights that it updated? There's no way to know without inspecting the optimizer object, which has further obscured the underlying parameter state.

In sum: A lack of clear ownership of state

The full PyTorch training pass reveals that none of the required objects has full ownership of the model's weights or gradient. Instead, model, loss, and optimizer all mutate the parameter state in different ways.

Why might PyTorch have designed its API such that all three objects -- the module container, the loss function, and the optimization algorithm -- mutate the same model parameter state? Are there benefits to this approach, or is it simply bad API design?

The benefits of a composable API

One advantage to spreading ownership of state between the model, loss function, and optimizer is that it makes PyTorch more composable, and hence in some sense more flexible. In PyTorch, loss functions and optimization algorithms aren't tightly coupled to neural networks, since the three types of objects are fully separate from one another. One interesting consequence of this is that PyTorch can be used to compute gradients for any composition of functions that returns a scalar, not just neural networks, using the backward() method of a Variable and the autograd module:

# Set the random seed for PyTorch to enforce reproducibility
torch.manual_seed(0)

# Define two simple functions to compose, returning a scalar
def f(x): return torch.log(x)
def g(x): return torch.mean(x)

# Initialize a 3 x 5 tensor and mark it as requiring a gradient
x = torch.rand((3, 5), requires_grad=True)

# Calculate the output of the function composition for the input tensor
output = g(f(x))

# Compute the gradients for the function composition
output.backward()

Now, inspecting x shows the gradient:

print(x.grad)
>>> tensor([[0.1343, 0.0868, 0.7535, 0.5049, 0.2169],
            [0.1051, 0.1360, 0.0744, 0.1463, 0.1054],
            [0.1911, 0.1660, 2.9861, 0.3948, 0.2268]])

This composability makes PyTorch useful for gradient computation in a variety of circumstances, not just neural network training. But it comes at a cost of clarity for working with neural networks -- supposedly the primary purpose of PyTorch.

Clearing up the API for neural nets

One simple way to clarify the API for neural nets would be to fully encapsulate the weights and the gradient the module container object model. We could accomplish this by requiring the module container to be explicitly initialized with a loss function and an optimization algorithm:

# Initialize the model as before, but specify a loss function and optimizer
model = torch.nn.Sequential(
    torch.nn.Linear(128, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 10),
    torch.nn.LogSoftmax(),
    loss=torch.nn.NLLLoss(),
    optimizer=torch.optim.Adam(lr=0.01)
)

Then, the training loop would make it clear that the module container was in charge of updating the weights:

for epoch in range(epochs):
    for feature, label in training_data:
        # Compute the gradient and propagate it backward
        model.step(feature, label)

Notice how model now owns all of the state update methods: zeroing out the gradient with zero_grad() and taking the weight update step with step() both belong to the module container model. Implicitly, the step() function could use the module's loss function and optimizer to calculate the gradient and update the weights -- something along the lines of the following pseudocode:

class Module:
    def step(self, feature, label):
        self.optimizer.zero_grad()
        output = self.forward(feature)
        loss = self.loss(output, label)
        loss.backward()
        self.optimizer.step()

With this API, the weights and their respective gradients more clearly "belong" to the network, not to its loss function or optimization algorithm.

In fact, this API winds up being more similar to the one that scikit-learn uses for neural networks. In scikit-learn, the entire training loop above is replaced by the model.fit(X, y) method, which takes care of looping over the training features X and labels y. In this way, scikit-learn takes the abstraction one layer higher and encapsulates all training state in the model object, including the training epochs.

And yet, by encapsulating state in the module container model, the API still suffers from a lack of clarity in terms of how zero_grad(), backward(), and step() are adjusting the module container's internal parameter state. In order to inspect the gradient at a particular point in the training loop, for example, the user would need to drop some form of debugger at the right moment and introspect the internal state of model. Is there a way to make the state mutations even clearer, such that this kind of detailed debugging could receive first-class support?

Functional programming with PyTorch

One way to provide clearer state would be to make the module container methods more functional by returning output instead of producing side effects:

# Create a module container as before, by specifying the network's linear operations,
# but assume that it returns a factory function instead of a model
ModelFactory = torch.nn.Sequential(
    torch.nn.Linear(128, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 10),
    torch.nn.LogSoftmax()
)

# Initialize the model with empty parameters
# (assume that Parameters() initializes empty weights)
model = ModelFactory(torch.nn.Parameters())

# Initialize a loss function
loss_fn = torch.nn.NLLLoss()

for epoch in range(epochs):
    for feature, label in training_data:
        # Forward pass through the network
        output = model(feature)

        # Calculate the loss
        loss = loss_fn(output, label)

        # Compute the gradient
        gradient = model.backward(loss)

        # Update the weights
        parameters = model.step(gradient)

        # Initilize a new model with the weights
        model = ModelFactory(parameters)

The training pass is now more verbose, but as as a result, its output is inspectable at every step, and no state is shared between objects. Each step in the training loop -- passing input forward, calculating loss, computing gradients, and updating the weights -- produces immutable output that can be passed into the next step.

There are a few critical differences with this API:

torch.nn.Sequential must be a factory function instead of a straightforward constructor. It produces functions (like Model) that will themselves produce module containers when initialized with a set of parameters.
torch.nn.Parameter must be adjusted slightly such that it can automatically represent empty weights when passed into a module container constructor function.
Parameters can now be made immutable. Instead of updating weights in-place, for instance, the step() method produces a new set of weights based on the gradient.
The module container model must now own the backward() and step() methods, in order to produce the gradient and the updated parameters, respectively. This makes intuitive sense, since the gradient and the updated weights depend on the existing weights, which are fully encapsulated in the module container model.

In all, I think this API makes more sense than the current API for feedforward networks. Inspired by functional programming, it makes the ownership of state in the network more transparent; it converts the module parameters to immutable objects, making their transitions clearer and safer; and it provides transparent output for every step of the training pass, making each operation both more flexible and more open to inspection.

A few caveats

There are some reasons that you might not prefer this more functional PyTorch API to the one that currently exists.

For one thing, an API with fully immutable state would mean that memory would need to be reallocated for the weights during every pass through the network. Without careful garbage collection, this might cause an explosion in memory when training very deep networks with large hidden layers.

Another important consideration is that I've only examined simple feedforward networks here. It's possible that recurrent networks and networks with convolutional layers have substantially different requirements such that this API would need a major redesign to support them.

Finally, I'm still relatively new to PyTorch. There's a good chance that I'm missing some basic motivations behind the design of the existing API. If you think this is true, get in touch with me on Twitter! I'd love to know what I'm missing.