Rethinking Backpropagation: Thoughts on What's Wrong with Backpropagation
As a young researcher, I've often pondered the limitations of backpropagation, especially when mapped with how learning occurs in the human brain. While backpropagation has been the workhorse of deep learning, it isn't without flaws. In this post, I aim to share some thoughts on these shortcomings from first principles.
The Backpropagation Blues
First off, what does backpropagation actually do? In essence, it's a learning algorithm that computes the gradients of a loss function with respect to a neural network's weights, allowing us to optimize those weights using gradient descent.
Mathematical Hitch
Consider the neural network above, receiving inputs x_i with corresponding weights w_ij. The net input to the neuron j is given by:
We then pass this weighted sum through an activation function ϕ with added bias θj to make up for the neuron's threshold:
Now backpropagation computes the gradient of the loss L with respect to these weights w_ij:
Where δ is the error signal for neuron j that is recursively computed backwards.
But what happens if the activation function ϕ is non-differentiable or includes an unknown function? We can't compute ϕ, and the entire process grinds to a halt.
The Brain Does not Backpropagate
Here’s where things get interesting—and revealing. As a model for how the brain’s cortex learns, backpropagation is not biologically plausible.
"Despite considerable efforts to invent biologically plausible implementations of backpropagation, there is no convincing evidence that the cortex explicitly propagates error derivatives or stores neural activities for use in a subsequent backward pass" - Geoffrey Hinton (The Forward-Forward Algorithm)
Biological Implausibilities of Backpropagation
- Backpropagation assumes that the forward and backward passes use the same weights. In biological terms, this would mean that neurons must somehow communicate their synaptic weights backward—a notion for which there's no evidence in neuroscience.
- Requires the storage of intermediate activations during the forward pass to compute gradients during the backward pass. For large networks or sequences (as in recurrent neural networks), this can be memory-intensive.
- In tasks involving sequences, backprop through time (BPTT) unfolds the network over time, increasing computational and memory demands. The brain, however, processes temporal information seamlessly and in real-time, without needing to "unfold" computations.
How Does The Brain Learn?
Neuroscientists believe that the brain relies on local, Hebbian learning rules—often summarized as "cells that fire together, wire together." Synaptic strength changes are based on the simultaneous activation of pre- and post-synaptic neurons.
This local learning doesn't require global error signals or the precise knowledge of all synaptic weights throughout the network.
Reinforcement Learning: Not the Hero We Need When backprop fails, you might think reinforcement learning (RL) could save the day. After all, RL doesn't require gradient information from intermediate computations, right?
Well, not so fast.
The Variance Villain
RL algorithms often rely on estimating gradients based on rewards, which introduces high variance in the updates. For large neural networks, this variance becomes unwieldy.
What's Next?
Backpropagation has served us well, but it's not without its warts. By rethinking our approach and taking closer inspiration from the brain, we might discover entirely new algorithms for training neural networks.
Could local, biologically plausible learning mechanisms replace backpropagation? How do we build algorithms that embrace uncertainty, unknowns, adaptability and real-world constraints?
If the brain doesn't backpropagate, why should neural nets?:)