Understanding how neural networks represent more features than they have dimensions
Superposition refers to the phenomenon where neural networks represent more features than they have neurons or dimensions. This occurs because networks often need to track many more features than they have parameters available to represent them individually.
The concept was formalized in the paper "Toy Models of Superposition" by Anthropic, which demonstrated how networks can encode multiple features in a lower-dimensional space by exploiting the geometry of feature co-occurrence patterns.
When a neural network has fewer dimensions than the features it needs to represent, these features must "compete" for representation space. The network learns to allocate its limited representational capacity efficiently.
Key factors that influence which features get represented include:
Polysemanticity is a direct consequence of superposition. It refers to the phenomenon where individual neurons or network directions respond to multiple unrelated features.
In a polysemantic network:
This makes interpretation challenging, as we cannot simply analyze individual neurons to understand what the network is representing.
Researchers have developed simplified models to study superposition. These toy models help illustrate how networks can embed many features in lower-dimensional spaces.
A typical toy model might involve:
These experiments have revealed that networks can use clever geometric arrangements to encode features efficiently, often exploiting properties like sparsity (features rarely appearing simultaneously).
Superposition poses several challenges for interpretability research:
However, understanding superposition also provides opportunities:
Superposition is a fundamental property of neural networks that arises when they need to represent more features than they have dimensions. This leads to polysemantic neurons and distributed representations that complicate interpretability efforts.
Understanding superposition is essential for developing effective methods to interpret neural networks, especially large language models with billions of parameters tracking potentially trillions of features. The next chapter will explore how Sparse Autoencoders can help address this challenge by disentangling these superposed representations.