Chapter 5: Superposition

Introduction to Superposition

Superposition refers to the phenomenon where neural networks represent more features than they have neurons or dimensions. This occurs because networks often need to track many more features than they have parameters available to represent them individually.

The concept was formalized in the paper "Toy Models of Superposition" by Anthropic, which demonstrated how networks can encode multiple features in a lower-dimensional space by exploiting the geometry of feature co-occurrence patterns.

Feature Competition

When a neural network has fewer dimensions than the features it needs to represent, these features must "compete" for representation space. The network learns to allocate its limited representational capacity efficiently.

Key factors that influence which features get represented include:

Frequency: More common features are more likely to be represented
Importance: Features that strongly affect the loss function are prioritized
Correlation: Features that often co-occur can share representation space
Orthogonality: Features that can be represented in orthogonal directions are easier to separate

Polysemanticity

Polysemanticity is a direct consequence of superposition. It refers to the phenomenon where individual neurons or network directions respond to multiple unrelated features.

In a polysemantic network:

Single neurons may respond to multiple semantically distinct concepts
Feature representations are distributed across many neurons
There may not be a clear one-to-one mapping between network components and human-interpretable concepts

This makes interpretation challenging, as we cannot simply analyze individual neurons to understand what the network is representing.

Toy Models of Superposition

Researchers have developed simplified models to study superposition. These toy models help illustrate how networks can embed many features in lower-dimensional spaces.

A typical toy model might involve:

Generating synthetic data with a known number of sparse features
Training a model with fewer dimensions than features
Analyzing how the model represents these features in its limited space

These experiments have revealed that networks can use clever geometric arrangements to encode features efficiently, often exploiting properties like sparsity (features rarely appearing simultaneously).

Implications for Interpretability

Superposition poses several challenges for interpretability research:

Direct neuron analysis may reveal misleading or incomplete information
Features may be encoded in complex, distributed patterns across many neurons
Simple linear probing techniques may fail to detect important features

However, understanding superposition also provides opportunities:

It suggests focusing on finding the right basis for analysis, rather than examining individual neurons
Techniques like Sparse Autoencoders (covered in the next chapter) can help extract features from superposed representations
Knowledge of superposition patterns can inform better training and architecture design

Conclusion

Superposition is a fundamental property of neural networks that arises when they need to represent more features than they have dimensions. This leads to polysemantic neurons and distributed representations that complicate interpretability efforts.

Understanding superposition is essential for developing effective methods to interpret neural networks, especially large language models with billions of parameters tracking potentially trillions of features. The next chapter will explore how Sparse Autoencoders can help address this challenge by disentangling these superposed representations.

Table of Contents