Using Sparse Autoencoders to disentangle superposition in neural networks
As we saw in the previous chapter, neural networks often represent information in a superposed manner, with many features sharing the same neurons or dimensions. This polysemanticity makes interpretation challenging. Sparse Autoencoders (SAEs) are a powerful tool designed to address this challenge by disentangling these superposed representations.
The goal of an SAE is to transform the polysemantic, distributed representations in a neural network into a monosemantic, sparse representation, where each feature corresponds to a specific, interpretable concept.
A Sparse Autoencoder consists of:
The key constraints in training an SAE are:
These constraints ensure that the SAE learns a dictionary of interpretable features that can be activated individually or in small groups to represent the network's internal states.
After training an SAE on a large dataset of activations from a neural network, we can analyze the features it has learned:
Unlike analyzing individual neurons, SAE features often correspond to meaningful, human-interpretable concepts because they're designed to disentangle the superposed representations.
Sparse Autoencoders can be used for various interpretability tasks:
By tracking which SAE features activate in response to specific inputs, researchers can identify the computational circuits within the network.
SAEs can help determine which features contribute to specific predictions, providing insight into how the model makes decisions.
Once interpretable features are identified, it's possible to modify the model's behavior by intervening on specific features, potentially enabling safer AI systems.
Recent advances in SAE research include:
Papers like "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" and "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" demonstrate the potential of SAEs for understanding increasingly complex models.
Sparse Autoencoders represent one of the most promising approaches for addressing the superposition problem in neural networks. By transforming polysemantic representations into monosemantic features, SAEs provide a powerful tool for mechanistic interpretability.
As research in this area continues to advance, SAEs may play a crucial role in building more transparent, trustworthy, and alignable AI systems. Understanding the internal workings of neural networks is not just an academic pursuit but a practical necessity as these systems become increasingly powerful and integrated into our society.