Chapter 4: Circuits

Introduction to Circuits

The "circuits" framework is a key concept in mechanistic interpretability, offering a way to understand how neural networks process information through connected components that collectively implement specific functions.

This chapter explores how circuit-based interpretability helps us reverse-engineer the computations performed by neural networks, with a focus on transformer architectures used in large language models.

Defining Neural Circuits

A neural circuit is a subset of a neural network that implements a specific, identifiable function. Circuits consist of neurons (or groups of neurons) connected across layers that collectively perform a computation.

Key properties of circuits include:

Modularity: Circuits can be studied in isolation from the rest of the network
Functionality: Circuits implement specific, meaningful computations
Composition: Complex behaviors emerge from combinations of simpler circuits
Distributed representation: Information is often represented across multiple neurons rather than in individual neurons

The Induction Circuit

The induction circuit is one of the most well-studied circuits in transformer models. This circuit enables language models to continue patterns like "A, B, A, B, A, ?" with "B".

The circuit works by:

Detecting when a token has appeared previously in the sequence
Identifying what token came after it in that previous occurrence
Predicting that the same token will follow in the current context

This circuit involves specific attention heads working together across layers:

Previous token detection: Attention heads that look for repeated tokens
Information copying: Mechanisms to retrieve what followed the previous occurrence
Output projection: Components that influence the final prediction

Circuit Discovery Methods

Identifying circuits within neural networks involves several techniques:

Activation Patching

This involves replacing activations at specific locations in the network when processing one input with activations from another input, then observing how the output changes. It helps identify which components are causally important for a given task.

Causal Mediation Analysis

This approach quantifies how much specific paths in the network contribute to a particular prediction, helping identify the most important connections.

Attention Analysis

Examining attention patterns can reveal which parts of the input sequence influence other parts, providing insights into information flow.

Practical Examples

Beyond induction, researchers have identified several other circuits:

Indirect Object Identification: Circuits that track relationships between subjects, verbs, and objects in sentences
Negation Circuits: Components that detect and process negation in text
Name Mover Circuits: Mechanisms for tracking and referencing entities throughout a text

These examples demonstrate how complex language understanding emerges from specific circuit implementations.

Conclusion

The circuits framework provides a powerful approach for understanding how neural networks implement specific functions. By identifying and studying these circuits, researchers can gain insights into how models process information and make predictions.

This understanding is crucial for improving interpretability, enhancing model safety, and developing more reliable AI systems. The next chapter will explore how neural networks handle information when they have limited capacity through the concept of superposition.

Table of Contents