Mechanistic Interpretability Course

A comprehensive introduction to Mechanistic Interpretability for Large Language Models

This is a quick 4-session course designed to introduce the field of Mechanistic Interpretability for Large Language Models (LLMs). The course includes theoretical materials, introduces Python libraries for interpretability, discusses recent papers in the field (published by organizations such as Anthropic and Google DeepMind), and provides practical exercises.

The field of Mechanistic Interpretability has rapidly gained popularity in recent years, with more than 90 papers accepted at ICML 2024. Its main goal is to understand the logic behind the decisions of machine learning models. This knowledge can be applied to improve transparency and trust in existing models, as well as to better understand how these models learn.

Shortly after the completion of this course, BAISH will organize a Mechanistic Interpretability Hackathon at FGV — details to be announced! We strongly recommend that anyone interested in participating in the Hackathon complete this course as an introduction to the topic, which will also increase their chances of receiving an award in the competition.

Course Schedule

Session 1 - Transformers and Interpretability

Date: August 23, Friday, starting at 2:30 PM

Location: Auditorium 537

Schedule:

Preparation: Handbook Chapters 2 and 3
45 min: Introduction to transformers and attention
20 min: Coffee break
45 min: Introduction to Mechanistic Interpretability
30 min: Coding: PyTorch and TransformerLens

Materials:

Handbook:
- Chapter 2
- Chapter 3
Papers:
- A Mathematical Framework for Transformer Circuits
Coding:
- ARENA -- [1.2] Intro to Mechanistic Interpretability: TransformerLens & induction circuits / Section 1
Videos:
- 3B1B -- But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning
- 3B1B -- Attention in transformers, visually explained | Chapter 6, Deep Learning

Session 2 - Circuits

Date: August 30, Friday, starting at 2:30 PM

Location: Auditorium 418

Schedule:

Preparation: Handbook Chapters 3 and 4
40 min: Circuits and the induction circuit
20 min: Exploration of the paper "A Mathematical Framework for Transformer Circuits"
20 min: Coffee break
60 min: Coding: circuit discovery

Materials:

Session 3 - Superposition

Date: September 6, Friday, starting at 2:30 PM

Location: Auditorium 537

Schedule:

Preparation: Handbook Chapter 5
30 min: Superposition
30 min: 3Blue1Brown video "How might LLMs store facts" and discussion
20 min: Coffee break
60 min: Coding: superposition in toy models

Materials:

Handbook:
- Chapter 5
Papers:
- Toy Models of Superposition
Coding:
- ARENA -- [1.4] Superposition & Sparse Autoencoders
Videos:
- 3B1B -- How might LLMs store facts | Chapter 7, Deep Learning

Session 4 - Sparse Autoencoders (SAE)

Date: September 13, Friday, starting at 3:00 PM

Location: Auditorium 418

Schedule:

Preparation: Handbook Chapter 6
30 min: Sparse Autoencoders (SAE) and practical exploration
30 min: Exploration of the papers "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" and "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet"
20 min: Coffee break
40 min: Coding: Using SAEs
30 min: New areas of exploration and tips for the Hackathon!

Materials:

Additional Resources

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 - A curated list of relevant papers in the field, according to Neel Nanda
Concrete Steps to Get Started in Transformer Mechanistic Interpretability - Practical steps to begin with MechInterp
200 Concrete Open Problems in Mechanistic Interpretability: Introduction - May be interesting for the Hackathon, but the list is outdated

Opportunities

🎓

ARENA

Opportunity to spend 4 weeks in London studying practical content relevant to AI Safety research

Learn More

🔬

MATS

Program to start research in AI Safety with guidance from mentors with extensive experience in the field. Several authors of papers used in this course are MATS mentors.

View Mentors